All language subtitles for [English] Python for Everybody - Full University Python Course [DownSub.com]

af Afrikaans
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:06,006 --> 00:00:09,200 Hello, everybody, and welcome to chapter one of Python for Everybody. 2 00:00:09,200 --> 00:00:10,120 I'm Charles Severance. 3 00:00:10,120 --> 00:00:13,600 I'm your instructor, and I welcome you to this class. 4 00:00:13,600 --> 00:00:17,680 The basic goal of this class is to teach everybody how to program, 5 00:00:17,680 --> 00:00:19,000 regardless of your background. 6 00:00:19,000 --> 00:00:20,680 You don't have to be a math whiz. 7 00:00:20,680 --> 00:00:24,760 You don't have to be a computer expert. 8 00:00:24,760 --> 00:00:27,520 No matter how old you are or what your background is, 9 00:00:27,520 --> 00:00:28,840 we want to teach you how to program. 10 00:00:28,840 --> 00:00:30,160 So welcome to the course. 11 00:00:30,160 --> 00:00:32,439 Welcome to chapter one. 12 00:00:32,439 --> 00:00:39,040 So the first thing to understand is that the purpose to learn to program 13 00:00:39,040 --> 00:00:41,160 is because computers want to do things for us. 14 00:00:41,160 --> 00:00:46,800 They are built and created and designed, and their hardware is set up 15 00:00:46,800 --> 00:00:51,560 so that they basically ask us, what do you want to do next? 16 00:00:51,560 --> 00:01:00,560 If you grab your phone, your phone sort of does nothing until you tell it 17 00:01:00,560 --> 00:01:01,200 what to do. 18 00:01:01,200 --> 00:01:04,200 It waits for you, and it's just waiting for you. 19 00:01:04,200 --> 00:01:07,560 And all the hardware computer technology around you 20 00:01:07,560 --> 00:01:09,680 is generally waiting for you. 21 00:01:09,680 --> 00:01:12,400 And we can use this for useful things. 22 00:01:12,400 --> 00:01:14,080 We could play video games. 23 00:01:14,080 --> 00:01:18,200 We could have it help navigate for our cars. 24 00:01:18,200 --> 00:01:21,120 Someday we might even have self-driving cars. 25 00:01:21,120 --> 00:01:26,320 And it's really, in a sense, in my mind, silly 26 00:01:26,320 --> 00:01:30,400 if you spend your whole life not really understanding this technology. 27 00:01:30,400 --> 00:01:35,360 And I think it's important that we learn to tell these computers what to do, 28 00:01:35,360 --> 00:01:39,400 rather than just let them increasingly control our lives. 29 00:01:39,400 --> 00:01:43,440 And so as we'll see, computers aren't very smart on their own. 30 00:01:43,440 --> 00:01:45,920 We humans are the ones that imbue them with knowledge. 31 00:01:45,920 --> 00:01:48,840 And what we need to learn to speak their language. 32 00:01:48,840 --> 00:01:52,160 It is much easier for us to learn to speak their language 33 00:01:52,160 --> 00:01:54,340 than it is for them to learn to speak our language. 34 00:01:54,340 --> 00:01:57,760 Although with these cell phones, we're starting to see little bits 35 00:01:57,760 --> 00:01:59,920 where they can begin to understand. 36 00:01:59,920 --> 00:02:03,520 But you would be amazed at the 40 or 50 years 37 00:02:03,520 --> 00:02:10,520 that it has taken us to build programs to begin to understand. 38 00:02:10,520 --> 00:02:14,400 So I'm bringing you into something where you are going 39 00:02:14,400 --> 00:02:18,560 to learn the ways of programming and the ways of the computer. 40 00:02:18,560 --> 00:02:22,560 Because it's easier to teach you how to program than it is to teach this, 41 00:02:22,560 --> 00:02:23,880 how to work in your world. 42 00:02:23,880 --> 00:02:29,880 Even though, ultimately, the goal is to get this to do work for you. 43 00:02:29,880 --> 00:02:33,360 So part of what I'm trying to do is move you from a user perspective, 44 00:02:33,360 --> 00:02:35,640 where you just look at the computer as something 45 00:02:35,640 --> 00:02:40,960 that someone else has constructed and you are the user of, 46 00:02:40,960 --> 00:02:42,800 to the point where you construct new things. 47 00:02:42,800 --> 00:02:44,840 Now the first kinds of things that you're going to construct 48 00:02:44,840 --> 00:02:47,360 are actually things to solve your own problems. 49 00:02:47,360 --> 00:02:50,680 And it's very popular now to work on data. 50 00:02:50,680 --> 00:02:54,600 And Python is an excellent programming language for data mining and data 51 00:02:54,600 --> 00:02:55,280 analysis. 52 00:02:55,280 --> 00:02:57,200 And that's a lot of what we're going to do in this course. 53 00:02:57,200 --> 00:02:59,600 Although really, it's a gateway to all kinds of things, 54 00:02:59,600 --> 00:03:03,600 like artificial intelligence, or gaming, or navigation, 55 00:03:03,600 --> 00:03:07,440 or mobile applications, or entertainment, all kinds of things. 56 00:03:07,440 --> 00:03:09,240 But first, we have to learn to program. 57 00:03:09,240 --> 00:03:12,520 We have to move from using the computer as a tool 58 00:03:12,520 --> 00:03:14,640 to using the tools within the computer that 59 00:03:14,640 --> 00:03:19,840 allow us to change how the computer sees the world. 60 00:03:19,840 --> 00:03:22,680 So there's a couple of reasons that you might want to be a programmer. 61 00:03:22,680 --> 00:03:26,200 Some of you are looking to improve your career, 62 00:03:26,200 --> 00:03:28,040 to be paid to work on programming. 63 00:03:28,040 --> 00:03:31,600 I've been a paid programmer most of my life, and I like it. 64 00:03:31,600 --> 00:03:33,080 It's a good job. 65 00:03:33,080 --> 00:03:36,600 You don't have to stand in the mud. 66 00:03:36,600 --> 00:03:38,600 You don't have to lift things. 67 00:03:38,600 --> 00:03:39,840 You have to use your brain. 68 00:03:39,840 --> 00:03:43,880 And I'll just say that it has been nice for my career 69 00:03:43,880 --> 00:03:47,040 to not be exposed to the elements, but to be able to work often 70 00:03:47,040 --> 00:03:49,400 wherever I want. 71 00:03:49,400 --> 00:03:51,160 But that's actually our secondary goal. 72 00:03:51,160 --> 00:03:54,520 Our first goal is to get you to write programs that solve problems 73 00:03:54,520 --> 00:03:55,440 that you have to solve. 74 00:03:55,440 --> 00:03:58,800 Maybe you have a job as an accountant, or a lawyer, 75 00:03:58,800 --> 00:04:00,000 or something else. 76 00:04:00,000 --> 00:04:02,560 And maybe you run across some data. 77 00:04:02,560 --> 00:04:05,600 Maybe there's some system that logs your time, 78 00:04:05,600 --> 00:04:07,960 and it's not quite giving the report that you want to give. 79 00:04:07,960 --> 00:04:10,520 And so you say, could I just grab the log data myself 80 00:04:10,520 --> 00:04:13,800 and write a program to do some analysis to say, 81 00:04:13,800 --> 00:04:16,399 well, what's the average this versus that, 82 00:04:16,399 --> 00:04:19,199 or the average of some other thing? 83 00:04:19,200 --> 00:04:23,760 And so that's the basic idea, that you'll initially 84 00:04:23,760 --> 00:04:26,920 use computers to serve your own ends. 85 00:04:26,920 --> 00:04:28,800 That makes it a lot easier to write programs, 86 00:04:28,800 --> 00:04:31,720 because you don't have to worry about a million users using 87 00:04:31,720 --> 00:04:32,400 your software. 88 00:04:32,400 --> 00:04:34,620 If it works for you, then we're happy. 89 00:04:34,620 --> 00:04:37,000 And so it takes a little more training 90 00:04:37,000 --> 00:04:38,640 to write software for other people, 91 00:04:38,640 --> 00:04:42,960 or for thousands and thousands of other people. 92 00:04:42,960 --> 00:04:44,440 And so part of what I want to do is 93 00:04:44,440 --> 00:04:47,480 I want to change your perspective. 94 00:04:47,480 --> 00:04:50,680 You look at this from the outside, 95 00:04:50,680 --> 00:04:54,640 and you see it from the outside, and you click on things. 96 00:04:54,640 --> 00:04:56,240 I want to turn this around, and I 97 00:04:56,240 --> 00:05:00,600 want you to be the person inside this looking out at the world. 98 00:05:00,600 --> 00:05:03,420 And as a programmer, we are making things 99 00:05:03,420 --> 00:05:06,480 inside these computers for the world. 100 00:05:06,480 --> 00:05:10,280 And so we want to pull you into being part of this. 101 00:05:10,280 --> 00:05:14,440 We want you inside this, or thinking inside this. 102 00:05:14,440 --> 00:05:19,520 And what you learn is that if you're inside this computer, 103 00:05:19,520 --> 00:05:22,560 and you are taking your instructions to build programs 104 00:05:22,560 --> 00:05:28,520 to be used by the human outside the computer, 105 00:05:28,520 --> 00:05:31,480 you have things that you need to take advantage of. 106 00:05:31,480 --> 00:05:33,440 There's things like the central processing unit, 107 00:05:33,440 --> 00:05:35,360 the memory of this system, the network connection 108 00:05:35,360 --> 00:05:38,440 of this system, the disk drive, or permanent storage 109 00:05:38,440 --> 00:05:39,400 on this system. 110 00:05:39,400 --> 00:05:41,560 And as a programmer, you are kind 111 00:05:41,560 --> 00:05:44,840 of mediating between all those internal resources 112 00:05:44,840 --> 00:05:48,600 that this has that are not very smart, but highly powerful, 113 00:05:48,600 --> 00:05:51,040 and mediating with what that user wants. 114 00:05:51,040 --> 00:05:54,180 And so we take the end user, and we programmers, 115 00:05:54,180 --> 00:05:57,640 we serve the end user, but the computer serves us. 116 00:05:57,640 --> 00:06:01,520 So together, between us and all the computer's resources, 117 00:06:01,520 --> 00:06:04,460 we can serve the needs of the end user. 118 00:06:04,460 --> 00:06:10,120 And we do this by writing code, or programming. 119 00:06:10,120 --> 00:06:11,040 And what is that? 120 00:06:11,040 --> 00:06:15,080 Well, programming is a sequence of instructions 121 00:06:15,080 --> 00:06:17,520 where we are giving instructions to the resources 122 00:06:17,520 --> 00:06:19,960 inside the computer in a way to accomplish 123 00:06:19,960 --> 00:06:21,080 the goals of the end user. 124 00:06:21,080 --> 00:06:24,600 And remember, sometimes we are our own end user. 125 00:06:24,600 --> 00:06:30,620 It's not just you're not always doing a startup. 126 00:06:30,620 --> 00:06:33,120 You're not always writing a mobile gaming system. 127 00:06:33,120 --> 00:06:35,000 So sometimes you're writing something for yourself. 128 00:06:35,000 --> 00:06:35,720 But that's OK. 129 00:06:38,880 --> 00:06:42,200 So sometimes you're writing something to solve a problem. 130 00:06:42,200 --> 00:06:43,400 You're like crafting. 131 00:06:43,400 --> 00:06:46,960 You're doing something that you could do by hand or manually. 132 00:06:46,960 --> 00:06:51,600 And you're making some clever little 25 or 100 line program. 133 00:06:51,600 --> 00:06:54,440 And you're putting that in. 134 00:06:54,440 --> 00:06:57,240 Other times, like when I work on the open source learning 135 00:06:57,240 --> 00:07:00,760 management system, Sakai, it is my creativity. 136 00:07:00,760 --> 00:07:01,840 I've got an idea. 137 00:07:01,840 --> 00:07:04,200 And I want to share it with a million users. 138 00:07:04,200 --> 00:07:07,520 And so I write my code for an external audience. 139 00:07:07,520 --> 00:07:10,160 And so code is that sequence of instructions 140 00:07:10,160 --> 00:07:14,520 that the computer itself doesn't know how to hand a roster out. 141 00:07:14,520 --> 00:07:17,000 But I can write code that will hand a roster out 142 00:07:17,000 --> 00:07:20,140 by looking at the data that's inside this computer, 143 00:07:20,140 --> 00:07:22,360 inside this application. 144 00:07:22,360 --> 00:07:24,240 And so if you think about programs, 145 00:07:24,240 --> 00:07:28,360 we have programs for computers and programs for humans. 146 00:07:28,360 --> 00:07:30,920 And a number of years ago, now I'm starting. 147 00:07:30,920 --> 00:07:34,480 Sooner or later, this will be me showing my age. 148 00:07:34,480 --> 00:07:36,560 This is an example of the Macarena. 149 00:07:36,560 --> 00:07:38,800 And the Macarena is a song that effectively 150 00:07:38,800 --> 00:07:40,720 is a sequence of instructions. 151 00:07:40,720 --> 00:07:42,000 You put your left hand out. 152 00:07:42,000 --> 00:07:43,160 You put your right hand out. 153 00:07:43,160 --> 00:07:44,400 You put it on the shoulder. 154 00:07:44,400 --> 00:07:45,520 You wiggle, wiggle, wiggle. 155 00:07:45,520 --> 00:07:46,440 And you spin around. 156 00:07:46,440 --> 00:07:47,240 And you do things. 157 00:07:47,240 --> 00:07:52,760 And this is a program for people. 158 00:07:52,760 --> 00:07:55,200 And so I want you to take a quick look at this 159 00:07:55,200 --> 00:07:58,500 and see if you can find anything wrong 160 00:07:58,500 --> 00:08:01,200 with this particular program. 161 00:08:01,200 --> 00:08:02,240 So look really closely. 162 00:08:09,520 --> 00:08:13,160 So I'll show you. 163 00:08:13,160 --> 00:08:15,040 It's got some typographical errors in it. 164 00:08:15,040 --> 00:08:20,920 And we as humans are really good at reading or hearing 165 00:08:20,920 --> 00:08:23,800 typographical errors and correcting them automatically 166 00:08:23,800 --> 00:08:26,120 and instantly. 167 00:08:26,120 --> 00:08:28,000 But computers are not. 168 00:08:28,000 --> 00:08:31,120 Computers are extremely literal. 169 00:08:31,120 --> 00:08:34,880 If it saw this ham instead of hand, 170 00:08:34,880 --> 00:08:36,919 it would think, what's a ham? 171 00:08:36,919 --> 00:08:39,339 And why am I going to hit someone in the back of the head 172 00:08:39,340 --> 00:08:40,580 with a ham? 173 00:08:40,580 --> 00:08:44,400 And why would I take my left hand and hit somebody? 174 00:08:44,400 --> 00:08:45,760 These are all bad things. 175 00:08:45,760 --> 00:08:49,340 But the computer is going to take us very literally. 176 00:08:49,340 --> 00:08:52,560 And so we have to be really precise. 177 00:08:52,560 --> 00:08:54,800 And the computer just doesn't know 178 00:08:54,800 --> 00:08:58,760 the difference between what we mean and what we say. 179 00:08:58,760 --> 00:09:00,600 So we have to be very precise. 180 00:09:00,600 --> 00:09:03,980 And this is one of the great frustrations 181 00:09:03,980 --> 00:09:08,040 that people have when they first start using computers. 182 00:09:08,040 --> 00:09:09,600 And so we have to get this right. 183 00:09:09,600 --> 00:09:11,840 We have to get these little bits of text 184 00:09:11,840 --> 00:09:13,520 exactly the way they are. 185 00:09:13,520 --> 00:09:15,400 Computers will blow up with syntax errors. 186 00:09:15,400 --> 00:09:17,480 And they seem to make quite a fuss 187 00:09:17,480 --> 00:09:19,600 when you make the tiniest of errors. 188 00:09:19,600 --> 00:09:20,760 But you'll get used to that. 189 00:09:20,760 --> 00:09:24,960 I mean, that's not because you're bad or you're 190 00:09:24,960 --> 00:09:26,400 less than awesome. 191 00:09:26,400 --> 00:09:29,440 It just means the computers can't compensate 192 00:09:29,440 --> 00:09:30,760 when you make small mistakes. 193 00:09:30,760 --> 00:09:32,740 And so you've got to get used to the fact 194 00:09:32,740 --> 00:09:35,520 that the computer is sort of intellectually 195 00:09:35,520 --> 00:09:37,000 not as strong as you. 196 00:09:37,000 --> 00:09:39,120 And so it gets confused really easy. 197 00:09:39,120 --> 00:09:40,880 Even though when it gets confused, 198 00:09:40,880 --> 00:09:42,800 it says seemingly mean things to you. 199 00:09:42,800 --> 00:09:45,840 So you'll get used to that. 200 00:09:45,840 --> 00:09:47,980 OK, so the first thing I want to do 201 00:09:47,980 --> 00:09:49,320 is I want to throw up some text. 202 00:09:49,320 --> 00:09:51,880 And I want you to, while this text is up, 203 00:09:51,880 --> 00:09:55,640 I want you to count the number of each word in this text 204 00:09:55,640 --> 00:10:00,240 and tell me what the most common word is in this text. 205 00:10:00,240 --> 00:10:01,520 OK, so here we go. 206 00:10:01,520 --> 00:10:23,600 OK, so I kind of made that hard on you on purpose 207 00:10:23,600 --> 00:10:26,440 by moving around and distracting you and confusing you. 208 00:10:26,440 --> 00:10:29,080 But even if it's not moving at all, 209 00:10:29,080 --> 00:10:31,720 it's a little bit tricky to do. 210 00:10:31,720 --> 00:10:34,040 You probably stare at it a couple of times. 211 00:10:34,040 --> 00:10:36,440 Your brain is going back and forth and back and forth. 212 00:10:36,440 --> 00:10:39,360 And so text analysis is one of the great things 213 00:10:39,360 --> 00:10:41,960 that computers are very, very good at. 214 00:10:41,960 --> 00:10:45,240 And some of the things that they can translate text, 215 00:10:45,240 --> 00:10:46,620 and that's because they've looked 216 00:10:46,620 --> 00:10:47,620 at a lot of information. 217 00:10:47,620 --> 00:10:49,620 So looking at text is actually something computers 218 00:10:49,620 --> 00:10:51,080 are really good at. 219 00:10:51,080 --> 00:10:54,500 And so if we take a look at the kind of programs 220 00:10:54,500 --> 00:10:57,440 that we're going to write to do this kind of thing, 221 00:10:57,440 --> 00:11:00,000 this is something that humans are not naturally good at, 222 00:11:00,000 --> 00:11:01,440 but computers are super good at. 223 00:11:01,440 --> 00:11:05,240 Now, I'm not going to have you look at this code. 224 00:11:05,240 --> 00:11:07,040 I'm not going to, this code you will understand 225 00:11:07,040 --> 00:11:08,120 in a few weeks. 226 00:11:08,120 --> 00:11:11,160 But basically, this is a set of instructions 227 00:11:11,160 --> 00:11:14,160 to open a file, read that file, 228 00:11:14,160 --> 00:11:16,160 read all the words in the file, 229 00:11:16,160 --> 00:11:19,120 create a histogram of all the words in the file, 230 00:11:19,120 --> 00:11:21,400 and then search through that histogram 231 00:11:21,400 --> 00:11:22,880 to find the most common word 232 00:11:22,880 --> 00:11:26,940 and tell us what the most common word is in the file. 233 00:11:26,940 --> 00:11:29,800 And in this clown file, the word the is the most common. 234 00:11:29,800 --> 00:11:31,040 It happened seven times. 235 00:11:31,040 --> 00:11:33,960 And here's another large file called words.text, 236 00:11:33,960 --> 00:11:36,640 and the word to is the most common thing. 237 00:11:36,640 --> 00:11:38,520 And our goal is to get to the point 238 00:11:38,520 --> 00:11:40,560 where you can write this on your own. 239 00:11:40,560 --> 00:11:43,420 So you can say, you know what, I got a problem to solve. 240 00:11:43,420 --> 00:11:45,920 That is, what's the most common word in this file? 241 00:11:45,920 --> 00:11:49,240 I know how to start, and then I know how to finish. 242 00:11:49,240 --> 00:11:51,120 I know how to do the stuff in the middle. 243 00:11:51,120 --> 00:11:53,840 And we have to learn this kind of weird language. 244 00:11:53,840 --> 00:11:57,600 But when we do, we can count millions of words 245 00:11:57,600 --> 00:11:59,640 as easily as we count 20 words. 246 00:11:59,640 --> 00:12:02,200 So that's the fun of all of this, 247 00:12:02,200 --> 00:12:04,200 is to teach you this language 248 00:12:04,200 --> 00:12:06,420 so that you can solve that problem 249 00:12:06,420 --> 00:12:07,900 so that you don't have to solve it. 250 00:12:07,900 --> 00:12:09,400 Because you could solve it, 251 00:12:09,400 --> 00:12:12,240 but it's not something that you're naturally good at, 252 00:12:12,240 --> 00:12:13,480 and it's hard work. 253 00:12:14,520 --> 00:12:16,020 So up next, we're gonna talk a little bit 254 00:12:16,020 --> 00:12:17,960 about the hardware architecture 255 00:12:17,960 --> 00:12:22,960 that you're gonna be experiencing as you write programs. 256 00:12:25,800 --> 00:12:28,240 Hello, and welcome back to hardware architecture. 257 00:12:28,240 --> 00:12:30,800 Now, you might ask, why do I tell you 258 00:12:30,800 --> 00:12:32,360 about hardware architecture? 259 00:12:34,360 --> 00:12:36,620 Probably you're not gonna build any hardware, 260 00:12:36,620 --> 00:12:38,120 although it's fun stuff to do. 261 00:12:38,120 --> 00:12:40,040 And if you're gonna become a computer scientist, 262 00:12:40,040 --> 00:12:42,560 which most of you won't want to be, 263 00:12:42,560 --> 00:12:43,760 it's a great thing to study. 264 00:12:43,760 --> 00:12:46,040 And it's those who build our hardware 265 00:12:46,040 --> 00:12:48,160 are amazingly talented individuals, 266 00:12:48,160 --> 00:12:49,760 and it's a really rewarding job. 267 00:12:51,400 --> 00:12:53,800 The reason I like talking to you about hardware 268 00:12:53,800 --> 00:12:57,000 is because I want to be able to use words at some point 269 00:12:57,000 --> 00:12:58,520 and say, oh, secondary storage, 270 00:12:58,520 --> 00:12:59,880 or central processing unit, 271 00:12:59,880 --> 00:13:02,360 or random access memory, 272 00:13:02,360 --> 00:13:05,760 or peripherals, input devices. 273 00:13:05,760 --> 00:13:07,000 And I wanna be able to say those words, 274 00:13:07,000 --> 00:13:08,840 and I want you to be able to understand them. 275 00:13:08,840 --> 00:13:11,120 And so I'll start with a little piece of hardware 276 00:13:11,120 --> 00:13:12,680 called the Raspberry Pi. 277 00:13:12,680 --> 00:13:16,360 And the Raspberry Pi is a cute little single board computer. 278 00:13:17,320 --> 00:13:19,640 As we go forward, these things are smaller, 279 00:13:19,640 --> 00:13:21,300 and smaller, and smaller. 280 00:13:21,300 --> 00:13:23,160 And the interesting thing is that 281 00:13:23,160 --> 00:13:25,060 the architecture of these stays the same, 282 00:13:25,060 --> 00:13:27,600 but the number of components drops. 283 00:13:27,600 --> 00:13:31,440 So I'm gonna start and give you a block diagram 284 00:13:31,440 --> 00:13:34,100 of sort of a generic computer 285 00:13:34,100 --> 00:13:36,460 and tell you the major parts of it. 286 00:13:36,460 --> 00:13:41,020 Now, I'm gonna show you some really old hardware, 287 00:13:41,020 --> 00:13:42,680 some really new hardware, 288 00:13:42,680 --> 00:13:46,560 and then some hardware that is of medium age. 289 00:13:46,560 --> 00:13:47,760 And the medium age hardware 290 00:13:47,760 --> 00:13:49,560 is probably the easiest one to see. 291 00:13:49,560 --> 00:13:52,520 The architecture is the same, okay? 292 00:13:52,520 --> 00:13:57,520 And so the basic block diagram is that the brains, 293 00:13:57,780 --> 00:13:59,720 if there are brains in computers, 294 00:13:59,720 --> 00:14:01,120 which there really aren't, 295 00:14:01,120 --> 00:14:03,760 the software is the closest thing computers have to brains, 296 00:14:03,760 --> 00:14:07,040 but in hardware, the closest brain a computer has is this, 297 00:14:07,040 --> 00:14:11,240 called a micro-processing unit, or a central processor unit. 298 00:14:11,240 --> 00:14:15,240 And this is designed three billion times a second 299 00:14:17,440 --> 00:14:20,320 to ask the question, what do you want me to do next? 300 00:14:20,320 --> 00:14:23,620 And these little pins on the back are instructions, 301 00:14:23,620 --> 00:14:26,660 like 32 or 64 of these pins, 302 00:14:26,660 --> 00:14:27,920 three billion times a second, 303 00:14:27,920 --> 00:14:30,200 we send an instruction into these things. 304 00:14:30,200 --> 00:14:35,200 Now, we can't sit there and talk to it, we can't. 305 00:14:35,200 --> 00:14:37,040 And so the instructions we store 306 00:14:37,040 --> 00:14:39,080 in what's called the main memory. 307 00:14:39,080 --> 00:14:40,960 And this memory is really fast, 308 00:14:40,960 --> 00:14:43,480 and the memory sort of feeds this. 309 00:14:43,480 --> 00:14:46,700 And so every time the CPU needs a new instruction, 310 00:14:46,700 --> 00:14:49,200 it asks the memory where that instruction is. 311 00:14:49,200 --> 00:14:51,480 And so the memory feeds the instruction CPU, 312 00:14:51,480 --> 00:14:53,800 the CPU does it, says give me another instruction, 313 00:14:53,800 --> 00:14:55,920 CPU does it, gives me another instruction, 314 00:14:55,920 --> 00:14:59,800 and that is the basic essence of programming. 315 00:14:59,800 --> 00:15:01,360 This asks what's next, 316 00:15:01,360 --> 00:15:03,920 and this is where your program is stored, 317 00:15:03,920 --> 00:15:06,820 or a program you purchased or came with your hardware, 318 00:15:06,820 --> 00:15:09,920 where that's all stored, and those are your places. 319 00:15:09,920 --> 00:15:13,000 And so you end up inside, 320 00:15:13,000 --> 00:15:16,320 your programs end up inside this memory. 321 00:15:16,320 --> 00:15:19,040 So then there's a, I mean, 322 00:15:19,040 --> 00:15:23,000 and so in software you tend to program the CPU, 323 00:15:23,000 --> 00:15:25,520 and if you had bought a desktop computer 324 00:15:25,520 --> 00:15:26,800 a number of years back, 325 00:15:26,800 --> 00:15:28,960 it would have this thing called the motherboard. 326 00:15:28,960 --> 00:15:31,100 And the motherboard is called this 327 00:15:31,100 --> 00:15:34,000 because it kind of connects all the components together. 328 00:15:34,000 --> 00:15:36,860 And so if you buy memory by itself, it does nothing, 329 00:15:36,860 --> 00:15:39,420 but it has a place to plug into the motherboard, 330 00:15:39,420 --> 00:15:41,320 and if you buy a microprocessor, 331 00:15:41,320 --> 00:15:44,080 it has a place to plug into the motherboard. 332 00:15:45,640 --> 00:15:49,520 And if you buy a hard drive, 333 00:15:51,000 --> 00:15:53,400 this is a really old hard drive, 334 00:15:53,400 --> 00:15:55,120 it has a place to plug in on the motherboard, 335 00:15:55,120 --> 00:15:57,880 and so the motherboard sort of connects everything together. 336 00:15:57,880 --> 00:16:00,120 The hard drive is secondary storage. 337 00:16:00,120 --> 00:16:04,280 Now the way, how secondary storage is different 338 00:16:04,280 --> 00:16:08,400 than the main memory, which, there it is. 339 00:16:08,400 --> 00:16:11,600 I gotta unpile this stuff. 340 00:16:11,600 --> 00:16:14,460 So this main memory is really fast, 341 00:16:14,460 --> 00:16:17,640 but as soon as you turn the power off of this memory, 342 00:16:17,640 --> 00:16:19,120 it sort of vanishes. 343 00:16:19,120 --> 00:16:21,880 And so to store files like word processing files 344 00:16:21,880 --> 00:16:24,740 or text files or whatever, 345 00:16:24,740 --> 00:16:25,840 you gotta store it on something 346 00:16:25,840 --> 00:16:27,560 that lasts a little bit longer. 347 00:16:27,560 --> 00:16:31,000 And so that's the purpose of the secondary storage. 348 00:16:31,000 --> 00:16:32,220 It's permanent. 349 00:16:32,220 --> 00:16:34,280 When the power's off, it stores it. 350 00:16:34,280 --> 00:16:36,680 Now this one here is in such bad shape 351 00:16:36,680 --> 00:16:38,800 that isn't probably storing anything, 352 00:16:38,800 --> 00:16:40,580 but it's got these little heads, 353 00:16:40,580 --> 00:16:43,800 and it spins around and goes in and out. 354 00:16:43,800 --> 00:16:46,480 And we'll have a video later that shows you 355 00:16:46,480 --> 00:16:48,920 one of these things that's not quite in as bad a shape. 356 00:16:48,920 --> 00:16:51,760 If you look, this has four different platters 357 00:16:51,760 --> 00:16:53,560 that are all spinning around. 358 00:16:53,560 --> 00:16:56,460 And so this is just using magnetic material 359 00:16:56,460 --> 00:16:59,280 and electronics that sort of magnetize 360 00:16:59,280 --> 00:17:00,920 and demagnetize this stuff. 361 00:17:00,920 --> 00:17:02,520 And if you look at a disk, 362 00:17:02,520 --> 00:17:04,560 they're often rated, physical disks, 363 00:17:04,560 --> 00:17:06,440 are rated in revolutions per minute. 364 00:17:06,440 --> 00:17:08,520 And that's how many times this thing spins around. 365 00:17:08,520 --> 00:17:11,599 And if you got an old desktop and you hear it spin up, 366 00:17:11,599 --> 00:17:13,519 this is the thing that's spinning. 367 00:17:13,520 --> 00:17:16,280 And it's the place that your operating system lives, 368 00:17:16,280 --> 00:17:18,940 your files live, your applications live, 369 00:17:18,940 --> 00:17:21,319 while they're stored and while the computer's turned off. 370 00:17:21,319 --> 00:17:24,859 And then they're loaded into this while they're running. 371 00:17:24,859 --> 00:17:29,040 And then, this CPU takes the data from the main memory, 372 00:17:36,880 --> 00:17:41,680 and your program runs at three billion operations per second. 373 00:17:41,680 --> 00:17:45,320 So, let's talk a little bit about something 374 00:17:45,320 --> 00:17:50,400 that this is probably from the 1960s or 70s. 375 00:17:50,400 --> 00:17:54,120 This actually has, if you're an electrical person, 376 00:17:54,120 --> 00:17:59,120 it has capacitors, those little silver things are capacitors. 377 00:17:59,280 --> 00:18:02,400 These little colored things are resistors, 378 00:18:02,400 --> 00:18:03,480 and that's more capacitors. 379 00:18:03,480 --> 00:18:06,440 And then there's wires, and wires move everything. 380 00:18:06,440 --> 00:18:10,920 And so, when you say like this has millions of transistors, 381 00:18:10,920 --> 00:18:13,800 oh wait, that isn't a capacitor, that's a transistor. 382 00:18:13,800 --> 00:18:15,000 That's a transistor. 383 00:18:15,000 --> 00:18:17,960 When you say that this here has etched, 384 00:18:17,960 --> 00:18:19,160 and if you look closely at this, 385 00:18:19,160 --> 00:18:22,040 go look at a picture of a microprocessor online, 386 00:18:22,040 --> 00:18:24,120 you will see that it has millions of these. 387 00:18:24,120 --> 00:18:29,120 And so, the difference between 1960 and today 388 00:18:29,280 --> 00:18:34,120 is this circuitry of capacitors, resistors, 389 00:18:34,120 --> 00:18:39,120 and transistors has been microized and put onto this. 390 00:18:39,880 --> 00:18:42,320 It's using a photographic process, 391 00:18:42,320 --> 00:18:45,560 and they're tinier and tinier and putting more and more on. 392 00:18:45,560 --> 00:18:48,820 And if you think going from millions of these 393 00:18:48,820 --> 00:18:53,640 to one of these is crazy, the thing that's happening now, 394 00:18:53,640 --> 00:18:56,560 and the reason we have whole computers inside our pocket, 395 00:18:56,560 --> 00:19:00,640 is that everything, all of this, this whole thing, 396 00:19:00,640 --> 00:19:04,520 CPU, memory, everything, all of it connected, 397 00:19:04,520 --> 00:19:07,760 and the storage is being made smaller and smaller. 398 00:19:07,760 --> 00:19:09,720 And so, this little single board computer 399 00:19:09,720 --> 00:19:12,480 called the Raspberry Pi has one thing in it, 400 00:19:12,480 --> 00:19:15,600 and it has the main memory, and it has the CPU, 401 00:19:15,600 --> 00:19:17,800 it has connections for things like peripherals, 402 00:19:17,800 --> 00:19:19,120 like keyboards and stuff. 403 00:19:19,120 --> 00:19:21,840 Now, it doesn't yet have secondary storage on it. 404 00:19:21,840 --> 00:19:26,120 The secondary storage gets plugged in right here via USB. 405 00:19:26,120 --> 00:19:29,720 And then if you take it one step farther to my phone, 406 00:19:29,720 --> 00:19:31,960 it's got the secondary storage built right in. 407 00:19:31,960 --> 00:19:36,880 And so, this picture goes from the size of cabinets 408 00:19:36,880 --> 00:19:40,680 in the old days all the way down to really tiny. 409 00:19:40,680 --> 00:19:44,280 But, at the end of the day, inside it is a highly 410 00:19:44,280 --> 00:19:47,840 sophisticated piece of circuitry that asks for instructions 411 00:19:47,840 --> 00:19:52,080 one at a time, and main memory that holds the instructions 412 00:19:52,080 --> 00:19:54,080 and feeds them, okay? 413 00:19:54,080 --> 00:19:56,000 Central processor does the thinking, 414 00:19:56,000 --> 00:19:58,760 take a look here, central processor does the thinking, 415 00:19:58,760 --> 00:20:01,520 it runs the program, it's asking what's next, 416 00:20:01,520 --> 00:20:04,520 it's not really smart, but it's really fast. 417 00:20:04,520 --> 00:20:08,480 And so, we compensate for the lack of intelligence 418 00:20:08,480 --> 00:20:11,440 of this thing by us writing really good software 419 00:20:11,440 --> 00:20:12,640 that runs really fast. 420 00:20:12,640 --> 00:20:16,920 And so, voice recognition on things like phones is possible 421 00:20:16,920 --> 00:20:20,200 because computers have so much storage and they run so fast 422 00:20:20,200 --> 00:20:23,120 and the algorithms that do voice recognition 423 00:20:23,120 --> 00:20:25,440 are finally starting to work. 424 00:20:25,440 --> 00:20:29,600 Input devices like keyboards and mice and pens and whatever, 425 00:20:29,600 --> 00:20:32,400 they come in, output devices are like the screens 426 00:20:32,400 --> 00:20:35,800 that we see, the main memory is the fast part 427 00:20:35,800 --> 00:20:37,920 of the computer that stores all the programs, 428 00:20:37,920 --> 00:20:41,160 and the secondary memory is the permanent storage. 429 00:20:41,160 --> 00:20:44,880 Increasingly, secondary memory, 430 00:20:44,880 --> 00:20:46,800 do I have any USB sticks in here? 431 00:20:47,720 --> 00:20:48,900 I don't. 432 00:20:48,900 --> 00:20:53,600 Well, increasingly secondary memory is flash RAM 433 00:20:53,600 --> 00:20:58,200 or static RAM with no moving parts. 434 00:20:58,200 --> 00:21:02,000 And so, in a few years you'll not even be able to see 435 00:21:02,000 --> 00:21:04,840 secondary memory with moving parts. 436 00:21:04,840 --> 00:21:07,160 But that's okay, it's still secondary memory, 437 00:21:07,160 --> 00:21:09,040 it's still memory that lasts. 438 00:21:09,040 --> 00:21:13,560 And so, you and where your place is in here 439 00:21:13,560 --> 00:21:15,120 is you live in the main memory. 440 00:21:15,120 --> 00:21:17,160 This is you, you are here. 441 00:21:17,160 --> 00:21:20,600 And so, in a sense, when the CPU asks the question 442 00:21:20,600 --> 00:21:23,160 what next, it is your job to answer that. 443 00:21:23,160 --> 00:21:25,880 And you answer that by writing Python code. 444 00:21:25,880 --> 00:21:28,760 And so, your Python code, you'll write a file in Python code. 445 00:21:28,760 --> 00:21:30,320 Blah, blah, blah, blah, blah, blah, blah. 446 00:21:30,320 --> 00:21:33,440 And then that Python code sort of gets loaded 447 00:21:33,440 --> 00:21:35,720 into main memory, there's a magic translation process 448 00:21:35,720 --> 00:21:36,560 that happens. 449 00:21:36,560 --> 00:21:40,120 And then your code is actually answering this question 450 00:21:40,120 --> 00:21:41,520 three billion times a second. 451 00:21:41,520 --> 00:21:43,960 Three billion times a second, you're sitting there. 452 00:21:43,960 --> 00:21:45,280 But this is you. 453 00:21:45,280 --> 00:21:48,220 You're really out here, but you then write a file 454 00:21:48,220 --> 00:21:50,500 and the file's loaded in and then the file runs. 455 00:21:50,500 --> 00:21:51,680 And that's how things are at. 456 00:21:51,680 --> 00:21:55,480 And that's your place in the world. 457 00:21:55,480 --> 00:21:58,600 Now, what's actually running is not Python code. 458 00:21:58,600 --> 00:22:01,280 There is, as I said, a translation process. 459 00:22:01,280 --> 00:22:05,440 You write a Python file and then Python itself 460 00:22:05,440 --> 00:22:08,120 translates this into the actual language 461 00:22:08,120 --> 00:22:12,120 known by the microprocessor, which is a series of zeros 462 00:22:12,120 --> 00:22:13,400 and ones called machine language. 463 00:22:13,400 --> 00:22:15,680 Someday I would love to teach you a class 464 00:22:15,680 --> 00:22:17,000 on machine language. 465 00:22:17,000 --> 00:22:18,760 But for now, we're gonna teach you Python 466 00:22:18,760 --> 00:22:20,400 and we're gonna use Python as a crutch. 467 00:22:20,400 --> 00:22:21,840 We don't have to talk machine language, 468 00:22:21,840 --> 00:22:24,600 but you could, if you really wanted to, 469 00:22:24,600 --> 00:22:26,160 you could know how to write machine language. 470 00:22:26,160 --> 00:22:30,000 But I assure you, Python is far easier to learn 471 00:22:30,000 --> 00:22:31,160 than machine language. 472 00:22:31,160 --> 00:22:33,540 So, Python acts as a translator, 473 00:22:33,540 --> 00:22:35,680 translates what you're doing into machine language, 474 00:22:35,680 --> 00:22:38,960 and then the machine language is what's sent back and forth. 475 00:22:38,960 --> 00:22:40,440 But still, even though it's translated 476 00:22:40,440 --> 00:22:42,520 to machine language, it's you. 477 00:22:42,520 --> 00:22:44,360 It is you answering those questions 478 00:22:44,360 --> 00:22:45,600 and that's what a program is, 479 00:22:45,600 --> 00:22:49,880 is you pre-storing your response to the what next question 480 00:22:49,880 --> 00:22:52,120 over and over again. 481 00:22:52,120 --> 00:22:54,200 So, here's a couple of videos that you can look at 482 00:22:54,200 --> 00:22:56,520 on YouTube about a CPU. 483 00:22:56,520 --> 00:22:59,200 These CPUs, and it looks very much like this CPU 484 00:22:59,200 --> 00:23:01,000 that I've got with me, 485 00:23:01,000 --> 00:23:05,920 these CPUs run extremely high heat 486 00:23:05,920 --> 00:23:08,340 when you put this thing on your computer on your lap 487 00:23:08,340 --> 00:23:09,600 and it starts to heat up. 488 00:23:09,600 --> 00:23:12,160 That means it's thinking really, really hard. 489 00:23:12,160 --> 00:23:15,360 And so, this is a small little old video 490 00:23:15,360 --> 00:23:17,580 from a long time ago that shows what happens 491 00:23:17,580 --> 00:23:19,960 when you take out the cooling capability 492 00:23:19,960 --> 00:23:24,080 of microprocessors and just how hot they can be. 493 00:23:24,080 --> 00:23:28,160 And the other video that I have is a hard disk. 494 00:23:28,160 --> 00:23:31,280 Something like this hard disk that I have 495 00:23:31,280 --> 00:23:34,120 except that it works and they turn the power on. 496 00:23:34,120 --> 00:23:36,440 Some of them last for a few seconds, 497 00:23:36,440 --> 00:23:38,680 some of them last for a few minutes. 498 00:23:38,680 --> 00:23:40,160 It's never a, 499 00:23:40,160 --> 00:23:41,000 achoo! 500 00:23:43,520 --> 00:23:44,360 Achoo! 501 00:23:44,360 --> 00:23:46,440 I must be allergic to this hard drive. 502 00:23:47,280 --> 00:23:49,760 Or maybe it's because there's dust in this hard drive 503 00:23:49,760 --> 00:23:52,160 and I keep spinning it and I sneeze. 504 00:23:52,160 --> 00:23:56,960 But basically, some of them last for a few seconds, 505 00:23:56,960 --> 00:23:58,320 some of them last for a few minutes. 506 00:23:58,320 --> 00:24:00,120 It's not a good idea to open them up, 507 00:24:00,120 --> 00:24:01,640 but I'm glad somebody opened it up 508 00:24:01,640 --> 00:24:04,020 and then did what they did and then recorded it 509 00:24:04,020 --> 00:24:06,920 so we can all enjoy what it is 510 00:24:06,920 --> 00:24:09,000 that they're capable of doing, okay? 511 00:24:09,000 --> 00:24:12,120 So that's a quick introduction to hardware, 512 00:24:12,120 --> 00:24:15,480 mostly so that I can use those words going forward. 513 00:24:15,480 --> 00:24:16,920 Now, what we're gonna talk about next 514 00:24:16,920 --> 00:24:19,680 is communicating in the language Python. 515 00:24:19,680 --> 00:24:22,640 That is, writing code and putting it into the computer 516 00:24:22,640 --> 00:24:27,640 so that that can execute, okay? 517 00:24:30,040 --> 00:24:33,400 And welcome to my video that shows how to get started 518 00:24:33,400 --> 00:24:38,400 and install Python on Microsoft Windows, okay? 519 00:24:38,600 --> 00:24:40,440 So it's not too hard. 520 00:24:40,440 --> 00:24:42,920 We're gonna both install Python 3 521 00:24:42,920 --> 00:24:45,480 and we're going to install Text Editor. 522 00:24:45,480 --> 00:24:48,240 And so I'm just gonna go into Google 523 00:24:48,240 --> 00:24:52,080 and I'm gonna install Python 3. 524 00:24:52,080 --> 00:24:54,760 And my top link is downloading Python. 525 00:24:55,680 --> 00:25:00,680 And there is my link for downloading Python 3.5.2. 526 00:25:00,800 --> 00:25:03,040 This version of my class uses Python 3. 527 00:25:03,040 --> 00:25:06,040 I have an earlier class that you may have seen 528 00:25:06,040 --> 00:25:08,400 that uses Python 2, but in this class, 529 00:25:08,400 --> 00:25:09,240 we're going to do this. 530 00:25:09,240 --> 00:25:10,880 Now, it might take you a while to download this. 531 00:25:10,880 --> 00:25:13,100 I've actually already downloaded it. 532 00:25:13,100 --> 00:25:16,960 Now, the other thing we need is a programmer text editor. 533 00:25:16,960 --> 00:25:20,240 And you can really use any programmer text editor. 534 00:25:20,240 --> 00:25:23,000 We've used Notepad Plus in the past. 535 00:25:23,000 --> 00:25:25,120 We've used JEdit in the past. 536 00:25:25,120 --> 00:25:30,120 I like Adam, Adam.io, T-O-M,.io, 537 00:25:30,200 --> 00:25:31,760 mostly because it works the same 538 00:25:31,760 --> 00:25:35,120 on Windows and Mac and Linux. 539 00:25:35,120 --> 00:25:39,200 But you can really use any text editor that you like. 540 00:25:39,200 --> 00:25:42,960 Just don't use Word or TextEdit 541 00:25:42,960 --> 00:25:44,400 that comes with the operating system. 542 00:25:44,400 --> 00:25:46,240 You need a programmer's editor 543 00:25:46,240 --> 00:25:50,060 that doesn't mess with weird characters or weird lines 544 00:25:50,060 --> 00:25:52,080 or strange formats. 545 00:25:52,080 --> 00:25:55,520 You must have a real programmer editor. 546 00:25:55,520 --> 00:25:58,200 And so I've already downloaded this as well. 547 00:25:59,040 --> 00:26:02,880 And so I won't waste the time waiting to download it, 548 00:26:02,880 --> 00:26:04,960 but let's go ahead and do the installation. 549 00:26:04,960 --> 00:26:09,960 So these things ended up in my downloads file. 550 00:26:15,960 --> 00:26:18,020 So I'm going to downloads. 551 00:26:18,020 --> 00:26:21,280 And I'll start installing Python 3.5.2. 552 00:26:22,760 --> 00:26:24,640 Now it's gonna ask me some things. 553 00:26:25,920 --> 00:26:28,120 Add Python 3.5 to the path. 554 00:26:28,120 --> 00:26:29,160 And that's a good idea. 555 00:26:29,160 --> 00:26:30,960 Install the launcher for all users. 556 00:26:30,960 --> 00:26:32,880 I'm going to add that. 557 00:26:32,880 --> 00:26:34,760 Maybe you will, maybe you won't do that. 558 00:26:34,760 --> 00:26:37,840 It's gonna tell me where it's going to install it. 559 00:26:42,000 --> 00:26:42,840 Install now. 560 00:26:44,280 --> 00:26:46,140 Of course, it's going to ask me 561 00:26:46,140 --> 00:26:48,120 for permission to do these things. 562 00:26:49,200 --> 00:26:51,440 And now it's running through the installation. 563 00:26:56,200 --> 00:26:57,980 Okay, so there we go. 564 00:26:57,980 --> 00:26:59,160 You could maybe click on this 565 00:26:59,160 --> 00:27:01,080 online tutorial and documentation. 566 00:27:02,520 --> 00:27:04,280 But we're just gonna close this. 567 00:27:06,400 --> 00:27:09,960 And I'm gonna start and run the Windows command line. 568 00:27:09,960 --> 00:27:14,960 Now, you may have all kinds of fancy ways to run Python, 569 00:27:14,960 --> 00:27:18,200 but I like running the command line, 570 00:27:19,200 --> 00:27:22,120 C-O-M-M-A-N-D. 571 00:27:22,120 --> 00:27:26,720 I like running the command line because after a while, 572 00:27:26,720 --> 00:27:29,800 it's important to know what folder things are being run in. 573 00:27:31,000 --> 00:27:33,040 And so here's this command line. 574 00:27:33,040 --> 00:27:36,080 And I should be able to type Python here. 575 00:27:36,080 --> 00:27:38,680 And so now I'm in Python 3.2. 576 00:27:38,680 --> 00:27:42,160 And this is, the chevron prompt here 577 00:27:42,160 --> 00:27:43,960 is the Python interpreter, 578 00:27:43,960 --> 00:27:46,040 where it's asking for Python commands. 579 00:27:46,040 --> 00:27:47,520 And I can say print. 580 00:27:51,320 --> 00:27:52,880 Hello world. 581 00:27:52,880 --> 00:27:56,840 Of course, this is what we tend to print all the time. 582 00:27:56,840 --> 00:27:57,980 I can make a mistake. 583 00:27:57,980 --> 00:27:59,280 I can say, 584 00:27:59,280 --> 00:28:00,120 lulululul. 585 00:28:06,720 --> 00:28:08,760 Right, and it'll complain to me. 586 00:28:08,760 --> 00:28:09,600 Now to get out of this, 587 00:28:09,600 --> 00:28:11,620 I can either type control-Z or quit. 588 00:28:11,620 --> 00:28:13,640 In this case, I'm gonna type control-Z. 589 00:28:13,640 --> 00:28:15,280 And I'm back to the prompt. 590 00:28:15,280 --> 00:28:16,700 A couple of things, 591 00:28:17,620 --> 00:28:20,920 I can do a dir to see what folders and files I have. 592 00:28:20,920 --> 00:28:22,520 And that is like my desktop. 593 00:28:23,420 --> 00:28:26,200 And then the cd command tells me 594 00:28:26,200 --> 00:28:28,200 where I'm at in the folder. 595 00:28:28,200 --> 00:28:31,520 That means I'm in the user's directory, Dr. Chuck. 596 00:28:32,440 --> 00:28:35,420 So I have now installed Python. 597 00:28:35,420 --> 00:28:38,120 I ran the Python interpreter to verify it. 598 00:28:38,120 --> 00:28:40,480 I said, print hello world. 599 00:28:40,480 --> 00:28:41,640 And so now what I'm gonna do 600 00:28:41,640 --> 00:28:44,080 is I'm gonna actually install Atom. 601 00:28:44,080 --> 00:28:45,680 And I already had this downloaded. 602 00:28:45,680 --> 00:28:48,420 So let's go ahead and install Atom on my computer. 603 00:29:05,240 --> 00:29:07,400 Okay, so Atom is now installed 604 00:29:07,400 --> 00:29:09,920 and it's kind of telling us what to do. 605 00:29:09,920 --> 00:29:12,960 So I'm gonna actually just close all these windows, 606 00:29:12,960 --> 00:29:15,440 close this window, close everything. 607 00:29:15,440 --> 00:29:18,040 And I'm gonna create a file. 608 00:29:18,040 --> 00:29:20,280 I'm going to say print. 609 00:29:20,280 --> 00:29:21,400 In this case, 610 00:29:22,320 --> 00:29:26,600 let's see if I can make this bigger. 611 00:29:26,600 --> 00:29:28,420 I can make it bigger. 612 00:29:28,420 --> 00:29:33,260 So I'm gonna type print hello from a file. 613 00:29:34,120 --> 00:29:37,680 Okay, and I'm gonna save this. 614 00:29:37,680 --> 00:29:42,680 I'm gonna say file, save as. 615 00:29:43,960 --> 00:29:47,420 And what I'm gonna do is I'm gonna go to my desktop. 616 00:29:50,960 --> 00:29:55,520 And I'm gonna make a folder on the desktop. 617 00:29:55,520 --> 00:29:59,640 I'm gonna call this folder py4e. 618 00:29:59,640 --> 00:30:01,560 So I now have a folder on the desktop. 619 00:30:02,400 --> 00:30:04,240 Move this here, I'll move this here. 620 00:30:04,240 --> 00:30:06,000 Oops. 621 00:30:06,000 --> 00:30:08,380 And I'm gonna go into py4e. 622 00:30:09,520 --> 00:30:13,920 And then I'm gonna name this file first.py. 623 00:30:17,160 --> 00:30:19,340 And you'll notice that when I save this, 624 00:30:20,280 --> 00:30:22,240 when I save this, 625 00:30:22,240 --> 00:30:24,840 it syntax highlighted it. 626 00:30:24,840 --> 00:30:27,520 That's one of the nice things about a programmer editor. 627 00:30:27,520 --> 00:30:30,860 Okay, and so it says, oh, it's got a suffix of.py. 628 00:30:30,860 --> 00:30:34,120 So therefore it knows that it's supposed to look pretty 629 00:30:34,120 --> 00:30:35,800 with Python and make this one color, 630 00:30:35,800 --> 00:30:37,400 make this another color. 631 00:30:37,400 --> 00:30:38,600 The other thing that you'll notice 632 00:30:38,600 --> 00:30:42,120 is that I now have a folder called py4e. 633 00:30:42,120 --> 00:30:45,780 And if I am in this command line, 634 00:30:45,780 --> 00:30:47,200 let me just start that up again. 635 00:30:47,200 --> 00:30:49,600 I'll show you how to start the command line again. 636 00:30:52,560 --> 00:30:53,400 Command. 637 00:30:55,400 --> 00:30:58,880 Now, if I do a dir, I see the folders that I'm in. 638 00:30:58,880 --> 00:31:00,760 And one of the folders that you can see here 639 00:31:00,760 --> 00:31:02,160 is the desktop folder. 640 00:31:02,160 --> 00:31:04,080 So I'm gonna say cd desktop. 641 00:31:06,160 --> 00:31:07,600 And then I'm gonna type the dir command 642 00:31:07,600 --> 00:31:10,240 to see what folders are in the desktop. 643 00:31:10,240 --> 00:31:14,560 These folders are the same as these folders. 644 00:31:14,560 --> 00:31:16,520 These things are kind of virtual folders. 645 00:31:16,520 --> 00:31:19,680 Py4e is py4e. 646 00:31:19,680 --> 00:31:24,160 Now I can type cd, which stands for change directory, py4e. 647 00:31:25,360 --> 00:31:28,460 And I can do a dir, and I see first.py. 648 00:31:28,460 --> 00:31:32,920 And that's the same as if I'm diving into this folder. 649 00:31:32,920 --> 00:31:34,980 Here's this file, first.py. 650 00:31:34,980 --> 00:31:36,600 Windows hides the suffix, 651 00:31:36,600 --> 00:31:40,120 which is somewhat annoying and frustrating, 652 00:31:40,120 --> 00:31:44,440 but that suffix is there, that file is there. 653 00:31:44,440 --> 00:31:46,440 And so for me, one of the things 654 00:31:46,440 --> 00:31:47,560 you gotta figure out in Windows 655 00:31:47,560 --> 00:31:51,320 is how to make sure that you are in the same folder, 656 00:31:51,320 --> 00:31:54,780 users, Dr. Chuck, desktop, py4e, 657 00:31:54,780 --> 00:31:58,440 and that's the name of this file, and here as well. 658 00:31:58,440 --> 00:32:00,280 And now I'm gonna run this program. 659 00:32:00,280 --> 00:32:05,120 I'm gonna type python, first.py. 660 00:32:06,900 --> 00:32:10,460 And you see that it ran the Python code, okay? 661 00:32:10,460 --> 00:32:15,460 Another way you can do this is you can type first.py. 662 00:32:15,960 --> 00:32:18,680 And that's because this file association 663 00:32:18,680 --> 00:32:19,640 has happened in Windows. 664 00:32:19,640 --> 00:32:21,160 This doesn't work in Macintosh. 665 00:32:21,160 --> 00:32:22,720 This only works in Windows. 666 00:32:22,720 --> 00:32:26,640 That all files with.py are expected to be Python, 667 00:32:26,640 --> 00:32:28,320 and it knows the Python interpreter 668 00:32:28,320 --> 00:32:30,160 where to run it, okay? 669 00:32:30,160 --> 00:32:33,120 And so I've got Python 3.0 installed, 670 00:32:33,120 --> 00:32:35,320 and that gets me started, 671 00:32:35,320 --> 00:32:40,200 and so I hope that this little introduction 672 00:32:40,200 --> 00:32:41,560 about getting things started 673 00:32:41,560 --> 00:32:43,640 and writing your first Python program 674 00:32:43,640 --> 00:32:44,900 has been helpful to you. 675 00:32:49,760 --> 00:32:51,160 We're going to actually download 676 00:32:51,160 --> 00:32:55,920 and install Python 3 from python.org on a Macintosh. 677 00:32:55,920 --> 00:32:58,640 If your Macintosh for years has wonderfully 678 00:32:58,640 --> 00:32:59,840 come with Python 2. 679 00:32:59,840 --> 00:33:03,220 So if I type python minus minus version, 680 00:33:04,160 --> 00:33:07,640 then I type that. 681 00:33:07,640 --> 00:33:09,680 I see that I've got Python 2.0. 682 00:33:09,680 --> 00:33:12,560 What we wanna do is, in addition, install Python 3. 683 00:33:12,560 --> 00:33:16,400 One of these days, Macintosh might upgrade 684 00:33:16,400 --> 00:33:19,520 their distributed version of Python 3, 685 00:33:19,520 --> 00:33:21,240 but there's so many things inside Mac 686 00:33:21,240 --> 00:33:22,760 that depend on Python 2. 687 00:33:22,760 --> 00:33:25,680 I'm gonna expect that it will always be named Python 3, 688 00:33:25,680 --> 00:33:29,840 which is what we're gonna call it in a second. 689 00:33:29,840 --> 00:33:33,320 So here I am at the python.org downloads, 690 00:33:33,320 --> 00:33:36,080 and I'm gonna download Python 3. 691 00:33:36,080 --> 00:33:38,640 You click here, and I've actually got it sitting here 692 00:33:38,640 --> 00:33:42,120 in downloads already, because I always do that. 693 00:33:42,120 --> 00:33:45,360 And so I'm gonna install this. 694 00:33:46,560 --> 00:33:48,640 There is the installer. 695 00:33:48,640 --> 00:33:50,960 I'm gonna say continue, continue, continue. 696 00:33:50,960 --> 00:33:53,680 Of course I agree, I read all that really fast, 697 00:33:53,680 --> 00:33:55,320 and now I'm going to install it. 698 00:33:55,320 --> 00:34:00,320 Okay, so now that means if I run a terminal, 699 00:34:09,360 --> 00:34:12,040 so this of course is start run terminal, 700 00:34:12,040 --> 00:34:15,320 so Python 2 is still there, 701 00:34:15,320 --> 00:34:18,800 but Python 3 is also now there, 702 00:34:18,800 --> 00:34:20,800 so we should have Python 3 installed. 703 00:34:20,800 --> 00:34:24,000 So we installed Python 3.6, and so there we go, 704 00:34:24,000 --> 00:34:26,800 and that's all it takes to install Python 3 705 00:34:26,800 --> 00:34:28,719 on the Macintosh. 706 00:34:28,719 --> 00:34:31,419 So let's write our first little Python program. 707 00:34:32,840 --> 00:34:36,000 I'm going to, I like Atom, 708 00:34:38,159 --> 00:34:40,119 and so I've got this Atom editor. 709 00:34:40,120 --> 00:34:43,800 It's atom.io, right here, atom.io, 710 00:34:43,800 --> 00:34:45,980 download and install the Atom editor. 711 00:34:45,980 --> 00:34:49,440 I like it because Atom works the same 712 00:34:49,440 --> 00:34:52,679 on both Windows, Mac, and Linux, 713 00:34:52,679 --> 00:34:54,039 and it has syntax highlighting, 714 00:34:54,040 --> 00:34:55,840 and so I really like things like that. 715 00:34:55,840 --> 00:35:00,240 So I'm gonna make myself a simple Python program. 716 00:35:02,520 --> 00:35:04,360 Hello world, like we always do. 717 00:35:04,360 --> 00:35:06,720 Now you'll notice that it's not syntax highlighting yet, 718 00:35:06,720 --> 00:35:10,960 but I'm gonna do a file, save, oopsie daisy, 719 00:35:10,960 --> 00:35:15,560 file, save as, and I'm gonna go into my desktop, 720 00:35:15,560 --> 00:35:19,460 and I'm gonna make a folder called py4e. 721 00:35:19,460 --> 00:35:24,460 I must find this call as hello.py. 722 00:35:27,280 --> 00:35:32,280 Oh crud, gotta rename it, rename it. 723 00:35:33,760 --> 00:35:37,120 I ended up with two dots, hello.py, there we are. 724 00:35:37,120 --> 00:35:40,160 And so now I'm here, and I'm in my home folder. 725 00:35:40,160 --> 00:35:42,240 I can go into my desktop, and I can go into 726 00:35:42,240 --> 00:35:45,000 that new folder I made, Python for Everybody, 727 00:35:45,000 --> 00:35:47,040 and I can see the files. 728 00:35:47,040 --> 00:35:50,200 Now there are ways to run this, and I really don't, 729 00:35:50,200 --> 00:35:52,680 I really want you to learn the terminal 730 00:35:53,560 --> 00:35:54,840 so that you really know what you're doing. 731 00:35:54,840 --> 00:35:56,940 And so here we are, we are in the folder 732 00:35:56,940 --> 00:35:59,660 that has the Python, and then all we do to run it 733 00:35:59,660 --> 00:36:04,660 is we say Python3 hello.py, and there we go. 734 00:36:05,280 --> 00:36:06,600 And of course this is Python3 735 00:36:06,600 --> 00:36:08,240 because I'm using parentheses there. 736 00:36:08,240 --> 00:36:11,320 So instead of double quotes. 737 00:36:11,320 --> 00:36:13,440 But Python2 is still there, and of course 738 00:36:13,440 --> 00:36:15,880 if you just run Python hello.py, 739 00:36:15,880 --> 00:36:19,600 it'll be a syntax error, or not. 740 00:36:19,600 --> 00:36:21,000 Must be they added something. 741 00:36:21,000 --> 00:36:21,840 Ha ha ha. 742 00:36:22,760 --> 00:36:25,420 Yeah, because Python is still version, 743 00:36:26,940 --> 00:36:29,160 still version two, but apparently they allowed print 744 00:36:29,160 --> 00:36:31,600 in the latest version of Python2. 745 00:36:31,600 --> 00:36:33,580 So away we go. 746 00:36:34,480 --> 00:36:37,680 Okay, so again, thanks for watching. 747 00:36:37,680 --> 00:36:41,160 I hope this was helpful to you to get Python3 748 00:36:41,160 --> 00:36:46,160 installed on your Macintosh. 749 00:36:47,200 --> 00:36:50,160 Hello, and welcome back to Python as a Language. 750 00:36:50,160 --> 00:36:52,280 You'll notice that I'm wearing a hat. 751 00:36:53,880 --> 00:36:57,600 And part of the story of the hat is that 752 00:36:57,600 --> 00:36:59,880 where I work here at the University of Michigan 753 00:36:59,880 --> 00:37:03,840 School of Information, my office is in this building 754 00:37:03,840 --> 00:37:05,520 called North Quad. 755 00:37:05,520 --> 00:37:09,320 And we call it quadwort sometimes 756 00:37:09,320 --> 00:37:11,240 because it's sort of got a square, 757 00:37:11,240 --> 00:37:13,740 it sort of imitates an Oxford quad. 758 00:37:13,740 --> 00:37:17,880 And so it seemed to me to evoke notions of Harry Potter. 759 00:37:17,880 --> 00:37:19,680 And when we first moved into the building, 760 00:37:19,680 --> 00:37:22,680 I joked in one of my classes that 761 00:37:22,680 --> 00:37:25,160 we should have a sorting ceremony for all the students 762 00:37:25,160 --> 00:37:28,360 as they come into North Quad for the first time. 763 00:37:28,360 --> 00:37:32,480 And so that was cool, and I thought that I would belong 764 00:37:32,480 --> 00:37:36,880 in Gryffindor, like everyone wants to be in Gryffindor, 765 00:37:36,880 --> 00:37:38,040 right, they're the good guys. 766 00:37:38,040 --> 00:37:41,860 And my students told me that I couldn't be in Gryffindor, 767 00:37:43,240 --> 00:37:45,340 that I had to be in Slytherin. 768 00:37:45,340 --> 00:37:47,900 So you'll see me drinking tea throughout the course 769 00:37:47,900 --> 00:37:49,120 out of this teacup. 770 00:37:49,120 --> 00:37:51,800 It's my Slytherin teacup. 771 00:37:51,800 --> 00:37:54,120 I picked that up from Harry Potter World. 772 00:37:54,120 --> 00:37:58,120 I went down to Florida and visited Harry Potter World. 773 00:37:58,120 --> 00:38:03,120 And the reason that I was sorted by my students 774 00:38:03,120 --> 00:38:08,120 into Slytherin is also because I teach Python. 775 00:38:09,320 --> 00:38:13,360 And Python is like a snake. 776 00:38:13,360 --> 00:38:17,280 And so if you think about the people from Slytherin, 777 00:38:17,280 --> 00:38:20,560 they are capable of talking to snakes. 778 00:38:20,560 --> 00:38:22,080 And the class that we were doing the sorting 779 00:38:22,080 --> 00:38:23,160 was a Python class. 780 00:38:23,160 --> 00:38:26,160 And so it sort of made perfect sense 781 00:38:26,160 --> 00:38:28,800 that you would have to be in Slytherin 782 00:38:28,800 --> 00:38:30,680 if you were the Python teacher. 783 00:38:30,680 --> 00:38:33,920 And of course, your name is Charles Severance. 784 00:38:33,920 --> 00:38:36,720 And then that sounds kind of like Severus Snape. 785 00:38:36,720 --> 00:38:41,720 And so I just accepted that I'm in Slytherin, okay? 786 00:38:43,720 --> 00:38:46,480 So you all can be in Gryffindor, but I can't. 787 00:38:46,480 --> 00:38:47,320 I'm in Slytherin. 788 00:38:47,320 --> 00:38:49,720 So I'm the bad guy or the good guy. 789 00:38:49,720 --> 00:38:51,800 Depends on how you look at it, right? 790 00:38:52,800 --> 00:38:56,400 And so what I'm going to do now is I'm going to bring you 791 00:38:56,400 --> 00:38:59,720 into Slytherin as well. 792 00:38:59,720 --> 00:39:04,520 Because I'm going to teach you the Python language. 793 00:39:04,520 --> 00:39:09,120 Python is the language that we Pythonistas talk. 794 00:39:09,120 --> 00:39:11,640 It was invented about over 20 years ago 795 00:39:11,640 --> 00:39:13,880 by a fellow named Gita Van Rossum. 796 00:39:13,880 --> 00:39:16,400 And away we go. 797 00:39:16,400 --> 00:39:20,160 Now, even though I'm using this whole snake Slytherin thing, 798 00:39:20,160 --> 00:39:22,960 it turns out that Python was not at all named 799 00:39:22,960 --> 00:39:26,000 for Harry Potter because Python was invented 800 00:39:26,000 --> 00:39:29,120 almost two decades before Harry Potter was created. 801 00:39:29,120 --> 00:39:30,880 And it wasn't for the snake. 802 00:39:30,880 --> 00:39:33,560 It was actually, Monty Python's flying circus 803 00:39:33,560 --> 00:39:37,360 was the inspiration for Python, the name Python. 804 00:39:37,360 --> 00:39:40,240 And because Gita Van Rossum really wanted 805 00:39:40,240 --> 00:39:41,800 to create a programming language, 806 00:39:41,800 --> 00:39:44,440 that while it was powerful underneath it 807 00:39:44,440 --> 00:39:47,120 in its very nature was a very powerful language, 808 00:39:47,120 --> 00:39:50,000 he wanted it to be a language that was fun. 809 00:39:50,000 --> 00:39:52,440 And he wanted it to be a language that was approachable. 810 00:39:52,440 --> 00:39:55,040 And so that's why Python recently has become 811 00:39:55,040 --> 00:39:59,360 so absolutely popular. 812 00:39:59,360 --> 00:40:02,480 And it's easy to learn. 813 00:40:02,480 --> 00:40:03,560 But it's also powerful. 814 00:40:03,560 --> 00:40:05,240 And that's sort of the magic of Python, 815 00:40:05,240 --> 00:40:09,600 is the ease of learning it, the brevity of the programs, 816 00:40:09,600 --> 00:40:14,320 the shortness of the programs, and the power. 817 00:40:14,320 --> 00:40:18,000 And so we are going to become Pythonistas. 818 00:40:18,000 --> 00:40:21,760 Now, as you learn to be a software developer using 819 00:40:21,760 --> 00:40:24,520 the Python programming language, you 820 00:40:24,520 --> 00:40:27,600 are going to encounter syntax errors. 821 00:40:27,600 --> 00:40:30,840 And I remember when I used to get syntax errors. 822 00:40:30,840 --> 00:40:34,840 And I remember my first programming class. 823 00:40:34,840 --> 00:40:37,680 And I would type on cards. 824 00:40:46,320 --> 00:40:51,800 And I would upload those cards to the computer. 825 00:40:51,800 --> 00:40:56,240 And the computer would say, you're not worthy. 826 00:40:56,240 --> 00:40:58,280 And I'm like, wait a sec, those are pretty good cards. 827 00:40:58,280 --> 00:41:01,280 How could you be so critical of me? 828 00:41:01,280 --> 00:41:02,400 I'd say syntax error. 829 00:41:02,400 --> 00:41:07,000 And I really got sort of a really bad attitude 830 00:41:07,000 --> 00:41:09,800 that somehow this computer didn't like me. 831 00:41:09,800 --> 00:41:12,680 And that I would make cards that would complain. 832 00:41:12,680 --> 00:41:14,320 And I would make changes to the cards. 833 00:41:14,320 --> 00:41:15,280 And it would still complain. 834 00:41:15,280 --> 00:41:17,280 And I make changes that would still complain. 835 00:41:17,280 --> 00:41:20,200 I'm like, how can I win in this situation? 836 00:41:20,200 --> 00:41:21,760 And you're going to feel the same thing. 837 00:41:21,760 --> 00:41:23,760 You're going to absolutely feel the same thing. 838 00:41:23,760 --> 00:41:25,440 You're going to be struggling. 839 00:41:25,440 --> 00:41:28,480 You're going to be like, how come this computer hates me? 840 00:41:28,480 --> 00:41:31,240 Let me assure you right now the computer doesn't hate you. 841 00:41:31,240 --> 00:41:33,320 The computer actually loves you. 842 00:41:33,320 --> 00:41:36,760 It just is not very good at showing how it loves you 843 00:41:36,760 --> 00:41:40,080 or telling you how or why it loves you. 844 00:41:40,080 --> 00:41:45,120 And so syntax errors are not so much Python telling you 845 00:41:45,120 --> 00:41:47,560 that you're bad or that you're an inadequate programmer 846 00:41:47,560 --> 00:41:49,840 or you should find something else to do. 847 00:41:49,840 --> 00:41:52,680 It's really Python's admission that it doesn't understand 848 00:41:52,680 --> 00:41:54,280 what you're trying to say. 849 00:41:54,280 --> 00:41:56,080 And so you've got to get used to that. 850 00:41:56,080 --> 00:41:58,000 And it's frustrating, but you've got 851 00:41:58,000 --> 00:42:00,680 to get used to the fact that syntax errors are your friend. 852 00:42:00,680 --> 00:42:03,480 Python is saying, hey, I got to line seven. 853 00:42:03,480 --> 00:42:05,120 And I was doing fine up to line seven. 854 00:42:05,120 --> 00:42:08,360 But boy, in line seven, there's some little thing. 855 00:42:08,360 --> 00:42:12,600 I don't know what the word else means in this context. 856 00:42:12,600 --> 00:42:13,840 Or you didn't indent it. 857 00:42:13,840 --> 00:42:15,160 And so I'm kind of confused. 858 00:42:15,160 --> 00:42:15,880 What did you mean? 859 00:42:15,880 --> 00:42:18,560 Please, please, please help me. 860 00:42:18,560 --> 00:42:20,800 And so it's so much easier for you 861 00:42:20,800 --> 00:42:23,880 to learn Python than it is for Python 862 00:42:23,880 --> 00:42:26,840 to figure out what you mean when you're writing code. 863 00:42:26,840 --> 00:42:28,320 So we have a number of different ways 864 00:42:28,320 --> 00:42:30,480 to sort of encode our instructions 865 00:42:30,480 --> 00:42:32,040 when we talk to Python. 866 00:42:32,040 --> 00:42:35,120 One is we just run Python interactively on our computer. 867 00:42:35,120 --> 00:42:37,240 Hopefully, by now, you've got it installed. 868 00:42:37,240 --> 00:42:39,640 And you just type Python at a command prompt. 869 00:42:39,640 --> 00:42:42,120 So either a Windows command prompt or a Linux command 870 00:42:42,120 --> 00:42:44,160 prompt or a Macintosh command prompt. 871 00:42:44,160 --> 00:42:47,240 And I got some examples of how to sort of get this all started, 872 00:42:47,240 --> 00:42:50,160 get Python installed, and away you go. 873 00:42:50,160 --> 00:42:52,360 Now, you'll notice when you run the Python interpreter, 874 00:42:52,360 --> 00:42:57,000 the three chevron prompt, Python is asking you what next. 875 00:42:57,000 --> 00:42:57,960 This is you. 876 00:42:57,960 --> 00:43:00,280 It's saying, I want to talk to you. 877 00:43:00,280 --> 00:43:02,960 I want you to tell me some Python to do. 878 00:43:02,960 --> 00:43:04,480 If you know the Python language, you 879 00:43:04,480 --> 00:43:07,080 know what to say right here. 880 00:43:07,080 --> 00:43:09,800 Now, if you know Python, you can type these languages. 881 00:43:09,800 --> 00:43:12,360 You can say, oh, x equals 1, which really means, 882 00:43:12,360 --> 00:43:14,640 go find a little piece of memory, label it x, 883 00:43:14,640 --> 00:43:15,880 and stick 1 in it. 884 00:43:15,880 --> 00:43:17,440 Print x is like, go find that thing 885 00:43:17,440 --> 00:43:19,720 where you labeled it x, and bring me back that number 886 00:43:19,720 --> 00:43:21,120 and tell me what I stored in there. 887 00:43:21,120 --> 00:43:23,560 Now, why you want to do this, that's a different question. 888 00:43:23,560 --> 00:43:24,960 These are very simple things. 889 00:43:24,960 --> 00:43:27,040 It's going to take you a while to get the big picture of why 890 00:43:27,040 --> 00:43:27,720 we're doing this. 891 00:43:27,720 --> 00:43:31,540 So just trust me that you want to learn these statements. 892 00:43:31,540 --> 00:43:33,520 And then later, we will successfully 893 00:43:33,520 --> 00:43:35,840 turn those into a program. 894 00:43:35,840 --> 00:43:39,280 So x equals x plus 1, the third line there. 895 00:43:39,280 --> 00:43:43,760 x equals x plus 1 is not, as it seems in math, 896 00:43:43,760 --> 00:43:46,320 it basically says, hey, go grab the old value of x, 897 00:43:46,320 --> 00:43:48,240 add 1 to it, and stick it back in x. 898 00:43:48,240 --> 00:43:49,240 That's what that means. 899 00:43:49,240 --> 00:43:52,720 So equal sign really has kind of an arrow to it. 900 00:43:52,720 --> 00:43:54,880 And then we say, hey, go look up that x thing 901 00:43:54,880 --> 00:43:56,720 that we just did, and print that out, 902 00:43:56,720 --> 00:43:58,680 and then we're going to say, quit. 903 00:43:58,680 --> 00:44:00,880 So that's us talking to Python. 904 00:44:00,880 --> 00:44:04,160 Now, you can type just about any crazy stuff you want in here, 905 00:44:04,160 --> 00:44:08,120 and Python will be unhappy and talk to you. 906 00:44:08,120 --> 00:44:10,880 So what we're going to do next is 907 00:44:10,880 --> 00:44:13,520 we're going to start talking about the actual language 908 00:44:13,520 --> 00:44:16,640 of Python and what it is that we have to say 909 00:44:16,640 --> 00:44:19,120 to make Python happy when we're talking to it. 910 00:44:24,080 --> 00:44:25,760 So now we're going to start learning 911 00:44:25,760 --> 00:44:28,380 the actual Python language. 912 00:44:28,380 --> 00:44:30,440 So what do we say? 913 00:44:30,440 --> 00:44:32,640 You can think of this as almost like writing, 914 00:44:32,640 --> 00:44:34,880 almost like writing a story. 915 00:44:34,880 --> 00:44:36,960 We're going to start with a basic vocabulary. 916 00:44:36,960 --> 00:44:40,200 We're going to talk a little bit about lines or sentences. 917 00:44:40,200 --> 00:44:41,680 And then we're going to start talking about 918 00:44:41,680 --> 00:44:44,200 how to put those sentences together 919 00:44:44,200 --> 00:44:47,640 to make a coherent paragraph, as it were. 920 00:44:47,640 --> 00:44:49,960 And you just have to accept the fact 921 00:44:49,960 --> 00:44:52,480 that when I start teaching you this stuff, 922 00:44:52,480 --> 00:44:53,800 it's not going to make sense 923 00:44:53,800 --> 00:44:56,800 for about six or seven more chapters. 924 00:44:56,800 --> 00:44:59,440 And so just sort of bear with me, 925 00:44:59,440 --> 00:45:02,520 except, I mean, I remember when I first learned, 926 00:45:02,520 --> 00:45:05,560 it went from me confused, confused, confused, 927 00:45:05,560 --> 00:45:09,120 confused, confused, holy mackerel, this is awesome. 928 00:45:09,120 --> 00:45:12,380 And so I expect many of you will go through that same thing. 929 00:45:12,380 --> 00:45:14,880 So just learn the first parts, 930 00:45:14,880 --> 00:45:17,960 accept the fact that it doesn't necessarily make sense 931 00:45:17,960 --> 00:45:22,960 in a big picture, and just bear with us, okay? 932 00:45:23,040 --> 00:45:24,280 So we'll start with vocabulary, 933 00:45:24,280 --> 00:45:25,440 we'll start to make sentences, 934 00:45:25,440 --> 00:45:29,120 and then we'll have little short stories and paragraphs, okay? 935 00:45:29,120 --> 00:45:31,240 And so this is a short story 936 00:45:31,240 --> 00:45:34,040 about how to count the words in Python. 937 00:45:34,040 --> 00:45:35,400 It's got a couple of paragraphs, 938 00:45:35,400 --> 00:45:39,520 and we are going to look at all of this stuff eventually. 939 00:45:39,520 --> 00:45:42,980 So we start with a set of reserved words. 940 00:45:42,980 --> 00:45:44,780 And what are reserved words? 941 00:45:44,780 --> 00:45:49,440 Well, they're words that Python expects 942 00:45:49,440 --> 00:45:51,720 when you use these words that they're gonna mean 943 00:45:51,720 --> 00:45:53,760 exactly what Python expects to mean. 944 00:45:53,760 --> 00:45:55,520 And what it really means is you're not allowed 945 00:45:55,520 --> 00:45:57,040 to use them for any other purpose 946 00:45:57,040 --> 00:45:58,440 than the purpose that Python wants. 947 00:45:58,440 --> 00:46:00,440 It's sort of part of the contract. 948 00:46:00,440 --> 00:46:02,720 It's like when you have a dog, 949 00:46:02,720 --> 00:46:07,340 and you go, what did you think of that television program? 950 00:46:07,340 --> 00:46:08,960 And the dog has no idea what you're saying, 951 00:46:08,960 --> 00:46:13,200 and then you say, do you wanna wait until Saturday 952 00:46:13,200 --> 00:46:17,260 to go to the veterinarian? 953 00:46:17,260 --> 00:46:19,040 And the dog still doesn't know what you're saying. 954 00:46:19,040 --> 00:46:22,160 Then you go like, how would you like to take a walk? 955 00:46:22,160 --> 00:46:23,640 And then the dog goes, walk? 956 00:46:23,640 --> 00:46:25,880 I know what that means, and then hits the door, right? 957 00:46:25,880 --> 00:46:29,320 And so the way the dog sees you is blah, blah, blah, 958 00:46:29,320 --> 00:46:31,120 walk, blah, blah, blah, blah, food, 959 00:46:31,120 --> 00:46:34,480 blah, blah, blah, blah, treat, blah, blah, blah, blah, walk. 960 00:46:34,480 --> 00:46:37,200 That's kinda how Python looks at these reserved words. 961 00:46:37,200 --> 00:46:39,040 When you say class, it goes class. 962 00:46:39,040 --> 00:46:40,520 Oh, I know what that means. 963 00:46:40,520 --> 00:46:44,120 Now, if I say zap, it's like, oh, zap's something 964 00:46:44,120 --> 00:46:47,280 that you get to decide, or it's maybe a variable name. 965 00:46:47,280 --> 00:46:49,440 So reserved words are simply words 966 00:46:49,440 --> 00:46:52,240 that when you use these words in Python, 967 00:46:52,240 --> 00:46:53,960 and there's only a few of them, 968 00:46:53,960 --> 00:47:00,960 like and, or del, or if, maybe, pass, maybe, in. 969 00:47:02,320 --> 00:47:04,760 A lot of these, you won't end up using them, 970 00:47:04,760 --> 00:47:06,880 it's just these are reserved for Python 971 00:47:06,880 --> 00:47:08,960 and part of the Python vocabulary. 972 00:47:08,960 --> 00:47:11,160 This is Python vocabulary. 973 00:47:11,160 --> 00:47:15,240 Now, when we move from words to sentences, 974 00:47:15,240 --> 00:47:17,760 you see that Python is a series of lines. 975 00:47:17,760 --> 00:47:20,400 A Python program is a series of statements. 976 00:47:20,400 --> 00:47:22,960 They have an order because the computer wants to know 977 00:47:22,960 --> 00:47:24,520 what next, what next, what next. 978 00:47:24,520 --> 00:47:27,880 So, what next is start at the beginning. 979 00:47:27,880 --> 00:47:30,040 So, I already talked about an assignment statement 980 00:47:30,040 --> 00:47:32,320 that basically says x equals two, 981 00:47:32,320 --> 00:47:33,880 this is not a mathematical statement, 982 00:47:33,880 --> 00:47:38,120 this is a directive to say, take this variable two, 983 00:47:38,120 --> 00:47:39,880 this value two, this constant two, 984 00:47:39,880 --> 00:47:42,520 and stick it in a location in your memory, 985 00:47:42,520 --> 00:47:45,400 and remember that I asked you to name it x. 986 00:47:45,400 --> 00:47:47,760 X is a variable, something you made up. 987 00:47:47,760 --> 00:47:52,140 You chose that, but it's Python's job to remember it. 988 00:47:52,140 --> 00:47:55,080 So, this says go, whatever that x is, 989 00:47:55,080 --> 00:47:58,120 there's a two in there, now pull that x back out, 990 00:47:58,120 --> 00:48:00,200 add two to it, which makes it four, 991 00:48:00,200 --> 00:48:02,600 and stick it back in x, and so that makes this a four. 992 00:48:02,600 --> 00:48:06,280 So, x is a four, and print x says, 993 00:48:06,280 --> 00:48:09,040 go look up that thing that was an x and print it out. 994 00:48:09,040 --> 00:48:12,360 And so, these are like, each line has something to it, 995 00:48:12,360 --> 00:48:15,000 I'm using a reserved word, well actually that's a function, 996 00:48:15,000 --> 00:48:17,460 but it's a reserved word too. 997 00:48:18,480 --> 00:48:21,760 And so, there's reserved words and all these things, 998 00:48:21,760 --> 00:48:24,560 and you combine these, there are operators, 999 00:48:24,560 --> 00:48:26,680 plus is an operator, equals is an operator, 1000 00:48:26,680 --> 00:48:29,800 these things do things, and we'll learn all this stuff 1001 00:48:29,800 --> 00:48:33,980 in time, so the basic building blocks of lines of Python. 1002 00:48:35,920 --> 00:48:38,640 Now, as we take these lines of Python and build them up, 1003 00:48:38,640 --> 00:48:42,000 we end up making paragraphs, programming in paragraphs. 1004 00:48:42,000 --> 00:48:45,200 And so, one of the things that it's important is 1005 00:48:45,200 --> 00:48:47,620 I showed you how to do interactive Python, 1006 00:48:47,620 --> 00:48:49,840 so you just type Python and you type a statement 1007 00:48:49,840 --> 00:48:52,860 and a statement, those get really tiring 1008 00:48:52,860 --> 00:48:54,940 after about three or four lines of Python, 1009 00:48:54,940 --> 00:48:57,840 because you start making mistakes and you have to start over. 1010 00:48:57,840 --> 00:49:00,900 So, the better thing to do is to, as your program 1011 00:49:00,900 --> 00:49:03,400 gets a little larger, to write a script, 1012 00:49:03,400 --> 00:49:05,560 put your Python instructions in a file, 1013 00:49:05,560 --> 00:49:08,520 and then tell Python to read from the file, 1014 00:49:08,520 --> 00:49:12,760 and then run the script as it's entered in that file. 1015 00:49:12,760 --> 00:49:15,480 We tend to name these files with.py, 1016 00:49:15,480 --> 00:49:18,300 and I've got a series of videos that you can watch 1017 00:49:18,300 --> 00:49:20,520 to figure out how this all works. 1018 00:49:20,520 --> 00:49:23,060 Like I said, you can type interactively to Python, 1019 00:49:23,060 --> 00:49:25,740 and it's a great way to experiment with Python, 1020 00:49:25,740 --> 00:49:28,240 check to see if a statement does what you think it does, 1021 00:49:28,240 --> 00:49:31,360 but script is the way, after we are past 1022 00:49:31,360 --> 00:49:33,700 one or two lines of code, we write it in files 1023 00:49:33,700 --> 00:49:35,060 and then run it separately. 1024 00:49:37,240 --> 00:49:39,880 So, there are a couple of basic patterns, 1025 00:49:39,880 --> 00:49:42,320 and it's really important to understand 1026 00:49:42,320 --> 00:49:43,920 each of these patterns, and like I said, 1027 00:49:43,920 --> 00:49:45,920 we'll teach you these patterns separately, 1028 00:49:45,920 --> 00:49:47,880 and then we'll combine them together. 1029 00:49:47,880 --> 00:49:49,360 And when you combine them together is when you say, 1030 00:49:49,360 --> 00:49:51,000 oh, that's what a program is. 1031 00:49:51,000 --> 00:49:53,960 So, you have to suspend disbelief. 1032 00:49:53,960 --> 00:49:55,640 We have a couple of different patterns. 1033 00:49:55,640 --> 00:49:57,640 One is a sequence of steps. 1034 00:49:57,640 --> 00:49:59,160 Do this, then do this, then do this. 1035 00:49:59,160 --> 00:50:01,440 Conditional is like skipping something. 1036 00:50:01,440 --> 00:50:03,640 Repeated does it over and over and over again. 1037 00:50:03,640 --> 00:50:05,900 Computers are really good at repeating stuff. 1038 00:50:05,900 --> 00:50:06,800 Much better than people. 1039 00:50:06,800 --> 00:50:09,600 People get tired going over and over doing the same thing. 1040 00:50:09,600 --> 00:50:12,800 And then we have store and repeated steps as well. 1041 00:50:12,800 --> 00:50:14,880 And so, if we take a look at this, 1042 00:50:14,880 --> 00:50:19,000 and we take a look at a Python program, this is a piece 1043 00:50:19,000 --> 00:50:20,080 of code, this is a little script. 1044 00:50:20,080 --> 00:50:23,460 If you type this into a code, take this Python code 1045 00:50:23,460 --> 00:50:26,760 into a file and run it, it starts at the beginning, 1046 00:50:26,760 --> 00:50:28,400 and then it goes to the next line, and the next line, 1047 00:50:28,400 --> 00:50:29,240 and the next line. 1048 00:50:29,240 --> 00:50:32,040 And Python executes the scripts as you write them. 1049 00:50:32,040 --> 00:50:36,320 So, it says, stick a variable, find a place called 1050 00:50:36,320 --> 00:50:40,080 in your memory called x, stick two into that, okay. 1051 00:50:40,080 --> 00:50:41,760 Then go to the next one, print that out. 1052 00:50:41,760 --> 00:50:43,640 So, the program is producing output. 1053 00:50:43,640 --> 00:50:46,620 Now, go read x and add two to it, and stick it back in x. 1054 00:50:46,620 --> 00:50:48,800 So, x is four, then print that. 1055 00:50:48,800 --> 00:50:51,200 This side over here, this is called a flow chart. 1056 00:50:51,200 --> 00:50:52,760 I'm not gonna make you draw flow charts. 1057 00:50:52,760 --> 00:50:54,440 I'm only gonna draw them a few times 1058 00:50:54,440 --> 00:50:56,160 in ways that I think will help you. 1059 00:50:56,160 --> 00:50:58,040 But you can think of it as Python, 1060 00:50:58,040 --> 00:51:00,280 when it finishes something, it goes onto the next one, 1061 00:51:00,280 --> 00:51:01,740 unless you tell it otherwise. 1062 00:51:01,740 --> 00:51:03,280 Finishes this, goes onto the next one. 1063 00:51:03,280 --> 00:51:05,160 Finishes this, goes onto the next one. 1064 00:51:05,160 --> 00:51:08,400 Finishes this, and now the program is all done. 1065 00:51:08,400 --> 00:51:10,080 And so, that's sequential steps. 1066 00:51:10,080 --> 00:51:12,840 You just type them in, Python runs it. 1067 00:51:12,840 --> 00:51:15,840 They're important, but sort of uninteresting, 1068 00:51:15,840 --> 00:51:18,800 because you can only get so far. 1069 00:51:18,800 --> 00:51:20,320 And you can't really make them intelligent, 1070 00:51:20,320 --> 00:51:22,200 because it's always gonna do the next one. 1071 00:51:22,200 --> 00:51:24,520 So, the next thing we do is what are called conditional steps. 1072 00:51:24,520 --> 00:51:27,080 And this is where it starts to get intelligent. 1073 00:51:27,080 --> 00:51:30,400 I mean, where you are able to encode your brain 1074 00:51:30,400 --> 00:51:32,360 into the computer, like, oh wait a sec, 1075 00:51:32,360 --> 00:51:34,680 let's only do this step if something is true. 1076 00:51:34,680 --> 00:51:38,160 And the syntax that we tend to use here 1077 00:51:38,160 --> 00:51:42,880 is the reserved word if, if, okay? 1078 00:51:42,880 --> 00:51:46,360 And so, the if is like a little fork in the road. 1079 00:51:46,360 --> 00:51:48,440 You can go one way, or you can go another way, 1080 00:51:48,440 --> 00:51:50,040 and you're asking a question. 1081 00:51:50,040 --> 00:51:52,120 So, inside the if statement, right here, 1082 00:51:52,120 --> 00:51:54,800 there is a question, saying, is x less than 10? 1083 00:51:54,800 --> 00:51:58,600 That's a, that resolves to a true or false. 1084 00:51:58,600 --> 00:52:00,100 If it's less than 10, that's true. 1085 00:52:00,100 --> 00:52:02,000 If it's greater than 10, it's false. 1086 00:52:02,000 --> 00:52:06,360 And so, then what we do is, if it's less than 10, 1087 00:52:06,360 --> 00:52:07,880 we have this indented block of code. 1088 00:52:07,880 --> 00:52:09,440 There's also this colon that tells us 1089 00:52:09,440 --> 00:52:11,280 we're in the beginning of an indented block of code. 1090 00:52:11,280 --> 00:52:13,120 And so, what it basically says is, 1091 00:52:13,120 --> 00:52:15,440 if this is true, run that code. 1092 00:52:15,440 --> 00:52:17,160 If it's false, skip that code. 1093 00:52:17,160 --> 00:52:18,960 So, it can either run it or skip it, 1094 00:52:18,960 --> 00:52:23,160 depending on this question that's being asked. 1095 00:52:23,160 --> 00:52:24,240 Now, if you look at this code, 1096 00:52:24,240 --> 00:52:27,180 it's pretty obvious what's going on. 1097 00:52:27,180 --> 00:52:29,300 It comes down, x is five. 1098 00:52:29,300 --> 00:52:31,600 If x is less than 10, that's true. 1099 00:52:31,600 --> 00:52:34,480 So, it runs this code and prints out smaller. 1100 00:52:34,480 --> 00:52:37,440 And then, it comes back here, deindents. 1101 00:52:37,440 --> 00:52:38,880 The next basic sequential, 1102 00:52:38,880 --> 00:52:40,920 this ends up being kind of a block. 1103 00:52:40,920 --> 00:52:43,080 If x is greater than 20, 1104 00:52:43,080 --> 00:52:46,080 if x is greater than 20, oh, come back, come back. 1105 00:52:47,640 --> 00:52:50,840 If x is greater than 20, this turns out to be false, 1106 00:52:50,840 --> 00:52:53,120 because x is five, and so it skips this. 1107 00:52:53,120 --> 00:52:54,760 So, the bigger never comes out, 1108 00:52:54,760 --> 00:52:57,120 and then it continues on and prints fini. 1109 00:52:57,120 --> 00:52:58,440 Oops, that's a typographical error. 1110 00:52:58,440 --> 00:53:01,120 Make that a lowercase print, and then prints fini. 1111 00:53:01,120 --> 00:53:06,040 So, it comes in, runs this, skips this, and then finishes. 1112 00:53:06,040 --> 00:53:10,840 Okay, so here is the last one we'll talk about, 1113 00:53:10,840 --> 00:53:11,700 the repeated steps. 1114 00:53:11,700 --> 00:53:15,460 We'll get back to store and retrieve later, 1115 00:53:15,460 --> 00:53:18,360 but for now, we're just gonna talk about three of the four. 1116 00:53:19,280 --> 00:53:22,040 This is another program, and the key is, 1117 00:53:22,040 --> 00:53:24,880 is that we're gonna use this same choice 1118 00:53:24,880 --> 00:53:25,820 where we're gonna go in, 1119 00:53:25,820 --> 00:53:28,400 but then we're gonna run for a while, 1120 00:53:28,400 --> 00:53:31,080 and then we'll have an exit condition where we get out. 1121 00:53:31,080 --> 00:53:34,760 So, this is repeated over and over and over and over again, 1122 00:53:34,760 --> 00:53:38,320 and this is the essence of how we make computers 1123 00:53:38,320 --> 00:53:40,120 do things that are seemingly difficult, 1124 00:53:40,120 --> 00:53:43,280 while they're more naturally difficult for people, okay? 1125 00:53:43,280 --> 00:53:45,720 And so, how do we encode this notion 1126 00:53:45,720 --> 00:53:49,040 that we wanna do something not forever, but for a while? 1127 00:53:49,040 --> 00:53:51,240 How do we encode that notion? 1128 00:53:51,240 --> 00:53:53,400 And so, we do it in this way. 1129 00:53:53,400 --> 00:53:55,000 So, we have our statement, 1130 00:53:55,000 --> 00:53:57,940 sequentially go to this while, while is a key word, 1131 00:53:57,940 --> 00:53:59,480 and it's asking another question 1132 00:53:59,480 --> 00:54:00,720 that's a true false question. 1133 00:54:00,720 --> 00:54:02,580 Is n greater than zero? 1134 00:54:02,580 --> 00:54:06,080 I read this as, as long as n remains greater than zero, 1135 00:54:06,080 --> 00:54:07,800 keep doing this indented block, 1136 00:54:07,800 --> 00:54:10,120 and you have a colon at the end, 1137 00:54:10,120 --> 00:54:12,760 and then you have two lines of code that's indented, 1138 00:54:12,760 --> 00:54:14,660 so that tells us what the loop is, 1139 00:54:14,660 --> 00:54:16,800 and then this is now deindented. 1140 00:54:16,800 --> 00:54:20,320 And so, it comes in, and if this is true, 1141 00:54:20,320 --> 00:54:24,180 if this is true, if this is true, it runs these two lines. 1142 00:54:24,180 --> 00:54:26,400 Prints out n, n is five, and then it says, 1143 00:54:26,400 --> 00:54:29,560 n equals n minus one, which makes n be four, 1144 00:54:29,560 --> 00:54:31,400 and it goes back up, and it goes up, 1145 00:54:31,400 --> 00:54:33,120 and it asks this question again. 1146 00:54:33,120 --> 00:54:34,940 Is n greater than zero? 1147 00:54:34,940 --> 00:54:37,740 If it is, continue on, and prints four, 1148 00:54:37,740 --> 00:54:39,640 and then subtracts it, and it does that, 1149 00:54:39,640 --> 00:54:43,440 four, three, two, and prints out one, 1150 00:54:43,440 --> 00:54:45,900 then it comes up, and now, after this, 1151 00:54:45,900 --> 00:54:49,140 n is now zero, n is now zero, 1152 00:54:49,140 --> 00:54:51,320 and n is no longer greater than zero, 1153 00:54:51,320 --> 00:54:54,040 so it takes sort of the exit ramp, and goes down here. 1154 00:54:54,040 --> 00:54:56,400 So, it takes the exit ramp, and goes to here, 1155 00:54:56,400 --> 00:54:58,200 and runs the next line. 1156 00:54:58,200 --> 00:55:01,440 Now, we're gonna cover all this again. 1157 00:55:01,440 --> 00:55:03,960 So, I'm just trying to give you the big picture, 1158 00:55:03,960 --> 00:55:05,460 next couple of chapters, we're gonna hit 1159 00:55:05,460 --> 00:55:07,260 all these things again, and we're gonna hit them 1160 00:55:07,260 --> 00:55:11,220 in much more detail, with a lot better information. 1161 00:55:11,220 --> 00:55:13,480 This is now sort of like combining these, 1162 00:55:13,480 --> 00:55:18,480 and again, I don't want you to really know this stuff, 1163 00:55:18,900 --> 00:55:21,320 just, you will know this in a couple of weeks, 1164 00:55:21,320 --> 00:55:23,640 you will see this program again, 1165 00:55:23,640 --> 00:55:26,600 but this shows you how we combine those patterns 1166 00:55:26,600 --> 00:55:30,420 of repeated, sequential, and conditional together. 1167 00:55:31,560 --> 00:55:33,520 So, this is a bit of sequential code, 1168 00:55:33,520 --> 00:55:36,080 comes in here, runs this, which happens to ask 1169 00:55:36,080 --> 00:55:38,280 for a file name, then it opens the file, 1170 00:55:38,280 --> 00:55:40,380 it creates a data structure called a dictionary, 1171 00:55:40,380 --> 00:55:42,240 this is all sequential, now the four 1172 00:55:42,240 --> 00:55:45,720 is another form of loops, so this is gonna loop for a while, 1173 00:55:45,720 --> 00:55:47,120 and then this is, within a loop, 1174 00:55:47,120 --> 00:55:49,840 we can even have two indents, and that's another loop, 1175 00:55:49,840 --> 00:55:52,840 so these are like repeated, and then it goes, 1176 00:55:52,840 --> 00:55:54,880 it goes down to the next sequential bit, 1177 00:55:54,880 --> 00:55:57,040 then it does this, here's another loop, 1178 00:55:57,040 --> 00:55:58,680 it's gonna run, and then here's a conditional, 1179 00:55:58,680 --> 00:56:00,800 it's gonna run, and then once all done, 1180 00:56:00,800 --> 00:56:03,000 we print out the last thing, and this is, of course, 1181 00:56:03,000 --> 00:56:08,000 is the program that does the, it figures out 1182 00:56:09,800 --> 00:56:12,300 the most common word and prints that most common word out, 1183 00:56:12,300 --> 00:56:16,720 and so this is a Python short story, it reads some data, 1184 00:56:16,720 --> 00:56:18,960 it reads the name of a file, it opens that file, 1185 00:56:18,960 --> 00:56:21,560 it talks about how to make a histogram, 1186 00:56:21,560 --> 00:56:25,160 and then it looks through for the most common word, 1187 00:56:25,160 --> 00:56:26,960 so don't worry too much about this, 1188 00:56:27,880 --> 00:56:30,100 over the next couple weeks, we'll fill in the pieces 1189 00:56:30,100 --> 00:56:31,880 so that you absolutely understand 1190 00:56:31,880 --> 00:56:34,160 every single line of this code. 1191 00:56:35,200 --> 00:56:40,160 So, this is a quick overview, chapter one, stick with us, 1192 00:56:40,160 --> 00:56:42,640 you realize it will be chapter seven 1193 00:56:42,640 --> 00:56:44,520 before this makes too much sense, 1194 00:56:44,520 --> 00:56:47,640 you really have to trust that you are learning 1195 00:56:47,640 --> 00:56:50,740 important things, and that it all makes sense 1196 00:56:50,740 --> 00:56:52,680 when we bring it together like in chapter seven 1197 00:56:52,680 --> 00:56:53,600 in a few weeks. 1198 00:56:57,840 --> 00:57:00,600 Hello, and welcome to chapter two. 1199 00:57:00,600 --> 00:57:02,100 Now we're gonna continue to talk about 1200 00:57:02,100 --> 00:57:04,480 the building blocks of Python, variables, 1201 00:57:04,480 --> 00:57:07,600 constants, statements, expressions, et cetera. 1202 00:57:07,600 --> 00:57:09,960 The first thing we have to talk about is constants, 1203 00:57:09,960 --> 00:57:11,800 these are just things we call constants 1204 00:57:11,800 --> 00:57:14,200 because they don't change, they're numbers, 1205 00:57:14,200 --> 00:57:16,520 strings, et cetera, and we use them 1206 00:57:16,520 --> 00:57:19,560 to sort of start calculations, or, you know, 1207 00:57:19,560 --> 00:57:23,360 if something is greater than 40 hours, 1208 00:57:23,360 --> 00:57:24,880 we're gonna do something, and so 40 1209 00:57:24,880 --> 00:57:26,560 is the constant in that situation. 1210 00:57:26,560 --> 00:57:31,360 So, we have 123, we have 98.6, we have hello world, 1211 00:57:31,360 --> 00:57:33,800 which is a string by enclosing it in quotes, 1212 00:57:33,800 --> 00:57:36,360 we pass each of these things to the print function, 1213 00:57:36,360 --> 00:57:38,120 and aside of the respect of the print function 1214 00:57:38,120 --> 00:57:39,680 is that we see the output. 1215 00:57:39,680 --> 00:57:44,040 So, print 123, prints out 123, print 98.6, prints it out. 1216 00:57:44,040 --> 00:57:47,280 So, these are just really the syntax of constants, 1217 00:57:47,280 --> 00:57:50,360 and without constants, we can't write really much 1218 00:57:50,360 --> 00:57:51,840 of anything. 1219 00:57:51,840 --> 00:57:54,040 The other sort of foundational notion 1220 00:57:54,040 --> 00:57:56,440 of any programming language are the reserved words, 1221 00:57:56,440 --> 00:57:58,680 and like I said before, reserved words are these 1222 00:57:58,680 --> 00:58:02,240 special words where Python is listening for them, 1223 00:58:02,240 --> 00:58:04,280 and there are very special meanings, 1224 00:58:04,280 --> 00:58:07,720 so when Python sees if, it's not just any other word, 1225 00:58:07,720 --> 00:58:11,400 it means how Python implements conditional execution. 1226 00:58:13,040 --> 00:58:15,980 Variables are the third building block, 1227 00:58:15,980 --> 00:58:19,320 and that is a way that you can ask Python 1228 00:58:19,320 --> 00:58:22,840 to allocate a piece of memory and then give it a name, 1229 00:58:22,840 --> 00:58:24,400 and you can put stuff in that. 1230 00:58:24,400 --> 00:58:26,840 Sometimes you just put one value, later we'll see, 1231 00:58:26,840 --> 00:58:29,700 when we do collections in chapters eight and nine, 1232 00:58:29,700 --> 00:58:31,340 we will see the more than one value 1233 00:58:31,340 --> 00:58:34,680 can be put into a variable, and the variable, 1234 00:58:34,680 --> 00:58:36,280 how we control the variable is through 1235 00:58:36,280 --> 00:58:38,800 the assignment statement, and as I said before, 1236 00:58:38,800 --> 00:58:41,260 it's important to think of the assignment statement 1237 00:58:41,260 --> 00:58:43,760 as having an arrow to it, so this is not saying 1238 00:58:43,760 --> 00:58:46,480 X for all time is the same as 12.2, 1239 00:58:46,480 --> 00:58:49,500 what it's saying is take 12.2, find a place, 1240 00:58:49,500 --> 00:58:52,600 find some memory in your computer there, Mr. Python, 1241 00:58:52,600 --> 00:58:55,020 give it a label X, we get to choose the X, 1242 00:58:55,020 --> 00:58:57,560 that's the variable part, we chose it, right? 1243 00:58:58,520 --> 00:59:01,600 And then stick 12 in it, and then the same is true for 14. 1244 00:59:01,600 --> 00:59:04,440 Go find another spot, name it Y, 1245 00:59:04,440 --> 00:59:08,640 and then put a 14 in there, so think of this as an arrow 1246 00:59:08,640 --> 00:59:10,920 every time you see that equality, 1247 00:59:10,920 --> 00:59:13,960 the assignment in an assignment statement. 1248 00:59:15,960 --> 00:59:18,520 Now, these variables hold one value, 1249 00:59:18,520 --> 00:59:22,720 so now if we have these three statements, these two, 1250 00:59:22,720 --> 00:59:25,560 and then the third one executes, it says put 100 into X, 1251 00:59:25,560 --> 00:59:28,800 but that wipes out the old value of 12.2, 1252 00:59:28,800 --> 00:59:31,560 and it rewrites it with 100, and so we can 1253 00:59:31,560 --> 00:59:33,320 change the variables, that's another reason 1254 00:59:33,320 --> 00:59:35,540 that we call them variable. 1255 00:59:37,340 --> 00:59:40,680 There are some names, some rules for making variable names, 1256 00:59:40,680 --> 00:59:43,040 you can start with a letter or an underscore. 1257 00:59:43,040 --> 00:59:46,000 We tend not to, as normal programmers use underscore, 1258 00:59:46,000 --> 00:59:49,240 we tend to reserve those for variables 1259 00:59:49,240 --> 00:59:51,620 that we use to communicate with Python itself, 1260 00:59:51,620 --> 00:59:52,860 so when we're making up a variable, 1261 00:59:52,860 --> 00:59:57,360 we tend not to use underscores as a first character. 1262 00:59:57,360 --> 01:00:00,080 You can have letters and numbers and underscores 1263 01:00:00,080 --> 01:00:02,600 after the first character, and they're case sensitive, 1264 01:00:02,600 --> 01:00:06,400 but it's really a bad idea to use case 1265 01:00:06,400 --> 01:00:07,980 as the only differentiator. 1266 01:00:07,980 --> 01:00:12,200 So, in this case, spam, eggs, spam 23, 1267 01:00:12,200 --> 01:00:14,160 and underscore speed are all totally legit, 1268 01:00:14,160 --> 01:00:15,720 we would probably not use this one 1269 01:00:15,720 --> 01:00:17,320 unless we were actually doing it 1270 01:00:17,320 --> 01:00:20,000 because Python told us to use that variable. 1271 01:00:20,000 --> 01:00:21,520 23 spam starts with a number, 1272 01:00:21,520 --> 01:00:23,960 pound sign starts and dot is not 1273 01:00:23,960 --> 01:00:25,760 a legitimate variable character. 1274 01:00:26,640 --> 01:00:30,400 And spam, capital spam and all caps spam are different, 1275 01:00:30,400 --> 01:00:32,520 but this is not something that you want 1276 01:00:32,520 --> 01:00:36,400 to sort of depend on too much, so. 1277 01:00:36,400 --> 01:00:38,000 That's just the rule names. 1278 01:00:38,000 --> 01:00:39,760 We tend to start them with a letter 1279 01:00:39,760 --> 01:00:41,480 and then use letters, numbers, and underscores. 1280 01:00:41,480 --> 01:00:43,240 Underscores other than the first character 1281 01:00:43,240 --> 01:00:45,400 are generally pretty common, 1282 01:00:45,400 --> 01:00:47,940 and you'll see those used a lot. 1283 01:00:47,940 --> 01:00:49,800 Now, when we're choosing variable names, 1284 01:00:49,800 --> 01:00:50,800 one of the things about variables 1285 01:00:50,800 --> 01:00:51,920 is we get to choose the name. 1286 01:00:51,920 --> 01:00:54,960 We get to choose the name X, choose the name Y, 1287 01:00:54,960 --> 01:00:57,000 and so sometimes we like them short, 1288 01:00:57,000 --> 01:00:58,640 but sometimes we want them descriptive, 1289 01:00:58,640 --> 01:01:02,520 and the notion that of making variables descriptive 1290 01:01:02,520 --> 01:01:04,520 is often confusing to beginning students. 1291 01:01:04,520 --> 01:01:07,280 Sometimes it's really helpful to, 1292 01:01:07,280 --> 01:01:08,840 if you're gonna have a line of text 1293 01:01:08,840 --> 01:01:11,600 and you name the variable line, that's great 1294 01:01:11,600 --> 01:01:13,440 because the next person reading your program 1295 01:01:13,440 --> 01:01:15,680 says, oh, that must be the line of text. 1296 01:01:15,680 --> 01:01:18,300 Whereas it also can become misleading 1297 01:01:18,300 --> 01:01:21,920 that line, the name of a variable somehow has meaning, 1298 01:01:21,920 --> 01:01:24,200 and so sometimes we'll have even singular variables 1299 01:01:24,200 --> 01:01:26,880 and plural variables like friend and friends. 1300 01:01:26,880 --> 01:01:28,240 You know, like, is plural, 1301 01:01:28,240 --> 01:01:30,840 does Python know about singular and plural? 1302 01:01:30,840 --> 01:01:32,120 And the answer is no. 1303 01:01:32,120 --> 01:01:35,240 So sometimes we pick variables that make no sense. 1304 01:01:35,240 --> 01:01:37,760 Sometimes we pick variables that make a lot of sense. 1305 01:01:37,760 --> 01:01:40,280 This is just something that you as a beginning programmer 1306 01:01:40,280 --> 01:01:42,140 are going to have to understand 1307 01:01:42,140 --> 01:01:44,480 that we can pick anything we want, 1308 01:01:45,380 --> 01:01:48,060 and so you'll see, I'll try to call attention to this 1309 01:01:48,060 --> 01:01:50,260 in the first few lectures as we go through. 1310 01:01:50,260 --> 01:01:53,580 So here's a bit of code with an assignment statement, 1311 01:01:53,580 --> 01:01:54,520 two assignment statements, 1312 01:01:54,520 --> 01:01:57,280 a multiplication, and a print statement, 1313 01:01:57,280 --> 01:01:59,120 and you can say, what is this doing? 1314 01:01:59,120 --> 01:02:02,920 Now, Python is perfectly happy with this code 1315 01:02:02,920 --> 01:02:04,000 because it assigns it in there. 1316 01:02:04,000 --> 01:02:06,760 You have said, please go give me this as a label, 1317 01:02:06,760 --> 01:02:08,200 and then we assign two variables, 1318 01:02:08,200 --> 01:02:10,840 and then we're carefully pulling these two variables 1319 01:02:10,840 --> 01:02:12,720 back out, multiplying them together 1320 01:02:12,720 --> 01:02:14,760 and sticking them into yet another variable, 1321 01:02:14,760 --> 01:02:16,000 and then printing that variable out. 1322 01:02:16,000 --> 01:02:18,680 That seems like, you know, we can figure out what it is. 1323 01:02:18,680 --> 01:02:20,280 You just have to look really careful, 1324 01:02:20,280 --> 01:02:22,120 and a single character mistake, 1325 01:02:22,120 --> 01:02:27,000 and Python is gonna be, you know, pretty unhappy, okay? 1326 01:02:27,000 --> 01:02:30,360 So that's one way to write this program. 1327 01:02:30,360 --> 01:02:32,700 It's hard, though, because any of those characters 1328 01:02:32,700 --> 01:02:35,520 are long variables and they're random stuff. 1329 01:02:35,520 --> 01:02:37,280 It's not very friendly to anyone 1330 01:02:37,280 --> 01:02:39,380 who might read your program. 1331 01:02:39,380 --> 01:02:40,700 Now, this looks a little friendlier. 1332 01:02:40,700 --> 01:02:41,840 It's the same program 1333 01:02:41,840 --> 01:02:44,440 because Python just wants a correspondence. 1334 01:02:44,440 --> 01:02:47,080 You pick A, you pick B, and you pick C, 1335 01:02:47,080 --> 01:02:50,760 and it's really much easier for us to see what's going on, 1336 01:02:50,760 --> 01:02:55,760 and so this is, in a way, going from here to here 1337 01:02:55,760 --> 01:02:59,400 is much friendlier, but we can be even friendlier 1338 01:02:59,400 --> 01:03:01,000 if we pick mnemonic variable names. 1339 01:03:01,000 --> 01:03:02,760 So this is not mnemonic. 1340 01:03:02,760 --> 01:03:04,600 This is short and convenient. 1341 01:03:04,600 --> 01:03:06,480 This is long and inconvenient. 1342 01:03:06,480 --> 01:03:08,600 Python is happy with any of these. 1343 01:03:09,520 --> 01:03:10,360 Here, on the other hand, 1344 01:03:10,360 --> 01:03:12,920 is another version of the exact same program, 1345 01:03:12,920 --> 01:03:16,120 and now you think to yourself, oh, yeah, now I get it. 1346 01:03:16,120 --> 01:03:17,920 35 is the number of hours. 1347 01:03:17,920 --> 01:03:20,040 12 dollars and 50 cents is the rate, 1348 01:03:20,040 --> 01:03:22,280 and then we're gonna multiply the hours and the rate 1349 01:03:22,280 --> 01:03:24,660 and come up with a pay, and we're putting out the pay. 1350 01:03:24,660 --> 01:03:27,440 Now, whoever wrote this program is much, 1351 01:03:27,440 --> 01:03:30,240 is helping us greatly understand what's going on, 1352 01:03:30,240 --> 01:03:31,600 and that's good. 1353 01:03:31,600 --> 01:03:33,200 Choosing variable names. 1354 01:03:33,200 --> 01:03:36,640 Python, again, all three of these are the same to Python. 1355 01:03:36,640 --> 01:03:39,260 Choosing variable names in a way that help your reader 1356 01:03:39,260 --> 01:03:42,560 understand what's going on is a great thing. 1357 01:03:42,560 --> 01:03:45,040 The problem is, the danger is, 1358 01:03:46,420 --> 01:03:48,640 if you read this and you think that somehow 1359 01:03:48,640 --> 01:03:50,440 Python understands payroll, 1360 01:03:50,440 --> 01:03:52,020 that if you name a variable hours 1361 01:03:52,020 --> 01:03:54,160 that Python knows what hours means, 1362 01:03:54,160 --> 01:03:57,200 the answer is, Python really doesn't care 1363 01:03:57,200 --> 01:03:58,660 what you name the variable as long as 1364 01:03:58,660 --> 01:04:01,400 what you name it, you use it, right? 1365 01:04:01,400 --> 01:04:02,680 And so, you gotta be careful, 1366 01:04:02,680 --> 01:04:04,920 and so you'll see, I will, 1367 01:04:04,920 --> 01:04:09,120 when I write my code in these first few weeks, 1368 01:04:09,120 --> 01:04:11,920 first few lectures, I will sometimes write it 1369 01:04:11,920 --> 01:04:13,800 with gibberish, I'll sometimes write it 1370 01:04:13,800 --> 01:04:16,000 with extremely short but meaningless variable names, 1371 01:04:16,000 --> 01:04:18,600 and sometimes I'll use meaningful variable names, 1372 01:04:18,600 --> 01:04:20,440 and I'll call your attention to it, 1373 01:04:20,440 --> 01:04:22,120 and it will get you. 1374 01:04:22,120 --> 01:04:24,440 You'll start, when you look at this third kind, 1375 01:04:24,440 --> 01:04:28,200 it has meaningful variables or mnemonic variable names, 1376 01:04:28,200 --> 01:04:31,160 you'll just instinctively want to give Python 1377 01:04:31,160 --> 01:04:34,120 more intelligence than it sort of deserves, 1378 01:04:34,120 --> 01:04:36,680 I guess that's probably the best way to say that. 1379 01:04:36,680 --> 01:04:38,620 So, we've talked about constants, 1380 01:04:38,620 --> 01:04:40,040 we've talked about reserved words, 1381 01:04:40,040 --> 01:04:41,480 we've talked about variables. 1382 01:04:43,440 --> 01:04:45,560 And so, here we have a sentence, 1383 01:04:45,560 --> 01:04:47,280 like we've already done some of these things, 1384 01:04:47,280 --> 01:04:49,560 where we set x equals two, 1385 01:04:49,560 --> 01:04:52,560 we retrieve the old value of x and add two to it, 1386 01:04:52,560 --> 01:04:55,200 so that becomes four, and then we print four out, 1387 01:04:55,200 --> 01:04:57,520 print is a function that's built in, 1388 01:04:57,520 --> 01:04:59,520 and we pass in whatever we want to print out. 1389 01:04:59,520 --> 01:05:03,480 So, this parentheses is part of a function call. 1390 01:05:05,080 --> 01:05:07,320 Okay, so, an assignment statement, 1391 01:05:07,320 --> 01:05:10,720 you have to really get your head around the notion 1392 01:05:10,720 --> 01:05:14,000 that it has this arrow nature, 1393 01:05:14,000 --> 01:05:17,280 and that it evaluates this entire right-hand side 1394 01:05:17,280 --> 01:05:20,640 before we change the left-hand side. 1395 01:05:20,640 --> 01:05:22,440 And so, you can think of this sort of as, 1396 01:05:22,440 --> 01:05:24,320 at time step one, it does this, 1397 01:05:24,320 --> 01:05:26,480 and then at time step two, it does the copy. 1398 01:05:26,480 --> 01:05:29,120 And that's how you can have something like x 1399 01:05:29,120 --> 01:05:33,040 on both sides of an assignment statement. 1400 01:05:33,040 --> 01:05:34,980 And so, if, for example, we have x, 1401 01:05:34,980 --> 01:05:39,960 and x has 0.6 in it, x has 0.6 in it, 1402 01:05:39,960 --> 01:05:42,200 what happens is that it first, 1403 01:05:42,200 --> 01:05:44,400 it sort of ignores this part right here, 1404 01:05:44,400 --> 01:05:45,920 and evaluates the expression. 1405 01:05:45,920 --> 01:05:48,960 So, it pulls the 0.6, everywhere x appears, 1406 01:05:48,960 --> 01:05:52,680 it pulls 0.6 out, then it starts running these calculations, 1407 01:05:52,680 --> 01:05:54,720 and then it has the new value. 1408 01:05:54,720 --> 01:05:56,640 After all the calculations are done, 1409 01:05:56,640 --> 01:06:01,640 then and only then is it going to put that back into x. 1410 01:06:02,000 --> 01:06:05,320 And so, it sort of takes that and puts it back into x, 1411 01:06:05,320 --> 01:06:07,840 and then wipes out the old value. 1412 01:06:07,840 --> 01:06:09,920 At this point, this has all been taken care of, 1413 01:06:09,920 --> 01:06:13,200 and it's been reduced down to this 0.93, 1414 01:06:13,200 --> 01:06:16,860 and so that is what's put in as the new value. 1415 01:06:18,360 --> 01:06:20,520 So, up next, we'll talk a little bit more 1416 01:06:20,520 --> 01:06:23,420 about making more complex expressions. 1417 01:06:27,400 --> 01:06:28,240 So, welcome back. 1418 01:06:28,240 --> 01:06:30,040 We're now going to talk about expressions. 1419 01:06:30,040 --> 01:06:33,120 Expressions are a little more complex calculations 1420 01:06:33,120 --> 01:06:33,960 that we can sort of do 1421 01:06:33,960 --> 01:06:37,800 on the right-hand side of an assignment statement. 1422 01:06:37,800 --> 01:06:41,480 So, one of the things about expressions is operators. 1423 01:06:41,480 --> 01:06:43,880 And then operators in computer programming 1424 01:06:43,880 --> 01:06:46,640 are often very much the same as the mathematical operators, 1425 01:06:46,640 --> 01:06:49,420 but we don't have all the fancy characters 1426 01:06:49,420 --> 01:06:51,480 that we have in mathematics, 1427 01:06:51,480 --> 01:06:54,800 and so we have to choose what's on the keyboard, 1428 01:06:54,800 --> 01:06:58,240 and then if we really go back to the 1960s and 1970s, 1429 01:06:58,240 --> 01:06:59,880 and then we used what was on the keyboard 1430 01:06:59,880 --> 01:07:03,200 in the 1960s and the 1970s to make these operators. 1431 01:07:03,200 --> 01:07:06,480 So, pluses addition, minuses subtraction, 1432 01:07:06,480 --> 01:07:09,160 we don't have a time sign or a dot in the middle, 1433 01:07:09,160 --> 01:07:12,120 so we use the asterisk as multiplication. 1434 01:07:12,120 --> 01:07:14,360 Division, we can't put two things over top of each other, 1435 01:07:14,360 --> 01:07:16,360 so we use slash for division. 1436 01:07:16,360 --> 01:07:17,400 Raising to the power, 1437 01:07:17,400 --> 01:07:19,060 because it didn't have little characters back then, 1438 01:07:19,060 --> 01:07:22,040 is star, star, which is raising to the power. 1439 01:07:22,040 --> 01:07:22,960 And then remainder. 1440 01:07:22,960 --> 01:07:26,680 Remainder is the, when you do integer division, 1441 01:07:26,680 --> 01:07:28,480 it's also called the modulo operator, 1442 01:07:28,480 --> 01:07:30,240 it's the remainder, not the quotient. 1443 01:07:30,240 --> 01:07:32,720 Now, I've got a picture of that coming up. 1444 01:07:32,720 --> 01:07:35,800 So, here's a whole series of little examples of this, right? 1445 01:07:35,800 --> 01:07:39,280 So, we've already seen, you know, the plus, x equals x plus one. 1446 01:07:39,280 --> 01:07:41,720 Keep remembering that these assignments are arrows, 1447 01:07:41,720 --> 01:07:44,480 basically, arrow, arrow, they have a direction. 1448 01:07:44,480 --> 01:07:47,200 Multiplication, 440 times 12. 1449 01:07:48,160 --> 01:07:53,160 Dividing this by, that's division over 1,000, 5.28. 1450 01:07:54,640 --> 01:07:56,400 Here, we're gonna put 23 into JJ, 1451 01:07:56,400 --> 01:07:57,340 and then we're gonna do modulo. 1452 01:07:57,340 --> 01:08:00,000 So, that says, take 23, divide it by five, 1453 01:08:00,000 --> 01:08:02,160 and give me back the remainder and put it in KK. 1454 01:08:02,160 --> 01:08:05,080 So, this is the expression that evaluates like this. 1455 01:08:05,080 --> 01:08:09,920 Take 23, divide five into 23, four, remainder, three. 1456 01:08:09,920 --> 01:08:12,600 The three is what comes back up here. 1457 01:08:12,600 --> 01:08:14,880 Okay, and so that is the remainder. 1458 01:08:14,880 --> 01:08:16,620 It's also called modulo operator. 1459 01:08:16,620 --> 01:08:19,880 It turns out that, for things like picking a random number 1460 01:08:19,880 --> 01:08:22,120 and then taking the modulo of 52 1461 01:08:22,120 --> 01:08:23,979 is a way to pick a card randomly. 1462 01:08:23,979 --> 01:08:26,239 So, this modulo operator is actually, 1463 01:08:26,240 --> 01:08:29,439 especially in games and other things, super useful. 1464 01:08:29,439 --> 01:08:32,459 So, that's the various operators. 1465 01:08:32,460 --> 01:08:37,460 It's important to know which of these operators goes first. 1466 01:08:37,580 --> 01:08:39,540 It's called operator precedence. 1467 01:08:39,540 --> 01:08:42,020 Now, normally, we put parentheses in, 1468 01:08:42,020 --> 01:08:44,380 like, you know, so if I put the parentheses in here, 1469 01:08:44,380 --> 01:08:46,240 I'd say this goes first, 1470 01:08:46,240 --> 01:08:48,020 parentheses, then this goes first. 1471 01:08:48,020 --> 01:08:49,620 Oh, actually, not that one. 1472 01:08:49,620 --> 01:08:51,880 Oops, got that one wrong. 1473 01:08:51,880 --> 01:08:56,700 This happens first, this happens, then this happens. 1474 01:08:56,700 --> 01:09:00,340 Okay, and so, but it's important for us to be able to know 1475 01:09:00,340 --> 01:09:01,640 if there were no parentheses, 1476 01:09:01,640 --> 01:09:03,920 the order in which these things will happen. 1477 01:09:03,920 --> 01:09:07,520 So, the way things work in terms of operator precedence 1478 01:09:07,520 --> 01:09:10,220 is parentheses are the most important thing, 1479 01:09:10,220 --> 01:09:13,620 followed by raising to the power, all else being equal. 1480 01:09:13,620 --> 01:09:17,260 Multiplication and division are all both equal, 1481 01:09:17,260 --> 01:09:18,859 and then addition, and then within, 1482 01:09:18,859 --> 01:09:20,239 it's adding left to right. 1483 01:09:20,240 --> 01:09:23,260 So, let's see an example of how this works. 1484 01:09:23,260 --> 01:09:27,279 And so, if we take one plus two to raise to the three power, 1485 01:09:27,279 --> 01:09:29,219 divided by four times five, 1486 01:09:29,220 --> 01:09:31,340 and we print out what comes out of this. 1487 01:09:31,340 --> 01:09:35,939 So, the way I did this when I was taking exams back 1488 01:09:35,939 --> 01:09:38,899 many, many years ago when I was first in computer science, 1489 01:09:38,899 --> 01:09:40,019 is I'd write it all down, 1490 01:09:40,020 --> 01:09:41,660 and I'd look for the highest precedence thing. 1491 01:09:41,660 --> 01:09:43,620 Now, parentheses would make this easy, 1492 01:09:43,620 --> 01:09:45,500 but exponentiation is the first one. 1493 01:09:45,500 --> 01:09:47,260 So, that means we're gonna take this, 1494 01:09:47,260 --> 01:09:50,100 and that's gonna be eight, two to the third power, 1495 01:09:50,100 --> 01:09:55,100 two times two times two, two cubed is eight. 1496 01:09:55,500 --> 01:09:57,100 Then what I would do is I rewrite the whole thing 1497 01:09:57,100 --> 01:09:59,180 with the eight there, and now I look across, 1498 01:09:59,180 --> 01:10:00,940 and I'm looking for multiplications, 1499 01:10:00,940 --> 01:10:02,100 because the power's been done, 1500 01:10:02,100 --> 01:10:03,940 the multiplication's what I'm looking for next. 1501 01:10:03,940 --> 01:10:05,980 And then there is both multiplication division, 1502 01:10:05,980 --> 01:10:08,140 they're equal, they're at the same level. 1503 01:10:08,140 --> 01:10:10,380 And so, what happens is they're done left to right. 1504 01:10:10,380 --> 01:10:14,500 Eight divided by four happens before four times five. 1505 01:10:14,500 --> 01:10:16,940 And so, the fact that it's not four times five, 1506 01:10:16,940 --> 01:10:18,420 but instead eight times four, 1507 01:10:18,420 --> 01:10:19,860 is because of the left to right rule. 1508 01:10:19,860 --> 01:10:22,420 So, then this gets rewritten to be two, 1509 01:10:22,420 --> 01:10:24,300 one plus two times five, 1510 01:10:24,300 --> 01:10:27,020 and this one, multiplication is the top one. 1511 01:10:27,020 --> 01:10:29,500 So, that does this next, two times five becomes 10, 1512 01:10:29,500 --> 01:10:32,180 I rewrite it again, and then one plus 10 addition 1513 01:10:32,180 --> 01:10:36,260 is the lowest thing, and that's how we end up with 11. 1514 01:10:36,260 --> 01:10:38,700 And so, that's how I would do these problems 1515 01:10:38,700 --> 01:10:41,720 if I ever saw the problem on an exam. 1516 01:10:41,720 --> 01:10:43,900 And it's a fun problem to put on exams, 1517 01:10:43,900 --> 01:10:46,420 because there is one and only one answer, 1518 01:10:46,420 --> 01:10:48,940 and every programming class has usually 1519 01:10:48,940 --> 01:10:51,260 at least one slide about this stuff. 1520 01:10:51,260 --> 01:10:53,780 So, like I said, the rules go top to bottom, 1521 01:10:53,780 --> 01:10:56,980 parentheses, power, multiplication, addition, 1522 01:10:56,980 --> 01:10:59,980 and then left to right within it. 1523 01:10:59,980 --> 01:11:03,420 So, we've talked about variables and computing values 1524 01:11:03,420 --> 01:11:06,380 to put inside variables, but the one thing you've kind of 1525 01:11:06,380 --> 01:11:08,180 also, maybe you noticed it as we go by, 1526 01:11:08,180 --> 01:11:10,680 is we have different kinds of data. 1527 01:11:10,680 --> 01:11:12,060 We call it type. 1528 01:11:12,060 --> 01:11:13,300 Is this of type integer? 1529 01:11:13,300 --> 01:11:15,340 Is this of type floating point number? 1530 01:11:15,340 --> 01:11:16,740 Is it of type string? 1531 01:11:16,740 --> 01:11:18,380 What is going on here? 1532 01:11:18,380 --> 01:11:21,420 And Python is pretty smart about various kinds 1533 01:11:21,420 --> 01:11:23,340 of types of data. 1534 01:11:23,340 --> 01:11:26,220 And so, you know, we're adding one plus four here, 1535 01:11:26,220 --> 01:11:28,180 and Python knows, as it looks at this, 1536 01:11:28,180 --> 01:11:30,100 that that's an integer and that's an integer, 1537 01:11:30,100 --> 01:11:32,020 and we'll add it together and make it an integer. 1538 01:11:32,020 --> 01:11:33,880 So, that thing is an integer. 1539 01:11:33,880 --> 01:11:37,180 We can also use this plus to concatenate two strings. 1540 01:11:37,180 --> 01:11:39,980 This is hello blank plus there, 1541 01:11:39,980 --> 01:11:42,020 and plus looks here, says, oh, that's a string, 1542 01:11:42,020 --> 01:11:43,000 and that's a string. 1543 01:11:43,000 --> 01:11:44,560 So, I know what to do with strings. 1544 01:11:44,560 --> 01:11:46,460 I will concatenate those two things together, 1545 01:11:46,460 --> 01:11:48,980 so it becomes another string that gets assigned 1546 01:11:48,980 --> 01:11:51,980 into EE, and it's hello space there. 1547 01:11:51,980 --> 01:11:53,380 The plus doesn't add the space. 1548 01:11:53,380 --> 01:11:56,140 I added the space by putting it right there. 1549 01:11:56,140 --> 01:11:58,100 And so, these operators are kind of smart 1550 01:11:58,100 --> 01:11:59,980 in that they kind of know what they're dealing with, 1551 01:11:59,980 --> 01:12:02,820 and sometimes they will do one thing or another 1552 01:12:02,820 --> 01:12:05,740 depending on the kinds of values, variables, 1553 01:12:05,740 --> 01:12:07,740 or constants that they're working with. 1554 01:12:09,380 --> 01:12:12,880 And so, sometimes type can get us in trouble. 1555 01:12:14,020 --> 01:12:16,740 So, here we have EE, which is hello there 1556 01:12:16,740 --> 01:12:18,900 because we've concatenated these two strings together, 1557 01:12:18,900 --> 01:12:20,140 and now we're adding one. 1558 01:12:20,140 --> 01:12:22,300 And the problem now is that it looks on one side 1559 01:12:22,300 --> 01:12:24,460 and says, that's a string, and that's a number, 1560 01:12:24,460 --> 01:12:26,340 and says, I don't know how to do that. 1561 01:12:26,340 --> 01:12:28,460 This is another one of those annoying errors 1562 01:12:28,460 --> 01:12:30,540 that you would like, you think that somehow 1563 01:12:30,540 --> 01:12:33,380 Python doesn't like you, but it just is confused. 1564 01:12:33,380 --> 01:12:35,460 If you look at these things, traceback, 1565 01:12:35,460 --> 01:12:37,500 traceback always means I quit. 1566 01:12:37,500 --> 01:12:40,100 It means I stopped, I ran, I'm quitting now 1567 01:12:40,100 --> 01:12:41,340 because I don't want to go any farther 1568 01:12:41,340 --> 01:12:43,040 because I've become confused. 1569 01:12:43,040 --> 01:12:46,100 So, your program stops running, and you say, 1570 01:12:46,100 --> 01:12:47,500 here's where I stopped running, 1571 01:12:47,500 --> 01:12:48,700 because we're typing interactively. 1572 01:12:48,700 --> 01:12:50,220 It's always line one here. 1573 01:12:50,220 --> 01:12:52,300 Type it, but you, if you read carefully 1574 01:12:52,300 --> 01:12:54,440 and you don't get too stuck on too much stuff, 1575 01:12:54,440 --> 01:12:58,260 line one that tells us something in module type error 1576 01:12:58,260 --> 01:13:01,500 can't convert int object to str implicitly. 1577 01:13:01,500 --> 01:13:03,940 So, that's an integer right there, and that's a string, 1578 01:13:03,940 --> 01:13:05,420 and that's what it's complaining about, 1579 01:13:05,420 --> 01:13:06,880 that little bit right there. 1580 01:13:06,880 --> 01:13:09,700 If Python is so grumpy about types, 1581 01:13:09,700 --> 01:13:11,620 then we should be able to ask it about type. 1582 01:13:11,620 --> 01:13:15,560 So, it turns out that there is, inside Python, 1583 01:13:15,560 --> 01:13:18,940 a built-in function called type, T-Y-P-E. 1584 01:13:18,940 --> 01:13:20,780 So, we can pass into type. 1585 01:13:20,780 --> 01:13:24,640 So, the syntax is calling a built-in function named type. 1586 01:13:24,640 --> 01:13:27,620 Parenthesis is the parameter that we're passing to it. 1587 01:13:27,620 --> 01:13:29,820 We're saying, hey, hello, tell me something 1588 01:13:29,820 --> 01:13:32,900 about the type of the variable E-E-E-E-E. 1589 01:13:32,900 --> 01:13:34,180 And so, this is a function, 1590 01:13:34,180 --> 01:13:36,380 the parentheses are part of the function call, 1591 01:13:36,380 --> 01:13:39,900 and it says, oh, that would be of class string. 1592 01:13:39,900 --> 01:13:41,920 And then we can pass in a constant and says, 1593 01:13:41,920 --> 01:13:43,540 hey, what about hello? 1594 01:13:43,540 --> 01:13:46,140 The string hello, it's like, oh, that's a string too. 1595 01:13:46,140 --> 01:13:47,100 What about a one? 1596 01:13:47,100 --> 01:13:47,980 Well, that's an integer. 1597 01:13:47,980 --> 01:13:52,340 And so, we are asking Python, through the type function, 1598 01:13:52,340 --> 01:13:56,460 what the type of either a variable or a constant is. 1599 01:13:56,460 --> 01:13:58,340 And there are even several types of numbers, 1600 01:13:58,340 --> 01:14:01,240 and we'll even see Booleans and others later, 1601 01:14:02,500 --> 01:14:05,420 like one with no decimal, that's an integer number. 1602 01:14:05,420 --> 01:14:08,860 98.6 with a decimal, that's a floating point number. 1603 01:14:08,860 --> 01:14:13,860 And so, constants can be both integer and floating point. 1604 01:14:14,700 --> 01:14:16,820 And I'm just asking over and over and over again, 1605 01:14:16,820 --> 01:14:19,100 what is the type of, what's in xxx? 1606 01:14:19,100 --> 01:14:20,820 What's the type of what's in temp? 1607 01:14:20,820 --> 01:14:24,380 And what's the type of the constant one? 1608 01:14:24,380 --> 01:14:26,780 And what's the type of 1.0? 1609 01:14:28,060 --> 01:14:30,980 You can also use a set of built-in functions, 1610 01:14:30,980 --> 01:14:34,860 like float and int, to convert from one to another. 1611 01:14:34,860 --> 01:14:37,860 And so, this basically says, I wanna convert, 1612 01:14:37,860 --> 01:14:39,960 oops, let's go back. 1613 01:14:39,960 --> 01:14:43,780 I wanna convert 99 to a floating point number. 1614 01:14:43,780 --> 01:14:45,020 So, this is a function, 1615 01:14:45,020 --> 01:14:47,900 and it's participating in this plus, 1616 01:14:47,900 --> 01:14:49,500 but before it can finish the plus, 1617 01:14:49,500 --> 01:14:52,700 it turns this into a 99.0. 1618 01:14:52,700 --> 01:14:55,380 The difference between 99 as an integer and 99.0 1619 01:14:55,380 --> 01:14:56,860 is that it's a floating point number. 1620 01:14:56,860 --> 01:14:59,900 And that actually turns this computation, 1621 01:14:59,900 --> 01:15:01,940 as it looks to the left and looks to the right, 1622 01:15:01,940 --> 01:15:03,820 says, oh, I've got a floating point number 1623 01:15:03,820 --> 01:15:06,080 on one side, an integer on the other side, 1624 01:15:06,080 --> 01:15:08,260 and so I'm gonna make my calculation overall 1625 01:15:08,260 --> 01:15:10,540 via floating point calculation. 1626 01:15:10,540 --> 01:15:13,100 I can also pass into the float function. 1627 01:15:13,100 --> 01:15:15,420 I can say, take this variable i, 1628 01:15:15,420 --> 01:15:17,820 which has a 42, also an integer, 1629 01:15:17,820 --> 01:15:19,940 and then give me back a floating point. 1630 01:15:19,940 --> 01:15:23,340 So, that'll be 42.0, pass that into f, 1631 01:15:23,340 --> 01:15:27,700 we print it out, and it is indeed 42.0, and it's a float. 1632 01:15:27,700 --> 01:15:31,860 And so, it knows the type and value in any variable. 1633 01:15:31,860 --> 01:15:34,360 This is an integer of value 42. 1634 01:15:34,360 --> 01:15:37,500 This is a float of value 42.0. 1635 01:15:39,740 --> 01:15:42,240 Integer division in Python 2 was kinda weird, 1636 01:15:42,240 --> 01:15:43,860 and it was actually one of the big things 1637 01:15:43,860 --> 01:15:46,500 that they changed between Python 2 and Python 3. 1638 01:15:46,500 --> 01:15:47,820 This is a Python 3 course, 1639 01:15:47,820 --> 01:15:49,820 so we're not worried about that too much. 1640 01:15:49,820 --> 01:15:52,900 What's nice about integer division in Python 3 1641 01:15:52,900 --> 01:15:55,420 is it always produces a floating point result. 1642 01:15:55,420 --> 01:15:59,020 And that means that Python 3's division is more predictable, 1643 01:15:59,020 --> 01:16:02,120 and it works more like a calculator. 1644 01:16:02,120 --> 01:16:04,380 So, in this case, I mean, you can go back 1645 01:16:04,380 --> 01:16:05,780 and look at my Python 2 lectures 1646 01:16:05,780 --> 01:16:07,940 and see how crazy it was in Python 2. 1647 01:16:07,940 --> 01:16:10,940 10 divided by 2 is 5.0, and the weird thing here 1648 01:16:10,940 --> 01:16:13,420 is these are both integers, but the division 1649 01:16:13,420 --> 01:16:15,460 forces the result of the calculation 1650 01:16:15,460 --> 01:16:16,780 to be a floating point number. 1651 01:16:16,780 --> 01:16:19,780 And this, you know, 10 over 2 could be 5, 1652 01:16:19,780 --> 01:16:24,740 but 9 over 2 is 4.5, and so that is accurate. 1653 01:16:24,740 --> 01:16:27,180 In old Python 2, that would give us back 4, 1654 01:16:27,180 --> 01:16:30,940 which is completely unpredictable and weird. 1655 01:16:30,940 --> 01:16:33,140 The same with 99 over 100. 1656 01:16:33,140 --> 01:16:35,060 As you would expect if this were a calculator, 1657 01:16:35,060 --> 01:16:36,980 you get 0.99. 1658 01:16:36,980 --> 01:16:39,380 Actually, what you get in Python 2 is zero 1659 01:16:39,380 --> 01:16:40,660 because it would round it down. 1660 01:16:40,660 --> 01:16:43,180 It doesn't round at all, it truncates it. 1661 01:16:43,180 --> 01:16:47,180 So, 99 over 100 is 0.99, and then it truncates it to zero. 1662 01:16:47,180 --> 01:16:48,260 That's Python 2. 1663 01:16:48,260 --> 01:16:49,900 We're not talking about Python 2. 1664 01:16:49,900 --> 01:16:51,900 There's a good reason we're not talking about Python 2. 1665 01:16:51,900 --> 01:16:53,380 Welcome to Python 3. 1666 01:16:53,380 --> 01:16:55,700 Of course, if there are a floating point on either side, 1667 01:16:55,700 --> 01:16:57,580 the result is still a floating point, 1668 01:16:57,580 --> 01:17:00,060 floating point, and the result is still a floating point. 1669 01:17:00,060 --> 01:17:03,220 So, integer division produces a floating result 1670 01:17:03,220 --> 01:17:07,900 in Python 3.0, not in Python 2.0. 1671 01:17:07,900 --> 01:17:11,220 That is an improvement in Python 3.0. 1672 01:17:11,220 --> 01:17:13,180 And that's why we're recording these lectures. 1673 01:17:13,180 --> 01:17:15,740 I have a whole great set of lectures about Python 2, 1674 01:17:15,740 --> 01:17:17,300 and now I'm gonna have a great set of lectures 1675 01:17:17,300 --> 01:17:18,540 about Python 3. 1676 01:17:18,540 --> 01:17:19,980 Welcome to Python 3. 1677 01:17:21,140 --> 01:17:23,620 Okay, so, we've been talking about converting 1678 01:17:23,620 --> 01:17:24,780 from integer to floating point, 1679 01:17:24,780 --> 01:17:27,420 but you can also convert from string to integer 1680 01:17:27,420 --> 01:17:28,820 or string to floating point. 1681 01:17:29,860 --> 01:17:33,420 And so, here we start out with a little string value. 1682 01:17:33,420 --> 01:17:35,340 Now, it only works for strings that are made of digits. 1683 01:17:35,340 --> 01:17:38,860 So, quote one, two, three, quote is not an integer. 1684 01:17:38,860 --> 01:17:42,140 It is a three-character string that has one, two, three 1685 01:17:42,140 --> 01:17:43,660 as the characters in that string, 1686 01:17:43,660 --> 01:17:46,220 which is very different than 123. 1687 01:17:47,140 --> 01:17:48,580 We say, what is the type of this? 1688 01:17:48,580 --> 01:17:49,860 It's a string. 1689 01:17:49,860 --> 01:17:51,540 We say, let's add one to it. 1690 01:17:51,540 --> 01:17:54,740 And it says, can't convert int to string, 1691 01:17:54,740 --> 01:17:56,060 so that blows up, right? 1692 01:17:56,060 --> 01:17:57,300 Because this is a string. 1693 01:17:57,300 --> 01:17:58,340 It looks to both sides. 1694 01:17:58,340 --> 01:18:01,980 String plus an integer, not good, okay? 1695 01:18:01,980 --> 01:18:03,820 But we can convert this. 1696 01:18:03,820 --> 01:18:05,420 We can call the int function, 1697 01:18:05,420 --> 01:18:07,180 which is like the float function, 1698 01:18:07,180 --> 01:18:08,300 and pass a string in. 1699 01:18:08,300 --> 01:18:11,620 So, it says, hey, take this and turn it into an integer. 1700 01:18:11,620 --> 01:18:13,660 So, take the input of sval, 1701 01:18:13,660 --> 01:18:15,780 which is the string one, two, three, 1702 01:18:15,780 --> 01:18:18,540 and give me back an integer representation of that, 1703 01:18:18,540 --> 01:18:21,100 which is going to be 123. 1704 01:18:21,100 --> 01:18:22,780 So, we say, what kind of thing do we get back? 1705 01:18:22,780 --> 01:18:24,280 Well, we got back an integer. 1706 01:18:24,280 --> 01:18:27,260 We can now add one to it and get 124. 1707 01:18:27,260 --> 01:18:30,140 And so, you have to manage the type of things 1708 01:18:30,140 --> 01:18:33,580 and you can convert from one type to another. 1709 01:18:33,580 --> 01:18:35,700 Now, int is not magic. 1710 01:18:35,700 --> 01:18:36,980 If you send something into it, 1711 01:18:36,980 --> 01:18:40,060 a string that doesn't consist of digits, 1712 01:18:40,060 --> 01:18:42,300 then you're gonna end up with another error. 1713 01:18:42,300 --> 01:18:44,620 Invalid literal for integer with base 10, 1714 01:18:44,620 --> 01:18:45,780 blah, blah, blah, blah, blah. 1715 01:18:45,780 --> 01:18:46,780 So, it's really complaining. 1716 01:18:46,780 --> 01:18:48,500 It says, I want these to be numbers here 1717 01:18:48,500 --> 01:18:49,900 and you just gave me letters. 1718 01:18:49,900 --> 01:18:52,260 So, that's going to cause this to fail. 1719 01:18:54,140 --> 01:18:55,900 Another thing that we're gonna do with variables 1720 01:18:55,900 --> 01:18:59,120 is just like the print function takes something, 1721 01:18:59,120 --> 01:19:00,980 a list of things, in this case, 1722 01:19:00,980 --> 01:19:02,620 a string, comma, a variable, 1723 01:19:02,620 --> 01:19:05,060 and then print some output in the program. 1724 01:19:05,060 --> 01:19:06,500 The opposite of that is input. 1725 01:19:06,500 --> 01:19:09,060 Actually, input generally happens before output. 1726 01:19:09,060 --> 01:19:12,860 Input is a built-in function and we pass to it a prompt, 1727 01:19:12,860 --> 01:19:15,200 a string of text that's going to be printed out 1728 01:19:15,200 --> 01:19:17,980 for the user and then it stops and waits. 1729 01:19:17,980 --> 01:19:19,560 So, it says, who are you? 1730 01:19:19,560 --> 01:19:21,580 And then right here, it just sits, 1731 01:19:21,580 --> 01:19:23,440 waiting for us to type something. 1732 01:19:23,440 --> 01:19:24,820 So, we type, blah, blah, blah, blah, 1733 01:19:24,820 --> 01:19:26,780 and then hit the enter key, right? 1734 01:19:26,780 --> 01:19:29,860 We hit the enter key and then this text 1735 01:19:29,860 --> 01:19:31,660 ends up in this variable. 1736 01:19:31,660 --> 01:19:33,700 So, this is an assignment statement 1737 01:19:33,700 --> 01:19:36,540 that chuck is the result of the input call, 1738 01:19:36,540 --> 01:19:38,620 gets copied into the nam variable. 1739 01:19:41,680 --> 01:19:43,680 So, let's do that again. 1740 01:19:43,680 --> 01:19:45,500 It's evaluating assignment statement. 1741 01:19:45,500 --> 01:19:46,740 Remember, it's kind of this way 1742 01:19:46,740 --> 01:19:49,780 or you can think of it as do this right side first. 1743 01:19:49,780 --> 01:19:53,220 It writes this out, writes that out, 1744 01:19:53,220 --> 01:19:55,820 then it waits, wait, wait, wait, wait, wait, 1745 01:19:55,820 --> 01:20:00,020 until we hit the enter and takes this chuck 1746 01:20:00,020 --> 01:20:02,620 and that becomes the result of this input 1747 01:20:02,620 --> 01:20:05,960 which is then assigned in to nam. 1748 01:20:05,960 --> 01:20:08,460 Now, then we go sequentially to the next line. 1749 01:20:08,460 --> 01:20:10,980 It prints out welcome, comma, 1750 01:20:10,980 --> 01:20:12,840 n-a, contents of the variable nam. 1751 01:20:12,840 --> 01:20:14,740 Now, this one, this comma here, 1752 01:20:14,740 --> 01:20:16,960 actually does put the space in here automatically. 1753 01:20:16,960 --> 01:20:18,860 So, it says welcome space chuck. 1754 01:20:18,860 --> 01:20:20,860 So, it pulls the, there's no space in chuck, 1755 01:20:20,860 --> 01:20:23,220 just the chu-c-k. 1756 01:20:23,220 --> 01:20:25,700 And so, print can take more than one thing 1757 01:20:25,700 --> 01:20:26,740 separated by commas. 1758 01:20:26,740 --> 01:20:28,420 Matter of fact, print can have, 1759 01:20:28,420 --> 01:20:31,960 you know, a whole bunch, oops, come back, come back, come back. 1760 01:20:34,220 --> 01:20:36,340 Print can have comma, comma, comma, parenthesis, 1761 01:20:36,340 --> 01:20:37,340 as many as you like. 1762 01:20:37,340 --> 01:20:38,260 Everything you've seen up to now 1763 01:20:38,260 --> 01:20:39,380 is kind of one thing in the print 1764 01:20:39,380 --> 01:20:42,340 but that doesn't mean that print only can do one thing. 1765 01:20:43,180 --> 01:20:44,700 So, I've talked about variables, 1766 01:20:44,700 --> 01:20:45,860 we've talked about constants, 1767 01:20:45,860 --> 01:20:46,900 we've talked about input, 1768 01:20:46,900 --> 01:20:47,780 we've talked about output, 1769 01:20:47,780 --> 01:20:51,580 and now it is time to write our first meaningful program. 1770 01:20:52,660 --> 01:20:55,740 And so, this program has to do with those of you 1771 01:20:55,740 --> 01:20:58,060 who have traveled internationally. 1772 01:20:58,060 --> 01:20:59,620 If you traveled to United States 1773 01:20:59,620 --> 01:21:01,080 and you traveled outside the United States, 1774 01:21:01,080 --> 01:21:03,980 you notice that there is an elevator convention 1775 01:21:03,980 --> 01:21:05,700 that is different inside the United States. 1776 01:21:05,700 --> 01:21:08,700 The United States, the walk in the ground floor 1777 01:21:08,700 --> 01:21:10,460 in the elevator, that's one. 1778 01:21:10,460 --> 01:21:12,440 And if you walk in a ground floor in Europe 1779 01:21:12,440 --> 01:21:13,940 or many other places in the world, 1780 01:21:13,940 --> 01:21:15,960 then the elevator is zero. 1781 01:21:15,960 --> 01:21:17,580 So, we have written a small app 1782 01:21:17,580 --> 01:21:18,620 that we're gonna put on the app store 1783 01:21:18,620 --> 01:21:19,780 and get wealthy with, 1784 01:21:19,780 --> 01:21:23,780 with called Elevator Floor Conversion App. 1785 01:21:23,780 --> 01:21:27,020 And it's gonna ask us, we're in Europe and we're lost, 1786 01:21:27,020 --> 01:21:28,900 and you say, well, what floor would this be 1787 01:21:28,900 --> 01:21:31,380 if I was in the United States of America? 1788 01:21:31,380 --> 01:21:33,660 And so, here's, we have to read the floor 1789 01:21:33,660 --> 01:21:35,680 that we are at in Europe, 1790 01:21:35,680 --> 01:21:38,500 and then we're going to convert it to a US floor, 1791 01:21:38,500 --> 01:21:39,520 and then we're gonna print it out. 1792 01:21:39,520 --> 01:21:41,580 This is very silly, 1793 01:21:41,580 --> 01:21:46,580 but it is a pure, essential program that has input, 1794 01:21:47,300 --> 01:21:49,620 does some kind of task on that input, 1795 01:21:49,620 --> 01:21:51,180 and then produces some output, 1796 01:21:51,180 --> 01:21:55,800 which is useful for some value of useful, okay? 1797 01:21:55,800 --> 01:21:57,700 So, let's take a look at how we combine 1798 01:21:57,700 --> 01:21:59,580 everything that we learned in this lecture, 1799 01:21:59,580 --> 01:22:01,340 input, processing, and output. 1800 01:22:01,340 --> 01:22:03,220 It's a three-line program, 1801 01:22:03,220 --> 01:22:05,100 but it's sort of the beginning 1802 01:22:05,100 --> 01:22:07,380 of something that programs do, okay? 1803 01:22:07,380 --> 01:22:09,100 You're gonna do lots of programs that do this. 1804 01:22:09,100 --> 01:22:11,560 So, here we go. 1805 01:22:11,560 --> 01:22:14,480 Program starts, we do the input side effect. 1806 01:22:14,480 --> 01:22:17,280 It prints out this and then waits. 1807 01:22:17,280 --> 01:22:20,220 We type in zero, that comes back here, 1808 01:22:20,220 --> 01:22:22,700 and the zero, which is a string. 1809 01:22:22,700 --> 01:22:24,500 Input gives you back a string. 1810 01:22:24,500 --> 01:22:26,620 It doesn't give you back a number. 1811 01:22:26,620 --> 01:22:27,780 It's a little different in Python too, 1812 01:22:27,780 --> 01:22:29,860 but in Python 3, input gives you a string. 1813 01:22:29,860 --> 01:22:32,300 So, quote zero, quote, which is what we typed here. 1814 01:22:32,300 --> 01:22:34,300 We didn't type the quotes, it's a string. 1815 01:22:34,300 --> 01:22:35,980 It gets stored in the imp variable. 1816 01:22:37,020 --> 01:22:39,240 Then we move to the next statement, 1817 01:22:39,240 --> 01:22:40,500 and on this right-hand side, 1818 01:22:40,500 --> 01:22:43,060 we convert that string variable to an integer, 1819 01:22:43,060 --> 01:22:44,940 so that becomes the integer zero. 1820 01:22:44,940 --> 01:22:48,260 We add one to it, and then that becomes one, 1821 01:22:48,260 --> 01:22:50,260 and then we assign that into USF. 1822 01:22:50,260 --> 01:22:54,340 I've named this variable United States Floor, right? 1823 01:22:54,340 --> 01:22:57,220 So, imp is the input, and USF, that's mnemonic. 1824 01:22:57,220 --> 01:22:58,880 It doesn't know anything about elevators, 1825 01:22:58,880 --> 01:23:01,840 it's just I picked a variable that was quite friendly. 1826 01:23:03,220 --> 01:23:07,940 And so, at this point, USF has the United States Floor 1827 01:23:07,940 --> 01:23:09,740 that's equivalent to the European Floor, 1828 01:23:09,740 --> 01:23:12,460 and then I just fall down and I do a print statement. 1829 01:23:12,460 --> 01:23:15,520 Print out USFloor, USFloor, comma, 1830 01:23:15,520 --> 01:23:16,980 that's the space right here, 1831 01:23:16,980 --> 01:23:20,060 and then whatever the contents of the USFloor variable is. 1832 01:23:20,060 --> 01:23:22,560 And you could see that I could write this on four, 1833 01:23:22,560 --> 01:23:23,740 and it would say three. 1834 01:23:23,740 --> 01:23:26,940 I could write this and say seven, and it would say six. 1835 01:23:26,940 --> 01:23:28,300 This is an amazing program. 1836 01:23:28,300 --> 01:23:32,460 It converts floors in a European numbering scheme. 1837 01:23:32,460 --> 01:23:35,240 Wait, actually, no, I got that wrong. 1838 01:23:35,240 --> 01:23:37,780 Hang on, let me clear this. 1839 01:23:37,780 --> 01:23:39,820 I wasn't thinking clearly. 1840 01:23:39,820 --> 01:23:42,840 I could type in four, and it would give me back five. 1841 01:23:42,840 --> 01:23:45,540 I could type in six, and it would give me back seven. 1842 01:23:45,540 --> 01:23:47,220 See, I'm confused, haven't been in Europe 1843 01:23:47,220 --> 01:23:50,140 in a couple of months, and so I forgot all about the floors, 1844 01:23:50,140 --> 01:23:51,700 but that's the idea. 1845 01:23:51,700 --> 01:23:55,900 Now, this is a super, super, super simple program. 1846 01:23:55,900 --> 01:23:58,520 Not super useful, but you get the idea 1847 01:23:58,520 --> 01:24:00,180 that we're gonna pull some data in, 1848 01:24:00,180 --> 01:24:02,900 we're gonna do some intelligent thing. 1849 01:24:02,900 --> 01:24:05,220 Soon this will be hundreds of lines of code 1850 01:24:05,220 --> 01:24:06,060 instead of one line of code, 1851 01:24:06,060 --> 01:24:08,660 and then we're gonna present the results to our user. 1852 01:24:12,140 --> 01:24:14,820 Now, another element of most any programming language 1853 01:24:14,820 --> 01:24:16,400 is what's called a comment. 1854 01:24:16,400 --> 01:24:20,700 A comment is a way for you to put in a program file 1855 01:24:20,700 --> 01:24:24,380 some text that's to be ignored by Python or C 1856 01:24:24,380 --> 01:24:26,340 or whatever language we happen to be using. 1857 01:24:26,340 --> 01:24:29,780 In Python, comments start with a pound sign. 1858 01:24:29,780 --> 01:24:32,660 So what you can do is put a pound sign anywhere in a line, 1859 01:24:32,660 --> 01:24:34,860 and then after the pound sign, 1860 01:24:34,860 --> 01:24:36,860 Python ignores everything after that pound sign. 1861 01:24:36,860 --> 01:24:38,360 It can be the first character. 1862 01:24:38,360 --> 01:24:43,180 So here's our recurring concept that we talk a lot about. 1863 01:24:43,180 --> 01:24:44,540 We're not gonna cover this. 1864 01:24:44,540 --> 01:24:45,460 Remember what this does. 1865 01:24:45,460 --> 01:24:47,860 This is counting how many letters, the, the, the. 1866 01:24:47,860 --> 01:24:50,260 There's 16 thes, and there's, in that file, 1867 01:24:50,260 --> 01:24:52,380 there was six twos or whatever it was. 1868 01:24:52,380 --> 01:24:53,220 This is that code. 1869 01:24:53,220 --> 01:24:54,960 We'll get back to this code. 1870 01:24:54,960 --> 01:24:57,300 But what we've done here is I've added some comments 1871 01:24:57,300 --> 01:25:00,740 that are really for human consumption. 1872 01:25:00,740 --> 01:25:03,060 So this first paragraph is get the name of the file 1873 01:25:03,060 --> 01:25:03,900 and open it. 1874 01:25:03,900 --> 01:25:06,820 The second paragraph is count the word frequency. 1875 01:25:06,820 --> 01:25:09,220 You know, maybe I should have said histogram here. 1876 01:25:10,140 --> 01:25:12,540 Count the word frequency and assemble a histogram. 1877 01:25:12,540 --> 01:25:15,180 And then here I'm putting this pound sign in, 1878 01:25:15,180 --> 01:25:16,260 find the most common word, 1879 01:25:16,260 --> 01:25:18,980 and then I'm all done, I print this stuff out, right? 1880 01:25:18,980 --> 01:25:23,140 And so all I'm saying is comments are for people to read. 1881 01:25:23,140 --> 01:25:25,700 Your next programmer or the person who's gonna change 1882 01:25:25,700 --> 01:25:28,340 your program after you're done with it. 1883 01:25:28,340 --> 01:25:29,180 And they're nice. 1884 01:25:29,180 --> 01:25:32,020 And you don't have to use any particularly weird syntax 1885 01:25:32,020 --> 01:25:33,860 or variable naming conventions. 1886 01:25:33,860 --> 01:25:36,780 You put a pound sign in and you can write anything you want 1887 01:25:36,780 --> 01:25:38,040 from that point forward. 1888 01:25:39,720 --> 01:25:42,560 Okay, so we've talked a little bit about variables 1889 01:25:42,560 --> 01:25:45,260 and types and mnemonics and how we would choose 1890 01:25:45,260 --> 01:25:47,780 variable names and how expressions work 1891 01:25:47,780 --> 01:25:49,440 and the various operators converting 1892 01:25:49,440 --> 01:25:52,540 between different types, printing, 1893 01:25:52,540 --> 01:25:54,660 input, output, and comments. 1894 01:25:54,660 --> 01:25:58,280 So that just kinda gets us sentences. 1895 01:25:58,280 --> 01:26:01,640 Coming up next we'll talk about conditional execution 1896 01:26:01,640 --> 01:26:03,940 where we're really starting to move up to paragraphs. 1897 01:26:03,940 --> 01:26:05,100 So see you in a bit. 1898 01:26:09,600 --> 01:26:12,180 Hello and welcome to chapter three, conditional execution. 1899 01:26:12,180 --> 01:26:15,300 In conditional execution we meet the if statement. 1900 01:26:15,300 --> 01:26:17,960 The if statement is where Python can go one way 1901 01:26:17,960 --> 01:26:19,180 or another way. 1902 01:26:19,180 --> 01:26:21,620 And it's the beginning of sort of our way 1903 01:26:21,620 --> 01:26:25,140 of making Python make decisions for us. 1904 01:26:25,140 --> 01:26:27,380 Sequential code, we just do some things. 1905 01:26:27,380 --> 01:26:28,500 Sometimes that's useful. 1906 01:26:28,500 --> 01:26:32,420 But now we can have our code check something 1907 01:26:32,420 --> 01:26:35,740 and then make a decision based on that thing. 1908 01:26:35,740 --> 01:26:37,820 So the conditional steps in Python 1909 01:26:37,820 --> 01:26:39,700 are pretty straightforward. 1910 01:26:39,700 --> 01:26:42,540 The keyword that we're going to use is the if statement. 1911 01:26:42,540 --> 01:26:44,460 And so if is a reserved word. 1912 01:26:44,460 --> 01:26:47,900 And the if statement has as part of it 1913 01:26:47,900 --> 01:26:48,980 a question that it asks. 1914 01:26:48,980 --> 01:26:51,780 And this is asking if x is less than 10. 1915 01:26:51,780 --> 01:26:53,740 And the colon is the end of the if statement. 1916 01:26:53,740 --> 01:26:56,540 And then we begin an indented block of text. 1917 01:26:56,540 --> 01:26:58,460 And the way this works in this particular thing 1918 01:26:58,460 --> 01:27:00,820 is this line is the conditional line. 1919 01:27:00,820 --> 01:27:03,820 If the question is true, the line executes. 1920 01:27:03,820 --> 01:27:06,580 And if the question is false, the line is skipped. 1921 01:27:06,580 --> 01:27:08,600 And you can think of it the way this is, right? 1922 01:27:08,600 --> 01:27:10,500 x is five, ask a question. 1923 01:27:10,500 --> 01:27:11,780 Is it 10 or not? 1924 01:27:11,780 --> 01:27:14,420 These questions do not harm the value of x. 1925 01:27:14,420 --> 01:27:16,700 If it is, then we run this code. 1926 01:27:16,700 --> 01:27:18,140 And then we sort of rejoin here. 1927 01:27:18,140 --> 01:27:20,420 And then we test this next if. 1928 01:27:20,420 --> 01:27:22,020 And if that's true, we do this code. 1929 01:27:22,020 --> 01:27:23,100 And then we do there. 1930 01:27:23,100 --> 01:27:24,580 But in this case, it's going to be false 1931 01:27:24,580 --> 01:27:26,060 because x is not less than 20. 1932 01:27:26,060 --> 01:27:28,100 And so it just continues down here. 1933 01:27:28,100 --> 01:27:32,860 So if we look at how this works, it runs. 1934 01:27:32,860 --> 01:27:34,300 It runs this line. 1935 01:27:34,300 --> 01:27:35,740 Then it sees this question. 1936 01:27:35,740 --> 01:27:36,580 It skips that line. 1937 01:27:36,580 --> 01:27:38,220 So this line does not run. 1938 01:27:38,220 --> 01:27:39,940 And so smaller prints out. 1939 01:27:39,940 --> 01:27:42,020 And funny prints out. 1940 01:27:42,020 --> 01:27:42,700 OK? 1941 01:27:42,700 --> 01:27:45,460 And so that's the basic idea of an if statement. 1942 01:27:45,460 --> 01:27:49,140 And the indentation, when we are done with an if statement, 1943 01:27:49,140 --> 01:27:50,220 we deindent back. 1944 01:27:50,220 --> 01:27:52,020 And there's this little block. 1945 01:27:52,020 --> 01:27:53,900 This is one sort of if statement. 1946 01:27:53,900 --> 01:27:56,480 And this is another if statement. 1947 01:27:56,480 --> 01:27:58,140 And these are the two conditional lines 1948 01:27:58,140 --> 01:27:59,840 that either run or they don't run, 1949 01:27:59,840 --> 01:28:04,180 depending on the answer to that question. 1950 01:28:04,180 --> 01:28:06,500 So we have a number of different comparison operators 1951 01:28:06,500 --> 01:28:09,980 that we can use to ask these true-false questions that 1952 01:28:09,980 --> 01:28:11,660 say, is this true? 1953 01:28:11,660 --> 01:28:14,420 So again, we're kind of limited to the keys 1954 01:28:14,420 --> 01:28:19,860 that were on computer keyboards in the 1940s and 1950s. 1955 01:28:19,860 --> 01:28:21,780 Less than, less than or equal to. 1956 01:28:21,780 --> 01:28:24,180 So we didn't have fancy math characters. 1957 01:28:24,180 --> 01:28:26,700 So we just concatenated less than and equal 1958 01:28:26,700 --> 01:28:28,380 to be less than or equal to. 1959 01:28:28,380 --> 01:28:32,460 This double equals is the asking, is this equal to? 1960 01:28:32,460 --> 01:28:34,380 And so that's a little tricky. 1961 01:28:34,380 --> 01:28:37,260 The equals sign is that assignment operator. 1962 01:28:37,260 --> 01:28:39,340 If I was building a language today from scratch, 1963 01:28:39,340 --> 01:28:41,340 I would probably make assignment be arrow. 1964 01:28:41,340 --> 01:28:44,660 And the equals question to have an equals. 1965 01:28:44,660 --> 01:28:49,060 Or I might say somewhere I would say question equals. 1966 01:28:49,060 --> 01:28:51,860 But I'm not building this language. 1967 01:28:51,860 --> 01:28:54,020 So that's not up to me. 1968 01:28:54,020 --> 01:28:55,420 So this is the question. 1969 01:28:55,420 --> 01:28:59,660 Double equals is asking the question is equal to. 1970 01:28:59,660 --> 01:29:02,940 Greater than or equal, greater than, and not equal. 1971 01:29:02,940 --> 01:29:06,220 So this is the exclamation point is sort of like not equal. 1972 01:29:06,220 --> 01:29:07,940 So that's sort of not equal. 1973 01:29:07,940 --> 01:29:09,100 So that's how we do not equal. 1974 01:29:09,100 --> 01:29:12,580 So if we take a look at some of these in some examples, 1975 01:29:12,580 --> 01:29:17,500 all of these are going to be true because of the way x is set. 1976 01:29:17,500 --> 01:29:20,900 If x is equal to 5, that's the question version. 1977 01:29:20,900 --> 01:29:22,420 That's true or false. 1978 01:29:22,420 --> 01:29:23,660 It'll execute that. 1979 01:29:23,660 --> 01:29:26,580 If x is greater than 4, it's going to execute that. 1980 01:29:26,580 --> 01:29:29,180 If x is greater than or equal to 5, it's going to execute that. 1981 01:29:29,180 --> 01:29:31,300 Here's kind of a shorthand where if there's only 1982 01:29:31,300 --> 01:29:33,860 one line in this block, you can kind of pull it up right 1983 01:29:33,860 --> 01:29:35,780 on the same line after the equals. 1984 01:29:35,780 --> 01:29:38,540 If x is less than 6, which it is, true. 1985 01:29:38,540 --> 01:29:40,180 Execute that. 1986 01:29:40,180 --> 01:29:42,460 Then if x is less than or equal to 5, do that. 1987 01:29:42,460 --> 01:29:44,740 And if x is not equal to 6, do that. 1988 01:29:44,740 --> 01:29:47,020 Now like I said, all these questions 1989 01:29:47,020 --> 01:29:49,700 have been carefully constructed so that they're true. 1990 01:29:49,700 --> 01:29:52,900 Just to kind of show you the syntax of those comparison 1991 01:29:52,900 --> 01:29:53,980 operators. 1992 01:29:53,980 --> 01:29:56,340 Now you don't just have to have a single line of text 1993 01:29:56,340 --> 01:29:57,940 in the indented block. 1994 01:29:57,940 --> 01:30:00,380 And this will be something you're going to get used to. 1995 01:30:00,380 --> 01:30:03,140 So if we indent more than one line, 1996 01:30:03,140 --> 01:30:08,500 then the conditional code is actually these three lines. 1997 01:30:08,500 --> 01:30:10,300 So the idea is you have an if statement. 1998 01:30:10,300 --> 01:30:12,060 You come in, you do an indent. 1999 01:30:12,060 --> 01:30:13,420 And as long as you stay indented, 2000 01:30:13,420 --> 01:30:15,100 you stay in that if block. 2001 01:30:15,100 --> 01:30:19,420 If it's false, it just skips all of those. 2002 01:30:19,420 --> 01:30:24,660 So the way this is going to execute, x is 5. 2003 01:30:24,660 --> 01:30:25,860 You could print before 5. 2004 01:30:25,860 --> 01:30:26,940 Is x equal 5? 2005 01:30:26,940 --> 01:30:29,020 That's the question mark, and that's true. 2006 01:30:29,020 --> 01:30:31,060 So it's going to run all these. 2007 01:30:31,060 --> 01:30:33,940 And then come back, and then continue on, and then de-indent. 2008 01:30:33,940 --> 01:30:36,700 So all this stuff is running. 2009 01:30:36,700 --> 01:30:38,340 And then it says if x equals 6. 2010 01:30:38,340 --> 01:30:39,700 So that was false. 2011 01:30:39,700 --> 01:30:41,140 So that skips all of them. 2012 01:30:41,140 --> 01:30:43,700 So none of these lines of code run. 2013 01:30:43,700 --> 01:30:49,460 So these actually don't run, and it says afterwards 6. 2014 01:30:49,460 --> 01:30:50,540 So that's a mistake. 2015 01:30:50,540 --> 01:30:56,060 Those don't run right there, because x is not equal 6. 2016 01:30:56,060 --> 01:30:57,940 OK? 2017 01:30:57,940 --> 01:31:03,740 So indentation is an essential part of Python. 2018 01:31:03,740 --> 01:31:06,180 We use indentation in lots of programming languages, 2019 01:31:06,180 --> 01:31:10,060 often to demarcate blocks to show 2020 01:31:10,060 --> 01:31:11,820 where blocks start and stop. 2021 01:31:11,820 --> 01:31:15,220 But in Python, it's syntactically correct. 2022 01:31:15,220 --> 01:31:17,780 You can make an error if your indentation is wrong. 2023 01:31:17,780 --> 01:31:19,580 After an if, you must indent. 2024 01:31:19,580 --> 01:31:21,140 And you maintain the indent as long 2025 01:31:21,140 --> 01:31:23,740 as you want to be in that same if block. 2026 01:31:23,740 --> 01:31:25,540 And then when you're done with the if block, 2027 01:31:25,540 --> 01:31:27,060 you reduce the indent. 2028 01:31:27,060 --> 01:31:31,260 In this rule of indenting, comment lines and blank lines 2029 01:31:31,260 --> 01:31:34,380 are completely ignored. 2030 01:31:34,380 --> 01:31:36,780 So we're going to tend to put four spaces. 2031 01:31:36,780 --> 01:31:44,220 Four spaces ends up being the normal thing that we do. 2032 01:31:44,220 --> 01:31:46,340 And you'll see all the code that I write 2033 01:31:46,340 --> 01:31:48,180 has four spaces for each indent. 2034 01:31:48,180 --> 01:31:51,020 If I go in twice, I use eight spaces. 2035 01:31:51,020 --> 01:31:52,540 And we have this instinct of wanting 2036 01:31:52,540 --> 01:31:55,300 to hit the Tab key to move in four spaces. 2037 01:31:55,300 --> 01:31:57,660 Now, the problem is that it might 2038 01:31:57,660 --> 01:31:59,180 look the same on your screen. 2039 01:31:59,180 --> 01:32:02,580 A Tab and four spaces might line up the same place, 2040 01:32:02,580 --> 01:32:04,740 depending on how tabs are set. 2041 01:32:04,740 --> 01:32:06,460 But Python can get confused by that. 2042 01:32:06,460 --> 01:32:11,260 So we tend to avoid using actual tabs in files. 2043 01:32:11,260 --> 01:32:12,920 And so most programming text editors, 2044 01:32:12,920 --> 01:32:15,220 like if you're using Notepad or Text Wrangler, 2045 01:32:15,220 --> 01:32:17,980 there is a place to set the tabs, 2046 01:32:17,980 --> 01:32:19,740 to say don't put tabs in this document. 2047 01:32:19,740 --> 01:32:22,140 But every time you hit Tab, move over four spaces. 2048 01:32:22,140 --> 01:32:24,460 And so if you hit a Tab, but it's like space, space, space, 2049 01:32:24,460 --> 01:32:25,460 space, space. 2050 01:32:25,460 --> 01:32:28,820 Now, the nice thing about Atom, and this is the text editor 2051 01:32:28,820 --> 01:32:30,300 we tend to recommend in this class, 2052 01:32:30,300 --> 01:32:33,700 A, because it works on Windows, Linux, and Mac, 2053 01:32:33,700 --> 01:32:36,060 but also because it automatically sets this up. 2054 01:32:36,060 --> 01:32:38,900 As soon as you save your file with a.py extension, 2055 01:32:38,900 --> 01:32:41,260 you can sort of hit the Tab key with impunity. 2056 01:32:41,260 --> 01:32:44,060 And everything works perfectly. 2057 01:32:44,060 --> 01:32:47,140 But the key thing here is that Python insists 2058 01:32:47,140 --> 01:32:48,380 that you get this right. 2059 01:32:48,380 --> 01:32:50,420 And if you don't get this right, you're 2060 01:32:50,420 --> 01:32:51,820 going to get indentation errors. 2061 01:32:51,820 --> 01:32:56,220 And they're just another syntax error. 2062 01:32:56,220 --> 01:33:01,540 So if you're using something like Text Wrangler or Notepad, 2063 01:33:01,540 --> 01:33:03,040 run around in the Preferences, and you'll 2064 01:33:03,040 --> 01:33:05,220 find something about expanding tabs, 2065 01:33:05,220 --> 01:33:09,260 or maybe how many spaces each tab stop is supposed to be. 2066 01:33:09,260 --> 01:33:10,380 And so you check these. 2067 01:33:10,380 --> 01:33:12,700 And what this really is doing is telling your text editor, 2068 01:33:12,700 --> 01:33:15,100 never put an actual tab in the document, 2069 01:33:15,100 --> 01:33:19,140 but somehow simulate tab stops using spaces. 2070 01:33:19,140 --> 01:33:21,780 And so here is a bit of code. 2071 01:33:21,780 --> 01:33:23,740 It's got some nested block. 2072 01:33:23,740 --> 01:33:27,060 But it gives you the sense that you have to be very explicit 2073 01:33:27,060 --> 01:33:30,580 when you're reading Python code of whether the indent is 2074 01:33:30,580 --> 01:33:36,300 the same between two lines, the same, increased or decreased. 2075 01:33:36,300 --> 01:33:38,700 And every time you increase it, you mean something. 2076 01:33:38,700 --> 01:33:40,260 And every time you decrease it, you mean something. 2077 01:33:40,260 --> 01:33:42,020 And literally, if it stays the same, 2078 01:33:42,020 --> 01:33:43,740 you mean something as well. 2079 01:33:43,740 --> 01:33:46,460 And so if we take a look at this, here we have a line. 2080 01:33:46,460 --> 01:33:48,460 And the next line has the same indent. 2081 01:33:48,460 --> 01:33:50,380 This is an if with a colon at the end. 2082 01:33:50,380 --> 01:33:52,460 So we have to increase the indent. 2083 01:33:52,460 --> 01:33:54,900 And now we're maintaining it. 2084 01:33:54,900 --> 01:33:56,980 So these two lines are part of that if. 2085 01:33:56,980 --> 01:33:58,540 But now we have deindent it. 2086 01:33:58,540 --> 01:34:02,260 So whether you choose to deindent this word, or this word, 2087 01:34:02,260 --> 01:34:05,100 or whatever, the where you do this deindent 2088 01:34:05,100 --> 01:34:09,380 affects the scope of how far this if statement lasts. 2089 01:34:09,380 --> 01:34:13,460 It lasts up to, but not including, the line that's 2090 01:34:13,460 --> 01:34:16,140 deindented to the same level as the if. 2091 01:34:16,140 --> 01:34:17,700 So this is a deindent. 2092 01:34:17,700 --> 01:34:19,840 Now we have a blank line, which doesn't matter. 2093 01:34:19,840 --> 01:34:20,980 And we maintain it. 2094 01:34:20,980 --> 01:34:23,340 And we have a for, which we'll learn about in the next chapter, 2095 01:34:23,340 --> 01:34:24,660 which is a looping structure. 2096 01:34:24,660 --> 01:34:25,460 Let's do a for. 2097 01:34:25,460 --> 01:34:27,380 For runs this five times. 2098 01:34:27,380 --> 01:34:28,180 It has a colon. 2099 01:34:28,180 --> 01:34:30,860 And it also expects an indented block. 2100 01:34:30,860 --> 01:34:32,740 Now we have what's called a nested block, 2101 01:34:32,740 --> 01:34:34,500 where we have an if and a colon. 2102 01:34:34,500 --> 01:34:35,820 We go into some more. 2103 01:34:35,820 --> 01:34:37,660 So this is like two indents. 2104 01:34:37,660 --> 01:34:39,260 So these are one indent. 2105 01:34:39,260 --> 01:34:40,300 And these are two indents. 2106 01:34:40,300 --> 01:34:43,380 And so this is a block within a block. 2107 01:34:43,380 --> 01:34:45,020 And then we deindent. 2108 01:34:45,020 --> 01:34:47,300 So that means this print is not part of the if statement, 2109 01:34:47,300 --> 01:34:49,700 but it's still part of the for statement. 2110 01:34:49,700 --> 01:34:51,580 And then we deindent again. 2111 01:34:51,580 --> 01:34:55,200 And then that means this print is on the same level 2112 01:34:55,200 --> 01:34:56,620 as that for statement. 2113 01:34:56,620 --> 01:34:59,460 So if you start thinking about this, 2114 01:34:59,460 --> 01:35:01,020 you want to be able to start thinking 2115 01:35:01,020 --> 01:35:04,340 that these blocks are the start of the block with the colon 2116 01:35:04,340 --> 01:35:08,220 line up to, but not including this line that's 2117 01:35:08,220 --> 01:35:09,380 been deindented. 2118 01:35:09,380 --> 01:35:12,340 So the for goes this far. 2119 01:35:12,340 --> 01:35:14,180 The for goes up to, but not including 2120 01:35:14,180 --> 01:35:15,660 the line that's deindented. 2121 01:35:15,660 --> 01:35:18,340 The if goes up to, but not including 2122 01:35:18,340 --> 01:35:20,260 the line that's deindented. 2123 01:35:20,260 --> 01:35:22,780 So as you do this, you'll sort of mentally 2124 01:35:22,780 --> 01:35:24,180 start drawing these blocks. 2125 01:35:24,180 --> 01:35:26,900 And pretty soon, you will start constructing them as blocks. 2126 01:35:26,900 --> 01:35:30,340 And it takes a while, but doesn't take forever. 2127 01:35:30,340 --> 01:35:43,460 But in Python, unlike other languages, 2128 01:35:43,460 --> 01:35:47,700 you have this is very important, and it matters. 2129 01:35:47,700 --> 01:35:49,780 And you can have syntax errors if you get it wrong. 2130 01:35:49,780 --> 01:35:51,580 Because you're really communicating 2131 01:35:51,580 --> 01:35:53,580 the shape and structure of your code 2132 01:35:53,580 --> 01:35:56,840 using these indents and deindents. 2133 01:35:56,840 --> 01:35:58,380 We already saw a nested indent. 2134 01:35:58,380 --> 01:36:00,020 This is a nested if. 2135 01:36:00,020 --> 01:36:01,900 So you can put an if within an if. 2136 01:36:01,900 --> 01:36:03,820 And you can go as far deep as you want to go, 2137 01:36:03,820 --> 01:36:05,500 like Russian dolls. 2138 01:36:05,500 --> 01:36:08,140 And so here we have x equals 42. 2139 01:36:08,140 --> 01:36:10,060 If it's one, we indent one. 2140 01:36:10,060 --> 01:36:11,620 And then with this next thing we do, 2141 01:36:11,620 --> 01:36:13,460 these are not the same level of indent. 2142 01:36:13,460 --> 01:36:16,060 But now we see an if, and it has to indent further. 2143 01:36:16,060 --> 01:36:19,380 So this is like two in, eight spaces. 2144 01:36:19,380 --> 01:36:21,340 And then we deindent back. 2145 01:36:21,340 --> 01:36:22,820 Actually, we deindent back too. 2146 01:36:22,820 --> 01:36:24,500 And so if you'll watch this, and you 2147 01:36:24,500 --> 01:36:27,700 take a look at how this works, it runs to here. 2148 01:36:27,700 --> 01:36:30,140 Oops, back up. 2149 01:36:30,140 --> 01:36:30,820 Comes in here. 2150 01:36:30,820 --> 01:36:33,180 The answer is yes, x is greater than one. 2151 01:36:33,180 --> 01:36:33,740 Prints this. 2152 01:36:33,740 --> 01:36:34,860 Is x less than 100? 2153 01:36:34,860 --> 01:36:36,820 Well, it's 42, so the answer is yes. 2154 01:36:36,820 --> 01:36:39,900 So it runs this, and then it kind of continues back to there. 2155 01:36:39,900 --> 01:36:42,140 And you can also think of drawing boxes around this. 2156 01:36:42,140 --> 01:36:44,340 This is one if box. 2157 01:36:44,340 --> 01:36:47,860 And then within that if box, there is another if box. 2158 01:36:47,860 --> 01:36:51,260 And again, it's the indent block up to, 2159 01:36:51,260 --> 01:36:53,540 but not including where the deindent happens. 2160 01:36:53,540 --> 01:36:57,860 And this here is like two backwards deindents. 2161 01:36:57,860 --> 01:36:59,020 So it ends two blocks. 2162 01:36:59,020 --> 01:37:01,740 So two blocks are ended by where we place this. 2163 01:37:01,740 --> 01:37:03,860 We could move this in, or we could move this out. 2164 01:37:03,860 --> 01:37:05,460 We could have it all the way into here. 2165 01:37:05,460 --> 01:37:07,300 We could have it to here or here. 2166 01:37:07,300 --> 01:37:09,300 And where we put that line depends 2167 01:37:09,300 --> 01:37:13,500 on how the ends of these blocks are going to work out. 2168 01:37:13,500 --> 01:37:18,660 So one form that's a one branch if that we just saw, 2169 01:37:18,660 --> 01:37:21,140 but then you can also have what's called a two branch if. 2170 01:37:21,140 --> 01:37:23,140 And the basic idea of a two branch if 2171 01:37:23,140 --> 01:37:25,660 is that you're going to come in, you're going to ask a question, 2172 01:37:25,660 --> 01:37:27,900 and you're going to go one direction if it's yes, 2173 01:37:27,900 --> 01:37:29,460 and another direction if it's no. 2174 01:37:29,460 --> 01:37:30,820 We call this an if then else. 2175 01:37:30,820 --> 01:37:32,660 It's kind of like a fork in the road. 2176 01:37:32,660 --> 01:37:34,700 And the way to think about it is depending 2177 01:37:34,700 --> 01:37:35,980 on the output of this question, we're 2178 01:37:35,980 --> 01:37:37,540 going to pick one or two of these. 2179 01:37:37,540 --> 01:37:40,020 But if we pick one, the other one's never going to happen. 2180 01:37:40,020 --> 01:37:41,420 So it's like an either or. 2181 01:37:41,420 --> 01:37:42,980 We're either going to go one way, 2182 01:37:42,980 --> 01:37:44,340 or we're going to go the other way. 2183 01:37:44,340 --> 01:37:45,940 But there is no path where we somehow 2184 01:37:45,940 --> 01:37:47,620 go boot through both on that. 2185 01:37:47,620 --> 01:37:49,820 That doesn't happen. 2186 01:37:49,820 --> 01:37:52,380 And the syntax that we use for this 2187 01:37:52,380 --> 01:37:55,060 is what we call the if then else. 2188 01:37:55,060 --> 01:37:59,580 And so the first part is normal if with an indent. 2189 01:37:59,580 --> 01:38:00,420 And then we deindent. 2190 01:38:00,420 --> 01:38:03,340 And then this is another reserved word else with a colon. 2191 01:38:03,340 --> 01:38:04,740 And then we reindent. 2192 01:38:04,740 --> 01:38:07,540 And so this is really end up being part of a whole block 2193 01:38:07,540 --> 01:38:08,220 here. 2194 01:38:08,220 --> 01:38:10,540 And the else is the part. 2195 01:38:10,540 --> 01:38:12,780 This is the part that runs if it's false. 2196 01:38:12,780 --> 01:38:14,580 And this is the part that runs if it's true. 2197 01:38:14,580 --> 01:38:17,900 The first branch of the if, the first indented block 2198 01:38:17,900 --> 01:38:19,380 is what runs if it's true. 2199 01:38:19,380 --> 01:38:23,660 And the second indented block is the one that runs if it's false. 2200 01:38:23,660 --> 01:38:24,500 And so here we go. 2201 01:38:24,500 --> 01:38:27,180 It's just if x is greater than 2, in this case it's yes. 2202 01:38:27,180 --> 01:38:28,380 We're going to print bigger. 2203 01:38:28,380 --> 01:38:29,820 And then we're going to be all done. 2204 01:38:29,820 --> 01:38:31,060 And so we do one. 2205 01:38:31,060 --> 01:38:32,220 And so this one did run. 2206 01:38:32,220 --> 01:38:33,740 And this one did not run. 2207 01:38:33,740 --> 01:38:35,460 So basically with an if then else, 2208 01:38:35,460 --> 01:38:37,660 one of the two branches is going to run. 2209 01:38:37,660 --> 01:38:41,180 But there's no case in which both branches run. 2210 01:38:41,180 --> 01:38:43,700 And again, you sort of draw these blocks 2211 01:38:43,700 --> 01:38:45,660 around these things mentally. 2212 01:38:45,660 --> 01:38:47,980 And in this one, you sort of take from the if, 2213 01:38:47,980 --> 01:38:50,580 not the else is really part of the block up to, 2214 01:38:50,580 --> 01:38:52,060 but not including that print, which 2215 01:38:52,060 --> 01:38:57,900 is deindented back to the same level as the if statement. 2216 01:38:57,900 --> 01:39:00,340 OK? 2217 01:39:00,340 --> 01:39:03,260 Python is actually one of the more elegant languages, 2218 01:39:03,260 --> 01:39:05,380 even though after a while this indenting, 2219 01:39:05,380 --> 01:39:09,660 and when you get too far in, it gets a little bit complex. 2220 01:39:09,660 --> 01:39:11,780 But this is a good way to visualize this with these 2221 01:39:11,780 --> 01:39:13,020 indents. 2222 01:39:13,020 --> 01:39:16,820 Coming up next, we're going to talk about some more complex 2223 01:39:16,820 --> 01:39:17,940 conditional structures. 2224 01:39:22,180 --> 01:39:23,140 So welcome back. 2225 01:39:23,140 --> 01:39:25,460 Let's talk a little bit more about some more complex 2226 01:39:25,460 --> 01:39:27,700 conditional statements that sort of build 2227 01:39:27,700 --> 01:39:30,140 on this concept of if and if then else. 2228 01:39:30,140 --> 01:39:31,660 The first thing we're going to look at 2229 01:39:31,660 --> 01:39:35,220 is the multi-way branch. 2230 01:39:35,220 --> 01:39:37,740 And so the idea is it's kind of like the if then else 2231 01:39:37,740 --> 01:39:39,380 where you're going to pick one of two, 2232 01:39:39,380 --> 01:39:41,500 but now we can pick one of three, or one of four, 2233 01:39:41,500 --> 01:39:43,380 or one of five. 2234 01:39:43,380 --> 01:39:46,060 And it introduces a new concept called the LF. 2235 01:39:46,060 --> 01:39:49,980 The LF is another reserved word inside Python. 2236 01:39:49,980 --> 01:39:51,900 And the way it works is it's probably 2237 01:39:51,900 --> 01:39:54,820 best to look at this here, where it checks the first one, 2238 01:39:54,820 --> 01:39:58,100 and if it's a true, then it runs that, and then it's done. 2239 01:39:58,100 --> 01:39:59,260 It doesn't check them all. 2240 01:39:59,260 --> 01:40:02,900 It's not like it sees that there are two logical conditions. 2241 01:40:02,900 --> 01:40:04,620 It actually checks them, the first one, 2242 01:40:04,620 --> 01:40:08,940 and how you order these matters, as we'll see in a bit. 2243 01:40:08,940 --> 01:40:10,540 And so if the first one is true, it 2244 01:40:10,540 --> 01:40:15,900 runs if the first one is false, and the second one is true, 2245 01:40:15,900 --> 01:40:17,740 it runs this one, and it's done. 2246 01:40:17,740 --> 01:40:21,700 And if neither of them are true, it falls through, 2247 01:40:21,700 --> 01:40:24,260 and there's an else clause that is otherwise, 2248 01:40:24,260 --> 01:40:25,420 and it runs that. 2249 01:40:25,420 --> 01:40:29,820 So basically, it's going to run one and then skip the other two, 2250 01:40:29,820 --> 01:40:35,320 or it is going to skip one, skip two, and then run this one. 2251 01:40:35,320 --> 01:40:37,940 But it only runs, in this case, one of them. 2252 01:40:37,940 --> 01:40:41,340 But the important thing is it checks these questions in order. 2253 01:40:41,340 --> 01:40:43,260 And it doesn't check the second question 2254 01:40:43,260 --> 01:40:45,100 until it finds that the first. 2255 01:40:45,100 --> 01:40:47,740 It doesn't check the second question 2256 01:40:47,740 --> 01:40:50,500 until it knows the first question is false. 2257 01:40:50,500 --> 01:40:52,540 So if the first question is true, you're done. 2258 01:40:52,540 --> 01:40:54,140 You're done, and you're done with this. 2259 01:40:54,140 --> 01:40:56,620 You're done with the whole block at that point. 2260 01:40:56,620 --> 01:41:00,340 So only one of these three is going to execute in that block. 2261 01:41:04,700 --> 01:41:07,380 So here's sort of some examples of this. 2262 01:41:07,380 --> 01:41:10,460 If we, for example, have x equals 0, 2263 01:41:10,460 --> 01:41:11,620 it's going to come down here. 2264 01:41:11,620 --> 01:41:12,580 x is less than true. 2265 01:41:12,580 --> 01:41:13,580 That's true. 2266 01:41:13,580 --> 01:41:16,020 So it runs this code, and then it skips, skips, skips, 2267 01:41:16,020 --> 01:41:17,020 down to that. 2268 01:41:17,020 --> 01:41:19,180 And so it's like this, runs that code, 2269 01:41:19,180 --> 01:41:21,980 and then skips to the end. 2270 01:41:21,980 --> 01:41:27,460 On the other hand, if it's 5, then this is false, 2271 01:41:27,460 --> 01:41:29,380 and it skips that, and it checks this. 2272 01:41:29,380 --> 01:41:30,660 This is true. 2273 01:41:30,660 --> 01:41:33,580 It runs this code, and then it's done, skips to the end. 2274 01:41:33,580 --> 01:41:39,100 It was like false, true, run, end. 2275 01:41:39,100 --> 01:41:44,100 And then if x is like 20, for example, it runs, it runs, 2276 01:41:44,100 --> 01:41:48,180 false, false, run the else clause, and you're done. 2277 01:41:48,180 --> 01:41:52,620 So skip, skip, else, run that code, and you're done. 2278 01:41:52,620 --> 01:41:54,940 So in this case, we ran that, and we didn't run that, 2279 01:41:54,940 --> 01:41:56,300 and we didn't run that. 2280 01:41:56,300 --> 01:41:58,700 Again, one of them is going to run. 2281 01:41:58,700 --> 01:41:59,940 They're checked in order. 2282 01:41:59,940 --> 01:42:03,420 These questions are checked in order, not out of order. 2283 01:42:03,420 --> 01:42:04,500 It doesn't look ahead. 2284 01:42:04,500 --> 01:42:07,220 It just checks in the order that you wrote it. 2285 01:42:07,220 --> 01:42:09,420 You're the one that wrote that order. 2286 01:42:09,420 --> 01:42:11,980 And so there's a couple of variations on this multi-way. 2287 01:42:14,700 --> 01:42:18,060 You can have no else. 2288 01:42:18,060 --> 01:42:20,260 You can have no else, as in this case. 2289 01:42:20,260 --> 01:42:23,860 And this just means that it might not run any of them. 2290 01:42:23,860 --> 01:42:27,260 In this case, x is 5, so it's not less than 2, 2291 01:42:27,260 --> 01:42:28,820 but then it runs this one. 2292 01:42:28,820 --> 01:42:34,340 But if x was like 50, for example, if x was 50, 2293 01:42:34,340 --> 01:42:36,540 then this would be false, then it would skip, 2294 01:42:36,540 --> 01:42:38,540 and this would still be false, and it would skip, 2295 01:42:38,540 --> 01:42:40,020 and neither of these two would run. 2296 01:42:40,020 --> 01:42:41,020 So if you don't have an else, you're 2297 01:42:41,020 --> 01:42:43,020 not guaranteed that one of them is going to run, 2298 01:42:43,020 --> 01:42:44,700 because else is like the catch-all. 2299 01:42:44,700 --> 01:42:46,980 If the other ones are all false, then the else 2300 01:42:46,980 --> 01:42:48,980 is the one that runs. 2301 01:42:48,980 --> 01:42:52,820 Similarly, you can have many elifs, 2302 01:42:52,820 --> 01:42:55,260 but this is where it's really important for you 2303 01:42:55,260 --> 01:42:57,740 to make sure you know what order they're being taken in. 2304 01:42:57,740 --> 01:43:02,340 So if this is true, it runs. 2305 01:43:02,340 --> 01:43:04,540 It goes all the way to the bottom. 2306 01:43:04,540 --> 01:43:11,260 If it's false, false, false, true, it runs this one, 2307 01:43:11,260 --> 01:43:12,660 and it's done. 2308 01:43:12,660 --> 01:43:17,180 If, on the other hand, it looks at it as false, 2309 01:43:17,180 --> 01:43:19,140 go back, go back. 2310 01:43:19,140 --> 01:43:23,020 If it runs false, false, false, false, they're all false, 2311 01:43:23,020 --> 01:43:24,260 then it runs the else. 2312 01:43:24,260 --> 01:43:25,380 This one has an else. 2313 01:43:25,380 --> 01:43:26,580 This one didn't have an else. 2314 01:43:26,580 --> 01:43:27,660 They don't have to have them. 2315 01:43:27,660 --> 01:43:30,940 The key is you can have more than one of these elifs. 2316 01:43:30,940 --> 01:43:32,620 So I got a couple little things. 2317 01:43:32,620 --> 01:43:37,500 I'll let you pause right now and look at the question is, 2318 01:43:37,500 --> 01:43:43,660 are there looking at the three lines or four lines of code, 2319 01:43:43,660 --> 01:43:45,660 x equals something. 2320 01:43:45,660 --> 01:43:48,460 Are there lines of code that will never execute, 2321 01:43:48,460 --> 01:43:50,340 regardless of the value for x? 2322 01:43:50,340 --> 01:43:52,460 And I'll let you pause and think about it, 2323 01:43:52,460 --> 01:43:54,020 and then I'll explain it to you. 2324 01:43:54,020 --> 01:43:56,340 OK, hopefully you paused and thought about it 2325 01:43:56,340 --> 01:44:00,140 as long as you liked, but so let me now explain it to you. 2326 01:44:00,140 --> 01:44:03,780 So we come in here, and if x is less than or equal to 2, 2327 01:44:03,780 --> 01:44:05,020 it's going to run this first thing. 2328 01:44:05,020 --> 01:44:07,900 And if x is greater than or equal to 2, it's going to run this. 2329 01:44:07,900 --> 01:44:10,300 And if neither of those are true, then it's going to run this. 2330 01:44:10,300 --> 01:44:13,260 Well, the weird thing is, all numbers 2331 01:44:13,260 --> 01:44:15,900 are either less than 2 or greater than or equal to 2. 2332 01:44:15,900 --> 01:44:17,940 I carefully constructed this to the point 2333 01:44:17,940 --> 01:44:20,700 where it would never run this line of code. 2334 01:44:20,700 --> 01:44:24,140 It is either going to run this one or run that one, 2335 01:44:24,140 --> 01:44:25,700 but it's not going to ever run this one. 2336 01:44:25,700 --> 01:44:27,900 So that was kind of like a weird dysfunctional one 2337 01:44:27,900 --> 01:44:29,140 that I constructed. 2338 01:44:29,140 --> 01:44:31,700 This other one is a little different. 2339 01:44:31,700 --> 01:44:33,700 If x is less than 2, we do this. 2340 01:44:33,700 --> 01:44:35,660 If x is less than 20, we do that. 2341 01:44:35,660 --> 01:44:37,340 If x is less than 10, we do that. 2342 01:44:37,340 --> 01:44:39,060 And if none of those are true, we do that. 2343 01:44:39,060 --> 01:44:41,700 Well, the problem here is between these two lines. 2344 01:44:41,700 --> 01:44:44,180 The problem is, if something's less than 10, like 6, 2345 01:44:44,180 --> 01:44:47,380 for example, it's also less than 20. 2346 01:44:47,380 --> 01:44:49,380 So even though x is less than 20, 2347 01:44:49,380 --> 01:44:52,020 so even though there might be values 2348 01:44:52,020 --> 01:44:54,300 for which this is true, those also 2349 01:44:54,300 --> 01:44:55,260 are going to have this true. 2350 01:44:55,260 --> 01:44:58,380 So for something like 6, it's going to run here. 2351 01:44:58,380 --> 01:45:00,420 And it's not even going to look at this. 2352 01:45:00,420 --> 01:45:01,140 That's the point. 2353 01:45:01,140 --> 01:45:02,620 It doesn't even look at this. 2354 01:45:02,620 --> 01:45:05,540 And so that's, I mean, I could have made this more sensible 2355 01:45:05,540 --> 01:45:08,060 if I'd have moved this little block of code up to there. 2356 01:45:08,060 --> 01:45:12,620 So this is where the order in which you choose your questions, 2357 01:45:12,620 --> 01:45:14,780 the way you put these LFs together, 2358 01:45:14,780 --> 01:45:17,180 matters because it doesn't look at all of them. 2359 01:45:17,180 --> 01:45:19,100 It only looks as long as it can. 2360 01:45:19,100 --> 01:45:21,260 As long as it sees falses, then it 2361 01:45:21,260 --> 01:45:22,500 keeps on going to the next one. 2362 01:45:22,500 --> 01:45:26,580 But as soon as it doesn't see a false, it doesn't continue. 2363 01:45:26,580 --> 01:45:28,900 So the last conditional structure 2364 01:45:28,900 --> 01:45:31,140 we'll talk about is the try and accept structure. 2365 01:45:31,140 --> 01:45:35,700 If you know any other languages like C++ or Java or JavaScript, 2366 01:45:35,700 --> 01:45:38,620 you're like, whoa, that's kind of an advanced concept. 2367 01:45:38,620 --> 01:45:42,500 But it turns out in Python, because of Python's propensity 2368 01:45:42,500 --> 01:45:47,380 to throw trace backs in situations 2369 01:45:47,380 --> 01:45:49,300 where you kind of would like to recover, 2370 01:45:49,300 --> 01:45:51,860 it turns out you kind of have to use it a little more 2371 01:45:51,860 --> 01:45:55,140 and a little earlier in your programming skill. 2372 01:45:55,140 --> 01:45:58,140 So the problem is, what if there is a line of code 2373 01:45:58,140 --> 01:46:00,540 and you absolutely know it's going to make a trace back. 2374 01:46:00,540 --> 01:46:02,700 It's going to blow up. 2375 01:46:02,700 --> 01:46:04,660 But you don't want to blow up. 2376 01:46:04,660 --> 01:46:06,660 I mean, I don't want to have code blow up. 2377 01:46:06,660 --> 01:46:08,120 If you're using my autograder and you 2378 01:46:08,120 --> 01:46:09,620 see a trace back in my autograder, 2379 01:46:09,620 --> 01:46:11,580 that's kind of like I consider that a failure. 2380 01:46:11,580 --> 01:46:14,500 I could put an error like, hey, you entered blank data 2381 01:46:14,500 --> 01:46:16,220 or you didn't enter a number. 2382 01:46:16,220 --> 01:46:18,140 But a trace back, that just seems 2383 01:46:18,140 --> 01:46:19,580 like I'm too lazy as a programmer. 2384 01:46:19,580 --> 01:46:21,380 So we as programmers are supposed 2385 01:46:21,380 --> 01:46:24,080 to anticipate parts of our code that 2386 01:46:24,080 --> 01:46:26,580 are going to blow up potentially based on perhaps the user's 2387 01:46:26,580 --> 01:46:28,780 input and then do something about it. 2388 01:46:28,780 --> 01:46:31,740 And that's what the try and accept are for. 2389 01:46:31,740 --> 01:46:33,580 You take this little dangerous piece of code 2390 01:46:33,580 --> 01:46:35,780 that might break and might blow up, 2391 01:46:35,780 --> 01:46:39,560 and you surround it with a try and says, this might blow up. 2392 01:46:39,560 --> 01:46:43,100 And if it fails, run this code down here. 2393 01:46:43,100 --> 01:46:44,300 So that's the try. 2394 01:46:44,300 --> 01:46:46,140 And if you get an exception, the accept 2395 01:46:46,140 --> 01:46:48,380 is kind of like if you get an exception. 2396 01:46:48,380 --> 01:46:52,100 And the problem is, is if you are running code, 2397 01:46:52,100 --> 01:46:54,580 here's a little bit of code, we put hello bob in 2398 01:46:54,580 --> 01:46:56,260 and we convert it to an integer, and we 2399 01:46:56,260 --> 01:47:01,100 know from past experience that this blows up. 2400 01:47:01,100 --> 01:47:03,220 You can't take hello bob and convert it to an integer. 2401 01:47:03,220 --> 01:47:04,500 It's just going to blow up. 2402 01:47:04,500 --> 01:47:06,860 The problem is, and here we are. 2403 01:47:06,860 --> 01:47:08,340 It says, oh, you blew up on line two. 2404 01:47:08,340 --> 01:47:09,260 That's great. 2405 01:47:09,260 --> 01:47:12,380 And I'm not very happy with hello bob and whatever. 2406 01:47:12,380 --> 01:47:17,660 But the important thing is your program stops. 2407 01:47:17,660 --> 01:47:22,540 These other lines, they don't exist. 2408 01:47:22,540 --> 01:47:24,780 It doesn't go any further. 2409 01:47:24,780 --> 01:47:27,740 Remember, the trace back is Python is really confused, 2410 01:47:27,740 --> 01:47:29,620 and I don't know what to do next. 2411 01:47:29,620 --> 01:47:32,460 So Python is just going to be conservative and stop. 2412 01:47:32,460 --> 01:47:34,900 So Python stops, and your program stops. 2413 01:47:34,900 --> 01:47:37,180 No matter how much error checking you put down here, 2414 01:47:37,180 --> 01:47:38,860 it doesn't matter because it's gone. 2415 01:47:38,860 --> 01:47:40,060 It's all gone. 2416 01:47:40,060 --> 01:47:42,380 And like I said, we take this kind of personally 2417 01:47:42,380 --> 01:47:44,580 because the code that you write is 2418 01:47:44,580 --> 01:47:49,180 like you being put into the computer giving it instructions. 2419 01:47:49,180 --> 01:47:52,620 And if the code blows up, well, that sort of wipes you out. 2420 01:47:52,620 --> 01:47:54,100 You're not in the game anymore. 2421 01:47:54,100 --> 01:47:56,140 You're not able to do anything. 2422 01:47:56,140 --> 01:47:58,460 So we want to be able to, especially 2423 01:47:58,460 --> 01:48:00,500 in these situations where we can anticipate 2424 01:48:00,500 --> 01:48:03,100 that an error that might happen in the normal course 2425 01:48:03,100 --> 01:48:05,980 or your program's execution might be something 2426 01:48:05,980 --> 01:48:07,940 that you want to compensate for. 2427 01:48:07,940 --> 01:48:09,980 And that's what the try and accept does. 2428 01:48:09,980 --> 01:48:14,140 So here's a bit of code for the try and accept. 2429 01:48:14,140 --> 01:48:16,940 And we just have two little bits of straight line code. 2430 01:48:16,940 --> 01:48:19,020 And so we put a string in here that's hello bob, 2431 01:48:19,020 --> 01:48:21,020 and then we're going to convert it to an integer. 2432 01:48:21,020 --> 01:48:22,060 This is the dangerous code. 2433 01:48:22,060 --> 01:48:23,860 This code, in this case, with hello bob, 2434 01:48:23,860 --> 01:48:25,780 is going to do a trace back. 2435 01:48:25,780 --> 01:48:28,820 And so we say try, and then we indent the dangerous code. 2436 01:48:28,820 --> 01:48:32,100 And then we add this little accept bit. 2437 01:48:32,100 --> 01:48:33,940 If it works, the accept is ignored. 2438 01:48:33,940 --> 01:48:35,900 If this blows up, it runs the accept. 2439 01:48:35,900 --> 01:48:37,480 So in this code, it's going to come in. 2440 01:48:37,480 --> 01:48:39,820 It's going to try this. 2441 01:48:39,820 --> 01:48:41,260 This is going to blow up. 2442 01:48:41,260 --> 01:48:42,620 But instead of giving a trace back, 2443 01:48:42,620 --> 01:48:44,700 it's going to say, oh, I've got an available accept. 2444 01:48:44,700 --> 01:48:46,540 I'm going to run this accept code, 2445 01:48:46,540 --> 01:48:47,980 and then I'm going to continue on. 2446 01:48:47,980 --> 01:48:49,940 And so that prints out first negative 1. 2447 01:48:49,940 --> 01:48:52,480 So because we set this variable Ister to negative 1, 2448 01:48:52,480 --> 01:48:55,860 like a little flag telling us that something went wrong. 2449 01:48:55,860 --> 01:48:57,180 And then we keep on going. 2450 01:48:57,180 --> 01:49:01,020 And now we have put in 1, 2, 3, the digits 1, 2, 3. 2451 01:49:01,020 --> 01:49:02,260 The digits 1, 2, 3. 2452 01:49:02,260 --> 01:49:03,900 And now it's going to work, but we still 2453 01:49:03,900 --> 01:49:05,020 have it in a try block. 2454 01:49:05,020 --> 01:49:06,380 And then this one works. 2455 01:49:06,380 --> 01:49:09,420 It does not blow up, and then ignores the accept block. 2456 01:49:09,420 --> 01:49:11,740 So the accept block is only triggered 2457 01:49:11,740 --> 01:49:13,820 when something goes wrong in the code. 2458 01:49:13,820 --> 01:49:16,140 It is ignored if something doesn't go wrong. 2459 01:49:16,140 --> 01:49:18,020 So it's like you bought an insurance policy 2460 01:49:18,020 --> 01:49:19,420 on this line of code. 2461 01:49:19,420 --> 01:49:22,180 And when things go wrong, your accept block 2462 01:49:22,180 --> 01:49:24,620 springs into action and does whatever 2463 01:49:24,620 --> 01:49:28,060 it is that you want it to do in the case of an error. 2464 01:49:28,060 --> 01:49:30,180 So that's a pretty useful thing. 2465 01:49:30,180 --> 01:49:31,580 You got to be a little bit careful 2466 01:49:31,580 --> 01:49:34,300 that you don't overuse it, because if you put more 2467 01:49:34,300 --> 01:49:36,820 than one line inside the try part, 2468 01:49:36,820 --> 01:49:39,340 and one of the lines blows up, it 2469 01:49:39,340 --> 01:49:41,260 doesn't come back to the try block. 2470 01:49:41,260 --> 01:49:44,860 And so in this one here, we have kind of a simple, silly one 2471 01:49:44,860 --> 01:49:47,540 where we set the string, we're worried about some stuff. 2472 01:49:47,540 --> 01:49:49,060 Well, the print statement's never going to blow up, 2473 01:49:49,060 --> 01:49:51,820 so it's a bad idea to put it in try accept anyways. 2474 01:49:51,820 --> 01:49:55,060 Then we do this conversion, and that's the dangerous part. 2475 01:49:55,060 --> 01:49:58,140 And in this one, it's going to blow up. 2476 01:49:58,140 --> 01:50:00,700 And so then it's going to go to the accept block, 2477 01:50:00,700 --> 01:50:02,980 and then run the accept block, and then continue. 2478 01:50:02,980 --> 01:50:05,380 What it does not do, what it doesn't do, 2479 01:50:05,380 --> 01:50:07,740 is somehow go back and finish this. 2480 01:50:07,740 --> 01:50:09,740 So these lines are gone. 2481 01:50:09,740 --> 01:50:13,220 So if you look at it like this, this works, the try starts. 2482 01:50:13,220 --> 01:50:16,940 Hello, this blows up, it goes to the accept, 2483 01:50:16,940 --> 01:50:19,060 it runs the accept, and it continues on. 2484 01:50:19,060 --> 01:50:21,780 Never runs that code. 2485 01:50:21,780 --> 01:50:26,140 So it's not like you took out an insurance on the whole block. 2486 01:50:26,140 --> 01:50:28,140 Any of those lines can blow up in the block, 2487 01:50:28,140 --> 01:50:29,660 but whichever line blows up, that 2488 01:50:29,660 --> 01:50:34,500 is the last line that's executing in that block. 2489 01:50:34,500 --> 01:50:37,740 So you tend to want, in this particular example, 2490 01:50:37,740 --> 01:50:40,300 you would probably, the print statement would go out there, 2491 01:50:40,300 --> 01:50:42,140 and this print statement would come down here, 2492 01:50:42,140 --> 01:50:44,540 and you would only put in your try block 2493 01:50:44,540 --> 01:50:47,060 the single line of code that you think might blow up, 2494 01:50:47,060 --> 01:50:48,780 because you kind of know print statements 2495 01:50:48,780 --> 01:50:50,180 aren't going to blow up. 2496 01:50:50,180 --> 01:50:53,660 So this is an example that's a more common real world 2497 01:50:53,660 --> 01:50:57,660 example, where the user is going to type some data, 2498 01:50:57,660 --> 01:50:59,740 and that's users that get us in trouble. 2499 01:50:59,740 --> 01:51:03,020 So our program starts by asking the user enter a number, 2500 01:51:03,020 --> 01:51:05,100 and we know that this could be dangerous. 2501 01:51:05,100 --> 01:51:09,540 So we're going to put the conversion from string 2502 01:51:09,540 --> 01:51:11,580 to integer in a try block, and we're 2503 01:51:11,580 --> 01:51:14,220 going to set negative 1 if that's a failure. 2504 01:51:14,220 --> 01:51:16,700 And then if it's greater than 0, we'll say nice work, 2505 01:51:16,700 --> 01:51:18,780 and if it's less than 0, well, not a number. 2506 01:51:18,780 --> 01:51:20,820 So first time we run this program, 2507 01:51:20,820 --> 01:51:22,860 out comes enter a number. 2508 01:51:22,860 --> 01:51:25,500 We type in 42, which is a string. 2509 01:51:25,500 --> 01:51:29,100 That 42 goes back into roster, runs in here. 2510 01:51:29,100 --> 01:51:29,940 This runs. 2511 01:51:29,940 --> 01:51:30,740 It's fine. 2512 01:51:30,740 --> 01:51:33,940 That becomes a 42 number, so we skip the accept block, 2513 01:51:33,940 --> 01:51:35,180 and iVal is greater than 0. 2514 01:51:35,180 --> 01:51:38,820 We print out nice work, and we skip the else. 2515 01:51:38,820 --> 01:51:41,420 So it says nice work. 2516 01:51:41,420 --> 01:51:46,180 On the other hand, if we run it again this time, 2517 01:51:46,180 --> 01:51:48,900 the input says enter a number, and we're silly. 2518 01:51:48,900 --> 01:51:53,620 We enter the word 42, but in words, 40, F-O-U-R-T-Y. 2519 01:51:53,620 --> 01:51:55,860 So that's a string, and that goes into roster. 2520 01:51:55,860 --> 01:51:57,220 And then the execution continues. 2521 01:51:57,220 --> 01:52:01,260 We run in here, and now this is going to blow up. 2522 01:52:01,260 --> 01:52:02,220 That's going to blow up. 2523 01:52:02,220 --> 01:52:04,980 Normally, we would see a trace back right there. 2524 01:52:04,980 --> 01:52:06,140 There would be a trace back. 2525 01:52:06,140 --> 01:52:08,420 But we're not going to, because we put this calculation 2526 01:52:08,420 --> 01:52:09,860 in a try and accept block. 2527 01:52:09,860 --> 01:52:11,700 It's going to immediately run the accept block, 2528 01:52:11,700 --> 01:52:14,580 set iVal to negative 1, continue on with the program, 2529 01:52:14,580 --> 01:52:16,980 see you are not blown up at this point. 2530 01:52:16,980 --> 01:52:19,220 And if iVal is greater than 0, well, it's negative 1, 2531 01:52:19,220 --> 01:52:21,620 so we're going to hit the else clause and print out not 2532 01:52:21,620 --> 01:52:22,300 a number. 2533 01:52:22,300 --> 01:52:24,580 So we've done error detection. 2534 01:52:24,580 --> 01:52:27,580 The user set something that caused a line of our code 2535 01:52:27,580 --> 01:52:29,580 to kind of blow up, but we put that line 2536 01:52:29,580 --> 01:52:31,980 in a try and accept block, and so we caught it. 2537 01:52:31,980 --> 01:52:36,100 And so we dealt with that fact. 2538 01:52:36,100 --> 01:52:39,140 So in summary in this, we talked about if statements. 2539 01:52:39,140 --> 01:52:40,340 We talked about else. 2540 01:52:40,340 --> 01:52:44,180 We talked about try and accept, how important indentation 2541 01:52:44,180 --> 01:52:48,100 is to mark blocks where they begin in the end, 2542 01:52:48,100 --> 01:52:50,380 and an else if, and try accept. 2543 01:52:50,380 --> 01:52:54,420 So up next, we're going to talk about loops and iteration. 2544 01:52:58,940 --> 01:53:01,780 Hello, and welcome to chapter four, functions. 2545 01:53:01,780 --> 01:53:04,340 This is the fourth of our basic patterns. 2546 01:53:04,340 --> 01:53:05,700 We'll get to iterations next. 2547 01:53:05,700 --> 01:53:07,740 Functions is the store and reuse. 2548 01:53:07,740 --> 01:53:10,980 One of the things in programming is that we never 2549 01:53:10,980 --> 01:53:12,300 like to repeat ourselves. 2550 01:53:12,300 --> 01:53:14,820 We don't like to, if we have four or five lines of code, 2551 01:53:14,820 --> 01:53:16,780 and we're going to do the same thing later, 2552 01:53:16,780 --> 01:53:19,740 we don't like to put the same four lines of code in, 2553 01:53:19,740 --> 01:53:24,380 even if it has to do with reliability. 2554 01:53:24,380 --> 01:53:26,580 If you find something wrong with those four lines of code 2555 01:53:26,580 --> 01:53:31,420 and you got them 12 different places in your program, 2556 01:53:31,420 --> 01:53:33,260 then you got to find all 12 places and fix them. 2557 01:53:33,260 --> 01:53:34,740 So we're like, collect those to one place 2558 01:53:34,740 --> 01:53:36,700 and then call them and reuse them, 2559 01:53:36,700 --> 01:53:38,700 and that's the idea of store and reuse. 2560 01:53:39,580 --> 01:53:44,140 So this is how functions work inside of Python. 2561 01:53:44,140 --> 01:53:47,100 The first thing we notice is there is a new keyword def 2562 01:53:47,100 --> 01:53:49,500 that stands for define function, 2563 01:53:49,500 --> 01:53:51,980 and the def is like an if statement 2564 01:53:51,980 --> 01:53:55,940 or we'll see fors and whiles that they end in a colon, 2565 01:53:55,940 --> 01:53:57,500 and then they have an indented block 2566 01:53:57,500 --> 01:53:59,300 and then the indented block deindents, 2567 01:53:59,300 --> 01:54:01,500 and that's the end of the function. 2568 01:54:01,500 --> 01:54:05,500 And so there's two statements make up this function. 2569 01:54:06,500 --> 01:54:09,740 The key thing that you have to understand and get used to 2570 01:54:09,740 --> 01:54:13,620 is this def part is actually not running any code whatsoever. 2571 01:54:13,620 --> 01:54:15,380 It's actually remembering the code, 2572 01:54:15,380 --> 01:54:17,220 and that's what I call the store phase. 2573 01:54:17,220 --> 01:54:22,220 The def creates a bit of code and records it like a macro, 2574 01:54:22,580 --> 01:54:24,980 although it's much more complex than a macro, 2575 01:54:24,980 --> 01:54:26,620 and it names it whatever you chose. 2576 01:54:26,620 --> 01:54:27,580 You gave it a name. 2577 01:54:27,580 --> 01:54:29,340 We named this one thing. 2578 01:54:29,340 --> 01:54:33,680 And so it has a side effect of Python reading 2579 01:54:33,680 --> 01:54:35,980 or parsing these three lines. 2580 01:54:35,980 --> 01:54:38,820 It doesn't do anything, but it remembers. 2581 01:54:38,820 --> 01:54:42,340 These two lines are what you would like to run 2582 01:54:42,340 --> 01:54:44,300 when you invoke thing. 2583 01:54:44,300 --> 01:54:46,300 So this is the definition of a function, 2584 01:54:46,300 --> 01:54:48,700 and this is the invoking of the function. 2585 01:54:48,700 --> 01:54:52,580 But so this doesn't do anything. 2586 01:54:52,580 --> 01:54:55,380 So there's no output here from that stuff right there. 2587 01:54:55,380 --> 01:54:57,780 But then what happens is you invoke it. 2588 01:54:57,780 --> 01:55:00,280 And this thing looks like it's part of Python, 2589 01:55:00,280 --> 01:55:02,340 but you an effective extended Python 2590 01:55:02,340 --> 01:55:04,140 with your def statement. 2591 01:55:04,140 --> 01:55:08,020 And so when it sees thing, it goes up and runs your code. 2592 01:55:08,020 --> 01:55:10,180 And so out comes hello fun, 2593 01:55:10,180 --> 01:55:14,500 and then it comes back and goes to the next line. 2594 01:55:14,500 --> 01:55:16,900 Does print, so print comes out. 2595 01:55:16,900 --> 01:55:18,060 And then it goes back and like, oh, 2596 01:55:18,060 --> 01:55:19,900 this is the reuse part, but we get to reuse it. 2597 01:55:19,900 --> 01:55:21,680 We define it once and we use it twice. 2598 01:55:21,680 --> 01:55:23,220 Then it runs this code again, 2599 01:55:23,220 --> 01:55:24,920 and it goes to the next line and it's all done. 2600 01:55:24,920 --> 01:55:27,420 So this little bit came out twice. 2601 01:55:27,420 --> 01:55:28,780 And of course this is really simple 2602 01:55:28,780 --> 01:55:30,860 so that I can fit it on a page. 2603 01:55:30,860 --> 01:55:33,780 But you get the idea that I don't want to repeat. 2604 01:55:33,780 --> 01:55:37,140 This might be 15 to 100 lines of code, 2605 01:55:37,140 --> 01:55:39,420 and I don't want to type those over and over again. 2606 01:55:39,420 --> 01:55:44,420 So I say, hey, store these in a name that I choose, 2607 01:55:44,520 --> 01:55:46,660 and then when I invoke them, 2608 01:55:46,660 --> 01:55:50,060 bring them back and then run them again, okay? 2609 01:55:50,060 --> 01:55:52,460 So that's the basic idea. 2610 01:55:52,460 --> 01:55:54,140 We actually have already been using functions 2611 01:55:54,140 --> 01:55:54,980 from the beginning. 2612 01:55:54,980 --> 01:55:56,660 The print is a function, right? 2613 01:55:56,660 --> 01:55:57,860 Print is a function. 2614 01:55:57,860 --> 01:56:01,140 Every time we see print, P-R-I-N-T, 2615 01:56:01,140 --> 01:56:03,540 parentheses, and then we have some stuff in here, 2616 01:56:03,540 --> 01:56:05,300 we are calling the print function. 2617 01:56:05,300 --> 01:56:08,780 This is the syntax with two little parentheses, 2618 01:56:08,780 --> 01:56:10,500 is the syntax for functions. 2619 01:56:11,820 --> 01:56:14,940 And so input's a function, type is a function, 2620 01:56:14,940 --> 01:56:17,340 float's a function, int's a function. 2621 01:56:17,340 --> 01:56:19,640 All these things are built-in functions 2622 01:56:19,640 --> 01:56:24,640 that come with Python at the moment that we started. 2623 01:56:24,660 --> 01:56:27,360 I mean, we installed Python and these came along. 2624 01:56:27,360 --> 01:56:31,580 And then there's other functions that we define and use, 2625 01:56:31,580 --> 01:56:33,420 and that's what the def is for. 2626 01:56:33,420 --> 01:56:37,380 And in effect we can create new reserved words 2627 01:56:37,380 --> 01:56:40,260 of our own making that extend the Python language 2628 01:56:40,260 --> 01:56:42,600 after we define the function. 2629 01:56:43,620 --> 01:56:45,740 So it's just this bit of reusable code 2630 01:56:45,740 --> 01:56:46,940 that takes some arguments. 2631 01:56:46,940 --> 01:56:48,140 We haven't seen any with arguments. 2632 01:56:48,140 --> 01:56:49,100 There's a little parentheses 2633 01:56:49,100 --> 01:56:50,820 and we'll see how that works in a bit. 2634 01:56:50,820 --> 01:56:54,420 We define using the def keyword and then we invoke it. 2635 01:56:54,420 --> 01:56:55,700 There's the defining phase, 2636 01:56:55,700 --> 01:56:56,780 which actually doesn't run the code, 2637 01:56:56,780 --> 01:56:57,780 it just remembers the code. 2638 01:56:57,780 --> 01:56:59,620 And then there's the invoking phase. 2639 01:56:59,620 --> 01:57:02,780 You define it once and then invoke it one or more times. 2640 01:57:02,780 --> 01:57:05,340 Calling the function or invoking the function, 2641 01:57:05,340 --> 01:57:07,940 we think of those two things as the same thing, 2642 01:57:07,940 --> 01:57:11,080 call, invoke, or just the terms we use. 2643 01:57:11,080 --> 01:57:12,900 Most people just say call the function, 2644 01:57:12,900 --> 01:57:15,480 but invoking is a perhaps more descriptive way 2645 01:57:15,480 --> 01:57:16,720 to think about it. 2646 01:57:16,720 --> 01:57:19,500 So here's an example of a function. 2647 01:57:19,500 --> 01:57:20,620 It is built into Python. 2648 01:57:20,620 --> 01:57:22,260 It's called the max function. 2649 01:57:22,260 --> 01:57:25,420 And we can pass some parameters into the max function. 2650 01:57:25,420 --> 01:57:27,620 So we pass the hello world string. 2651 01:57:27,620 --> 01:57:29,140 Now, like much of Python, 2652 01:57:29,140 --> 01:57:32,660 max knows what kind of thing is being passed into it. 2653 01:57:32,660 --> 01:57:34,660 And it knows that it's looking for 2654 01:57:34,660 --> 01:57:36,060 the largest character, 2655 01:57:36,060 --> 01:57:40,900 the lexographically largest character. 2656 01:57:40,900 --> 01:57:43,220 And in this case, it scans this little, 2657 01:57:43,220 --> 01:57:44,620 that's inside the max code, 2658 01:57:44,620 --> 01:57:46,900 it scans through and finds the largest character. 2659 01:57:46,900 --> 01:57:48,920 So apparently lowercase letters 2660 01:57:48,920 --> 01:57:50,820 are higher than uppercase letters 2661 01:57:50,820 --> 01:57:53,820 because in English we get back a W. 2662 01:57:53,820 --> 01:57:56,680 And so this is what's called the return value. 2663 01:57:56,680 --> 01:57:59,020 So this is an assignment statement. 2664 01:57:59,020 --> 01:58:00,960 Let me clear this and start over. 2665 01:58:00,960 --> 01:58:02,660 So this is an assignment statement. 2666 01:58:02,660 --> 01:58:05,180 So it has to evaluate this right-hand side. 2667 01:58:05,180 --> 01:58:08,460 And a function call is nothing more than like x plus one. 2668 01:58:08,460 --> 01:58:10,180 It's something to evaluate. 2669 01:58:10,180 --> 01:58:11,700 It runs the function code, 2670 01:58:11,700 --> 01:58:13,100 passes in this argument, 2671 01:58:13,100 --> 01:58:14,820 and then this residual value, 2672 01:58:14,820 --> 01:58:16,060 this call return value, 2673 01:58:16,060 --> 01:58:17,820 we'll look at this in more detail, 2674 01:58:17,820 --> 01:58:22,020 becomes the result of this little bit in the expression 2675 01:58:22,020 --> 01:58:23,180 and there's nothing else. 2676 01:58:23,180 --> 01:58:25,660 We could have W plus one or something. 2677 01:58:25,660 --> 01:58:29,100 And then the W is what's stored into big. 2678 01:58:29,100 --> 01:58:31,940 Okay, so we print big and big is a variable 2679 01:58:31,940 --> 01:58:34,660 that has the letter W inside of it. 2680 01:58:34,660 --> 01:58:36,700 And then we ask what is the smallest 2681 01:58:36,700 --> 01:58:37,940 and that finds the blank. 2682 01:58:37,940 --> 01:58:39,620 And so we get a blank to see this. 2683 01:58:39,620 --> 01:58:41,740 There's a min function and a max function. 2684 01:58:41,740 --> 01:58:43,420 Both of these are built-in. 2685 01:58:45,540 --> 01:58:46,700 These are built-in functions. 2686 01:58:46,700 --> 01:58:48,140 They're always there for us. 2687 01:58:50,180 --> 01:58:55,180 Okay, so here is another example of the max function. 2688 01:58:55,300 --> 01:58:58,220 And so we can think of this as invoking 2689 01:58:58,220 --> 01:58:59,420 or calling this function 2690 01:58:59,420 --> 01:59:01,900 as this right-hand side is being evaluated. 2691 01:59:02,740 --> 01:59:04,500 We are passing this variable in 2692 01:59:04,500 --> 01:59:06,100 and there's some code in here 2693 01:59:06,100 --> 01:59:08,420 and it's gonna do some stuff, yada, yada, yada, 2694 01:59:08,420 --> 01:59:12,340 and then it's gonna give us back a bit of stuff. 2695 01:59:12,340 --> 01:59:14,260 And that's its return value 2696 01:59:14,260 --> 01:59:17,300 and then that goes up into the big, right? 2697 01:59:17,300 --> 01:59:19,380 And so that's how this works. 2698 01:59:19,380 --> 01:59:21,740 And so this is actually built-in. 2699 01:59:23,700 --> 01:59:26,300 Built-in or burnt-in, I guess I can't draw. 2700 01:59:26,300 --> 01:59:30,340 And so you can think of this as some time a long time ago 2701 01:59:30,340 --> 01:59:32,740 when Python was being first formed, 2702 01:59:32,740 --> 01:59:34,460 somebody wrote some code. 2703 01:59:34,460 --> 01:59:35,900 And it's got some stuff in it. 2704 01:59:35,900 --> 01:59:38,860 It's got a little loop that reads through all the letters. 2705 01:59:38,860 --> 01:59:41,140 It has to figure out if it's a string or a list, 2706 01:59:41,140 --> 01:59:42,460 et cetera, et cetera, et cetera. 2707 01:59:42,460 --> 01:59:47,500 But this is store, except you didn't do the storing 2708 01:59:47,500 --> 01:59:48,500 because it's already built-in. 2709 01:59:48,500 --> 01:59:51,220 And then this is the reuse, store and reuse. 2710 01:59:51,220 --> 01:59:52,960 So we build these things into Python. 2711 01:59:52,960 --> 01:59:54,620 They're already pre-built 2712 01:59:54,620 --> 01:59:57,180 as if before the first line of your code executes 2713 01:59:57,180 --> 02:00:00,940 way up here, someone put all this code in for you 2714 02:00:00,940 --> 02:00:04,380 into Python and created a thing called max for you. 2715 02:00:05,740 --> 02:00:08,520 Now we've been using this already, built-in functions. 2716 02:00:08,520 --> 02:00:10,080 We've got type conversions. 2717 02:00:10,080 --> 02:00:13,580 We've got like the float that takes a integer 2718 02:00:13,580 --> 02:00:17,120 and returns a floating point version of that. 2719 02:00:17,120 --> 02:00:19,660 And again, this is kind of like an expression. 2720 02:00:19,660 --> 02:00:22,140 So it's like, I wanna divide this by 100, 2721 02:00:22,140 --> 02:00:24,820 but before I do that, I've gotta convert it to a float. 2722 02:00:24,820 --> 02:00:27,940 So it has to sort of do these function calls 2723 02:00:27,940 --> 02:00:32,460 as it's evaluating the expression, okay? 2724 02:00:32,460 --> 02:00:35,740 Sometimes like here, we just have, 2725 02:00:35,740 --> 02:00:38,560 we just have a prints out the return value. 2726 02:00:38,560 --> 02:00:39,400 That's what this is. 2727 02:00:39,400 --> 02:00:40,460 This is the return value. 2728 02:00:40,460 --> 02:00:42,940 If you just type a function in a parameter, 2729 02:00:42,940 --> 02:00:45,260 it can be in a constant or it can be a variable. 2730 02:00:45,260 --> 02:00:46,300 And as we'll see in a second, 2731 02:00:46,300 --> 02:00:48,380 we'll give you many of these if you like. 2732 02:00:48,380 --> 02:00:50,100 So you can either just run it 2733 02:00:50,100 --> 02:00:53,660 or take the result of this, this passes an integer in, 2734 02:00:53,660 --> 02:00:57,780 converts it to a float and then puts the float into that. 2735 02:00:57,780 --> 02:00:59,700 Type tells us what kind of thing that is 2736 02:00:59,700 --> 02:01:02,140 and you can use this inside of an expression. 2737 02:01:02,140 --> 02:01:03,820 And so it's like, what am I gonna do first? 2738 02:01:03,820 --> 02:01:05,820 Oh, I've gotta do two times this thing. 2739 02:01:05,820 --> 02:01:09,020 Oh, wait a sec, pause just briefly for a moment, 2740 02:01:09,020 --> 02:01:13,580 call out to some float code, pass a three into it 2741 02:01:13,580 --> 02:01:17,020 and then something comes back, the return value, 2742 02:01:17,020 --> 02:01:18,740 the residual value comes back 2743 02:01:18,740 --> 02:01:20,420 and then that participates, 2744 02:01:20,420 --> 02:01:22,600 in this case it's gonna be 3.0, 2745 02:01:22,600 --> 02:01:26,420 participates in this two times 3.0, okay? 2746 02:01:26,420 --> 02:01:29,660 And so two times 3.0 ends up being 6.0, et cetera, et cetera. 2747 02:01:29,660 --> 02:01:31,340 But you can see as it, it's like, 2748 02:01:31,340 --> 02:01:33,140 oh, wait a sec, I gotta figure out what this is, 2749 02:01:33,140 --> 02:01:34,740 call the function, get the return value 2750 02:01:34,740 --> 02:01:37,660 and then continue processing this expression. 2751 02:01:39,780 --> 02:01:42,020 We've also done this with string conversions, 2752 02:01:42,020 --> 02:01:44,780 partly because just as an example, 2753 02:01:44,780 --> 02:01:46,420 the input always returns a string, 2754 02:01:46,420 --> 02:01:48,780 the input function returns a string. 2755 02:01:48,780 --> 02:01:51,100 And so, you know, here's this string, 2756 02:01:51,100 --> 02:01:54,860 could be coming from input, but we'll just take one, two, three. 2757 02:01:54,860 --> 02:01:58,060 We know that that's a string, it's not the number 123. 2758 02:01:58,060 --> 02:02:01,160 And if we try to add one to it, we get a trace back, 2759 02:02:01,160 --> 02:02:06,160 cannot concatenate string and integer, trace back, 2760 02:02:06,200 --> 02:02:08,020 but we can convert that string to an integer. 2761 02:02:08,020 --> 02:02:10,980 And so int can take like a floating point number 2762 02:02:10,980 --> 02:02:13,940 or an integer or even a string and it says, 2763 02:02:13,940 --> 02:02:15,180 oh, I know what I'm supposed to do with string, 2764 02:02:15,180 --> 02:02:18,180 I'm supposed to look at this, interpret these as numbers 2765 02:02:18,180 --> 02:02:20,700 and, you know, multiply by 10 2766 02:02:20,700 --> 02:02:22,260 and figure out what the hundreds place is 2767 02:02:22,260 --> 02:02:23,740 and all that stuff, there's a little bit work to that 2768 02:02:23,740 --> 02:02:26,060 and it does it, but then it gives us back an integer 2769 02:02:26,060 --> 02:02:27,300 and we say, oh, what is that? 2770 02:02:27,300 --> 02:02:31,100 That's now the 123, but it is of type int. 2771 02:02:31,100 --> 02:02:34,260 And now we can add one to it and get 124. 2772 02:02:34,260 --> 02:02:37,340 And as before from this example that we're kind of reusing 2773 02:02:37,340 --> 02:02:42,340 from a previous chapter, you don't want to try to convert, 2774 02:02:42,340 --> 02:02:46,660 oops, sad face, sad face, sad face. 2775 02:02:46,660 --> 02:02:47,900 Don't want to try to convert something 2776 02:02:47,900 --> 02:02:49,420 that doesn't have digits using int 2777 02:02:49,420 --> 02:02:51,700 because it'll say, I don't know what to do 2778 02:02:51,700 --> 02:02:54,500 and then your program quits, right? 2779 02:02:54,500 --> 02:02:57,340 You don't want your program to stop, trace backs 2780 02:02:57,340 --> 02:03:00,260 and you can of course deal with that with try and accept, 2781 02:03:00,260 --> 02:03:02,820 but that's like a previous lecture. 2782 02:03:02,820 --> 02:03:05,500 Okay, so up next, we're gonna talk about building 2783 02:03:05,500 --> 02:03:12,500 our own functions, not just using the predefined ones. 2784 02:03:12,940 --> 02:03:14,380 So welcome back, we're gonna continue 2785 02:03:14,380 --> 02:03:18,280 and start talking about building our own functions. 2786 02:03:18,280 --> 02:03:22,620 So again, we use the def keyword to define a function 2787 02:03:22,620 --> 02:03:25,220 and then later we're gonna invoke this 2788 02:03:25,220 --> 02:03:26,660 and there's a bit to it. 2789 02:03:26,660 --> 02:03:28,620 We are defining the name of the function 2790 02:03:28,620 --> 02:03:30,340 and in effect we're extending Python 2791 02:03:30,340 --> 02:03:33,020 and creating new predefined things that we can use 2792 02:03:33,020 --> 02:03:34,460 except it's our code. 2793 02:03:34,460 --> 02:03:37,500 It starts with a def keyword, has some optional arguments 2794 02:03:37,500 --> 02:03:39,860 which we'll see in a bit, that's what the parenthesis is 2795 02:03:39,860 --> 02:03:41,800 and then the name and the function names file, 2796 02:03:41,800 --> 02:03:44,420 the same rules as variable names 2797 02:03:44,420 --> 02:03:46,620 and then you have an indented block, 2798 02:03:46,620 --> 02:03:47,820 whatever code you want to do 2799 02:03:47,820 --> 02:03:49,260 and then you have a deindented block 2800 02:03:49,260 --> 02:03:52,220 and that sort of defines the essence. 2801 02:03:52,220 --> 02:03:56,260 The key thing here is this is not calling, 2802 02:03:57,720 --> 02:03:59,860 it's not invoking, it's not executing, 2803 02:03:59,860 --> 02:04:03,540 it's remembering, it's storing, it's figuring things out. 2804 02:04:03,540 --> 02:04:06,860 So here is the output of a program that defines a function 2805 02:04:06,860 --> 02:04:08,100 but then doesn't use it. 2806 02:04:08,100 --> 02:04:10,620 So this is a sort of broken function. 2807 02:04:10,620 --> 02:04:12,860 So here we go, we start x equals five print. 2808 02:04:12,860 --> 02:04:15,580 You don't have to def, you have all the defs at the beginning. 2809 02:04:15,580 --> 02:04:16,980 The def runs whenever. 2810 02:04:16,980 --> 02:04:19,340 So you know, out comes hello 2811 02:04:19,340 --> 02:04:21,740 and then we define a function and this says, 2812 02:04:21,740 --> 02:04:24,020 oh, oh, you wanna make a new thing here. 2813 02:04:24,020 --> 02:04:24,940 So I'll make a new thing. 2814 02:04:24,940 --> 02:04:26,580 It's kinda like a variable in a sense 2815 02:04:26,580 --> 02:04:28,420 and then it copies this stuff, 2816 02:04:28,420 --> 02:04:30,740 copies it up there and says later you probably 2817 02:04:30,740 --> 02:04:33,100 are gonna wanna use this so I'm gonna remember it 2818 02:04:33,100 --> 02:04:35,100 so it doesn't do anything there. 2819 02:04:35,100 --> 02:04:38,740 No output comes out, then it says print yo 2820 02:04:38,740 --> 02:04:41,580 and out comes yo and then it adds two to x 2821 02:04:41,580 --> 02:04:43,740 so x is now seven and then it prints x 2822 02:04:43,740 --> 02:04:45,640 and there's no seven, there's seven. 2823 02:04:45,640 --> 02:04:48,700 These print statements never ran. 2824 02:04:48,700 --> 02:04:49,880 They never ran, why? 2825 02:04:49,880 --> 02:04:52,180 Because we did not invoke them down here. 2826 02:04:52,180 --> 02:04:55,340 We defined them but didn't invoke them. 2827 02:04:55,340 --> 02:04:58,820 So let's take a look at how you invoke a function, right? 2828 02:04:58,820 --> 02:05:00,560 You define it and then you use it. 2829 02:05:00,560 --> 02:05:02,420 Sometimes you define it once and use it once 2830 02:05:02,420 --> 02:05:04,660 but more commonly you define it once 2831 02:05:04,660 --> 02:05:06,220 and use more than one time. 2832 02:05:06,220 --> 02:05:07,840 Again, the store and reuse pattern. 2833 02:05:07,840 --> 02:05:11,460 The def is the store and the invoking is the reuse. 2834 02:05:12,540 --> 02:05:14,300 So here's just a slightly different version 2835 02:05:14,300 --> 02:05:16,220 of that last program and so now 2836 02:05:16,220 --> 02:05:18,340 it's gonna actually invoke it. 2837 02:05:19,420 --> 02:05:22,260 So x equals five, print hello, def, 2838 02:05:22,260 --> 02:05:23,820 so out comes hello. 2839 02:05:23,820 --> 02:05:27,460 This produces, the def produces no output, right? 2840 02:05:27,460 --> 02:05:29,700 But because there's a deindent here, 2841 02:05:29,700 --> 02:05:32,860 that is the entire blob of the code 2842 02:05:32,860 --> 02:05:34,620 that is part of print lyrics. 2843 02:05:34,620 --> 02:05:38,020 So it prints out yo and now we're gonna invoke. 2844 02:05:38,020 --> 02:05:39,540 This is the call. 2845 02:05:39,540 --> 02:05:40,980 We're gonna call the function. 2846 02:05:40,980 --> 02:05:44,520 Now the function goes up, let's clear this. 2847 02:05:45,680 --> 02:05:47,760 Somewhere down to here. 2848 02:05:47,760 --> 02:05:51,020 Now this like suspends at this place. 2849 02:05:51,020 --> 02:05:53,940 It's like remember to come back to here when we're done. 2850 02:05:53,940 --> 02:05:57,800 Go up, run this code and then come back 2851 02:05:57,800 --> 02:05:59,020 and then continue on. 2852 02:05:59,020 --> 02:06:01,180 So it like leaves like a breadcrumb 2853 02:06:01,180 --> 02:06:02,940 of where it's supposed to come back to. 2854 02:06:02,940 --> 02:06:05,180 And then it runs and then the print lyrics 2855 02:06:05,180 --> 02:06:08,900 of course produces the two lines of output. 2856 02:06:08,900 --> 02:06:11,880 And yeah, that should probably not have, 2857 02:06:11,880 --> 02:06:13,620 that day should be up there. 2858 02:06:13,620 --> 02:06:16,100 And then x equals x plus two which makes it seven 2859 02:06:16,100 --> 02:06:17,500 and then prints out seven. 2860 02:06:17,500 --> 02:06:22,500 Okay, so this is the invoke or call the function. 2861 02:06:24,140 --> 02:06:26,340 You defined it and then later you called it. 2862 02:06:26,340 --> 02:06:31,340 Now, in addition to just call and return and invoking, 2863 02:06:32,860 --> 02:06:34,680 we can pass parameters in. 2864 02:06:34,680 --> 02:06:37,900 And the example of the parameter is in the max function 2865 02:06:37,900 --> 02:06:39,620 we have to say, this is the thing I want you 2866 02:06:39,620 --> 02:06:42,420 to find the maximum of, the largest thing. 2867 02:06:42,420 --> 02:06:46,300 And part of it is in the whole store and reuse pattern, 2868 02:06:46,300 --> 02:06:48,860 we have a few lines of code but sometimes we wanna do 2869 02:06:48,860 --> 02:06:51,000 ever so slightly different things 2870 02:06:51,000 --> 02:06:52,300 in a different invocations. 2871 02:06:52,300 --> 02:06:55,900 And so we use the arguments to subtly adjust 2872 02:06:55,900 --> 02:07:00,220 like finding the maximum is a general thing 2873 02:07:00,220 --> 02:07:03,540 but what thing to find the maximum of that makes a function 2874 02:07:03,540 --> 02:07:06,300 that's much more useful and reusable 2875 02:07:06,300 --> 02:07:08,280 in a lot more situations. 2876 02:07:08,280 --> 02:07:10,740 So arguments are the thing we passed in 2877 02:07:10,740 --> 02:07:13,980 and we defined for our functions that we're going to build, 2878 02:07:13,980 --> 02:07:18,540 we on the def statement, so we say def, greet, 2879 02:07:18,540 --> 02:07:21,140 name a function and then this is the arguments, 2880 02:07:21,140 --> 02:07:22,700 the things that are coming in. 2881 02:07:22,700 --> 02:07:27,420 Now, this lang variable in a sense only exists 2882 02:07:27,420 --> 02:07:29,340 during the life of the function 2883 02:07:29,340 --> 02:07:31,260 and it represents sort of a placeholder, 2884 02:07:31,260 --> 02:07:34,200 it's not a real variable in the same sense, 2885 02:07:34,200 --> 02:07:37,780 it's a placeholder that refers to how you touch 2886 02:07:37,780 --> 02:07:40,660 that first parameter that's sitting in there. 2887 02:07:40,660 --> 02:07:45,360 Okay, and so lang, so lang is our first parameter, 2888 02:07:45,360 --> 02:07:48,500 whatever it is, we don't need to see this part down here 2889 02:07:48,500 --> 02:07:51,180 right now, all we know is we're gonna make a function 2890 02:07:51,180 --> 02:07:53,820 and we're gonna take a first, we're gonna take a parameter 2891 02:07:53,820 --> 02:07:56,240 and this lang is the placeholder that tells us 2892 02:07:56,240 --> 02:07:59,220 what that parameter is, okay? 2893 02:07:59,220 --> 02:08:01,380 So within the function, we're gonna check to see 2894 02:08:01,380 --> 02:08:04,540 if the language is Spanish, if we are print hello, 2895 02:08:04,540 --> 02:08:07,880 else if the language is French, print bonjour, 2896 02:08:07,880 --> 02:08:09,060 otherwise print hello. 2897 02:08:09,060 --> 02:08:12,140 We have a very highly simplified 2898 02:08:12,140 --> 02:08:14,580 language translation system here. 2899 02:08:14,580 --> 02:08:17,580 So the def, of course, does nothing, 2900 02:08:17,580 --> 02:08:21,740 except it remembers that and defines the concept greet. 2901 02:08:24,660 --> 02:08:26,940 So that comes down and now we're gonna call it 2902 02:08:26,940 --> 02:08:28,300 and that says go look up the thing 2903 02:08:28,300 --> 02:08:29,660 that I defined called greet. 2904 02:08:29,660 --> 02:08:31,420 If you don't put this in, greet is gonna give you 2905 02:08:31,420 --> 02:08:34,660 a trace back, but because you extended and named it greet, 2906 02:08:34,660 --> 02:08:38,340 so it runs in, it starts, suspends the code here, 2907 02:08:38,340 --> 02:08:43,340 starts up here, but then lang is now an alias to en. 2908 02:08:43,340 --> 02:08:48,340 So now we can run if that is es, else if, 2909 02:08:49,900 --> 02:08:52,220 oop, I'm getting it all wrong now. 2910 02:08:55,500 --> 02:08:58,760 Right, so en comes in as lang, we're coming in the code. 2911 02:08:59,820 --> 02:09:03,980 If it's not es, it's not fr, else, it prints hello, 2912 02:09:03,980 --> 02:09:06,060 and then it comes back to the next line. 2913 02:09:07,260 --> 02:09:10,340 And then we call it again and this time es is lang 2914 02:09:10,340 --> 02:09:14,580 and so it runs this code and prints hola, 2915 02:09:14,580 --> 02:09:17,020 and then next time it calls with this, 2916 02:09:17,020 --> 02:09:21,380 and then prints bonjour, you get the idea. 2917 02:09:21,380 --> 02:09:26,380 So this is a placeholder so that on the success of calls 2918 02:09:26,380 --> 02:09:30,340 or invokes, invocating invocation of the function, 2919 02:09:30,340 --> 02:09:33,340 we can get at whatever the programmer put in 2920 02:09:33,340 --> 02:09:34,620 as that first parameter. 2921 02:09:34,620 --> 02:09:37,300 And so we are saying in this definition, 2922 02:09:37,300 --> 02:09:39,740 we are ready to receive a first parameter. 2923 02:09:39,740 --> 02:09:42,220 Please call us with a parameter 2924 02:09:42,220 --> 02:09:44,700 and then we will be able to do something slightly different 2925 02:09:44,700 --> 02:09:45,580 for the different values. 2926 02:09:45,580 --> 02:09:48,820 So this is a reusable bit of function that prints hello 2927 02:09:48,820 --> 02:09:51,380 in three different languages and then we tell it 2928 02:09:51,380 --> 02:09:54,740 what language at the moment that we're actually invoking it. 2929 02:09:57,180 --> 02:09:59,300 So that's putting stuff into the function. 2930 02:09:59,300 --> 02:10:03,740 Now getting stuff back out is the concept of returning. 2931 02:10:03,740 --> 02:10:07,380 In the return statement, the return statement 2932 02:10:07,380 --> 02:10:12,380 is an executable statement that does two basic things. 2933 02:10:12,860 --> 02:10:16,420 The first thing that it does is it finishes. 2934 02:10:16,420 --> 02:10:19,460 Now this is a one line function so that's kind of redundant, 2935 02:10:19,460 --> 02:10:23,660 but when Python goes into the return statement, 2936 02:10:23,660 --> 02:10:26,180 it doesn't continue on to the next line. 2937 02:10:26,180 --> 02:10:27,340 It just returns. 2938 02:10:27,340 --> 02:10:29,300 That is the end of the invocation 2939 02:10:29,300 --> 02:10:30,980 of that particular function. 2940 02:10:30,980 --> 02:10:33,660 But even more importantly, it takes as its parameter. 2941 02:10:33,660 --> 02:10:36,220 You can say return without a parameter 2942 02:10:36,220 --> 02:10:38,460 and it will stop the execution of the function 2943 02:10:38,460 --> 02:10:41,100 kind of like a break does for a loop. 2944 02:10:41,100 --> 02:10:42,300 It's kind of a break for a loop. 2945 02:10:42,300 --> 02:10:43,260 Get out, we're done. 2946 02:10:43,260 --> 02:10:44,580 Don't run that next line. 2947 02:10:44,580 --> 02:10:45,500 Get out. 2948 02:10:45,500 --> 02:10:49,300 But it also allows the specification of what you want 2949 02:10:49,300 --> 02:10:51,460 as the residual value in an expression. 2950 02:10:51,460 --> 02:10:54,460 So we're doing a print and then we're saying greet. 2951 02:10:54,460 --> 02:10:59,060 And what's gonna show up here is whatever this function does 2952 02:10:59,060 --> 02:11:00,460 in its return statement. 2953 02:11:00,460 --> 02:11:02,900 And so that prints hello. 2954 02:11:02,900 --> 02:11:05,700 We call it again and it prints hello again. 2955 02:11:05,700 --> 02:11:06,540 Okay? 2956 02:11:10,180 --> 02:11:12,340 And so basically the return statement, 2957 02:11:13,740 --> 02:11:15,220 I call this the residual value. 2958 02:11:15,220 --> 02:11:18,740 It's like what shows up here when the function is all done 2959 02:11:18,740 --> 02:11:20,560 and it's the string hello. 2960 02:11:22,020 --> 02:11:24,620 We call the functions that return value is fruitful 2961 02:11:24,620 --> 02:11:27,620 because they produce something but you don't have to. 2962 02:11:27,620 --> 02:11:29,100 You can just say return. 2963 02:11:29,100 --> 02:11:30,860 Or you don't even have to have a return statement. 2964 02:11:30,860 --> 02:11:32,140 It goes to the last line of the function 2965 02:11:32,140 --> 02:11:33,660 and it does a return automatically 2966 02:11:33,660 --> 02:11:35,140 at the last line of the function. 2967 02:11:35,140 --> 02:11:36,560 So here's a little bit of a rewrite 2968 02:11:36,560 --> 02:11:39,660 of our little language program. 2969 02:11:39,660 --> 02:11:41,420 We are going to create a greeting program. 2970 02:11:41,420 --> 02:11:44,060 We're gonna take the language as the first parameter. 2971 02:11:44,060 --> 02:11:45,820 And instead of just doing a print statement, 2972 02:11:45,820 --> 02:11:46,960 which is what we did before, 2973 02:11:46,960 --> 02:11:49,380 this is now more like a function 2974 02:11:49,380 --> 02:11:52,860 because it takes some input and produces some output 2975 02:11:52,860 --> 02:11:54,860 as a return rather than just printing. 2976 02:11:54,860 --> 02:11:57,060 It's a little tacky for a function to print. 2977 02:11:58,060 --> 02:12:01,620 And so here we return hola bonjour and hello 2978 02:12:01,620 --> 02:12:03,540 based on the right thing. 2979 02:12:03,540 --> 02:12:05,900 So now we say print greet en. 2980 02:12:05,900 --> 02:12:08,460 So it runs the code once, lang is en. 2981 02:12:08,460 --> 02:12:12,460 And then it runs this code and the residual value is hello. 2982 02:12:12,460 --> 02:12:14,620 So it says hello glen. 2983 02:12:14,620 --> 02:12:18,340 And similarly, when it runs this code, 2984 02:12:18,340 --> 02:12:20,940 it passes es and is lang, it runs through 2985 02:12:20,940 --> 02:12:23,020 and it runs this statement. 2986 02:12:23,020 --> 02:12:25,120 If there was more statements, it still wouldn't run them. 2987 02:12:25,120 --> 02:12:26,640 As soon as this return runs, 2988 02:12:26,640 --> 02:12:31,640 that says that this bit right here is now hola. 2989 02:12:31,640 --> 02:12:34,720 And the same with French, goes in, runs again, 2990 02:12:34,720 --> 02:12:38,400 out comes the return statement, and then bonjour, Michael. 2991 02:12:38,400 --> 02:12:42,400 So you see how we can control as we're writing the application, 2992 02:12:43,400 --> 02:12:45,200 we can control as we're writing the function 2993 02:12:45,200 --> 02:12:48,080 what the residual value that we want to see 2994 02:12:48,080 --> 02:12:50,240 in whatever expression is calling us. 2995 02:12:50,240 --> 02:12:51,480 Sometimes we have returns 2996 02:12:51,480 --> 02:12:53,280 and sometimes we don't have returns. 2997 02:12:53,280 --> 02:12:57,280 So, if you think of the method as a function, 2998 02:12:57,280 --> 02:13:02,280 well, so if you think of the max code 2999 02:13:02,360 --> 02:13:03,600 that we talked about before, 3000 02:13:03,600 --> 02:13:06,180 we can kind of see that somewhere inside that max code, 3001 02:13:06,180 --> 02:13:07,020 there's a return. 3002 02:13:07,020 --> 02:13:09,840 And that's how it communicates the W back to us. 3003 02:13:09,840 --> 02:13:12,520 So we pass in his argument, hello world. 3004 02:13:12,520 --> 02:13:13,880 It comes in as a parameter 3005 02:13:13,880 --> 02:13:17,040 and it's gonna loop through this imp somewhere. 3006 02:13:17,040 --> 02:13:19,040 It's gonna loop over and over into imp. 3007 02:13:19,040 --> 02:13:21,120 And then at some point it's gonna figure something out 3008 02:13:21,120 --> 02:13:23,880 and tell us what it wants to send back to us 3009 02:13:23,880 --> 02:13:24,920 is a return statement. 3010 02:13:24,920 --> 02:13:29,800 And so the W comes back and gets assigned into big. 3011 02:13:31,400 --> 02:13:33,320 You can have more than one parameter 3012 02:13:33,320 --> 02:13:34,680 and there's just an order. 3013 02:13:34,680 --> 02:13:36,960 The first one and the second one, three and five. 3014 02:13:36,960 --> 02:13:41,080 So three becomes A and five becomes B and away we go. 3015 02:13:41,080 --> 02:13:43,040 So we just use this to add two numbers 3016 02:13:43,040 --> 02:13:45,120 and so three plus five is eight. 3017 02:13:47,920 --> 02:13:50,600 So you get as many as you like and the order matters. 3018 02:13:50,600 --> 02:13:53,940 And if you do things like you tell it you want parameters 3019 02:13:53,940 --> 02:13:54,960 and you don't give it to them, 3020 02:13:54,960 --> 02:13:58,000 then that'll become a trace back and it will blow up. 3021 02:13:58,000 --> 02:14:01,280 You can also talk about optional parameters later. 3022 02:14:01,280 --> 02:14:03,440 So you don't have to have return values 3023 02:14:03,440 --> 02:14:05,920 and that means that you simply don't call 3024 02:14:05,920 --> 02:14:07,400 the return with a value. 3025 02:14:07,400 --> 02:14:11,420 And return is always implicitly happening 3026 02:14:11,420 --> 02:14:13,900 as the last line of the function. 3027 02:14:14,800 --> 02:14:19,800 So that's kind of the basics of how functions operate. 3028 02:14:19,800 --> 02:14:23,340 But I don't want you to get too excited about writing 3029 02:14:23,340 --> 02:14:26,760 functions, some programming classes are like 3030 02:14:26,760 --> 02:14:28,560 gotta write a function, gotta write a function. 3031 02:14:28,560 --> 02:14:32,620 Functions to be clear are a very powerful mechanism. 3032 02:14:32,620 --> 02:14:37,620 And as we write programs 150, 200,000, 200 lines of code, 3033 02:14:37,800 --> 02:14:40,360 1000 lines of code, 10,000 lines of code, 3034 02:14:40,360 --> 02:14:43,080 the concept of a function is really important. 3035 02:14:43,080 --> 02:14:45,200 We would go crazy if we didn't have functions. 3036 02:14:45,200 --> 02:14:48,160 But if you're only writing 20 lines of code, 3037 02:14:48,160 --> 02:14:51,560 forcing yourself to write a function is kind of pointless. 3038 02:14:51,560 --> 02:14:56,560 So don't worry about maybe the lack of urge to use this. 3039 02:14:57,360 --> 02:14:59,480 We are calling lots of predefined functions 3040 02:14:59,480 --> 02:15:01,980 and we will for the next couple of lectures. 3041 02:15:01,980 --> 02:15:03,520 There will be a time when you go like, 3042 02:15:03,520 --> 02:15:05,300 oh I'm sick and tired of repeating myself. 3043 02:15:05,300 --> 02:15:07,000 Oh yeah, time to write a function. 3044 02:15:08,140 --> 02:15:11,260 So that's why we don't push functions prematurely. 3045 02:15:11,260 --> 02:15:14,200 We just want you to know what they are, 3046 02:15:14,200 --> 02:15:16,080 use them and at some moment you'll be like, 3047 02:15:16,080 --> 02:15:17,220 oh I wanna define one. 3048 02:15:17,220 --> 02:15:19,160 But don't worry about, it might take a while 3049 02:15:19,160 --> 02:15:21,700 before you really wanna define a function. 3050 02:15:21,700 --> 02:15:25,740 So that kind of summarizes our lecture on functions 3051 02:15:25,740 --> 02:15:28,460 and up next we're gonna do iterations. 3052 02:15:32,120 --> 02:15:35,180 Hello and welcome to chapter five, loops and iteration. 3053 02:15:35,180 --> 02:15:39,840 Now we're going to work on our fourth basic pattern 3054 02:15:39,840 --> 02:15:43,340 on sequential, conditional, store and reuse 3055 02:15:43,340 --> 02:15:44,540 and loops and iteration. 3056 02:15:44,540 --> 02:15:47,560 And this is the one where we teach the computer 3057 02:15:47,560 --> 02:15:49,440 how to do things a lot. 3058 02:15:49,440 --> 02:15:51,480 We can tell it to do something a million times. 3059 02:15:51,480 --> 02:15:56,240 And so that's where we get the doggedness of computers 3060 02:15:56,240 --> 02:15:58,960 or the fact that they're so good at doing work for us 3061 02:15:58,960 --> 02:16:01,240 because we can set them off to a task 3062 02:16:01,240 --> 02:16:03,340 and they'll do it until it's done. 3063 02:16:04,340 --> 02:16:07,960 So here's a very simple loop, a very simple loop. 3064 02:16:09,520 --> 02:16:11,280 Let's put the coffee over here. 3065 02:16:11,280 --> 02:16:16,280 The key word that we're gonna start using is the while loop. 3066 02:16:16,920 --> 02:16:18,920 We're also gonna use the for later on. 3067 02:16:20,000 --> 02:16:23,480 And the while loop functions very much like an if statement. 3068 02:16:23,480 --> 02:16:27,440 The while starts it and this is just like an if statement. 3069 02:16:27,440 --> 02:16:30,560 It's a question that leads to a true or a false answer. 3070 02:16:30,560 --> 02:16:33,200 And then there's a colon and then there's an indented block 3071 02:16:33,200 --> 02:16:36,200 and then we use the deindent to determine how long 3072 02:16:36,200 --> 02:16:38,940 the loop is and so this print is deindented 3073 02:16:38,940 --> 02:16:41,520 so that indicates the end of the loop. 3074 02:16:41,520 --> 02:16:45,920 And so at some level, what's gonna happen here 3075 02:16:45,920 --> 02:16:48,799 is it's just gonna run and if this is true, 3076 02:16:48,799 --> 02:16:51,319 it's gonna run this code and if it's false, 3077 02:16:51,320 --> 02:16:53,080 it's gonna skip the code and that way 3078 02:16:53,080 --> 02:16:54,280 it functions like an if. 3079 02:16:54,280 --> 02:16:56,559 The place that it doesn't function like an if 3080 02:16:56,559 --> 02:16:58,799 is after it's run the code once, 3081 02:16:58,799 --> 02:17:01,559 it goes up and then asks the question again 3082 02:17:01,559 --> 02:17:03,699 and so you can think of it going back up 3083 02:17:03,700 --> 02:17:05,520 kind of to the top of the while loop 3084 02:17:05,520 --> 02:17:07,719 and then re-asking the question like, 3085 02:17:07,719 --> 02:17:10,799 okay, is this going to run again? 3086 02:17:10,799 --> 02:17:13,679 And then it's gonna do that some number of times 3087 02:17:13,680 --> 02:17:15,000 and then it's gonna finish. 3088 02:17:15,000 --> 02:17:18,040 And so that's the loop, that's the iteration. 3089 02:17:18,040 --> 02:17:19,879 And we're going to make a variable, 3090 02:17:19,879 --> 02:17:21,919 we're gonna construct very carefully 3091 02:17:21,920 --> 02:17:24,840 a variable that we call the iteration variable 3092 02:17:24,840 --> 02:17:28,680 and that's n and it's a variable that's gonna change 3093 02:17:28,680 --> 02:17:30,320 and it's our way of running the loop 3094 02:17:30,320 --> 02:17:32,040 but not running the loop forever. 3095 02:17:33,240 --> 02:17:35,059 So let's just run this. 3096 02:17:35,059 --> 02:17:37,719 We come in, n is five, is n greater than zero? 3097 02:17:37,719 --> 02:17:40,119 Yes it is, so we're gonna run this code. 3098 02:17:40,120 --> 02:17:42,120 So we're gonna run this code, we're gonna print out five, 3099 02:17:42,120 --> 02:17:43,360 then we're gonna subtract one 3100 02:17:43,360 --> 02:17:45,040 and then we're gonna go back up, 3101 02:17:45,040 --> 02:17:47,200 go back up and ask the question, 3102 02:17:47,200 --> 02:17:48,639 is n greater than zero? 3103 02:17:48,639 --> 02:17:51,399 And the answer is, since it's four, the answer is yes. 3104 02:17:51,400 --> 02:17:53,000 So n, it runs again. 3105 02:17:53,000 --> 02:17:55,240 Then it prints out four, subtracts it again, 3106 02:17:55,240 --> 02:17:58,320 checks, prints three, subtracts it again, 3107 02:17:58,320 --> 02:18:00,480 prints two, subtracts it again, 3108 02:18:00,480 --> 02:18:02,620 prints one, subtracts it again. 3109 02:18:02,620 --> 02:18:05,760 Now n is zero and so it comes back up, 3110 02:18:05,760 --> 02:18:10,080 comes back up, is this question has now become false. 3111 02:18:10,080 --> 02:18:11,559 So it's gonna take the exit, 3112 02:18:11,559 --> 02:18:13,979 so it's gonna come down and run this line right here, 3113 02:18:13,980 --> 02:18:15,040 then it prints blast off 3114 02:18:15,040 --> 02:18:17,080 and we can kind of print out the residual value 3115 02:18:17,080 --> 02:18:19,360 of n just to sort of prove to ourselves 3116 02:18:19,360 --> 02:18:22,879 that it ran until n was no longer greater than zero 3117 02:18:22,879 --> 02:18:25,239 and then zero was the final value for n 3118 02:18:25,240 --> 02:18:30,240 and we carefully constructed this n, n equals, oops, go back. 3119 02:18:30,240 --> 02:18:32,860 We carefully constructed n, we set it to five, 3120 02:18:32,860 --> 02:18:36,139 then we carefully subtracted one each time through the loop 3121 02:18:36,139 --> 02:18:39,419 and then we're using that to control when to exit the loop. 3122 02:18:39,420 --> 02:18:41,379 And so you could think of this loop as, 3123 02:18:41,379 --> 02:18:43,499 for now, running five times, 3124 02:18:43,500 --> 02:18:47,620 true, true, true, true, true, and then false, finally. 3125 02:18:47,620 --> 02:18:50,260 So this question was true for a while 3126 02:18:50,260 --> 02:18:52,459 and as long as it was true, the loop ran 3127 02:18:52,459 --> 02:18:55,899 and then when it turned false, the loop stopped. 3128 02:18:57,180 --> 02:18:59,379 And so this variable that we construct 3129 02:18:59,379 --> 02:19:01,999 to control the loop was called the iteration variable 3130 02:19:02,000 --> 02:19:04,260 because it tells how many times this loop 3131 02:19:04,260 --> 02:19:06,299 is going to run over and over 3132 02:19:06,299 --> 02:19:08,519 or otherwise known as iterate. 3133 02:19:09,920 --> 02:19:12,260 So this is a badly constructed loop 3134 02:19:12,260 --> 02:19:15,500 with an iteration variable that we didn't do very well. 3135 02:19:15,500 --> 02:19:18,500 And so if we take a look at this, 3136 02:19:18,500 --> 02:19:21,540 we start it with n five and then this is greater than zero 3137 02:19:21,540 --> 02:19:23,500 so it's true so it runs it and then it runs it again 3138 02:19:23,500 --> 02:19:25,540 and then it's still greater than zero. 3139 02:19:25,540 --> 02:19:28,000 So you can pretty much see because we're not changing n, 3140 02:19:28,000 --> 02:19:30,139 this is gonna be true, true, true, true, 3141 02:19:30,139 --> 02:19:33,059 dot, dot, dot, dot, forever, true, forever. 3142 02:19:33,059 --> 02:19:35,579 And so this is an infinite loop 3143 02:19:35,580 --> 02:19:37,700 and it's just gonna run until your computer 3144 02:19:37,700 --> 02:19:40,379 runs out of battery or you hit the button. 3145 02:19:40,379 --> 02:19:42,099 This is the kind of thing where you often see 3146 02:19:42,100 --> 02:19:46,280 your computer spinning like a spinning beach ball 3147 02:19:46,280 --> 02:19:49,260 or some other indication that your computer's super busy. 3148 02:19:49,260 --> 02:19:51,180 It's in some kind of a loop, really tight 3149 02:19:51,180 --> 02:19:53,260 and it's running something and it's using up 3150 02:19:53,260 --> 02:19:55,920 all of the processing resources of your computer. 3151 02:19:55,920 --> 02:19:57,740 That's an infinite loop. 3152 02:19:57,740 --> 02:19:59,980 And so the problem is we did nothing 3153 02:19:59,980 --> 02:20:02,380 with the iteration variable. 3154 02:20:04,060 --> 02:20:05,460 Now here's a different loop. 3155 02:20:05,460 --> 02:20:08,180 And so this one demonstrates a different idea. 3156 02:20:08,180 --> 02:20:11,100 So in this case, we start out with n is zero 3157 02:20:11,100 --> 02:20:13,640 and it comes in here and is n greater than zero? 3158 02:20:13,640 --> 02:20:15,820 Question mark and the answer is false. 3159 02:20:15,820 --> 02:20:19,780 So it skips it, it doesn't run these lines of code at all. 3160 02:20:19,780 --> 02:20:21,700 And so this loop doesn't run at all 3161 02:20:21,700 --> 02:20:23,380 because it comes in, asks the question, 3162 02:20:23,380 --> 02:20:26,060 it says no and then it skips right around it. 3163 02:20:26,060 --> 02:20:27,980 So never run, never run. 3164 02:20:27,980 --> 02:20:31,420 And so this actually is, sometimes you write a while loop 3165 02:20:31,420 --> 02:20:34,700 on purpose like this, not quite as simple as this one. 3166 02:20:34,700 --> 02:20:38,320 But the idea is this emphasizes that these loops 3167 02:20:38,320 --> 02:20:39,960 are what we call zero trip. 3168 02:20:41,940 --> 02:20:45,220 They are not even guaranteed to run once. 3169 02:20:45,220 --> 02:20:46,780 They're gonna run maybe zero times. 3170 02:20:46,780 --> 02:20:49,260 And in this respect, it functions exactly 3171 02:20:49,260 --> 02:20:51,620 like an if statement, right? 3172 02:20:51,620 --> 02:20:53,860 Meaning the first time through the loop, if it's not true, 3173 02:20:53,860 --> 02:20:55,740 it's just gonna skip right by it. 3174 02:20:58,500 --> 02:21:01,340 So there's a couple of ways of getting out of loops. 3175 02:21:01,340 --> 02:21:03,580 In this case, I'm constructing an infinite loop 3176 02:21:03,580 --> 02:21:07,340 because remember the kind of definition of an infinite loop 3177 02:21:07,340 --> 02:21:10,220 is if this is gonna stay true. 3178 02:21:10,220 --> 02:21:13,060 Well, true is the constant true. 3179 02:21:13,060 --> 02:21:14,740 So this is gonna run forever. 3180 02:21:14,740 --> 02:21:16,780 And what it's gonna do is it's gonna prompt 3181 02:21:16,780 --> 02:21:21,080 with a little arrow and then let us type 3182 02:21:21,080 --> 02:21:24,460 and read whatever we type into the variable line. 3183 02:21:24,460 --> 02:21:26,660 And then if the line is done, we're gonna break. 3184 02:21:26,660 --> 02:21:29,220 Now break is an executable statement. 3185 02:21:29,220 --> 02:21:34,000 And if you hit the break, it exits the innermost loop 3186 02:21:34,000 --> 02:21:36,900 out to the place beyond the end of the loop. 3187 02:21:38,060 --> 02:21:43,060 So when this runs the first time and we say hello there, 3188 02:21:43,900 --> 02:21:45,560 line is not done, so it prints it. 3189 02:21:45,560 --> 02:21:47,840 So it prints out hello there and then goes up. 3190 02:21:47,840 --> 02:21:49,900 And then we type in again, we type finished. 3191 02:21:49,900 --> 02:21:52,980 And so it doesn't, it's not done, so it prints it. 3192 02:21:52,980 --> 02:21:54,460 So now comes that print statement. 3193 02:21:54,460 --> 02:21:57,660 Then we type in done and now this becomes true. 3194 02:21:57,660 --> 02:22:00,140 And it comes out and runs the code 3195 02:22:00,140 --> 02:22:02,180 beyond the end of the loop. 3196 02:22:02,180 --> 02:22:04,020 The key is it doesn't go back. 3197 02:22:04,020 --> 02:22:07,660 It's like once you've done a break, that loop is done. 3198 02:22:07,660 --> 02:22:11,740 And so you look at basically the block that is the loop. 3199 02:22:11,740 --> 02:22:14,180 So here's kind of the loop block. 3200 02:22:14,180 --> 02:22:15,940 And then the break goes to the line 3201 02:22:15,940 --> 02:22:20,940 after the end of the loop block. 3202 02:22:23,180 --> 02:22:24,660 And you can think of this as sort of like 3203 02:22:24,660 --> 02:22:26,060 just a hyperspace jump. 3204 02:22:26,060 --> 02:22:28,820 There is nothing really, this could be literally 3205 02:22:28,820 --> 02:22:31,580 hundreds of lines with if statements. 3206 02:22:31,580 --> 02:22:33,720 And you could be running and doing all kinds of stuff 3207 02:22:33,720 --> 02:22:35,900 and running and doing all these things. 3208 02:22:35,900 --> 02:22:38,460 And these things could run all kinds of ways, right? 3209 02:22:38,460 --> 02:22:40,940 The point is as soon as you hit a break statement, 3210 02:22:40,940 --> 02:22:42,460 however much stuff is down here, 3211 02:22:42,460 --> 02:22:44,100 however much stuff is up here, 3212 02:22:44,100 --> 02:22:47,760 it exits to whatever the next line is 3213 02:22:47,760 --> 02:22:50,060 beyond the end of the loop. 3214 02:22:51,500 --> 02:22:54,280 Continue is another loop control statement, 3215 02:22:54,280 --> 02:22:56,560 but it works differently than break. 3216 02:22:56,560 --> 02:22:59,480 So break says get out of this loop. 3217 02:22:59,480 --> 02:23:02,700 Continue effectively says stop this iteration. 3218 02:23:02,700 --> 02:23:04,400 We're done with this iteration. 3219 02:23:04,400 --> 02:23:08,180 And so continue says go up back to the top of the loop. 3220 02:23:08,180 --> 02:23:09,560 Oops, yeah. 3221 02:23:09,560 --> 02:23:11,200 Go up back to the top of the loop. 3222 02:23:11,200 --> 02:23:14,760 And so here we read a line. 3223 02:23:14,760 --> 02:23:17,400 If the first character is a pound sign, 3224 02:23:17,400 --> 02:23:20,480 line sub zero, if that first character is a pound sign, 3225 02:23:20,480 --> 02:23:21,920 we're gonna skip it. 3226 02:23:21,920 --> 02:23:24,380 And this is a way for us to make like little comments 3227 02:23:24,380 --> 02:23:25,560 in our typing. 3228 02:23:25,560 --> 02:23:28,440 And then if the line is done, we get out 3229 02:23:28,440 --> 02:23:29,640 and otherwise we print it. 3230 02:23:29,640 --> 02:23:31,400 And so that's why there is no print out here 3231 02:23:31,400 --> 02:23:35,560 because it comes in, runs, oops. 3232 02:23:35,560 --> 02:23:40,560 It comes in, this is true and that goes back up, 3233 02:23:42,160 --> 02:23:45,360 but it comes back and prints out the next one 3234 02:23:45,360 --> 02:23:46,300 and does another thing. 3235 02:23:46,300 --> 02:23:47,880 And so the loop continues, 3236 02:23:47,880 --> 02:23:50,440 whereas the break ends the loop. 3237 02:23:50,440 --> 02:23:52,520 And so again, the same kind of notion 3238 02:23:52,520 --> 02:23:54,880 that you're sort of doing all kinds of complexity. 3239 02:23:54,880 --> 02:23:56,800 Wherever you're at in this loop, 3240 02:23:56,800 --> 02:24:00,400 you hit continue and it doesn't go any further. 3241 02:24:00,400 --> 02:24:02,880 It goes back up and runs the question mark. 3242 02:24:02,880 --> 02:24:04,600 It asks the question mark. 3243 02:24:04,600 --> 02:24:07,600 And so, I mean, ask the question 3244 02:24:07,600 --> 02:24:09,620 and it might exit the loop in that particular case. 3245 02:24:09,620 --> 02:24:11,000 But this one here is a true, 3246 02:24:11,000 --> 02:24:13,300 this is an infinite loop that I've constructed. 3247 02:24:13,300 --> 02:24:15,600 This is not an infinite loop because at some point 3248 02:24:15,600 --> 02:24:17,000 the break gets us out of the loop. 3249 02:24:17,000 --> 02:24:20,560 And so it's an infinite loop with break to escape it. 3250 02:24:20,560 --> 02:24:23,340 And that's another common way to construct a loop. 3251 02:24:26,100 --> 02:24:29,100 So these loops that we've been drawing so far, 3252 02:24:29,100 --> 02:24:31,160 the ones that use while as their keyword, 3253 02:24:32,600 --> 02:24:34,280 are what are called indefinite loops. 3254 02:24:34,280 --> 02:24:36,640 And that's because they kind of go for a while 3255 02:24:36,640 --> 02:24:41,120 till a break hits or until some value becomes true. 3256 02:24:41,120 --> 02:24:44,200 I mean, as long as that value remains true. 3257 02:24:44,200 --> 02:24:48,120 So all the ones we've done so far are easy to look at 3258 02:24:48,120 --> 02:24:50,720 and know that they look pretty good 3259 02:24:50,720 --> 02:24:52,500 and they're probably gonna finish. 3260 02:24:52,500 --> 02:24:55,440 But there are some times if they're long and complex 3261 02:24:55,440 --> 02:24:58,480 and their exit or termination conditions 3262 02:24:58,480 --> 02:24:59,760 are a little more complex, 3263 02:24:59,760 --> 02:25:02,240 it's not clear that they're really gonna terminate. 3264 02:25:02,240 --> 02:25:05,280 And so we can use while loops for a lot of things, 3265 02:25:05,280 --> 02:25:08,360 but for most of our looping, 3266 02:25:08,360 --> 02:25:10,320 we're gonna use what are called definite loops. 3267 02:25:10,320 --> 02:25:12,480 And that's what we're gonna talk about next. 3268 02:25:16,320 --> 02:25:19,120 So definite loops use the for keyword. 3269 02:25:19,120 --> 02:25:20,800 And the idea of a definite loop 3270 02:25:20,800 --> 02:25:23,200 is it's going to loop through some set of things. 3271 02:25:23,200 --> 02:25:25,280 It might be a set of lines in a file, 3272 02:25:25,280 --> 02:25:28,320 it might be a set of characters in a string, 3273 02:25:28,320 --> 02:25:31,380 it might be a set of strings in a list of strings. 3274 02:25:31,380 --> 02:25:34,760 But whatever it is, it's sort of gonna run 3275 02:25:34,760 --> 02:25:37,360 a finite number of times depending on the thing 3276 02:25:37,360 --> 02:25:38,920 that it's looping through. 3277 02:25:38,920 --> 02:25:40,800 And we like this. 3278 02:25:40,800 --> 02:25:43,760 And it's an easier way to construct it 3279 02:25:43,760 --> 02:25:45,960 and we actually don't have to deal with the iteration 3280 02:25:45,960 --> 02:25:48,620 variable, the for loop includes a mechanism 3281 02:25:48,620 --> 02:25:50,520 to construct the iteration variable for us. 3282 02:25:50,520 --> 02:25:54,440 So it's definite loops iterate through the members of a set. 3283 02:25:54,440 --> 02:25:57,160 So here's a very simple for loop. 3284 02:25:57,160 --> 02:26:02,160 And so you see the for keyword and n is also a keyword. 3285 02:26:03,600 --> 02:26:06,640 And the iteration variable is something we put right here. 3286 02:26:06,640 --> 02:26:11,040 This i is declared, this i is like an assignment statement. 3287 02:26:11,040 --> 02:26:13,720 And i is going to take on successive values. 3288 02:26:13,720 --> 02:26:17,240 So i is going to be five the first time through the loop. 3289 02:26:17,240 --> 02:26:19,800 Then i is gonna be four the second time through the loop. 3290 02:26:19,800 --> 02:26:22,320 Third, two, one. 3291 02:26:22,320 --> 02:26:25,360 So i is gonna be assigned five different times 3292 02:26:25,360 --> 02:26:26,880 to five different values. 3293 02:26:26,880 --> 02:26:29,880 And then the loop is going to run. 3294 02:26:29,880 --> 02:26:32,440 It's gonna run once with five, once with four, 3295 02:26:32,440 --> 02:26:35,320 once with three, once with two, and once with one. 3296 02:26:35,320 --> 02:26:38,240 And so this block of code we have contracted, 3297 02:26:38,240 --> 02:26:42,300 say execute it five times with these values of i. 3298 02:26:42,300 --> 02:26:43,900 i is that iteration variable. 3299 02:26:43,900 --> 02:26:47,560 i is the thing changing through each iteration of the loop. 3300 02:26:47,560 --> 02:26:48,400 Okay? 3301 02:26:48,400 --> 02:26:52,480 And so that's why this prints out five, four, three, two, 3302 02:26:52,480 --> 02:26:55,240 one, and then when it's done it finishes it. 3303 02:26:55,240 --> 02:26:58,360 So this is a much more direct syntax 3304 02:26:58,360 --> 02:27:01,440 for looping five times and setting iteration variable. 3305 02:27:01,440 --> 02:27:05,600 You kind of all combine it into this one thing, right? 3306 02:27:05,600 --> 02:27:06,740 All into one thing. 3307 02:27:06,740 --> 02:27:08,860 So it's quite nice. 3308 02:27:08,860 --> 02:27:11,440 So you don't have to be going through a list of numbers. 3309 02:27:11,440 --> 02:27:12,560 There's all kinds of things 3310 02:27:12,560 --> 02:27:14,700 that we can iterate through with four. 3311 02:27:14,700 --> 02:27:16,640 And by the way, while I'm sitting here, 3312 02:27:16,640 --> 02:27:19,940 don't, I named my variable friends, 3313 02:27:19,940 --> 02:27:21,280 because that's a list of strings, 3314 02:27:21,280 --> 02:27:24,460 and friend, which is the iteration variable. 3315 02:27:24,460 --> 02:27:27,680 I'm using singular and plural because it helps you read it. 3316 02:27:27,680 --> 02:27:29,740 Python doesn't understand singular and plural. 3317 02:27:29,740 --> 02:27:31,920 So just because you say friends 3318 02:27:31,920 --> 02:27:33,560 doesn't mean Python knows it's a list. 3319 02:27:33,560 --> 02:27:35,340 Python does know it's a list, 3320 02:27:35,340 --> 02:27:37,560 but it doesn't know by the name of the variable I've chosen. 3321 02:27:37,560 --> 02:27:40,360 That's your basic mnemonic variable warning. 3322 02:27:40,360 --> 02:27:41,720 These are cool variable names, 3323 02:27:41,720 --> 02:27:44,360 but I don't want you to get confused by them. 3324 02:27:44,360 --> 02:27:46,400 So you can loop through a variable. 3325 02:27:46,400 --> 02:27:48,260 So we're gonna take this list of three strings 3326 02:27:48,260 --> 02:27:49,400 and stick it in friends. 3327 02:27:49,400 --> 02:27:51,500 And so friend is gonna iterate through that. 3328 02:27:51,500 --> 02:27:54,260 So the first time through, friend is gonna be Joseph. 3329 02:27:54,260 --> 02:27:56,140 Second time through, it's gonna be Glen. 3330 02:27:56,140 --> 02:27:58,520 Third time through, it's going to be Sally. 3331 02:27:58,520 --> 02:28:00,400 And so that just says run this loop, 3332 02:28:00,400 --> 02:28:02,540 run this code, the indented code, 3333 02:28:02,540 --> 02:28:05,660 three times each time the variable friend 3334 02:28:05,660 --> 02:28:08,840 takes on a successive version of, 3335 02:28:08,840 --> 02:28:12,820 a successive value that's in the friends array. 3336 02:28:13,740 --> 02:28:16,980 So it says happy birthday Joseph, Glen, Sally, 3337 02:28:16,980 --> 02:28:19,600 and then we come out of the loop and we print done. 3338 02:28:19,600 --> 02:28:24,600 So if we try to draw a picture of what this is really doing, 3339 02:28:29,040 --> 02:28:31,960 the for loop is actually doing a whole bunch of stuff 3340 02:28:31,960 --> 02:28:34,760 that we would have to do with maybe 3341 02:28:34,760 --> 02:28:37,480 separate statements in the while loop. 3342 02:28:37,480 --> 02:28:40,000 First it decides how many times to run the loop. 3343 02:28:40,000 --> 02:28:43,100 So it's answering the done question, which way do we go? 3344 02:28:43,100 --> 02:28:45,720 And it is also then moving I ahead. 3345 02:28:45,720 --> 02:28:47,880 It's managing the iteration variable. 3346 02:28:47,880 --> 02:28:51,640 If you go back to the, it's initializing it too. 3347 02:28:51,640 --> 02:28:53,200 If you go back to the while loop, 3348 02:28:53,200 --> 02:28:55,640 we had n equals zero, while n greater than zero, 3349 02:28:55,640 --> 02:28:57,200 n equals n minus one. 3350 02:28:57,200 --> 02:29:00,720 So we had like three lines to control the loop 3351 02:29:00,720 --> 02:29:02,520 to manage the iteration variable. 3352 02:29:02,520 --> 02:29:04,760 But with a for loop, we don't have to do that. 3353 02:29:04,760 --> 02:29:07,120 And so that's all taken care of. 3354 02:29:07,120 --> 02:29:10,720 And so that basically says the for loop, 3355 02:29:10,720 --> 02:29:13,320 by you using a for loop, are we done? 3356 02:29:13,320 --> 02:29:14,720 No, we have five things to work. 3357 02:29:14,720 --> 02:29:16,980 Well set out of the first one, run it. 3358 02:29:16,980 --> 02:29:18,520 We're not done, because we've got one more. 3359 02:29:18,520 --> 02:29:20,780 Set it to the second one, third one, fourth one, 3360 02:29:20,780 --> 02:29:22,560 fifth one, and now we're done. 3361 02:29:22,560 --> 02:29:26,240 And that is all handled in a single line of code 3362 02:29:26,240 --> 02:29:28,040 and that includes the iteration variable 3363 02:29:28,040 --> 02:29:30,200 and the set of things through which 3364 02:29:30,200 --> 02:29:32,040 we are going to iterate through. 3365 02:29:34,080 --> 02:29:37,360 I really like the word in. 3366 02:29:37,360 --> 02:29:42,040 It is mathematically, I mean, it reminds me of 3367 02:29:42,040 --> 02:29:45,800 the set theory where you say this is a member of this set 3368 02:29:45,800 --> 02:29:49,040 or the for each. 3369 02:29:49,040 --> 02:29:51,120 Math isn't important here, but if you do know math, 3370 02:29:51,120 --> 02:29:53,760 the vertical bar means such that, right, 3371 02:29:53,760 --> 02:29:56,600 is a member of this set and that kind of stuff, 3372 02:29:56,600 --> 02:29:59,600 member of the set, I'll erase the math stuff 3373 02:29:59,600 --> 02:30:00,980 so we don't over math. 3374 02:30:00,980 --> 02:30:04,520 But it's like for each of the values in the set, 3375 02:30:04,520 --> 02:30:07,600 five, four, three, two, one, run this loop, 3376 02:30:07,600 --> 02:30:10,980 setting the iteration variable i to the members of that set. 3377 02:30:10,980 --> 02:30:14,960 So n reminds me, for those of us who are math oriented, 3378 02:30:14,960 --> 02:30:19,120 n reminds me of a really nice concept in mathematics. 3379 02:30:23,360 --> 02:30:26,640 Now, you could think of this as sort of this 3380 02:30:26,640 --> 02:30:29,220 looping structure where the for loop, 3381 02:30:29,220 --> 02:30:30,920 and this is pretty much how it actually runs 3382 02:30:30,920 --> 02:30:34,320 inside the computer, right, where it initializes it, 3383 02:30:34,320 --> 02:30:37,320 i, it runs this, runs this thing five times, 3384 02:30:37,320 --> 02:30:38,500 and then executes. 3385 02:30:38,500 --> 02:30:40,080 That's one way to think about it. 3386 02:30:40,080 --> 02:30:43,040 You could also think about it in a somewhat 3387 02:30:43,040 --> 02:30:47,680 more abstract way, and think of it as all we're really doing 3388 02:30:47,680 --> 02:30:51,440 is we have a contract with Python that says i, 3389 02:30:51,440 --> 02:30:53,000 we're supposed to run this code five times, 3390 02:30:53,000 --> 02:30:56,800 and i's supposed to be five, four, three, two, and one. 3391 02:30:56,800 --> 02:30:59,200 So you could imagine this might be what's going on. 3392 02:30:59,200 --> 02:31:01,640 The for loop sets i to five, runs our code. 3393 02:31:01,640 --> 02:31:04,280 The for loop sets i to four, runs our code. 3394 02:31:04,280 --> 02:31:06,560 The for loop sets i to three, runs our code. 3395 02:31:06,560 --> 02:31:09,120 The for loop sets i to two, runs our code. 3396 02:31:09,120 --> 02:31:11,480 For loop sets i to one, and runs our code. 3397 02:31:11,480 --> 02:31:15,720 All we know is our code was run five, ran five times, 3398 02:31:15,720 --> 02:31:18,720 and by contract, each success of time, 3399 02:31:20,440 --> 02:31:22,040 we're getting a different value for i, 3400 02:31:22,040 --> 02:31:24,040 and the value for i is taken from this set. 3401 02:31:24,040 --> 02:31:27,040 And so this is just one way to think about it, 3402 02:31:27,040 --> 02:31:31,400 to say to yourself, oh yeah, this is one way to think about it 3403 02:31:31,400 --> 02:31:34,000 as it's actually, and this is how it really works, 3404 02:31:34,000 --> 02:31:36,400 but this is also kind of logically the contract 3405 02:31:36,400 --> 02:31:39,120 that Python is making for us. 3406 02:31:39,120 --> 02:31:42,200 So up next, we're gonna talk about taking this notion 3407 02:31:42,200 --> 02:31:44,680 of doing something to a lot of items, 3408 02:31:44,680 --> 02:31:46,240 but accomplishing something with that, 3409 02:31:46,240 --> 02:31:48,740 and I call these loop idioms. 3410 02:31:52,040 --> 02:31:54,040 So now we're gonna talk about loop idioms, 3411 02:31:54,040 --> 02:31:57,040 and loop idioms are patterns 3412 02:31:57,040 --> 02:31:59,520 that have to do with how we construct loops. 3413 02:31:59,520 --> 02:32:03,160 We have the mechanics of fors and whiles, 3414 02:32:03,160 --> 02:32:05,680 but ultimately we wanna get something done. 3415 02:32:05,680 --> 02:32:08,000 We wanna solve a problem with a loop, 3416 02:32:08,000 --> 02:32:11,560 and often what we have to do is if we have a set of things, 3417 02:32:11,560 --> 02:32:14,780 whether it's lines, or strings, or characters, or numbers, 3418 02:32:14,780 --> 02:32:16,480 we're looking for something like the largest, 3419 02:32:16,480 --> 02:32:18,280 or the smallest, or we wanna add them up, 3420 02:32:18,280 --> 02:32:20,040 or something like that. 3421 02:32:20,040 --> 02:32:22,800 And so we can't just say add them up, 3422 02:32:22,800 --> 02:32:25,200 we have to say go through each one 3423 02:32:25,200 --> 02:32:26,640 and do something to each one, 3424 02:32:26,640 --> 02:32:28,800 and somehow achieve adding them up. 3425 02:32:28,800 --> 02:32:31,020 And the pattern that we're gonna follow is 3426 02:32:31,020 --> 02:32:33,320 we're gonna have this loop that's gonna do all, 3427 02:32:33,320 --> 02:32:38,320 run once for each thing in some chunk of data, 3428 02:32:38,480 --> 02:32:40,680 and then, but we're gonna set something at the beginning, 3429 02:32:40,680 --> 02:32:42,520 and then we're gonna do something to each one, 3430 02:32:42,520 --> 02:32:44,660 and then at the end we're gonna kinda get the payoff, 3431 02:32:44,660 --> 02:32:46,220 we're gonna get the result. 3432 02:32:46,220 --> 02:32:49,680 So if we're doing sort of summing things, 3433 02:32:49,680 --> 02:32:51,160 we're gonna have a running total, 3434 02:32:51,160 --> 02:32:54,040 and so this'll be like t equals zero, 3435 02:32:54,040 --> 02:32:58,280 and then this'll be t equals t plus the thing value. 3436 02:32:58,280 --> 02:33:00,400 And then, but this is not the real total, 3437 02:33:00,400 --> 02:33:02,480 it's the running total during the loop, 3438 02:33:02,480 --> 02:33:04,500 but at the end it is the real total. 3439 02:33:05,440 --> 02:33:07,960 And so we're gonna look at what you do 3440 02:33:07,960 --> 02:33:09,680 before the loop starts, during the loop, 3441 02:33:09,680 --> 02:33:12,120 and then what you get after the loop, 3442 02:33:12,120 --> 02:33:13,520 and how you can use that. 3443 02:33:14,600 --> 02:33:16,240 So we're gonna use this loop, 3444 02:33:16,240 --> 02:33:18,440 it's just gonna loop through a set of six numbers 3445 02:33:18,440 --> 02:33:20,840 over and over and over again, right? 3446 02:33:20,840 --> 02:33:22,480 So we're gonna do something before the loop, 3447 02:33:22,480 --> 02:33:23,680 we're gonna do something after the loop, 3448 02:33:23,680 --> 02:33:25,960 and then we're gonna run the loop some number of times, 3449 02:33:25,960 --> 02:33:28,600 and in this case thing is our iteration variable, 3450 02:33:28,600 --> 02:33:32,000 because I'm using unnemonic variables now. 3451 02:33:32,000 --> 02:33:36,960 So it's gonna run 9, 41, 12, three, 74, and 15, 3452 02:33:36,960 --> 02:33:38,840 so it's gonna run and print these things out. 3453 02:33:38,840 --> 02:33:41,480 So it runs this loop six times, and away we go. 3454 02:33:41,480 --> 02:33:44,200 Now this loop does nothing except print stuff out. 3455 02:33:44,200 --> 02:33:45,800 Of course I like to do that first, 3456 02:33:45,800 --> 02:33:47,660 is always print things out, 3457 02:33:47,660 --> 02:33:50,780 to make sure that sort of my brain is functioning. 3458 02:33:52,080 --> 02:33:57,080 So, to kind of understand how these loops work, 3459 02:33:57,540 --> 02:34:00,200 I'm gonna ask you to function as a program, 3460 02:34:00,200 --> 02:34:03,000 and I'm gonna show you some numbers in succession, 3461 02:34:03,000 --> 02:34:06,520 and I want you to mentally figure out 3462 02:34:06,520 --> 02:34:08,320 what the largest number is, but more importantly, 3463 02:34:08,320 --> 02:34:11,320 think about how your brain is solving this problem 3464 02:34:11,320 --> 02:34:12,700 of what is the largest number, 3465 02:34:12,700 --> 02:34:14,100 given that I'm only gonna show them to you 3466 02:34:14,100 --> 02:34:15,920 one at a time for a little while, 3467 02:34:15,920 --> 02:34:17,200 and your brain has to do something, 3468 02:34:17,200 --> 02:34:19,880 and imagine I was gonna show you thousands of numbers, 3469 02:34:19,880 --> 02:34:21,680 I'm not, but imagine I was. 3470 02:34:21,680 --> 02:34:24,440 How would you organize yourself in a way, 3471 02:34:24,440 --> 02:34:26,160 so that for like an hour and a half, 3472 02:34:26,160 --> 02:34:28,200 you could sit here as I showed you numbers, 3473 02:34:28,200 --> 02:34:30,920 and you keep track of the largest number 3474 02:34:30,920 --> 02:34:33,760 that you've seen of all the numbers, okay? 3475 02:34:33,760 --> 02:34:35,920 So here we go, here's your first number, 3476 02:34:39,760 --> 02:34:44,760 second number, third number, fourth number, 3477 02:34:47,860 --> 02:34:52,860 fifth number, sixth and last number. 3478 02:34:52,860 --> 02:34:57,860 What was the largest number, hmm? What was it? 3479 02:34:59,940 --> 02:35:04,780 Well, it wasn't too hard, it was 74, 3480 02:35:04,780 --> 02:35:06,640 but that's not the question. 3481 02:35:06,640 --> 02:35:11,320 How did your brain arrive at 74? 3482 02:35:11,320 --> 02:35:12,480 So here's all the numbers, 3483 02:35:12,480 --> 02:35:13,920 if I was showing you all the numbers, 3484 02:35:13,920 --> 02:35:17,360 and asked you what's the largest number, 3485 02:35:17,360 --> 02:35:19,280 your eyes would have sort of gone, 3486 02:35:19,280 --> 02:35:22,580 zer, zer, zer, zer, zer, and then you got to 74, 3487 02:35:22,580 --> 02:35:25,720 and you wouldn't do it in any particular order, 3488 02:35:25,720 --> 02:35:27,960 your eyes would just like see the 74, 3489 02:35:27,960 --> 02:35:30,280 and it would just throw smaller numbers away, 3490 02:35:30,280 --> 02:35:32,800 and it would move really quickly to what the answer is. 3491 02:35:32,800 --> 02:35:36,160 Even if there was several hundred numbers on the screen, 3492 02:35:36,160 --> 02:35:39,160 your mind would sort of move fluidly 3493 02:35:39,160 --> 02:35:42,040 wherever it felt like moving, and then arrive at it. 3494 02:35:42,040 --> 02:35:43,840 And probably what it would do is, 3495 02:35:43,840 --> 02:35:45,680 it would do something like, you know, 3496 02:35:45,680 --> 02:35:47,600 kind of move like this, find this, 3497 02:35:47,600 --> 02:35:49,800 and then sort of check to make sure that it's okay, 3498 02:35:49,800 --> 02:35:52,880 and then say like, okay I got 74, I'm done. 3499 02:35:53,900 --> 02:35:55,320 That's not how computers do it, 3500 02:35:55,320 --> 02:35:57,280 that is not how computers do it. 3501 02:35:57,280 --> 02:35:59,240 They do not move fluidly, 3502 02:35:59,240 --> 02:36:01,640 but they are highly dedicated, 3503 02:36:01,640 --> 02:36:02,960 they're gonna do something, 3504 02:36:02,960 --> 02:36:06,600 gee, gee, gee, gee, gee, gee, gee. 3505 02:36:06,600 --> 02:36:11,300 74, but how would you construct a loop to achieve this? 3506 02:36:11,300 --> 02:36:12,400 So let's take a look. 3507 02:36:13,440 --> 02:36:15,960 You could create a variable called largest so far, 3508 02:36:15,960 --> 02:36:17,780 and this is the largest variable, 3509 02:36:17,780 --> 02:36:19,740 the value that you've seen in the list so far, 3510 02:36:19,740 --> 02:36:21,540 I don't know, I haven't shown you any numbers yet, 3511 02:36:21,540 --> 02:36:24,420 so we'll just set this to negative one to get us started. 3512 02:36:24,420 --> 02:36:27,460 So now, we see three, and we're like, 3513 02:36:27,460 --> 02:36:29,160 oh, that's better than negative one, 3514 02:36:29,160 --> 02:36:30,800 it's our first number, so it's probably the largest 3515 02:36:30,800 --> 02:36:32,620 we've seen so far, right? 3516 02:36:32,620 --> 02:36:35,260 Great, 41, oh, that's bigger than the largest 3517 02:36:35,260 --> 02:36:37,200 we've seen so far, so we'll keep it. 3518 02:36:37,200 --> 02:36:40,320 12 is not bigger than 41, so we're not gonna keep it. 3519 02:36:40,320 --> 02:36:42,300 Notice this keeping thing. 3520 02:36:42,300 --> 02:36:44,360 Nine is not bigger than 41, so there's no point 3521 02:36:44,360 --> 02:36:48,000 to keeping it, 74 is bigger than 41, so we'll keep it. 3522 02:36:48,000 --> 02:36:49,040 Is this the largest number? 3523 02:36:49,040 --> 02:36:51,460 We don't know, we don't know until we're done. 3524 02:36:51,460 --> 02:36:55,340 15, not better than 74, so now, we're all done, 3525 02:36:55,340 --> 02:37:00,100 and hooray, hooray, hooray, we have the largest number. 3526 02:37:00,100 --> 02:37:03,680 And we had this variable that we kept the largest number 3527 02:37:03,680 --> 02:37:06,160 that we'd seen up to this point, 3528 02:37:06,160 --> 02:37:08,960 and then when we know that we're done at the end, 3529 02:37:08,960 --> 02:37:11,460 then that becomes the largest. 3530 02:37:12,800 --> 02:37:13,900 So if you look at all the numbers, 3531 02:37:13,900 --> 02:37:15,320 keeping track of the largest so far, 3532 02:37:15,320 --> 02:37:17,060 at the end of all the numbers, the largest so far 3533 02:37:17,060 --> 02:37:19,820 and the largest are the same thing. 3534 02:37:19,820 --> 02:37:22,460 And so that's how you get this idea 3535 02:37:22,460 --> 02:37:24,760 of something you're doing during the loop 3536 02:37:24,760 --> 02:37:28,160 is not really the answer, but by the time the loop is done, 3537 02:37:28,160 --> 02:37:30,360 you will have the answer. 3538 02:37:30,360 --> 02:37:32,440 And so here's a bit of code that does this. 3539 02:37:32,440 --> 02:37:35,040 Use it with our numbers, right? 3540 02:37:35,040 --> 02:37:36,780 So let's take a look. 3541 02:37:36,780 --> 02:37:38,440 So I have this variable called largest so far, 3542 02:37:38,440 --> 02:37:41,040 I set it to negative one, before the loop. 3543 02:37:41,040 --> 02:37:42,880 Remember, there's a loop before and a loop after 3544 02:37:42,880 --> 02:37:45,320 and loop in the middle, before it's negative one. 3545 02:37:45,320 --> 02:37:49,060 So now the num, remember underscores are okay, 3546 02:37:49,060 --> 02:37:51,060 that's my iteration variable. 3547 02:37:51,060 --> 02:37:53,700 If nine is greater than largest so far, 3548 02:37:53,700 --> 02:37:55,100 well largest so far is negative one, 3549 02:37:55,100 --> 02:37:56,800 so that's true, so this code's gonna run. 3550 02:37:56,800 --> 02:37:59,300 So we're gonna remember the new number. 3551 02:37:59,300 --> 02:38:02,860 So this is nine, and so nine ends up in largest so far, 3552 02:38:02,860 --> 02:38:05,900 and then we print it out, and so largest so far is nine 3553 02:38:05,900 --> 02:38:08,280 after we saw the number nine. 3554 02:38:08,280 --> 02:38:09,780 Then we do it again. 3555 02:38:09,780 --> 02:38:13,960 Do it again, so now 41 comes in, and is 41 greater than nine? 3556 02:38:15,860 --> 02:38:18,600 The answer is yes it is, so we're gonna run this code, 3557 02:38:18,600 --> 02:38:23,340 copy 41 into largest so far, and then print it out, 3558 02:38:23,340 --> 02:38:27,500 and largest so far is 41 after we saw the number 41. 3559 02:38:29,000 --> 02:38:32,180 Now we're gonna run the loop again with 12, okay? 3560 02:38:32,180 --> 02:38:33,740 And you get the idea, I hope. 3561 02:38:33,740 --> 02:38:36,340 Is 12 greater than 41, which is the largest we've seen 3562 02:38:36,340 --> 02:38:39,620 so far, and the answer is no it is not, so we skip. 3563 02:38:39,620 --> 02:38:43,580 So the largest so far stays 41 even though we saw 12, 3564 02:38:43,580 --> 02:38:45,780 meaning we're sort of like ratcheting up, 3565 02:38:45,780 --> 02:38:48,060 but we never ratchet back down. 3566 02:38:48,060 --> 02:38:51,060 So we run it again with three and 41, 3567 02:38:52,420 --> 02:38:56,560 and we skip this, and then the largest so far is 41 3568 02:38:56,560 --> 02:38:59,240 even though we just saw three, 3569 02:38:59,240 --> 02:39:03,200 and now we see 74 is 74 greater than 41. 3570 02:39:03,200 --> 02:39:05,300 See, we never are looking at all the numbers. 3571 02:39:05,300 --> 02:39:07,140 We're only looking at the window on the numbers 3572 02:39:07,140 --> 02:39:10,480 of the current number that we're looking at. 3573 02:39:10,480 --> 02:39:13,020 So is 74 greater than 41? 3574 02:39:13,020 --> 02:39:15,220 The answer is yes, so we run this code, 3575 02:39:15,220 --> 02:39:17,580 and then we capture the 74. 3576 02:39:17,580 --> 02:39:22,060 So we've seen, we just saw 74, and it is the largest so far. 3577 02:39:22,060 --> 02:39:24,800 And then we run it again with 15, 3578 02:39:24,800 --> 02:39:28,600 but 74 is our largest so far, and so it skips. 3579 02:39:28,600 --> 02:39:32,440 So 74 remains largest so far after 15, 3580 02:39:32,440 --> 02:39:33,600 and now we're finished, 3581 02:39:33,600 --> 02:39:35,140 because we just ran the last thing, 3582 02:39:35,140 --> 02:39:37,200 before loop takes care of everything, 3583 02:39:37,200 --> 02:39:38,880 and jumps to this print statement, and says, 3584 02:39:38,880 --> 02:39:41,340 afterwards, largest so far is 74, 3585 02:39:41,340 --> 02:39:45,420 but at this point, it's also the largest, right? 3586 02:39:45,420 --> 02:39:49,580 So largest so far became largest when our loop finished. 3587 02:39:50,880 --> 02:39:52,680 So that sort of gives you this notion 3588 02:39:52,680 --> 02:39:56,760 of how we construct something at the beginning, 3589 02:39:57,620 --> 02:39:58,920 some kind of thing that we're gonna do 3590 02:39:58,920 --> 02:40:00,860 over and over and over again, 3591 02:40:00,860 --> 02:40:03,060 and then something at the end. 3592 02:40:03,060 --> 02:40:04,360 And we put some print statements in 3593 02:40:04,360 --> 02:40:08,840 just so we can watch it and see what's going on. 3594 02:40:08,840 --> 02:40:10,980 So coming up next, we're gonna talk about 3595 02:40:10,980 --> 02:40:13,280 some more loop patterns, some counting, 3596 02:40:13,280 --> 02:40:17,440 totaling, averaging, and finding the smallest number. 3597 02:40:20,580 --> 02:40:22,280 So now we're gonna look at some more patterns 3598 02:40:22,280 --> 02:40:24,660 of the different things we can do at the top of the loop, 3599 02:40:24,660 --> 02:40:26,360 in the middle of the loop, and at the bottom of the loop. 3600 02:40:26,360 --> 02:40:28,460 And the first one we're going to do is counting. 3601 02:40:28,460 --> 02:40:31,960 Now we're gonna take a look at the number of something, 3602 02:40:31,960 --> 02:40:33,300 the number of things in our list. 3603 02:40:33,300 --> 02:40:35,500 Now we could just inspect it and see six, 3604 02:40:35,500 --> 02:40:37,960 but you'll have four loops like you're reading through 3605 02:40:37,960 --> 02:40:41,600 a file or scanning through some data. 3606 02:40:41,600 --> 02:40:43,900 And so the notion of counting, 3607 02:40:43,900 --> 02:40:46,680 but you have to assume that you don't really know 3608 02:40:46,680 --> 02:40:47,840 dot, dot, dot, dot, dot, 3609 02:40:47,840 --> 02:40:50,240 that there's gonna be a lot more than just six. 3610 02:40:50,240 --> 02:40:51,940 But for now, we're just gonna do six, 3611 02:40:51,940 --> 02:40:53,380 and we're gonna count how many things 3612 02:40:53,380 --> 02:40:55,720 that we see in this loop. 3613 02:40:55,720 --> 02:40:57,660 And the pattern is simple. 3614 02:40:57,660 --> 02:41:00,860 You set a variable, zork to zero at the beginning. 3615 02:41:00,860 --> 02:41:03,920 We often call this variable count in mnemonic. 3616 02:41:03,920 --> 02:41:06,220 And now we're gonna run this loop six times. 3617 02:41:06,220 --> 02:41:08,600 One, two, three, four, five, six. 3618 02:41:08,600 --> 02:41:10,600 And each time through, we're just gonna add one to zork. 3619 02:41:10,600 --> 02:41:13,440 So zork start at zero, then it goes one, two, three, 3620 02:41:13,440 --> 02:41:14,760 four, five, six. 3621 02:41:14,760 --> 02:41:16,000 And we're gonna print it out. 3622 02:41:16,000 --> 02:41:18,700 So we see the nine and zork is one. 3623 02:41:18,700 --> 02:41:20,380 See 41, zork is two. 3624 02:41:20,380 --> 02:41:21,880 And in it, zork is 16. 3625 02:41:21,880 --> 02:41:24,380 When we see the 15, four stops. 3626 02:41:24,380 --> 02:41:25,520 And we print out afterwards. 3627 02:41:25,520 --> 02:41:30,220 And this then is six is then the ultimate count that we got. 3628 02:41:30,220 --> 02:41:32,280 So that's very, very simple. 3629 02:41:32,280 --> 02:41:36,460 The pattern is that set it to zero at the beginning, 3630 02:41:36,460 --> 02:41:39,260 add one to it, and if you run that enough times, 3631 02:41:39,260 --> 02:41:42,600 then this is how many times that happened. 3632 02:41:42,600 --> 02:41:46,160 And in a sense, it's how many times this line ran, right? 3633 02:41:46,160 --> 02:41:48,100 Sometimes you put this in an if statement, 3634 02:41:48,100 --> 02:41:50,640 et cetera, et cetera, et cetera, okay? 3635 02:41:52,700 --> 02:41:53,540 Oops. 3636 02:41:54,640 --> 02:41:57,420 Now, we can do the same thing to get a total. 3637 02:41:57,420 --> 02:42:00,820 And the way the total works is you compute a running total 3638 02:42:00,820 --> 02:42:03,420 of the number of the items that you've seen so far. 3639 02:42:03,420 --> 02:42:04,780 And at the end, the running total 3640 02:42:04,780 --> 02:42:06,820 in effect becomes the total. 3641 02:42:08,420 --> 02:42:09,660 A better variable name for this 3642 02:42:09,660 --> 02:42:11,380 would be like sum or total or something. 3643 02:42:11,380 --> 02:42:13,220 But zork, I'll use zork again. 3644 02:42:13,220 --> 02:42:16,460 So you set zork to zero, and it starts up. 3645 02:42:16,460 --> 02:42:19,300 The total we've seen so far is indeed zero. 3646 02:42:19,300 --> 02:42:22,020 And then we're gonna run this one, two, three, four, 3647 02:42:22,020 --> 02:42:23,140 five, six times. 3648 02:42:23,140 --> 02:42:25,700 And thing is gonna be the iteration variable. 3649 02:42:25,700 --> 02:42:27,700 It's gonna take on the successive values. 3650 02:42:27,700 --> 02:42:29,300 And each time through, we're just gonna take 3651 02:42:29,300 --> 02:42:32,740 our running total and add to it the thing we've seen. 3652 02:42:32,740 --> 02:42:35,340 So we see nine, and the running total is nine. 3653 02:42:35,340 --> 02:42:37,980 We see 41, and the running total becomes 50. 3654 02:42:37,980 --> 02:42:40,580 We see 12, the running total becomes 62. 3655 02:42:40,580 --> 02:42:43,780 We get a three, it becomes 65, we get 74, 3656 02:42:43,780 --> 02:42:45,500 running total is 139. 3657 02:42:45,500 --> 02:42:48,060 How many more, how many more are we gonna see? 3658 02:42:48,060 --> 02:42:49,920 We don't know, it could be a million, could be one. 3659 02:42:49,920 --> 02:42:50,940 Oh, it's only one. 3660 02:42:50,940 --> 02:42:53,700 We get a 15, our running total is 154. 3661 02:42:53,700 --> 02:42:55,820 And what's true at any moment here 3662 02:42:55,820 --> 02:42:59,680 is the running total is right, up of what we've seen so far. 3663 02:42:59,680 --> 02:43:02,940 Now, when we're done, the for loop quits for us, 3664 02:43:02,940 --> 02:43:06,880 and afterwards 154 is indeed the total. 3665 02:43:06,880 --> 02:43:09,500 So the running total while we're in the loop, 3666 02:43:09,500 --> 02:43:11,620 at the end of the loop, after the end of the loop, 3667 02:43:11,620 --> 02:43:13,940 we have the actual total. 3668 02:43:13,940 --> 02:43:17,340 So it's not very difficult to convert this to the average, 3669 02:43:17,340 --> 02:43:18,740 because we've calculated the count, 3670 02:43:18,740 --> 02:43:20,580 and we've calculated the running total, 3671 02:43:20,580 --> 02:43:22,300 and now we're gonna have the average 3672 02:43:22,300 --> 02:43:24,780 by simply dividing those, okay? 3673 02:43:24,780 --> 02:43:29,400 So, now this time I've used mnemonic variables. 3674 02:43:29,400 --> 02:43:30,900 Don't get confused by this, 3675 02:43:30,900 --> 02:43:33,360 mnemonic variables are just friendly names I chose 3676 02:43:33,360 --> 02:43:35,280 for you to read the code easier. 3677 02:43:35,280 --> 02:43:37,660 I am not communicating to Python in any way 3678 02:43:37,660 --> 02:43:42,060 by naming this count and sum, but count and sum is nice. 3679 02:43:42,060 --> 02:43:44,680 Okay, so I set count to zero and sum to zero, 3680 02:43:44,680 --> 02:43:46,220 oh, go back up. 3681 02:43:46,220 --> 02:43:48,980 I set count to zero and sum to zero at the beginning, 3682 02:43:48,980 --> 02:43:51,340 and the count is zero and the sum is zero, 3683 02:43:51,340 --> 02:43:53,140 and then I'm gonna run this loop six times, 3684 02:43:53,140 --> 02:43:54,900 one, two, three, four, five, six, 3685 02:43:54,900 --> 02:43:59,260 and each time value is the iteration variable. 3686 02:43:59,260 --> 02:44:01,140 I count, every time I run the loop, 3687 02:44:01,140 --> 02:44:02,980 I count equals count plus one, 3688 02:44:02,980 --> 02:44:04,300 sum equals sum plus value, 3689 02:44:04,300 --> 02:44:06,980 so I have a running count and a running total, 3690 02:44:06,980 --> 02:44:09,620 and they show up here, one, two, three, four, five, six, 3691 02:44:09,620 --> 02:44:10,940 and then the running total, 3692 02:44:10,940 --> 02:44:12,620 and then at some point the for loop, 3693 02:44:12,620 --> 02:44:15,220 we do the last one and the for loop jumps out, 3694 02:44:15,220 --> 02:44:19,420 and it divides, 654 is the count and running total, 3695 02:44:19,420 --> 02:44:22,820 and then it divides the average, sum over count, okay? 3696 02:44:22,820 --> 02:44:25,380 So that's just, again, a pattern of something 3697 02:44:25,380 --> 02:44:27,460 in the beginning, something in the middle, 3698 02:44:27,460 --> 02:44:28,720 something in the end. 3699 02:44:31,260 --> 02:44:33,040 Another kind of thing we tend to do in loops 3700 02:44:33,040 --> 02:44:36,500 is we look for things, we hunt for things, 3701 02:44:36,500 --> 02:44:38,860 and so this is where we have an if statement 3702 02:44:38,860 --> 02:44:40,220 inside of a loop, and of course, 3703 02:44:40,220 --> 02:44:42,220 I've created a silly, simple thing. 3704 02:44:42,220 --> 02:44:46,920 In this code, I am looking for large values 3705 02:44:46,920 --> 02:44:48,560 that are values that are greater than 20, 3706 02:44:48,560 --> 02:44:51,220 and again, don't think of this as just six numbers, 3707 02:44:51,220 --> 02:44:52,660 but I'm looking for all the values, 3708 02:44:52,660 --> 02:44:54,020 and I'm gonna print them out. 3709 02:44:54,020 --> 02:44:56,940 So, you know, it says before, it's gonna run this, 3710 02:44:56,940 --> 02:44:59,940 nine, well, if nine's greater than 20, it's false, 3711 02:44:59,940 --> 02:45:03,500 so it goes back up, 41, true, 3712 02:45:03,500 --> 02:45:05,540 so it prints out 41, then goes back up, 3713 02:45:05,540 --> 02:45:08,980 12, false, goes back up, 3714 02:45:08,980 --> 02:45:12,820 three, false, goes back up, 74, true, 3715 02:45:12,820 --> 02:45:15,740 so it runs this, so out comes that little print statement, 3716 02:45:15,740 --> 02:45:18,500 goes back up, and then 15 is the last one, 3717 02:45:18,500 --> 02:45:20,420 and that's false, it goes back up, 3718 02:45:20,420 --> 02:45:23,500 and the four says we're done, and then we do afterwards, 3719 02:45:23,500 --> 02:45:26,540 and so this is just the notion of having 3720 02:45:26,540 --> 02:45:30,980 an if statement inside of a for loop, 3721 02:45:30,980 --> 02:45:34,220 where we're sort of picking, or choosing, 3722 02:45:34,220 --> 02:45:37,120 or selecting, or looking for something 3723 02:45:37,120 --> 02:45:40,100 in a large set of things that we're looping through. 3724 02:45:41,780 --> 02:45:45,920 We can also say I wanna know if a particular value is there, 3725 02:45:45,920 --> 02:45:48,420 and so we're gonna use a Boolean variable, 3726 02:45:48,420 --> 02:45:52,420 and we've talked about integer variables like one, 42, 3727 02:45:52,420 --> 02:45:55,020 and then floating point variables like 98.6, 3728 02:45:55,020 --> 02:45:57,740 and then string variables like hello world, 3729 02:45:57,740 --> 02:45:58,580 that have quotes in them. 3730 02:45:58,580 --> 02:46:02,460 This is a fourth type, type, a kind of variable. 3731 02:46:03,460 --> 02:46:06,620 It's called a Boolean variable, and it only has two values. 3732 02:46:06,620 --> 02:46:10,220 It has true and false. 3733 02:46:10,220 --> 02:46:12,180 Matter of fact, these if statements, 3734 02:46:12,180 --> 02:46:15,300 they return Boolean values, value equal equal three, 3735 02:46:15,300 --> 02:46:17,940 that is returning a true or a false 3736 02:46:17,940 --> 02:46:21,220 based on the value of value. 3737 02:46:21,220 --> 02:46:23,700 There's a new monic confusion there, but I'm using, 3738 02:46:23,700 --> 02:46:26,220 so I'm gonna make a variable called found, 3739 02:46:26,220 --> 02:46:27,840 and that's a decent name for a variable, 3740 02:46:27,840 --> 02:46:29,660 so don't get hung up on that, 3741 02:46:29,660 --> 02:46:32,920 and I'm gonna initially say found is gonna indicate to me 3742 02:46:32,920 --> 02:46:35,780 whether or not I found a three in my list, 3743 02:46:35,780 --> 02:46:37,700 and I'm gonna start before the loop starts, 3744 02:46:37,700 --> 02:46:40,620 let's say false, because we haven't found anything yet, 3745 02:46:40,620 --> 02:46:44,040 so found equals false, and so at the beginning of the loop, 3746 02:46:44,040 --> 02:46:47,020 found is false, before the loop starts, found is false, 3747 02:46:47,020 --> 02:46:49,580 and now we're gonna run this loop a bunch of times. 3748 02:46:49,580 --> 02:46:51,320 Nine, is that true? 3749 02:46:51,320 --> 02:46:52,700 No, skip. 3750 02:46:52,700 --> 02:46:54,580 41, is that true? 3751 02:46:54,580 --> 02:46:56,140 Skip. 3752 02:46:56,140 --> 02:46:59,900 12, skip, right, so nine, 41, 12, 3753 02:46:59,900 --> 02:47:01,640 and found has remained false, 3754 02:47:01,640 --> 02:47:03,360 because we haven't done anything to it, 3755 02:47:03,360 --> 02:47:06,140 but now in comes a three, and this becomes true, 3756 02:47:06,140 --> 02:47:09,280 so it runs this code, so found becomes true, 3757 02:47:09,280 --> 02:47:10,740 and then we print it, and you'll notice 3758 02:47:10,740 --> 02:47:13,340 that when we see a three, we get true, 3759 02:47:13,340 --> 02:47:16,420 and then it runs again, we get 74, it's still false, 3760 02:47:16,420 --> 02:47:19,760 15, it's still false, run, run, run, quit, 3761 02:47:19,760 --> 02:47:22,800 and the residual afterwards is true, 3762 02:47:22,800 --> 02:47:25,040 and in fact, if you didn't know any of this, 3763 02:47:25,040 --> 02:47:26,760 and you don't print that out, 3764 02:47:26,760 --> 02:47:28,660 all you know is that afterwards, 3765 02:47:28,660 --> 02:47:30,260 we loop through all those things, 3766 02:47:30,260 --> 02:47:32,840 and we know that there was a three in there. 3767 02:47:32,840 --> 02:47:36,500 That's what we're doing, so we searched all of them, 3768 02:47:36,500 --> 02:47:39,460 we checked for threes when we found a three, 3769 02:47:39,460 --> 02:47:43,940 and you can see basically that the found remains false 3770 02:47:43,940 --> 02:47:45,460 until it flips to true, 3771 02:47:45,460 --> 02:47:47,340 but then there's nothing to set it back to false, 3772 02:47:47,340 --> 02:47:48,220 there's nothing in this loop 3773 02:47:48,220 --> 02:47:49,700 that's gonna set it back to false, 3774 02:47:49,700 --> 02:47:52,480 so once it sort of catches the three, 3775 02:47:52,480 --> 02:47:54,860 then it remains true for the rest of the loop, 3776 02:47:54,860 --> 02:47:57,500 and then it just finds its way out. 3777 02:47:57,500 --> 02:47:59,640 Now if you wanna think about it for a moment, 3778 02:47:59,640 --> 02:48:03,180 ask yourself, how might we make this loop more efficient 3779 02:48:03,180 --> 02:48:06,380 by putting a statement right in here? 3780 02:48:06,380 --> 02:48:10,540 Think about a way to, once you've found it, 3781 02:48:10,540 --> 02:48:14,500 and it's true, there is sort of no reason to keep on going, 3782 02:48:14,500 --> 02:48:18,540 so what would you put there to perhaps make this loop, 3783 02:48:18,540 --> 02:48:21,160 to look for threes, just to tell you whether or not 3784 02:48:21,160 --> 02:48:23,980 there was at least one three in there, 3785 02:48:23,980 --> 02:48:25,220 how to make that more efficient? 3786 02:48:25,220 --> 02:48:26,320 Just think about that. 3787 02:48:28,260 --> 02:48:32,820 Okay, so now let's look back at the largest value 3788 02:48:32,820 --> 02:48:34,820 that we started out with, right? 3789 02:48:34,820 --> 02:48:36,720 And so if you think about this, 3790 02:48:36,720 --> 02:48:41,140 let's kind of give it a sort of a rough look here. 3791 02:48:41,140 --> 02:48:44,180 Largest so far is our kind of, like a running total, 3792 02:48:44,180 --> 02:48:47,660 but it's our hypothesis is the best large number. 3793 02:48:47,660 --> 02:48:49,560 And we have this if statement that says, 3794 02:48:49,560 --> 02:48:51,500 if the number we just see right now 3795 02:48:51,500 --> 02:48:54,820 is greater than the largest so far, then capture it, right? 3796 02:48:54,820 --> 02:48:56,700 Take whatever number we saw and capture it. 3797 02:48:56,700 --> 02:49:00,040 So when we see a nine, it's better, we capture it. 3798 02:49:00,040 --> 02:49:02,600 We see a 41, it's better, we capture it. 3799 02:49:02,600 --> 02:49:04,180 We don't capture this, we don't capture this, 3800 02:49:04,180 --> 02:49:07,340 we capture the 74, and we don't capture the 15, 3801 02:49:07,340 --> 02:49:08,220 and that's how we do it. 3802 02:49:08,220 --> 02:49:10,740 So you could think of this as better. 3803 02:49:10,740 --> 02:49:15,500 When the number we're looking at is greater 3804 02:49:15,500 --> 02:49:18,780 than our working hypothesis of the largest, 3805 02:49:18,780 --> 02:49:20,180 we grab it because it's better. 3806 02:49:20,180 --> 02:49:25,180 So this line right here is the grab line, grab it, okay? 3807 02:49:28,100 --> 02:49:31,540 So then the question is how would you modify this code 3808 02:49:31,540 --> 02:49:33,820 to teach it to find the smallest value 3809 02:49:33,820 --> 02:49:35,420 in this list of numbers? 3810 02:49:36,960 --> 02:49:39,260 Think of it as you have a starting number, 3811 02:49:39,260 --> 02:49:43,940 you have a sort of what's better in this grabbing notion. 3812 02:49:43,940 --> 02:49:45,420 How could you do that? 3813 02:49:45,420 --> 02:49:46,260 Take a look. 3814 02:49:51,940 --> 02:49:53,940 Okay, so let's take a look. 3815 02:49:53,940 --> 02:49:55,700 So let's do a couple things. 3816 02:49:55,700 --> 02:49:59,700 Like if you look at this if statement that's better, 3817 02:49:59,700 --> 02:50:03,020 well, it's better now if the number is less than. 3818 02:50:03,020 --> 02:50:05,700 So if the, but then we should probably change this 3819 02:50:05,700 --> 02:50:08,660 to be smallest so far, smallest so far, 3820 02:50:08,660 --> 02:50:11,080 smallest so far, smallest so far, 3821 02:50:11,080 --> 02:50:13,900 smallest so far, smallest so far, right? 3822 02:50:14,820 --> 02:50:16,520 Matter of fact, that's what this is. 3823 02:50:16,520 --> 02:50:20,060 We've changed the word largest so far to smallest so far, 3824 02:50:20,060 --> 02:50:24,940 and we've changed the greater than to a less than. 3825 02:50:24,940 --> 02:50:26,140 Is that gonna fix it? 3826 02:50:27,780 --> 02:50:30,820 Give you a second to look at it, pause if you need. 3827 02:50:30,820 --> 02:50:31,740 It's not gonna fix it. 3828 02:50:31,740 --> 02:50:33,700 It's not gonna find our smallest number. 3829 02:50:33,700 --> 02:50:38,700 The answer is, of course, no, it's not. 3830 02:50:41,040 --> 02:50:43,040 So if we run this code, 3831 02:50:43,040 --> 02:50:45,000 so we set the smallest so far to negative one 3832 02:50:45,000 --> 02:50:46,400 and it starts out negative one. 3833 02:50:46,400 --> 02:50:49,160 We run it, and it's nine. 3834 02:50:49,160 --> 02:50:51,680 Is nine less than negative one? 3835 02:50:51,680 --> 02:50:53,100 No, it's not. 3836 02:50:53,100 --> 02:50:57,140 So after we see a nine, the smallest so far is negative one. 3837 02:50:57,140 --> 02:50:58,760 Now we're gonna run 41. 3838 02:50:58,760 --> 02:51:01,320 Is 41 less than negative one? 3839 02:51:01,320 --> 02:51:04,320 No, it is not. 3840 02:51:04,320 --> 02:51:06,160 So the smallest so far is still negative one. 3841 02:51:06,160 --> 02:51:08,160 As a matter of fact, it isn't the smallest so far anymore. 3842 02:51:08,160 --> 02:51:09,840 Just because we named it smallest so far 3843 02:51:09,840 --> 02:51:11,920 doesn't mean it is the smallest so far. 3844 02:51:11,920 --> 02:51:13,720 It didn't work out so well. 3845 02:51:13,720 --> 02:51:16,240 And so you see that none of these, 3846 02:51:16,240 --> 02:51:18,400 because they're never less than negative one, 3847 02:51:18,400 --> 02:51:20,400 do anything, and we claim that afterwards, 3848 02:51:20,400 --> 02:51:23,120 the smallest we've seen so far is negative one. 3849 02:51:23,120 --> 02:51:25,620 And that is because, of course, 3850 02:51:25,620 --> 02:51:28,600 negative one is smaller than any of the numbers that we saw. 3851 02:51:28,600 --> 02:51:31,620 So how could we fix this? 3852 02:51:31,620 --> 02:51:34,120 Well, if we started the smallest so far 3853 02:51:34,120 --> 02:51:37,260 with some like arbitrary big number, then it'd be better. 3854 02:51:37,260 --> 02:51:39,720 So if we made this 100, whoops, come back. 3855 02:51:41,540 --> 02:51:44,420 If we made this be like 100, that'd be good, 3856 02:51:44,420 --> 02:51:46,380 because the first time through the nine 3857 02:51:46,380 --> 02:51:48,780 would be less than 100, so we would capture the nine 3858 02:51:48,780 --> 02:51:51,500 and then the rest of the loop would work just fine. 3859 02:51:51,500 --> 02:51:53,600 But then what if we didn't know 3860 02:51:53,600 --> 02:51:54,660 how big these numbers were? 3861 02:51:54,660 --> 02:51:57,340 As a matter of fact, the largest so far wouldn't have worked 3862 02:51:57,340 --> 02:51:59,700 if all the numbers were negative. 3863 02:51:59,700 --> 02:52:01,500 Think about that. 3864 02:52:01,500 --> 02:52:02,940 We just assumed they were positive, 3865 02:52:02,940 --> 02:52:04,620 and so we kind of wrote lazy code 3866 02:52:04,620 --> 02:52:06,020 that assumed all numbers were positive. 3867 02:52:06,020 --> 02:52:07,260 That might not be a good assumption 3868 02:52:07,260 --> 02:52:10,520 depending on the numbers that you're dealing with, right? 3869 02:52:10,520 --> 02:52:12,420 So maybe 100's a good number to start with, 3870 02:52:12,420 --> 02:52:16,500 or maybe like 1,000, or 10,000, 3871 02:52:16,500 --> 02:52:19,980 or like some number with lots of zeros in it. 3872 02:52:19,980 --> 02:52:21,940 How big should we make this? 3873 02:52:21,940 --> 02:52:24,300 And the answer is we're kind of solving 3874 02:52:24,300 --> 02:52:26,500 this problem the wrong way. 3875 02:52:26,500 --> 02:52:29,940 And the thing we really want to do to solve the problem 3876 02:52:29,940 --> 02:52:34,940 is to just accept the fact that if we're looking 3877 02:52:35,100 --> 02:52:38,380 for the smallest number so far, 3878 02:52:38,380 --> 02:52:42,740 that the right hypothesis is the first number. 3879 02:52:42,740 --> 02:52:46,080 And if we just knew what that first number was, the nine, 3880 02:52:46,080 --> 02:52:49,500 that would either, because it's the first number, 3881 02:52:49,500 --> 02:52:52,380 we know that it's both the largest so far 3882 02:52:52,380 --> 02:52:54,560 and the smallest so far, as soon as you see the first number. 3883 02:52:54,560 --> 02:52:57,440 But we don't know here before the loop starts 3884 02:52:57,440 --> 02:52:58,640 what that first number is. 3885 02:52:58,640 --> 02:53:01,220 I mean, you can look at it, but assume this is just data 3886 02:53:01,220 --> 02:53:02,640 that's coming from somewhere else, 3887 02:53:02,640 --> 02:53:04,940 and we don't know it until we start reading it. 3888 02:53:04,940 --> 02:53:07,680 So we have to construct a loop that deals with the fact 3889 02:53:07,680 --> 02:53:10,400 that we want to capture the first value 3890 02:53:10,400 --> 02:53:13,400 as our hypothesis for smallest so far. 3891 02:53:14,580 --> 02:53:15,860 So how do we do that? 3892 02:53:15,860 --> 02:53:16,700 Let's take a look. 3893 02:53:17,700 --> 02:53:22,700 So what we do is we use yet another type. 3894 02:53:22,700 --> 02:53:25,900 So we have integer, floating point, string, Boolean, 3895 02:53:25,900 --> 02:53:27,940 and now we have a thing called the none type. 3896 02:53:27,940 --> 02:53:32,380 None type is a special marker in that it only has one value. 3897 02:53:32,380 --> 02:53:34,260 Boolean has true and false. 3898 02:53:34,260 --> 02:53:37,100 You know, floating point has an infinite number of values 3899 02:53:37,100 --> 02:53:38,700 and integer has an infinite number of values, 3900 02:53:38,700 --> 02:53:41,820 but none type has one value, none. 3901 02:53:41,820 --> 02:53:42,860 None is a constant. 3902 02:53:42,860 --> 02:53:45,540 Capital none is a constant. 3903 02:53:47,260 --> 02:53:49,340 The difference is, is we can check to see 3904 02:53:49,340 --> 02:53:51,060 if we have stored none. 3905 02:53:51,060 --> 02:53:54,460 None is often used to indicate emptiness. 3906 02:53:54,460 --> 02:53:58,380 Not non-existence, because smallest doesn't exist 3907 02:53:58,380 --> 02:54:00,060 until we assign it, but we're gonna assign it 3908 02:54:00,060 --> 02:54:03,500 to like a mark, a flag, a marker. 3909 02:54:03,500 --> 02:54:06,180 Some way to say, oh, this is not even a number. 3910 02:54:06,180 --> 02:54:08,060 It's nothing. 3911 02:54:08,060 --> 02:54:09,580 And so we're gonna, and you can do this. 3912 02:54:09,580 --> 02:54:12,340 So that's like, makes a variable called smallest, 3913 02:54:12,340 --> 02:54:14,460 and then it puts none. 3914 02:54:14,460 --> 02:54:15,300 It sticks it right in. 3915 02:54:15,300 --> 02:54:16,140 It's not a string none. 3916 02:54:16,140 --> 02:54:19,740 It's like a special type, okay? 3917 02:54:19,740 --> 02:54:22,540 So that actually captures the notion 3918 02:54:22,540 --> 02:54:24,340 that before the loop starts, 3919 02:54:24,340 --> 02:54:27,860 the smallest number that we've seen so far is none. 3920 02:54:27,860 --> 02:54:29,980 We haven't seen any numbers, okay? 3921 02:54:31,980 --> 02:54:35,780 So, then we come in and we have an if statement. 3922 02:54:35,780 --> 02:54:38,340 And we have a new operator called is. 3923 02:54:38,340 --> 02:54:40,940 Is is stronger than equal sign. 3924 02:54:40,940 --> 02:54:44,660 And so if smallest is none, that becomes true. 3925 02:54:44,660 --> 02:54:46,300 It runs this case. 3926 02:54:46,300 --> 02:54:48,820 And so then what it does is it copies this first value, 3927 02:54:48,820 --> 02:54:50,900 which is nine, into smallest. 3928 02:54:50,900 --> 02:54:53,780 And so we see a nine and a smallest so far is nine, 3929 02:54:53,780 --> 02:54:55,380 which is the first value. 3930 02:54:55,380 --> 02:54:57,420 And again, we're assuming we don't know 3931 02:54:57,420 --> 02:54:59,940 what the first value is before the loop starts. 3932 02:54:59,940 --> 02:55:02,400 So we use the first iteration through the loop 3933 02:55:02,400 --> 02:55:05,780 as the moment where we capture that, okay? 3934 02:55:05,780 --> 02:55:10,340 So smallest is the value, 3935 02:55:10,340 --> 02:55:12,340 and then we print it and we go back up. 3936 02:55:12,340 --> 02:55:15,380 And now it runs again with 41. 3937 02:55:15,380 --> 02:55:17,300 41 is not none. 3938 02:55:17,300 --> 02:55:19,300 None is, there's only one thing that's none. 3939 02:55:19,300 --> 02:55:22,020 So it is not equal to none. 3940 02:55:22,020 --> 02:55:24,700 Smallest is not equal to none or is not none. 3941 02:55:24,700 --> 02:55:27,600 So this is false, so it skips over here. 3942 02:55:27,600 --> 02:55:29,180 Then it asks the question, 3943 02:55:29,180 --> 02:55:32,140 is the value we're looking at 41 less than smallest? 3944 02:55:32,140 --> 02:55:35,860 Well, smallest is nine in this case, and this is 41. 3945 02:55:35,860 --> 02:55:38,540 So that's false, so it skips that and goes on. 3946 02:55:38,540 --> 02:55:41,500 So we see 41, we don't take it. 3947 02:55:41,500 --> 02:55:44,980 And then you can see that this will never become true again. 3948 02:55:44,980 --> 02:55:48,400 This is pretty much false for the rest of the iterations 3949 02:55:48,400 --> 02:55:49,260 of the loop. 3950 02:55:51,540 --> 02:55:53,740 It's false for the rest of the iterations for the loop. 3951 02:55:53,740 --> 02:55:56,460 So it just is gonna run down here and ask this question. 3952 02:55:56,460 --> 02:55:58,300 And at some point, we see a three, 3953 02:55:58,300 --> 02:56:00,420 and we run this code, we capture it. 3954 02:56:00,420 --> 02:56:02,120 We see 74, we don't capture it. 3955 02:56:02,120 --> 02:56:04,180 We see 15, we don't capture it. 3956 02:56:04,180 --> 02:56:07,100 So then the for loop skips out. 3957 02:56:07,100 --> 02:56:09,060 At the end, we have the smallest. 3958 02:56:09,060 --> 02:56:11,580 And actually, this would be a good technique 3959 02:56:11,580 --> 02:56:13,340 for the largest as well. 3960 02:56:13,340 --> 02:56:15,100 Because it really is just a technique 3961 02:56:15,100 --> 02:56:17,100 to put a marker in this variable 3962 02:56:17,100 --> 02:56:19,620 so that we snag that first number, 3963 02:56:19,620 --> 02:56:23,500 or first whatever as we read and parse through them. 3964 02:56:25,740 --> 02:56:30,080 So the is and is not operators are very useful in Python. 3965 02:56:30,080 --> 02:56:32,680 You can think of them as like the double equal sign, 3966 02:56:32,680 --> 02:56:34,180 they're asking a question. 3967 02:56:35,100 --> 02:56:37,700 And they're asking a question, 3968 02:56:37,700 --> 02:56:40,340 they return a true, you know, blank is blank, 3969 02:56:40,340 --> 02:56:42,000 returns a true or a false. 3970 02:56:42,000 --> 02:56:43,100 It is stronger. 3971 02:56:44,380 --> 02:56:49,380 Double equal says are these things equal in type and value? 3972 02:56:50,380 --> 02:56:52,980 So just as an example, 3973 02:56:55,060 --> 02:56:56,260 if I were to say 3974 02:56:59,660 --> 02:57:03,720 is zero equal to 0.0, 3975 02:57:03,720 --> 02:57:05,780 it would say, yeah, that's true. 3976 02:57:05,780 --> 02:57:10,780 But then if I says zero is 0.0, that would be false. 3977 02:57:10,780 --> 02:57:15,740 So that's because these two are the same value-wise, 3978 02:57:15,740 --> 02:57:17,740 and these two are not the same type-wise. 3979 02:57:17,740 --> 02:57:20,060 So is is stronger than equals, 3980 02:57:20,060 --> 02:57:23,320 meaning that it demands equality 3981 02:57:23,320 --> 02:57:25,300 in both the type of the variable 3982 02:57:25,300 --> 02:57:27,340 and the value of the variable. 3983 02:57:27,340 --> 02:57:29,020 And no conversion is done. 3984 02:57:29,020 --> 02:57:31,000 And so that's just a very strong. 3985 02:57:31,000 --> 02:57:32,900 Don't overuse is. 3986 02:57:32,900 --> 02:57:35,020 If you're dealing with numbers or even strings, 3987 02:57:35,020 --> 02:57:36,780 use double equals, don't use is, 3988 02:57:36,780 --> 02:57:40,620 because sometimes it gets a little confusing. 3989 02:57:40,620 --> 02:57:42,780 So use is sparingly. 3990 02:57:42,780 --> 02:57:47,020 I tend to only use is on booleans and on none types. 3991 02:57:47,020 --> 02:57:48,580 I don't use is on integers, 3992 02:57:48,580 --> 02:57:51,420 and I don't use is on floats, 3993 02:57:51,420 --> 02:57:53,260 and I don't use is on strings. 3994 02:57:53,260 --> 02:57:56,300 Just none or true false. 3995 02:57:58,100 --> 02:58:00,580 And also is not is also an operator. 3996 02:58:00,580 --> 02:58:02,980 So you just say blah, blah, blah, is not none, 3997 02:58:02,980 --> 02:58:04,860 or blah, blah, blah, is not false. 3998 02:58:06,580 --> 02:58:09,020 Okay, so we've been looping around 3999 02:58:09,020 --> 02:58:11,620 and doing loops and loops of loops. 4000 02:58:11,620 --> 02:58:14,540 We looked at the indefinite loops, 4001 02:58:14,540 --> 02:58:16,900 the while loops that kind of run for a while. 4002 02:58:16,900 --> 02:58:19,740 The definite loop, and we looked at break and continue 4003 02:58:19,740 --> 02:58:23,160 as a way to either escape completely from the loop 4004 02:58:23,160 --> 02:58:27,320 or go back up and discard the current iteration of the loop. 4005 02:58:28,260 --> 02:58:29,380 We looked at none. 4006 02:58:29,380 --> 02:58:32,220 We looked at Boolean variables with for loops, 4007 02:58:32,220 --> 02:58:34,580 definite loops, where you've got some kind of a set 4008 02:58:34,580 --> 02:58:36,900 or a list or some kind of sequence 4009 02:58:36,900 --> 02:58:37,980 that you're looping through. 4010 02:58:37,980 --> 02:58:39,580 And then the concept of loop idioms 4011 02:58:39,580 --> 02:58:41,100 where you do something at the top, 4012 02:58:41,100 --> 02:58:42,360 something to each item, 4013 02:58:42,360 --> 02:58:45,860 and then you sort of get a benefit at the bottom. 4014 02:58:45,860 --> 02:58:49,020 And so that gets us through iterations. 4015 02:58:53,020 --> 02:58:54,460 Hello and welcome to chapter six. 4016 02:58:54,460 --> 02:58:56,420 And this chapter we're gonna talk about strings. 4017 02:58:56,420 --> 02:58:58,780 And chapter seven is the payoff chapter. 4018 02:58:58,780 --> 02:59:01,940 So up to this point we're still learning 4019 02:59:01,940 --> 02:59:03,800 sort of basic building blocks, 4020 02:59:03,800 --> 02:59:06,900 and actually we're gonna write a real program in chapter seven. 4021 02:59:06,900 --> 02:59:11,060 So just learn this, and the payoff's in chapter seven. 4022 02:59:11,060 --> 02:59:12,700 So we've actually been using strings 4023 02:59:12,700 --> 02:59:14,020 from the very first lecture, 4024 02:59:14,020 --> 02:59:17,380 because if you print Hello World, well, that's a string. 4025 02:59:17,380 --> 02:59:19,060 And so we've been doing things. 4026 02:59:19,060 --> 02:59:22,260 This slide here is all review. 4027 02:59:22,260 --> 02:59:24,100 We use plastic and catenate strings. 4028 02:59:24,100 --> 02:59:25,820 We use print to print them out. 4029 02:59:25,820 --> 02:59:28,180 Print's just a function that takes as a parameter 4030 02:59:28,180 --> 02:59:30,700 something, strings, integers, et cetera. 4031 02:59:31,700 --> 02:59:35,860 We can put digits in strings, but we can't add to them. 4032 02:59:35,860 --> 02:59:37,460 By now you've figured this out, 4033 02:59:37,460 --> 02:59:40,660 but you can use things like ints to convert the strings 4034 02:59:40,660 --> 02:59:42,980 to integers and then print things out. 4035 02:59:42,980 --> 02:59:45,540 So we've been doing this for a while, 4036 02:59:45,540 --> 02:59:47,540 and we've been talking about strings all along. 4037 02:59:47,540 --> 02:59:51,780 Now today what we're gonna do is going to just get 4038 02:59:51,780 --> 02:59:54,860 into strings in more detail. 4039 02:59:54,860 --> 02:59:59,860 We're reading the input data with the input function. 4040 02:59:59,860 --> 03:00:01,780 Input returns us a string. 4041 03:00:03,180 --> 03:00:04,920 And if we want to input a number, 4042 03:00:04,920 --> 03:00:06,740 we have to run some kind of conversion, 4043 03:00:06,740 --> 03:00:10,580 like we have to do on int before we take this data 4044 03:00:10,580 --> 03:00:13,140 that we read from input, you know? 4045 03:00:13,140 --> 03:00:15,540 And so there's things that we've gotta do, 4046 03:00:15,540 --> 03:00:19,060 and we've been doing all these things in programs so far. 4047 03:00:19,060 --> 03:00:21,140 But if we look a little in a little more detail 4048 03:00:21,140 --> 03:00:25,220 inside strings, we can index within strings each character. 4049 03:00:25,220 --> 03:00:27,940 So each character has a separate position 4050 03:00:27,940 --> 03:00:29,940 and a separate index. 4051 03:00:29,940 --> 03:00:34,940 And basically the letters have positions, 4052 03:00:35,300 --> 03:00:37,180 and the positions start at zero, 4053 03:00:37,180 --> 03:00:39,860 and the best way I explain this to remember this 4054 03:00:39,860 --> 03:00:42,060 is it's the elevators. 4055 03:00:42,060 --> 03:00:44,660 As we used in one of our examples a long time ago, 4056 03:00:44,660 --> 03:00:46,300 elevators in Europe start at zero, 4057 03:00:46,300 --> 03:00:48,960 and so strings start at zero as well. 4058 03:00:49,860 --> 03:00:52,180 Turns out in the old days there's some efficiency 4059 03:00:52,180 --> 03:00:55,380 with the notion of lists of things starting with zero. 4060 03:00:55,380 --> 03:00:57,620 These days the efficiency isn't the issue, 4061 03:00:57,620 --> 03:01:00,300 but there's a certain elegance starting at zero, 4062 03:01:00,300 --> 03:01:02,100 even though intellectually you might think 4063 03:01:02,100 --> 03:01:06,420 one would be the first character in the string 4064 03:01:06,420 --> 03:01:08,400 might make most sense to be sub one, 4065 03:01:08,400 --> 03:01:09,780 but it's not, it's sub zero. 4066 03:01:09,780 --> 03:01:11,580 But just remember that. 4067 03:01:12,580 --> 03:01:15,280 And so we have this operator called the index operator, 4068 03:01:15,280 --> 03:01:16,620 and it's square brackets. 4069 03:01:16,620 --> 03:01:21,620 So fruit is a variable that contains the string banana, 4070 03:01:21,620 --> 03:01:26,260 and then fruit sub one is the character 4071 03:01:26,260 --> 03:01:27,580 that's in position one. 4072 03:01:27,580 --> 03:01:29,820 Now that actually is the second character. 4073 03:01:29,820 --> 03:01:31,940 I'll keep reminding you until I get tired of reminding you. 4074 03:01:31,940 --> 03:01:35,780 So that assigns A, the letter A, 4075 03:01:35,780 --> 03:01:40,780 into, I mean A, the letter A into the variable letter. 4076 03:01:41,380 --> 03:01:42,820 Of course that's a badly chosen, 4077 03:01:42,820 --> 03:01:44,460 it's either a well chosen variable name 4078 03:01:44,460 --> 03:01:46,940 or a badly chosen variable name. 4079 03:01:46,940 --> 03:01:48,740 And the thing that goes inside this 4080 03:01:48,740 --> 03:01:50,900 can either be a constant or it can be an expression, 4081 03:01:50,900 --> 03:01:52,740 so this is x equals three, 4082 03:01:52,740 --> 03:01:54,420 and then fruit sub x minus one, 4083 03:01:54,420 --> 03:01:56,940 well that means two, which is position two, 4084 03:01:56,940 --> 03:02:00,220 which is an n, and so that gives us an n. 4085 03:02:00,220 --> 03:02:02,260 So the index is an operator, 4086 03:02:02,260 --> 03:02:04,380 and you can add this bracket syntax 4087 03:02:04,380 --> 03:02:06,500 to the end of a string variable. 4088 03:02:08,140 --> 03:02:11,020 You can't index beyond the length of the string, 4089 03:02:11,020 --> 03:02:13,140 so if I say zot sub five, 4090 03:02:13,140 --> 03:02:14,420 well there's only three characters, 4091 03:02:14,420 --> 03:02:16,420 which means zero, one, two, 4092 03:02:16,420 --> 03:02:18,500 but sub five doesn't work, 4093 03:02:18,500 --> 03:02:22,180 and of course we get a happy little trace back. 4094 03:02:23,540 --> 03:02:24,660 So you have to be careful 4095 03:02:24,660 --> 03:02:26,340 when you're starting to pull stuff out of strings, 4096 03:02:26,340 --> 03:02:27,980 although some of the things allow it, 4097 03:02:27,980 --> 03:02:31,060 some of them don't, and you'll kind of get used to that. 4098 03:02:31,060 --> 03:02:33,900 We can ask how long a string is, 4099 03:02:33,900 --> 03:02:35,780 and so we use the len function, 4100 03:02:35,780 --> 03:02:37,860 we pass the string variable, 4101 03:02:37,860 --> 03:02:40,140 and we pass it into len as parameter, 4102 03:02:40,140 --> 03:02:42,540 and len gives us back the length of the string, 4103 03:02:42,540 --> 03:02:47,540 not the position, so it's zero through len minus one. 4104 03:02:47,540 --> 03:02:51,080 So it's zero through len minus one. 4105 03:02:51,080 --> 03:02:53,520 So len is just another function 4106 03:02:53,520 --> 03:02:55,880 that we've been doing functions now for a while, 4107 03:02:55,880 --> 03:02:57,820 you pass in a parameter, 4108 03:02:57,820 --> 03:02:59,200 and then len does some work, 4109 03:02:59,200 --> 03:03:02,360 and out comes six, and that goes back into x, 4110 03:03:02,360 --> 03:03:04,360 because the function has a residual value, 4111 03:03:04,360 --> 03:03:07,460 it just happens to be a built-in function. 4112 03:03:08,360 --> 03:03:11,720 And so, you know, somewhere deep inside Python, 4113 03:03:11,720 --> 03:03:14,520 there is code that takes this, 4114 03:03:14,520 --> 03:03:16,960 and somebody wrote a loop, or looked something up, 4115 03:03:16,960 --> 03:03:18,640 and then returned a return value, 4116 03:03:18,640 --> 03:03:23,640 and sent back a six to go into our x variable. 4117 03:03:24,080 --> 03:03:25,880 And so a function is there, 4118 03:03:25,880 --> 03:03:29,040 and like I said, we've been using this for a while. 4119 03:03:29,040 --> 03:03:31,280 Another thing we tend to do is to look through strings, 4120 03:03:31,280 --> 03:03:34,480 and look at strings, and dig data out of strings. 4121 03:03:34,480 --> 03:03:36,760 Python is excellent for doing 4122 03:03:36,760 --> 03:03:39,240 sort of these kinds of lookups. 4123 03:03:39,240 --> 03:03:42,360 And so we can write a simple loop, 4124 03:03:42,360 --> 03:03:44,120 we can write a for loop that 4125 03:03:44,120 --> 03:03:49,120 creates some kind of iteration variable, like index, 4126 03:03:49,680 --> 03:03:52,120 and given that we know that these positions are zero 4127 03:03:52,120 --> 03:03:54,520 through five, we can set this to be zero, 4128 03:03:54,520 --> 03:03:56,080 and then write a while loop, 4129 03:03:56,080 --> 03:03:58,160 while the iteration variable is less than 4130 03:03:58,160 --> 03:04:00,680 the length of fruit, and remember, this is six, 4131 03:04:00,680 --> 03:04:03,380 so it's gonna be zero through five. 4132 03:04:03,380 --> 03:04:07,800 Zero through five are the values we wanna generate, 4133 03:04:07,800 --> 03:04:09,680 and then we can look up one at a time, 4134 03:04:09,680 --> 03:04:12,220 pull out fruit sub index, so fruit sub zero, 4135 03:04:12,220 --> 03:04:14,360 fruit sub one, two, three, four, five, 4136 03:04:14,360 --> 03:04:17,920 and then print out the position and the letter, index, 4137 03:04:17,920 --> 03:04:20,040 and then add one to index, and it runs, 4138 03:04:20,040 --> 03:04:22,600 this'll run six times, zero through five, 4139 03:04:22,600 --> 03:04:26,040 and out we go to produce this output right here. 4140 03:04:26,040 --> 03:04:28,660 And so that's one way of looping through strings. 4141 03:04:28,660 --> 03:04:32,980 That is a basic indeterminate loop, 4142 03:04:32,980 --> 03:04:35,760 but we construct carefully an iteration value, 4143 03:04:36,960 --> 03:04:38,520 construct an iteration value, 4144 03:04:38,520 --> 03:04:43,520 and work our way through that loop data. 4145 03:04:43,680 --> 03:04:46,760 The other way is to use a determinate loop, a for loop, 4146 03:04:46,760 --> 03:04:50,320 and generally when we are able to use a while loop 4147 03:04:50,320 --> 03:04:52,880 or a for loop, all else being equal, 4148 03:04:52,880 --> 03:04:55,320 we generally prefer a for loop. 4149 03:04:55,320 --> 03:04:58,400 And so here we have the for keyword and fruit, 4150 03:04:58,400 --> 03:05:01,920 and it's an in, and so for letter in fruit, 4151 03:05:01,920 --> 03:05:04,380 well that just says letter is our iteration variable 4152 03:05:04,380 --> 03:05:06,440 and it's gonna take on the successive values 4153 03:05:06,440 --> 03:05:08,200 of each of the characters. 4154 03:05:08,200 --> 03:05:10,320 So this loop is gonna run six times, 4155 03:05:10,320 --> 03:05:14,840 and letter's gonna be B-A-N-A-N-A, banana. 4156 03:05:14,840 --> 03:05:16,680 I'm always terrified when I make these slides 4157 03:05:16,680 --> 03:05:18,200 that I'm gonna misspell banana, 4158 03:05:18,200 --> 03:05:22,020 because somehow I always think that there are two ends, 4159 03:05:22,020 --> 03:05:24,240 somewhere, I don't know. 4160 03:05:24,240 --> 03:05:25,980 It's not one of my favorite words to spell. 4161 03:05:25,980 --> 03:05:28,960 I actually didn't choose banana as the constant. 4162 03:05:28,960 --> 03:05:32,000 The author who I borrowed the textbook from, 4163 03:05:32,000 --> 03:05:34,580 Alan Downey and Jeff Elkner, they used banana, 4164 03:05:34,580 --> 03:05:36,080 and so I'm still using banana. 4165 03:05:36,080 --> 03:05:38,560 So some of the jokes in the book aren't my book, 4166 03:05:38,560 --> 03:05:42,680 aren't my jokes, they are the jokes of Jeff and Alan. 4167 03:05:42,680 --> 03:05:45,760 So here are just two equivalent, 4168 03:05:45,760 --> 03:05:47,040 so you can have the while loop, 4169 03:05:47,040 --> 03:05:48,520 they sort of both do the same thing, 4170 03:05:48,520 --> 03:05:51,640 they both just print the letters out one time through, 4171 03:05:51,640 --> 03:05:53,580 each of these loops runs five times, 4172 03:05:53,580 --> 03:05:56,360 but you can see how the determinant loop, 4173 03:05:56,360 --> 03:05:58,640 the for loop is a prettier loop, 4174 03:05:58,640 --> 03:06:01,040 unless you truly somehow need to know this number 4175 03:06:01,040 --> 03:06:02,040 as you're going through the loop. 4176 03:06:02,040 --> 03:06:03,320 But if all you're doing is going through 4177 03:06:03,320 --> 03:06:05,880 and you wanna touch in order each of the characters 4178 03:06:05,880 --> 03:06:09,360 of the string, you then simply write a for loop 4179 03:06:09,360 --> 03:06:10,480 because it's more elegant. 4180 03:06:10,480 --> 03:06:13,240 The less code you write, the less code you write, 4181 03:06:13,240 --> 03:06:15,600 the less chance there is for you to make a mistake. 4182 03:06:15,600 --> 03:06:17,800 And so the fact that these are equivalent, 4183 03:06:17,800 --> 03:06:20,120 this is three lines, well, two lines of a loop 4184 03:06:20,120 --> 03:06:21,440 and this is four lines of a loop, 4185 03:06:21,440 --> 03:06:24,320 that's twice as many places as you could make a mistake 4186 03:06:24,320 --> 03:06:27,100 because you might misspell index or something. 4187 03:06:27,100 --> 03:06:29,680 I mean, why even make an iteration variable 4188 03:06:29,680 --> 03:06:33,120 if you don't need to make an iteration variable? 4189 03:06:33,120 --> 03:06:36,680 And so we can do things that harken back 4190 03:06:36,680 --> 03:06:39,200 to our iterations and loops chapter 4191 03:06:39,200 --> 03:06:41,520 where anything that you can do in those things 4192 03:06:41,520 --> 03:06:43,000 like look for the largest letter, 4193 03:06:43,000 --> 03:06:44,520 look for the smallest letter, 4194 03:06:44,520 --> 03:06:46,280 search to see if a letter exists 4195 03:06:46,280 --> 03:06:50,320 or say count the number of A's in the word banana. 4196 03:06:50,320 --> 03:06:52,700 And so that's what this is doing. 4197 03:06:52,700 --> 03:06:55,880 And so we have a counter. 4198 03:06:55,880 --> 03:06:58,320 So again, we do something at the top of the loop, 4199 03:06:58,320 --> 03:06:59,680 we're gonna do something in the middle loop 4200 03:06:59,680 --> 03:07:00,840 and then we're gonna print it out at the bottom. 4201 03:07:00,840 --> 03:07:02,440 So we start our counter at zero, 4202 03:07:02,440 --> 03:07:05,360 we're gonna loop through all the letters 4203 03:07:05,360 --> 03:07:06,960 and then if the letter is A, 4204 03:07:06,960 --> 03:07:08,360 then count equals count plus one. 4205 03:07:08,360 --> 03:07:10,680 This is kind of a pattern in a loop 4206 03:07:10,680 --> 03:07:12,400 where we're noticing something 4207 03:07:12,400 --> 03:07:14,080 and instead of like we did it earlier 4208 03:07:14,080 --> 03:07:15,800 where we said found equals true, 4209 03:07:15,800 --> 03:07:17,100 well, we're gonna count them this time. 4210 03:07:17,100 --> 03:07:18,680 So if we have one, we'll get one, 4211 03:07:18,680 --> 03:07:20,080 if we have zero, we get zero 4212 03:07:20,080 --> 03:07:24,040 and how many there are but there should be three 4213 03:07:24,040 --> 03:07:25,940 because it's gonna run three times 4214 03:07:25,940 --> 03:07:28,240 and there's three A's in banana. 4215 03:07:28,240 --> 03:07:32,480 And so this is a conditional within count. 4216 03:07:32,480 --> 03:07:35,600 We've seen counts, we've seen conditionals in loop 4217 03:07:35,600 --> 03:07:37,080 in prior chapters. 4218 03:07:37,080 --> 03:07:42,080 And so again, I love the in keyword in Python. 4219 03:07:42,760 --> 03:07:46,000 It again reminds me of a set notation in algebra. 4220 03:07:46,000 --> 03:07:48,960 If you're a math whiz, if you're not, don't worry about it 4221 03:07:48,960 --> 03:07:50,600 or maybe you will be a math whiz 4222 03:07:50,600 --> 03:07:52,080 and you'll say, whoa, this set notation 4223 03:07:52,080 --> 03:07:57,080 reminds me a lot of the in keyword in Python. 4224 03:07:59,240 --> 03:08:04,240 So again, it's for iteration variable letter. 4225 03:08:04,960 --> 03:08:06,800 Again, don't get stuck with letter. 4226 03:08:06,800 --> 03:08:09,520 I just happen to be using it here in banana. 4227 03:08:09,520 --> 03:08:14,520 And that is for each character in the string banana, 4228 03:08:14,760 --> 03:08:18,400 run this loop once, changing the variable letter 4229 03:08:18,400 --> 03:08:21,000 to be the particular character that we're pointing at. 4230 03:08:21,000 --> 03:08:23,520 And so, it's taking care of, four is taking care 4231 03:08:23,520 --> 03:08:25,200 of a lot for us, right? 4232 03:08:25,200 --> 03:08:27,840 And so this is sort of this really smart for loop. 4233 03:08:27,840 --> 03:08:31,160 The for loop is both deciding how many times 4234 03:08:31,160 --> 03:08:33,680 to run the loop, in this case six, 4235 03:08:33,680 --> 03:08:35,160 and it's advancing the letter. 4236 03:08:35,160 --> 03:08:38,400 So advance print. 4237 03:08:38,400 --> 03:08:40,800 decide whether you're done, advance print. 4238 03:08:40,800 --> 03:08:43,200 Decide whether you're done, advance print. 4239 03:08:43,200 --> 03:08:45,400 Decide whether you're done, advance print. 4240 03:08:45,400 --> 03:08:47,080 Decide whether you're done, advance print. 4241 03:08:47,080 --> 03:08:47,980 Decide whether you're done, 4242 03:08:47,980 --> 03:08:48,820 advance print. 4243 03:08:48,820 --> 03:08:50,720 Decide whether you're not, I am now done 4244 03:08:50,720 --> 03:08:53,520 because I, whoop, you know, 4245 03:08:53,520 --> 03:08:55,600 we're done with that particular string. 4246 03:08:55,600 --> 03:08:59,920 And so, you can think of the four as, you know, 4247 03:08:59,920 --> 03:09:02,800 magically doing all of this for you, 4248 03:09:02,800 --> 03:09:05,400 of both deciding how long to run the loop, 4249 03:09:05,400 --> 03:09:06,680 when you're done or not, 4250 03:09:06,680 --> 03:09:09,680 and moving down through all the success 4251 03:09:09,680 --> 03:09:11,800 of letters in the loop. 4252 03:09:11,800 --> 03:09:13,680 So up next, we'll talk a little bit 4253 03:09:13,680 --> 03:09:20,680 about additional things that we can do with strings. 4254 03:09:20,800 --> 03:09:22,800 So now we're gonna dig into strings a bit, 4255 03:09:22,800 --> 03:09:24,800 and we've already looked at how you can pull out 4256 03:09:24,800 --> 03:09:26,160 a single character in a string, 4257 03:09:26,160 --> 03:09:28,120 and now we're going to look at what we call slicing, 4258 03:09:28,120 --> 03:09:30,800 and that is pulling chunks of a string out. 4259 03:09:30,800 --> 03:09:33,960 And again, we're gonna use the square bracket operator, 4260 03:09:33,960 --> 03:09:38,960 and so S, and the way I say it is sub, S sub zero through 4261 03:09:39,740 --> 03:09:41,320 four, that's how I read this. 4262 03:09:41,320 --> 03:09:44,120 S sub zero through four. 4263 03:09:44,120 --> 03:09:46,800 So I look at the colon as through, 4264 03:09:46,800 --> 03:09:49,400 I look at the brackets as sub. 4265 03:09:50,600 --> 03:09:53,680 And so, S sub zero through four says, 4266 03:09:53,680 --> 03:09:55,680 start at position zero, 4267 03:09:57,360 --> 03:10:00,360 and then go up through, but not including four, right? 4268 03:10:00,360 --> 03:10:01,760 So we don't include four. 4269 03:10:01,760 --> 03:10:04,800 So that's probably the hardest part of this, 4270 03:10:04,800 --> 03:10:07,640 up to but not including, up to but not including. 4271 03:10:08,640 --> 03:10:09,840 This seems counterintuitive, 4272 03:10:09,840 --> 03:10:11,760 kind of like starting at zero seems counterintuitive, 4273 03:10:11,760 --> 03:10:15,160 but after a while, you'll kind of get used to it, 4274 03:10:15,160 --> 03:10:17,200 and there'll be situations where you're writing code like, 4275 03:10:17,200 --> 03:10:18,920 oh, that's why that works better. 4276 03:10:18,920 --> 03:10:22,100 But just for now, remember it, up to but not including. 4277 03:10:22,100 --> 03:10:23,880 It's just kind of a little thing. 4278 03:10:25,520 --> 03:10:29,880 We'll come back to when that is useful for us. 4279 03:10:30,920 --> 03:10:34,720 Six through seven, well that ends up being starting at six, 4280 03:10:34,720 --> 03:10:36,400 up to but not including seven. 4281 03:10:36,400 --> 03:10:38,220 So that's why we only get the P out. 4282 03:10:38,220 --> 03:10:41,600 Now one thing that Python is pretty nice about, 4283 03:10:41,600 --> 03:10:43,600 is it's not gonna give you a trace back. 4284 03:10:44,720 --> 03:10:47,040 We might expect that six through 20, 4285 03:10:47,040 --> 03:10:48,720 well there's no 20 characters, but it's like, 4286 03:10:48,720 --> 03:10:51,860 ah, that's okay, we'll just let you stop at the end, 4287 03:10:51,860 --> 03:10:54,080 and we'll start at six and go all the way to the end. 4288 03:10:54,080 --> 03:10:56,040 Oh, no trace back. 4289 03:10:56,040 --> 03:10:57,400 It's almost disappointing sometimes 4290 03:10:57,400 --> 03:11:00,320 when Python doesn't trace back when you think, 4291 03:11:00,320 --> 03:11:02,320 ah, you know, if you're so obsessed about everything, 4292 03:11:02,320 --> 03:11:04,080 now I would have traced back in that situation. 4293 03:11:04,080 --> 03:11:07,600 But hey, I guess if you're allowed, you're allowed. 4294 03:11:07,600 --> 03:11:08,640 And so there we go. 4295 03:11:09,800 --> 03:11:14,320 Now you can eliminate or omit the first or last. 4296 03:11:14,320 --> 03:11:15,720 If you eliminate the first, 4297 03:11:15,720 --> 03:11:17,180 it assumes the beginning of string. 4298 03:11:17,180 --> 03:11:19,640 If you eliminate the second, 4299 03:11:19,640 --> 03:11:21,200 it assumes the end of the string. 4300 03:11:21,200 --> 03:11:23,640 And why you would do this, I don't know, 4301 03:11:23,640 --> 03:11:25,680 but that's from beginning to end, 4302 03:11:25,680 --> 03:11:26,680 so it's the whole string. 4303 03:11:26,680 --> 03:11:31,360 So whole string, eight through the end is thon, 4304 03:11:31,360 --> 03:11:35,920 and up to but not including two is mo, all right? 4305 03:11:35,920 --> 03:11:37,400 So you get that. 4306 03:11:37,400 --> 03:11:40,080 So just, that's pretty simple. 4307 03:11:40,080 --> 03:11:41,720 Once you've got the rest of slicing 4308 03:11:41,720 --> 03:11:43,440 and the rest of string indexing, 4309 03:11:43,440 --> 03:11:46,080 the notion of eliminating the first or the last 4310 03:11:46,080 --> 03:11:47,880 of the colon expression, 4311 03:11:47,880 --> 03:11:50,200 the first or second of the colon expression, 4312 03:11:50,200 --> 03:11:53,080 I think is actually pretty intuitive, pretty nice. 4313 03:11:54,080 --> 03:11:56,320 We've already been concatenating strings together. 4314 03:11:56,320 --> 03:11:59,040 We overload the plus operator, 4315 03:11:59,040 --> 03:12:01,180 and there is no space added. 4316 03:12:01,180 --> 03:12:05,160 Remember when you're doing print, x comma y, 4317 03:12:05,160 --> 03:12:07,540 this comma does turn into a space, 4318 03:12:07,540 --> 03:12:09,040 but that's not what's happening here. 4319 03:12:09,040 --> 03:12:11,280 There is no automatic space being added, 4320 03:12:11,280 --> 03:12:12,920 and so we see hello in there, 4321 03:12:12,920 --> 03:12:14,800 and it's just as hello there with no space. 4322 03:12:14,800 --> 03:12:15,760 And so if we want, 4323 03:12:15,760 --> 03:12:18,660 we just have to concatenate the space explicitly 4324 03:12:18,660 --> 03:12:21,240 if we wanna put spaces into strings. 4325 03:12:21,240 --> 03:12:23,360 The problem is, is if this, 4326 03:12:23,360 --> 03:12:24,680 you might think it's more convenient 4327 03:12:24,680 --> 03:12:26,600 to add a space with a concatenation, 4328 03:12:26,600 --> 03:12:27,440 but then you have to think, 4329 03:12:27,440 --> 03:12:29,800 well, what about if I wanna concatenate things 4330 03:12:29,800 --> 03:12:31,200 and not put the space in, 4331 03:12:31,200 --> 03:12:32,840 then I'd need a different operator. 4332 03:12:32,840 --> 03:12:35,800 So that's kind of why it works that way. 4333 03:12:37,600 --> 03:12:41,080 We can use in differently as a logical operator, 4334 03:12:41,080 --> 03:12:44,540 so we're using it as an iteration structure in for loops, 4335 03:12:44,540 --> 03:12:48,600 but we can also use it as a logical operator in if statements. 4336 03:12:48,600 --> 03:12:51,680 So it's kind of like the double equals, 4337 03:12:51,680 --> 03:12:54,520 or not equals, or less than or equals, 4338 03:12:54,520 --> 03:12:55,360 or something like that. 4339 03:12:55,360 --> 03:12:57,000 It's like those guys. 4340 03:12:57,000 --> 03:13:01,080 And so, and it returns a true or a false, 4341 03:13:01,080 --> 03:13:02,440 is n in fruit. 4342 03:13:02,440 --> 03:13:03,760 So that's a question, 4343 03:13:03,760 --> 03:13:04,880 and the answer is true. 4344 03:13:04,880 --> 03:13:06,720 Is m in fruit? 4345 03:13:06,720 --> 03:13:08,760 No, that's the answer to a question. 4346 03:13:08,760 --> 03:13:09,840 Is nan in fruit? 4347 03:13:09,840 --> 03:13:11,000 Doesn't have to be single character, 4348 03:13:11,000 --> 03:13:12,320 can be more than one character, 4349 03:13:12,320 --> 03:13:13,820 and the answer is true. 4350 03:13:13,820 --> 03:13:15,520 And then you say something like, 4351 03:13:15,520 --> 03:13:16,760 if a in fruit. 4352 03:13:16,760 --> 03:13:18,440 And so this is the logical value 4353 03:13:18,440 --> 03:13:20,320 that returns a true or a false, 4354 03:13:20,320 --> 03:13:21,680 and yes, we found it. 4355 03:13:21,680 --> 03:13:24,040 So that becomes true in this particular case, 4356 03:13:24,040 --> 03:13:26,720 so it runs the little indented bit. 4357 03:13:26,720 --> 03:13:30,600 So n is an operator in this particular situation. 4358 03:13:30,600 --> 03:13:32,680 In a for loop, n means something different. 4359 03:13:32,680 --> 03:13:35,600 And we'll use n for other things as operators, 4360 03:13:35,600 --> 03:13:38,740 as logical operators, coming up in a bit. 4361 03:13:40,240 --> 03:13:41,500 You can compare strings, 4362 03:13:41,500 --> 03:13:45,200 and this has to do with the character set of your computer, 4363 03:13:45,200 --> 03:13:47,320 the character set that Python is. 4364 03:13:47,320 --> 03:13:48,780 But in general, 4365 03:13:50,160 --> 03:13:52,240 it is lexographically less than 4366 03:13:52,240 --> 03:13:54,400 and lexographically greater than. 4367 03:13:54,400 --> 03:13:57,040 Uppercase and lowercase are a little weird. 4368 03:13:57,040 --> 03:14:00,620 I think when we used the max function earlier, 4369 03:14:00,620 --> 03:14:02,400 the way my computer was set up, 4370 03:14:03,660 --> 03:14:07,580 uppercase was less than lowercase. 4371 03:14:07,580 --> 03:14:12,580 But in general, uppercase is less than lowercase. 4372 03:14:12,900 --> 03:14:16,140 But in general, it's bad to assume case, 4373 03:14:16,140 --> 03:14:19,100 but there is a deterministic way to sort strings. 4374 03:14:20,060 --> 03:14:23,260 You can have something equal to or less than 4375 03:14:23,260 --> 03:14:24,900 or greater than, 4376 03:14:24,900 --> 03:14:28,340 and all those operations work naturally, 4377 03:14:28,340 --> 03:14:29,420 the less than and greater than. 4378 03:14:29,420 --> 03:14:32,140 You have to kind of be aware of uppercase, lowercase, 4379 03:14:32,140 --> 03:14:36,380 things like where punctuation 4380 03:14:36,380 --> 03:14:39,300 sorts less than or greater than letters. 4381 03:14:39,300 --> 03:14:40,980 That's kind of unpredictable 4382 03:14:40,980 --> 03:14:44,100 and depends on the character set of your computer 4383 03:14:44,100 --> 03:14:45,340 and something you just play with 4384 03:14:45,340 --> 03:14:47,800 and figure out if you're doing sorting stuff 4385 03:14:47,800 --> 03:14:49,100 by first name and last name, 4386 03:14:49,100 --> 03:14:54,100 as long as the case is kind of the same, you know, if, 4387 03:14:55,980 --> 03:14:59,980 if you were sorting chuck with uppercase and glen, 4388 03:15:01,300 --> 03:15:02,700 the fact that these uppercases, 4389 03:15:02,700 --> 03:15:05,240 they'd sort right and these lowercases would sort right, 4390 03:15:05,240 --> 03:15:07,260 but if you were to subdue instead, 4391 03:15:08,320 --> 03:15:12,000 lowercase chuck and uppercase glen, 4392 03:15:12,940 --> 03:15:15,180 then that would sort weird as a matter of fact, 4393 03:15:15,180 --> 03:15:16,860 the G would come before that. 4394 03:15:16,860 --> 03:15:18,880 And so case can mess this up, 4395 03:15:18,880 --> 03:15:21,620 but in general, other than case 4396 03:15:21,620 --> 03:15:24,260 and special characters and other things, 4397 03:15:24,260 --> 03:15:26,220 it technically works. 4398 03:15:26,220 --> 03:15:28,060 It's just hard to kind of predict it. 4399 03:15:28,940 --> 03:15:32,080 A lot of what we do is use the string library. 4400 03:15:32,080 --> 03:15:35,120 And so the strings are objects, 4401 03:15:35,120 --> 03:15:37,740 and we'll talk later about what that really means. 4402 03:15:37,740 --> 03:15:40,940 And objects have these things we call methods. 4403 03:15:43,580 --> 03:15:47,100 So a string object has some built-in capabilities. 4404 03:15:47,100 --> 03:15:49,660 And one of the built-in capabilities 4405 03:15:49,660 --> 03:15:53,380 that the string object has is here is a string object. 4406 03:15:53,380 --> 03:15:55,340 And because greet is a string object, 4407 03:15:55,340 --> 03:15:58,080 if we said type, we'd see that it was an str. 4408 03:15:58,080 --> 03:16:01,260 Dot lower says, hey, dear string, 4409 03:16:01,260 --> 03:16:03,680 make a lowercase version of yourself. 4410 03:16:03,680 --> 03:16:05,040 It's like calling this function lower 4411 03:16:05,040 --> 03:16:07,380 and passing greet into it. 4412 03:16:07,380 --> 03:16:08,860 And then give that back to me. 4413 03:16:08,860 --> 03:16:10,620 Now it doesn't actually change greet. 4414 03:16:10,620 --> 03:16:12,420 It gives me a lowercase copy. 4415 03:16:12,420 --> 03:16:15,220 So here I have hello Bob with an H and a B uppercase. 4416 03:16:15,220 --> 03:16:18,140 And when I get back in zap is hello Bob all lowercase. 4417 03:16:18,140 --> 03:16:21,100 And note that greet is unchanged. 4418 03:16:21,100 --> 03:16:23,020 So hello Bob is still there. 4419 03:16:23,020 --> 03:16:25,580 And you can even call these methods on constants. 4420 03:16:25,580 --> 03:16:28,620 So this is a string object, quote, hi there, quote. 4421 03:16:28,620 --> 03:16:33,620 Dot lower, that says call lower on this bit of string 4422 03:16:33,840 --> 03:16:35,580 and give me back a lowercase version of it. 4423 03:16:35,580 --> 03:16:39,260 And so it prints out as the residual return value. 4424 03:16:39,260 --> 03:16:40,460 This is like a function call. 4425 03:16:40,460 --> 03:16:44,060 A method call is a kind of special form of a function call. 4426 03:16:44,060 --> 03:16:46,700 It's a function call where you say the thing dot 4427 03:16:46,700 --> 03:16:49,220 the function name rather than function name 4428 03:16:49,220 --> 03:16:50,920 pressed in as a parameter. 4429 03:16:50,920 --> 03:16:54,380 Like len, for example, is non-object oriented. 4430 03:16:54,380 --> 03:16:57,260 You know, len of x, that's non-object oriented. 4431 03:16:57,260 --> 03:17:00,680 Object oriented would be x dot something, parenthesis. 4432 03:17:03,420 --> 03:17:06,400 But, so constants are objects as well. 4433 03:17:06,400 --> 03:17:09,420 And taking the lower gives us back lowercase, hi there. 4434 03:17:09,420 --> 03:17:11,520 And so that's just one of the things 4435 03:17:11,520 --> 03:17:13,180 that you can do in the string library. 4436 03:17:13,180 --> 03:17:17,220 These are built into string variables and constants. 4437 03:17:17,220 --> 03:17:18,420 They're just always there. 4438 03:17:18,420 --> 03:17:21,540 As soon as you make a string, they're part of it. 4439 03:17:21,540 --> 03:17:24,540 And when you do type and it says it's class STR, 4440 03:17:26,180 --> 03:17:27,500 we'll get to object oriented, don't worry. 4441 03:17:27,500 --> 03:17:29,420 We'll get to object oriented. 4442 03:17:29,420 --> 03:17:32,240 Okay, and so you can do things like use the type. 4443 03:17:33,380 --> 03:17:36,580 If you're just, this used to say type str 4444 03:17:36,580 --> 03:17:39,340 but it's class str, kind of this is more of an oh oh. 4445 03:17:39,340 --> 03:17:41,740 The word class is an object oriented concept. 4446 03:17:41,740 --> 03:17:42,900 But it is a string. 4447 03:17:42,900 --> 03:17:44,260 And you can use the dir and of course 4448 03:17:44,260 --> 03:17:45,540 there's extra stuff up here. 4449 03:17:45,540 --> 03:17:48,540 And this is showing all the different methods 4450 03:17:50,940 --> 03:17:53,700 or capabilities, things we can do to strings. 4451 03:17:53,700 --> 03:17:58,060 So, you know, x dot something, parenthesis. 4452 03:17:58,060 --> 03:17:59,500 Well, what can we do there? 4453 03:17:59,500 --> 03:18:02,780 This is all of those things that we can do to x's 4454 03:18:02,780 --> 03:18:05,660 that are built in and come with x's, 4455 03:18:05,660 --> 03:18:09,100 I mean come with strings when we build them. 4456 03:18:09,100 --> 03:18:12,580 And Python of course has great documentation online 4457 03:18:12,580 --> 03:18:14,780 for all of these string methods and what they do 4458 03:18:14,780 --> 03:18:17,620 and how they work and why they work the way they do. 4459 03:18:17,620 --> 03:18:20,020 And so here's some of that Python documentation. 4460 03:18:20,020 --> 03:18:22,540 We'll look at a few of these. 4461 03:18:22,540 --> 03:18:25,860 But, you know, don't hesitate to say Python string upper case 4462 03:18:25,860 --> 03:18:29,460 and then we're like oh yeah, yeah, that is upper, right? 4463 03:18:29,460 --> 03:18:31,820 And so here's a few things that we can 4464 03:18:34,540 --> 03:18:37,060 do and use, some of the ones I use a lot. 4465 03:18:37,060 --> 03:18:39,060 And we'll look at each one of these things. 4466 03:18:39,060 --> 03:18:44,060 So, the find operation says find me a substring 4467 03:18:45,820 --> 03:18:48,060 within a string, right? 4468 03:18:48,060 --> 03:18:49,580 Find me a substring within a string, 4469 03:18:49,580 --> 03:18:53,340 so find me the first na and give me back the position. 4470 03:18:53,340 --> 03:18:55,220 So that gives me back two. 4471 03:18:56,100 --> 03:18:59,140 And then I can say go find a z in there. 4472 03:18:59,140 --> 03:19:02,060 Well, there's no z and so it returns me negative one. 4473 03:19:02,060 --> 03:19:03,220 So that's what the find does. 4474 03:19:03,220 --> 03:19:07,180 So we're gonna use this kind of stuff a lot 4475 03:19:07,180 --> 03:19:09,580 and we do a lot of looking in strings. 4476 03:19:09,580 --> 03:19:11,340 Converting things to upper or lower case, 4477 03:19:11,340 --> 03:19:14,380 there is an upper method and a lower method. 4478 03:19:14,380 --> 03:19:17,100 So greet, greet dot upper and that means 4479 03:19:17,100 --> 03:19:21,580 the upper case nnn is hello bob, greet dot lower, 4480 03:19:21,580 --> 03:19:24,420 that means that dub dub dub is the lower case hello world 4481 03:19:24,420 --> 03:19:26,260 and greet is unchanged. 4482 03:19:26,260 --> 03:19:29,020 Greet is still hello bob with upper and lower 4483 03:19:29,020 --> 03:19:31,660 because each of these methods basically say 4484 03:19:31,660 --> 03:19:34,940 I'm going to give you back a upper case copy 4485 03:19:34,940 --> 03:19:37,220 or a lower case copy of the original thing 4486 03:19:37,220 --> 03:19:39,380 without changing the original thing. 4487 03:19:42,740 --> 03:19:46,860 Search and replace is super useful, super duper useful. 4488 03:19:46,860 --> 03:19:48,860 And it's pretty clean. 4489 03:19:48,860 --> 03:19:51,420 Here we have a string and we use the replace method. 4490 03:19:51,420 --> 03:19:55,340 In this case, we're passing in the old and the new bob, 4491 03:19:55,340 --> 03:19:57,060 replace all bobs with janes. 4492 03:19:57,060 --> 03:19:59,300 And so that takes this hello bob 4493 03:19:59,300 --> 03:20:01,380 and turns it to hello jane. 4494 03:20:01,380 --> 03:20:06,380 Again, greet is unchanged, greet is unchanged 4495 03:20:07,500 --> 03:20:09,220 and it does more than one thing. 4496 03:20:09,220 --> 03:20:12,740 So this says go find, well, let's clear that. 4497 03:20:12,740 --> 03:20:14,860 This says go find all the o's 4498 03:20:14,860 --> 03:20:16,620 and replace all the o's with x's. 4499 03:20:16,620 --> 03:20:18,620 And so it goes and finds two of them 4500 03:20:18,620 --> 03:20:21,100 and then out come two x's. 4501 03:20:21,100 --> 03:20:23,300 And so that really is a replace, 4502 03:20:23,300 --> 03:20:25,220 it's not just replace the first one 4503 03:20:25,220 --> 03:20:26,880 but replace all of them. 4504 03:20:26,880 --> 03:20:30,680 White space, as we'll see, is a big deal. 4505 03:20:32,920 --> 03:20:34,420 And white space is not just blanks 4506 03:20:34,420 --> 03:20:35,660 although the most common thing 4507 03:20:35,660 --> 03:20:37,560 but it's also sort of non-printing characters 4508 03:20:37,560 --> 03:20:40,240 like tabs and new lines and other kinds of things. 4509 03:20:40,240 --> 03:20:42,520 And so we have a number of different ways 4510 03:20:42,520 --> 03:20:43,960 to strip white space. 4511 03:20:45,080 --> 03:20:46,800 So here we've got some spaces at the beginning 4512 03:20:46,800 --> 03:20:48,280 and spaces at the end. 4513 03:20:48,280 --> 03:20:50,400 And we print out, we do an L strip 4514 03:20:50,400 --> 03:20:52,200 and that throws away the spaces at the beginning. 4515 03:20:52,200 --> 03:20:54,280 That's the left, so that's the left strip. 4516 03:20:54,280 --> 03:20:56,560 It all takes any, if there's nothing there 4517 03:20:56,560 --> 03:20:58,000 it doesn't harm it. 4518 03:20:58,000 --> 03:21:01,080 R strip means throw away all the blanks on the far end. 4519 03:21:01,080 --> 03:21:04,940 And then strip says go take both sides, 4520 03:21:04,940 --> 03:21:07,260 both sides for strip and so that pulls out 4521 03:21:07,260 --> 03:21:09,220 all the spaces on both sides. 4522 03:21:09,220 --> 03:21:10,240 This will be useful 4523 03:21:10,240 --> 03:21:12,080 because sometimes when you're tearing stuff apart 4524 03:21:12,080 --> 03:21:15,120 you'll find yourself getting extra spaces. 4525 03:21:15,120 --> 03:21:17,220 Sometimes at the beginning, sometimes at the end. 4526 03:21:17,220 --> 03:21:21,500 And it can be tab or new line. 4527 03:21:23,400 --> 03:21:26,000 It's sort of white space. 4528 03:21:26,000 --> 03:21:30,400 Space that is kind of not visible, clear. 4529 03:21:30,400 --> 03:21:31,720 That's what white space is. 4530 03:21:31,720 --> 03:21:33,720 It's like if you were on a piece of paper 4531 03:21:33,720 --> 03:21:35,200 it's the white space. 4532 03:21:35,200 --> 03:21:37,000 It's like X, well that's not white space 4533 03:21:37,000 --> 03:21:39,040 but right here, oh that's white space. 4534 03:21:39,040 --> 03:21:43,920 It's any character that doesn't cause printing to happen. 4535 03:21:43,920 --> 03:21:46,000 If that makes any sense. 4536 03:21:46,000 --> 03:21:48,140 It's any character where nothing would be printed. 4537 03:21:48,140 --> 03:21:49,600 And there are characters like that. 4538 03:21:49,600 --> 03:21:51,640 There's like even bell characters 4539 03:21:51,640 --> 03:21:53,240 but we don't use them very much. 4540 03:21:53,240 --> 03:21:56,040 We can ask very conveniently we can say 4541 03:21:56,040 --> 03:21:59,760 hey, does this line start with a particular string? 4542 03:21:59,760 --> 03:22:03,600 And so line, this is a question, 4543 03:22:03,600 --> 03:22:05,240 gonna return a true or false. 4544 03:22:06,160 --> 03:22:07,760 Does this line start with please? 4545 03:22:07,760 --> 03:22:10,100 And the answer is true, it does start with please. 4546 03:22:10,100 --> 03:22:12,280 Does this line start with a lowercase p? 4547 03:22:12,280 --> 03:22:14,400 No, it does not. 4548 03:22:14,400 --> 03:22:16,240 And so again you'll use this in the context 4549 03:22:16,240 --> 03:22:19,360 of if something colon some block of text. 4550 03:22:19,360 --> 03:22:20,560 It's a block of code. 4551 03:22:20,560 --> 03:22:25,560 So we can combine these things to tear stuff out. 4552 03:22:26,280 --> 03:22:29,740 And so let's assume that what we wanna do in this case 4553 03:22:29,740 --> 03:22:32,240 is we wanna take a from line. 4554 03:22:32,240 --> 03:22:36,060 This is from an email format from a mailbox. 4555 03:22:37,480 --> 03:22:40,200 And this has got the from with a space 4556 03:22:40,200 --> 03:22:42,680 and the person's email and then at sign 4557 03:22:42,680 --> 03:22:45,240 in the school they're from and a space 4558 03:22:45,240 --> 03:22:47,720 and then the rest of the stuff like when this mail was sent. 4559 03:22:47,720 --> 03:22:50,600 And this is a real mail message from this guy Steven 4560 03:22:50,600 --> 03:22:52,880 from the University of Cape Town in South Africa. 4561 03:22:52,880 --> 03:22:55,600 It's really Steven and this really is the first line 4562 03:22:55,600 --> 03:22:57,380 of a file that you'll get to know pretty well 4563 03:22:57,380 --> 03:22:58,560 by the rest of this course. 4564 03:22:58,560 --> 03:23:01,480 Hi Steven, you, we like you. 4565 03:23:01,480 --> 03:23:03,040 You are the example in my class 4566 03:23:03,040 --> 03:23:05,360 and have been for a long time. 4567 03:23:05,360 --> 03:23:07,440 People actually who know Steven have taken this class 4568 03:23:07,440 --> 03:23:09,780 and they're like Steven, I saw your picture in the class. 4569 03:23:09,780 --> 03:23:12,200 So if you're ever in Cape Town at the University of Cape Town 4570 03:23:12,200 --> 03:23:14,340 say hi to Steven and tell him that you saw him in the class. 4571 03:23:14,340 --> 03:23:16,640 But okay, that's neither here nor there. 4572 03:23:16,640 --> 03:23:20,480 What I really want to do is I want to extract his school 4573 03:23:20,480 --> 03:23:23,320 from this email line. 4574 03:23:23,320 --> 03:23:26,720 Okay, so now eventually we will do things 4575 03:23:26,720 --> 03:23:28,440 like the data will come from files 4576 03:23:28,440 --> 03:23:29,760 but this is still chapter six. 4577 03:23:29,760 --> 03:23:31,720 So this is the data we're going to search through. 4578 03:23:31,720 --> 03:23:36,440 And so we can say, hey, let's go find the at sign. 4579 03:23:36,440 --> 03:23:38,680 Search up to this position and find the at sign. 4580 03:23:38,680 --> 03:23:42,280 So data.find at sign and give me back where that's at. 4581 03:23:42,280 --> 03:23:46,120 That's in position 21, it's position zero. 4582 03:23:46,120 --> 03:23:48,840 Then what we're going to do is we're going to look 4583 03:23:48,840 --> 03:23:51,440 for the next space after the at sign. 4584 03:23:51,440 --> 03:23:53,560 So we're going to start at the at sign 4585 03:23:53,560 --> 03:23:56,060 until find to start here and look forward 4586 03:23:56,060 --> 03:23:57,920 until it finds a space. 4587 03:23:57,920 --> 03:24:00,640 So data.find, look for a space starting 4588 03:24:00,640 --> 03:24:02,640 at the position of the at sign 4589 03:24:02,640 --> 03:24:05,440 and then that'll be in position 31. 4590 03:24:05,440 --> 03:24:07,920 So 31 is what we get in the space position. 4591 03:24:07,920 --> 03:24:11,480 So now what we have is we have in two variables, 4592 03:24:11,480 --> 03:24:15,000 we have the position of the at sign 4593 03:24:15,000 --> 03:24:16,840 and the position of the space after the at sign. 4594 03:24:16,840 --> 03:24:19,600 Now what we really want is this bit right here. 4595 03:24:19,600 --> 03:24:22,360 So we have to go one beyond the at sign 4596 03:24:22,360 --> 03:24:24,660 and we don't want the space. 4597 03:24:24,660 --> 03:24:27,000 So we say we're going to use slicing here, 4598 03:24:27,000 --> 03:24:30,680 data sub at position plus one up to 4599 03:24:30,680 --> 03:24:32,400 but not including the space. 4600 03:24:32,400 --> 03:24:36,080 Oh, smiley face, because we didn't have to say space minus one 4601 03:24:36,080 --> 03:24:41,080 because that is up to but not including. 4602 03:24:41,080 --> 03:24:45,000 And so we get that little bit right there. 4603 03:24:45,000 --> 03:24:47,920 So we don't have to say minus one there 4604 03:24:47,920 --> 03:24:49,840 because this is not actually included. 4605 03:24:49,840 --> 03:24:51,400 The thing that's at the position of the space 4606 03:24:51,400 --> 03:24:52,320 is not included. 4607 03:24:52,320 --> 03:24:54,040 So that's already a little benefit 4608 03:24:54,040 --> 03:24:56,040 for the up to but not including. 4609 03:24:56,040 --> 03:24:58,300 And so when we print this variable out host, 4610 03:24:58,300 --> 03:25:02,640 we get exactly just the school that Steven works at 4611 03:25:02,640 --> 03:25:04,900 and probably went to as a matter of fact. 4612 03:25:06,080 --> 03:25:07,760 I don't know if you went there or not. 4613 03:25:07,760 --> 03:25:12,760 So this is just kind of a note for non-Latin character sets. 4614 03:25:15,360 --> 03:25:17,600 All programming languages from the 60s on 4615 03:25:17,600 --> 03:25:21,960 tended to work in what we call the Latin character set 4616 03:25:21,960 --> 03:25:25,520 which is United States and England and Europe 4617 03:25:25,520 --> 03:25:28,800 and lots of places use this ABC character set 4618 03:25:28,800 --> 03:25:30,680 and the special characters. 4619 03:25:30,680 --> 03:25:35,320 But it's really common to want to use different characters. 4620 03:25:35,320 --> 03:25:38,720 And so if you're going from Python two to Python three 4621 03:25:38,720 --> 03:25:40,240 and we'll talk about this a little later 4622 03:25:40,240 --> 03:25:44,160 when it matters more, luckily we're in Python three 4623 03:25:44,160 --> 03:25:48,060 and so one of the big things about Python three 4624 03:25:48,060 --> 03:25:51,120 is that all the internal strings are Unicode. 4625 03:25:51,120 --> 03:25:55,440 In Python two, there was sort of some confusion 4626 03:25:55,440 --> 03:25:56,960 as you went between strings 4627 03:25:56,960 --> 03:25:58,560 and this is just a little bit of code 4628 03:25:58,560 --> 03:26:01,340 and so I'm putting a in here, 4629 03:26:01,340 --> 03:26:06,100 some Asian characters, this is Korean actually, 4630 03:26:06,100 --> 03:26:08,860 Asian characters into X and I say 4631 03:26:08,860 --> 03:26:13,020 what kind of a thing this is and that is a string 4632 03:26:13,020 --> 03:26:15,020 and then there's this Unicode 4633 03:26:15,020 --> 03:26:17,300 and this comes from Python two. 4634 03:26:17,300 --> 03:26:20,940 If it's a Unicode operation, it's still a string 4635 03:26:20,940 --> 03:26:23,180 whereas in Python two, if you put 4636 03:26:23,180 --> 03:26:26,580 a international characters into X, then it was a string 4637 03:26:26,580 --> 03:26:29,000 and then there was a separate kind of a constant 4638 03:26:29,000 --> 03:26:30,620 called a Unicode constant 4639 03:26:30,620 --> 03:26:33,280 and it was a different type and there was ways 4640 03:26:33,280 --> 03:26:36,900 that you had to mess with these Unicode variables 4641 03:26:36,900 --> 03:26:39,340 as you did things like read them from files 4642 03:26:39,340 --> 03:26:41,860 and put them back into files and did other things. 4643 03:26:41,860 --> 03:26:43,740 So it was much more difficult 4644 03:26:46,740 --> 03:26:49,620 in Python two but we're doing in Python three 4645 03:26:49,620 --> 03:26:53,700 and in Python three, it natively understands 4646 03:26:53,700 --> 03:26:57,460 non-Latin character sets, international Asian character sets, 4647 03:26:57,460 --> 03:26:59,060 Spanish, French character sets 4648 03:26:59,060 --> 03:27:01,540 and so this is a good thing for Python three 4649 03:27:01,540 --> 03:27:04,620 and this is one of the real benefits of using Python three 4650 03:27:04,620 --> 03:27:07,220 and as we start doing stuff where we're exchanging data 4651 03:27:07,220 --> 03:27:10,460 with the outside world, this will come into play 4652 03:27:10,460 --> 03:27:13,180 and I'll have to show you how to use it. 4653 03:27:13,180 --> 03:27:14,580 There was weird things that you had to do, 4654 03:27:14,580 --> 03:27:18,180 it just makes a lot more sense in Python three, okay? 4655 03:27:18,180 --> 03:27:20,540 So we've talked about strings, 4656 03:27:20,540 --> 03:27:23,500 we learned about the string, we're converting it, 4657 03:27:23,500 --> 03:27:24,620 we've done a whole bunch of stuff 4658 03:27:24,620 --> 03:27:28,740 and this is again, we're not yet doing anything 4659 03:27:28,740 --> 03:27:31,180 super useful, we're learning sort of how to like slice 4660 03:27:31,180 --> 03:27:33,980 and dice even though we're sort of not making the meal yet. 4661 03:27:33,980 --> 03:27:36,500 Up next, we're gonna talk about files, 4662 03:27:36,500 --> 03:27:38,180 we're gonna read some data and we're gonna slice 4663 03:27:38,180 --> 03:27:41,740 and dice and use all the things in the next chapter 4664 03:27:41,740 --> 03:27:43,420 that we've learned up to this point. 4665 03:27:43,420 --> 03:27:45,020 So see you in a bit. 4666 03:27:49,100 --> 03:27:51,260 Hello and welcome to chapter seven. 4667 03:27:51,260 --> 03:27:53,940 This is the chapter where it all really starts to pay off. 4668 03:27:53,940 --> 03:27:56,980 We have been learning bits and pieces 4669 03:27:56,980 --> 03:28:01,820 and doing little two lines, three lines, four lines of code 4670 03:28:01,820 --> 03:28:04,420 to learn the basic building blocks of Python 4671 03:28:04,420 --> 03:28:07,660 and learn some of the syntax and find lots of terms 4672 03:28:07,660 --> 03:28:11,140 but now we're actually going to start doing something. 4673 03:28:11,140 --> 03:28:14,060 So if you look at what we've been doing so far, 4674 03:28:14,060 --> 03:28:17,580 you know, we have been, we're inside this little computer 4675 03:28:17,580 --> 03:28:20,660 and you type up, you know, the Python says what next 4676 03:28:20,660 --> 03:28:22,620 and you give it its command and it does something 4677 03:28:22,620 --> 03:28:24,900 and you do something else and does something 4678 03:28:24,900 --> 03:28:26,500 and you do this three or four times 4679 03:28:26,500 --> 03:28:28,260 unless you write a loop and then it goes like, 4680 03:28:28,260 --> 03:28:30,500 you know, 10, 20 times and that's it. 4681 03:28:30,500 --> 03:28:33,380 And then maybe we write a thing that reads something 4682 03:28:33,380 --> 03:28:35,580 from our keyboard, gives us something back 4683 03:28:35,580 --> 03:28:37,740 and then we write something and print something out, 4684 03:28:37,740 --> 03:28:39,300 print a few foot things out 4685 03:28:39,300 --> 03:28:42,260 and so we've been pretty much using the keyboard, 4686 03:28:42,260 --> 03:28:45,460 the screen, the CPU and the memory. 4687 03:28:45,460 --> 03:28:47,740 That's kind of where we've been living. 4688 03:28:47,740 --> 03:28:49,620 And while it's important to talk to the keyboard 4689 03:28:49,620 --> 03:28:53,580 and the screen, the real world is things like databases 4690 03:28:53,580 --> 03:28:56,380 that live out here, files live on our systems 4691 03:28:56,380 --> 03:28:58,580 and, you know, connecting to the network 4692 03:28:58,580 --> 03:29:01,380 and reading data from the network. 4693 03:29:01,380 --> 03:29:03,420 And so that's what we're starting to do right now 4694 03:29:03,420 --> 03:29:07,160 is we're starting to be able to work outside 4695 03:29:07,160 --> 03:29:09,860 kind of our code and create things that are permanent. 4696 03:29:10,860 --> 03:29:12,220 And so we're gonna be talking, 4697 03:29:12,220 --> 03:29:14,300 initially we're gonna work on files. 4698 03:29:14,300 --> 03:29:16,500 We'll later talk to databases and the network 4699 03:29:16,500 --> 03:29:20,020 and other stuff, but for now we are talking about files. 4700 03:29:20,020 --> 03:29:22,820 And so really kind of, we're stepping out a little bit 4701 03:29:22,820 --> 03:29:25,340 and creating, reading things that are prominent 4702 03:29:25,340 --> 03:29:28,020 and creating things that are permanent. 4703 03:29:28,020 --> 03:29:30,620 The kinds of files that we're going to talk about mostly 4704 03:29:30,620 --> 03:29:33,360 are text files and you can think of these 4705 03:29:33,360 --> 03:29:36,300 as a sequence of lines in a file 4706 03:29:36,300 --> 03:29:38,060 that are easily read by Python. 4707 03:29:39,060 --> 03:29:40,920 You've been making text files all along. 4708 03:29:40,920 --> 03:29:42,900 You're, you know, hello.py. 4709 03:29:44,860 --> 03:29:46,100 That file's a text file too. 4710 03:29:46,100 --> 03:29:48,620 You're using a text editor to create that file. 4711 03:29:48,620 --> 03:29:50,140 You put your Python commands in a file, 4712 03:29:50,140 --> 03:29:52,540 you run those files and that's what it is. 4713 03:29:52,540 --> 03:29:55,980 And so a file can be thought of as a bunch of lines, 4714 03:29:55,980 --> 03:29:58,020 you know, one, two, three, four, five, six, seven, 4715 03:29:58,020 --> 03:29:59,300 a blank line here. 4716 03:29:59,300 --> 03:30:03,500 That's possible and, but the reality is, 4717 03:30:03,500 --> 03:30:05,500 is that these are actually just lines 4718 03:30:05,500 --> 03:30:07,540 and we have a special character called the new line 4719 03:30:07,540 --> 03:30:09,300 that we'll talk about in a second. 4720 03:30:10,660 --> 03:30:14,180 So to read a file, you have to call the open function. 4721 03:30:14,180 --> 03:30:16,960 And open returns what we call a file handle. 4722 03:30:16,960 --> 03:30:19,320 Open doesn't actually read the file. 4723 03:30:19,320 --> 03:30:23,540 Open makes it possible so that you can read the file. 4724 03:30:25,100 --> 03:30:27,380 So the parameters to open are, 4725 03:30:27,380 --> 03:30:29,320 it takes one parameter that's required, 4726 03:30:29,320 --> 03:30:30,780 which is the name of the file, 4727 03:30:30,780 --> 03:30:31,960 another parameter that's optional, 4728 03:30:31,960 --> 03:30:33,900 whether or not to read it or write it. 4729 03:30:33,900 --> 03:30:36,100 If we're reading the file, it doesn't harm it. 4730 03:30:36,100 --> 03:30:37,100 You can read it over and over. 4731 03:30:37,100 --> 03:30:38,500 If you write it, it actually, 4732 03:30:38,500 --> 03:30:39,660 if there's already data in that file, 4733 03:30:39,660 --> 03:30:41,140 it truncates it and writes something. 4734 03:30:41,140 --> 03:30:42,580 And we're not gonna really write files, 4735 03:30:42,580 --> 03:30:43,980 we're mostly gonna read them. 4736 03:30:43,980 --> 03:30:46,680 And so open, sort of, you pass it in a file, 4737 03:30:46,680 --> 03:30:48,260 it gives you back this file handle 4738 03:30:48,260 --> 03:30:50,820 and then you have a variable in which you store it. 4739 03:30:50,820 --> 03:30:54,700 I often call it fhand to be mnemonic. 4740 03:30:54,700 --> 03:30:57,420 You'll see my code, I use fhand all the time 4741 03:30:57,420 --> 03:31:00,140 to indicate that that is a file handle. 4742 03:31:00,140 --> 03:31:04,780 And so if we were to run this in an interactive mode, 4743 03:31:04,780 --> 03:31:08,940 we'll open mbox.txt and that is a function 4744 03:31:08,940 --> 03:31:11,300 built into Python and then it gives us back a handle. 4745 03:31:11,300 --> 03:31:12,820 It does not give the data. 4746 03:31:12,820 --> 03:31:15,940 You can kinda see this when we print out the file handle 4747 03:31:15,940 --> 03:31:17,460 using the print statement. 4748 03:31:17,460 --> 03:31:20,240 It doesn't print the lines that are in the file. 4749 03:31:20,240 --> 03:31:21,860 The lines that are in the file are sort of out there. 4750 03:31:21,860 --> 03:31:24,380 There could be like, you know, 10 million lines 4751 03:31:24,380 --> 03:31:26,760 for all we know, lines in the file. 4752 03:31:27,700 --> 03:31:30,700 The handle's like a little opening outside of your program 4753 03:31:30,700 --> 03:31:32,900 and you can talk to the file by opening it, 4754 03:31:32,900 --> 03:31:34,460 then you can read stuff, you could, 4755 03:31:34,460 --> 03:31:36,500 if you're writing the file, you can write stuff 4756 03:31:36,500 --> 03:31:38,660 and then you close the file to shut the handle down. 4757 03:31:38,660 --> 03:31:42,460 But handle is a thing that allows you to get to the file. 4758 03:31:42,460 --> 03:31:45,540 It is not the file itself and it's not the data in the file, 4759 03:31:45,540 --> 03:31:48,460 it's just a wrapper that kind of allows you. 4760 03:31:48,460 --> 03:31:49,740 So this, if you print it out, it's like, 4761 03:31:49,740 --> 03:31:51,780 that's the file we opened, we're reading it 4762 03:31:51,780 --> 03:31:53,700 and then coding has to do with the different kinds 4763 03:31:53,700 --> 03:31:55,620 of character sets, which we talked about 4764 03:31:55,620 --> 03:31:57,940 at the end of last lecture, the Unicode character set, 4765 03:31:57,940 --> 03:32:01,540 et cetera, UTF-8 is a great character set. 4766 03:32:01,540 --> 03:32:04,820 It's probably the most typical character set 4767 03:32:04,820 --> 03:32:05,900 that you will run into it, 4768 03:32:05,900 --> 03:32:08,740 although you can have different character sets of files, 4769 03:32:08,740 --> 03:32:10,220 but most of them are UTF-8. 4770 03:32:11,940 --> 03:32:14,220 So, of course, this is Python. 4771 03:32:14,220 --> 03:32:16,980 If you make a mistake and there's a file that doesn't exist, 4772 03:32:16,980 --> 03:32:19,800 we get a trace back and it blows up. 4773 03:32:23,340 --> 03:32:25,620 We'll show you in a second how to deal with that. 4774 03:32:25,620 --> 03:32:28,420 Now, the newline character is an important part 4775 03:32:28,420 --> 03:32:32,580 of file reading and in strings, 4776 03:32:32,580 --> 03:32:34,480 we can put the newline character in 4777 03:32:34,480 --> 03:32:36,740 by this backslash n character. 4778 03:32:36,740 --> 03:32:39,500 And the backslash n is the character that indicates 4779 03:32:39,500 --> 03:32:42,100 that we're supposed to go to another line. 4780 03:32:42,100 --> 03:32:44,940 Go to a newline, go to a newline. 4781 03:32:44,940 --> 03:32:46,140 And so we have, what is this? 4782 03:32:46,140 --> 03:32:48,540 Well, that's a backslash n, that's a backslash n. 4783 03:32:50,580 --> 03:32:53,140 And so, if we print it out, we print it this way, 4784 03:32:53,140 --> 03:32:54,940 we see that the backslash n is in there. 4785 03:32:54,940 --> 03:32:55,940 This is how we type it. 4786 03:32:55,940 --> 03:32:58,740 We actually type backslash n to Python 4787 03:32:58,740 --> 03:33:01,960 to indicate that we're supposed to put that there. 4788 03:33:03,140 --> 03:33:04,900 But if we do a print statement, 4789 03:33:04,900 --> 03:33:06,620 it actually interprets the backslash n, 4790 03:33:06,620 --> 03:33:09,460 so the backslash n causes this movement to the beginning. 4791 03:33:09,460 --> 03:33:11,860 Now, the print actually, at the end of this, 4792 03:33:11,860 --> 03:33:13,180 adds another backslash n. 4793 03:33:13,180 --> 03:33:15,900 So, the backslash n that we put in 4794 03:33:15,900 --> 03:33:17,940 by putting it into the string is that one. 4795 03:33:17,940 --> 03:33:20,740 And then print always puts a backslash n at the end. 4796 03:33:21,700 --> 03:33:25,100 There's actually a way to override that backslash n behavior 4797 03:33:25,100 --> 03:33:26,900 by putting something on the print statement, 4798 03:33:26,900 --> 03:33:28,580 which we'll talk about later. 4799 03:33:28,580 --> 03:33:30,540 Now, it's important to note 4800 03:33:30,540 --> 03:33:33,460 that the backslash n is one character, right? 4801 03:33:33,460 --> 03:33:37,900 And so, even though this x backslash ny prints this, 4802 03:33:37,900 --> 03:33:40,500 and then print adds another new line to go down to here, 4803 03:33:40,500 --> 03:33:41,940 if you ask how many characters, 4804 03:33:41,940 --> 03:33:44,860 what is the length of this, well, it's only three. 4805 03:33:44,860 --> 03:33:46,740 That's because that's a character, 4806 03:33:46,740 --> 03:33:49,260 the backslash n is a character, and the y is a character. 4807 03:33:49,260 --> 03:33:50,820 So, it's a three character string. 4808 03:33:50,820 --> 03:33:52,620 So, the backslash n is a character 4809 03:33:52,620 --> 03:33:54,380 like all the rest of the characters, 4810 03:33:54,380 --> 03:33:59,100 but it's only, we encode it by typing backslash n. 4811 03:33:59,100 --> 03:34:01,620 It's called an escape, where the backslash is the escape. 4812 03:34:01,620 --> 03:34:04,220 Backslash n is a way to say new line, 4813 03:34:04,220 --> 03:34:05,500 because we can't see it. 4814 03:34:05,500 --> 03:34:08,060 It's a way for us to encode in a string 4815 03:34:08,060 --> 03:34:11,660 this non-printable character, this invisible character. 4816 03:34:11,660 --> 03:34:13,780 The white space, it's part of white space. 4817 03:34:14,900 --> 03:34:16,500 So, as we're reading through the file, 4818 03:34:16,500 --> 03:34:18,260 we can think of it as a sequence of lines, 4819 03:34:18,260 --> 03:34:20,180 and we can read these a line at a time. 4820 03:34:20,180 --> 03:34:22,700 We can also read them a character at a time if we want. 4821 03:34:22,700 --> 03:34:25,040 And so, but it's more common to say read this line, 4822 03:34:25,040 --> 03:34:27,180 read the next line, read the line after that, 4823 03:34:27,180 --> 03:34:28,980 et cetera, et cetera, et cetera. 4824 03:34:28,980 --> 03:34:31,060 But the way to best think about this, 4825 03:34:31,940 --> 03:34:33,140 it doesn't really matter. 4826 03:34:33,140 --> 03:34:34,540 You can think about it as lines, 4827 03:34:34,540 --> 03:34:36,940 and we will in most of the programs that we write. 4828 03:34:36,940 --> 03:34:39,820 But realize that the way when we see this, 4829 03:34:41,380 --> 03:34:44,980 we see it like this, it comes back to the beginning, 4830 03:34:44,980 --> 03:34:45,820 it comes back to the beginning. 4831 03:34:45,820 --> 03:34:47,980 There's a character in the file. 4832 03:34:47,980 --> 03:34:50,180 At each of these points to say go back to the beginning. 4833 03:34:50,180 --> 03:34:53,380 It's like hitting the enter key on your computer. 4834 03:34:53,380 --> 03:34:54,500 And that is a new line. 4835 03:34:54,500 --> 03:34:56,940 So you have to think that in the file, 4836 03:34:56,940 --> 03:35:00,300 in order for your text editor and Python and everybody 4837 03:35:00,300 --> 03:35:03,660 to know where the lines end, you put new lines in the file. 4838 03:35:03,660 --> 03:35:05,420 And that's another character. 4839 03:35:05,420 --> 03:35:08,740 So, you know, this looks like an empty line. 4840 03:35:08,740 --> 03:35:10,420 This line here looks like an empty line, 4841 03:35:10,420 --> 03:35:11,780 but really it has a single character, 4842 03:35:11,780 --> 03:35:13,300 and the character is a new line. 4843 03:35:13,300 --> 03:35:14,980 And it turns out that in a bit, 4844 03:35:14,980 --> 03:35:17,620 we're gonna need to keep track of the fact that 4845 03:35:17,620 --> 03:35:20,440 every line is ended by a new line. 4846 03:35:20,440 --> 03:35:22,200 So up next, I'm gonna talk a little bit 4847 03:35:22,200 --> 03:35:24,660 about how to read files in Python. 4848 03:35:28,480 --> 03:35:30,560 So we're gonna find that there's a number of different ways 4849 03:35:30,560 --> 03:35:31,740 that we can read through the file. 4850 03:35:31,740 --> 03:35:33,300 But the most common way that we're gonna read 4851 03:35:33,300 --> 03:35:35,980 through the file is to treat it as a sequence of lines. 4852 03:35:35,980 --> 03:35:38,100 And we're gonna use the determinant loop, 4853 03:35:38,100 --> 03:35:40,660 the for loop, to do this. 4854 03:35:40,660 --> 03:35:43,740 And so what happens here is we get back this handle, 4855 03:35:43,740 --> 03:35:45,500 that opens the file and gives us back the handle. 4856 03:35:45,500 --> 03:35:49,100 That handle xfile is the variable I named, 4857 03:35:49,100 --> 03:35:51,020 I just named it xfile. 4858 03:35:51,020 --> 03:35:52,260 That's not the data. 4859 03:35:52,260 --> 03:35:54,860 But it is a sequence. 4860 03:35:54,860 --> 03:35:59,100 It is that file handle represents to Python a sequence 4861 03:35:59,100 --> 03:36:00,820 that we can potentially walk through 4862 03:36:00,820 --> 03:36:02,140 and then get all the lines. 4863 03:36:02,140 --> 03:36:04,940 And it's the simplest, most beautiful, elegant way 4864 03:36:04,940 --> 03:36:07,060 to read all the lines in a file. 4865 03:36:07,060 --> 03:36:09,700 We use the for loop and we have an iteration variable. 4866 03:36:09,700 --> 03:36:12,540 This is going to take, when we talk about the file, 4867 03:36:12,540 --> 03:36:14,540 cheese is gonna be the first line, then the second line, 4868 03:36:14,540 --> 03:36:15,780 then the third line, then the fourth line. 4869 03:36:15,780 --> 03:36:17,300 So it's like going through a string, 4870 03:36:17,300 --> 03:36:18,500 but you're going through a file now 4871 03:36:18,500 --> 03:36:19,780 and you're getting it line by line. 4872 03:36:19,780 --> 03:36:21,300 So that's each line. 4873 03:36:21,300 --> 03:36:22,840 I just picked a variable named cheese 4874 03:36:22,840 --> 03:36:23,900 so you didn't get confused. 4875 03:36:23,900 --> 03:36:25,500 Later I'll call this line. 4876 03:36:25,500 --> 03:36:28,300 But Python doesn't know anything special 4877 03:36:28,300 --> 03:36:30,340 by naming that variable line. 4878 03:36:30,340 --> 03:36:32,980 Okay, and so this is, it's the for and the in. 4879 03:36:32,980 --> 03:36:36,820 And so I read this as for each line 4880 03:36:36,820 --> 03:36:40,260 in the file handle xfile. 4881 03:36:40,260 --> 03:36:43,460 So run this loop one time for every line 4882 03:36:43,460 --> 03:36:44,500 and then print it out. 4883 03:36:44,500 --> 03:36:47,920 So it's actually really quite simple, okay? 4884 03:36:49,060 --> 03:36:53,340 Other languages like C or C++ or other languages, 4885 03:36:53,340 --> 03:36:55,860 they have to write while loops with end of file conditions 4886 03:36:55,860 --> 03:36:58,740 and all kinds of things that make this very difficult. 4887 03:36:58,740 --> 03:37:02,300 But this is one of the prettiest things that Python has. 4888 03:37:02,300 --> 03:37:05,020 It's a very, very pretty thing. 4889 03:37:07,540 --> 03:37:09,960 Okay, so let's talk about what we might do. 4890 03:37:09,960 --> 03:37:12,060 And we're going kind of back to iterations now. 4891 03:37:12,060 --> 03:37:14,180 What if we wanted to count the number of lines in a file? 4892 03:37:14,180 --> 03:37:16,540 Well, this is a basic loop counting pattern. 4893 03:37:17,500 --> 03:37:20,500 So we open the file and then like in all these loops, 4894 03:37:20,500 --> 03:37:23,260 we do something to sort of prime the loop to get it started, 4895 03:37:23,260 --> 03:37:24,940 set a variable count to zero. 4896 03:37:24,940 --> 03:37:26,860 And I'm gonna use the variable line 4897 03:37:26,860 --> 03:37:29,260 that's gonna go through each of the lines in the file 4898 03:37:29,260 --> 03:37:32,460 for line in fhand, down the file. 4899 03:37:32,460 --> 03:37:33,980 And it's gonna run this loop once 4900 03:37:33,980 --> 03:37:35,060 for each line in the file 4901 03:37:35,060 --> 03:37:37,260 and the variable line is gonna change. 4902 03:37:37,260 --> 03:37:39,620 But all I'm gonna do is add count equals count plus one. 4903 03:37:39,620 --> 03:37:41,540 And so that's just like from counters, 4904 03:37:41,540 --> 03:37:43,420 that's just how you detect. 4905 03:37:43,420 --> 03:37:44,620 So every time we see a line, 4906 03:37:44,620 --> 03:37:45,700 we're just gonna add one to the counter. 4907 03:37:45,700 --> 03:37:46,580 We're not printing the line, 4908 03:37:46,580 --> 03:37:48,820 we're not even looking at its data at this point. 4909 03:37:48,820 --> 03:37:49,940 And then when the line is done, 4910 03:37:49,940 --> 03:37:51,500 however many times it has to go, 4911 03:37:51,500 --> 03:37:54,340 out it comes and we print out line count equals count. 4912 03:37:54,340 --> 03:37:56,700 And so if we open mbox.txt, 4913 03:37:56,700 --> 03:37:58,180 this is gonna do all this work 4914 03:37:58,180 --> 03:37:59,620 and then print this line out 4915 03:37:59,620 --> 03:38:03,140 and say line count is 132,045. 4916 03:38:03,140 --> 03:38:05,940 So this is a little five line program 4917 03:38:05,940 --> 03:38:08,180 that shows you how to count the lines 4918 03:38:08,180 --> 03:38:10,660 in a text file using Python. 4919 03:38:10,660 --> 03:38:12,900 Again, simple and elegant 4920 03:38:12,900 --> 03:38:15,620 and not too much syntax for you to have to learn. 4921 03:38:16,800 --> 03:38:18,920 Now it's also possible to read the file 4922 03:38:18,920 --> 03:38:22,120 as a series of characters all in one go. 4923 03:38:22,120 --> 03:38:23,180 Read the whole file in. 4924 03:38:23,180 --> 03:38:25,940 Now you gotta be careful depending on the size of the file, 4925 03:38:25,940 --> 03:38:28,340 this is gonna lead to a string variable 4926 03:38:28,340 --> 03:38:29,740 with a lot of data in it. 4927 03:38:29,740 --> 03:38:32,180 Now if it's 100,000 characters, 4928 03:38:32,180 --> 03:38:33,980 that's actually kind of a small thing. 4929 03:38:33,980 --> 03:38:36,840 But if it was 10 million lines, 4930 03:38:36,840 --> 03:38:38,100 that would probably not be good. 4931 03:38:38,100 --> 03:38:40,020 You'd wanna read it one line at a time 4932 03:38:40,020 --> 03:38:42,620 and process each line and then do something. 4933 03:38:42,620 --> 03:38:46,140 But mbox.short.txt is a small little file. 4934 03:38:46,140 --> 03:38:50,260 So we open it and we get back a file object, 4935 03:38:50,260 --> 03:38:53,060 file handle object, and we call the read method. 4936 03:38:53,060 --> 03:38:55,620 And that says go through and read all the text 4937 03:38:55,620 --> 03:38:59,220 and give it back in one big blob, one big string, 4938 03:38:59,220 --> 03:39:00,660 and I'll put it in imp. 4939 03:39:00,660 --> 03:39:03,140 And so that's where you have a line, a new line, 4940 03:39:03,140 --> 03:39:05,500 a line, a new line, a line, a new line. 4941 03:39:05,500 --> 03:39:08,240 So not really lines, it's just a sequence of characters 4942 03:39:08,240 --> 03:39:10,620 with new lines in there to punctuate them. 4943 03:39:10,620 --> 03:39:11,900 And now you can split that, 4944 03:39:11,900 --> 03:39:14,060 later we'll see how to split that 4945 03:39:14,060 --> 03:39:16,500 into separate lines if you want. 4946 03:39:16,500 --> 03:39:19,060 Now I picked a file that was short, 4947 03:39:19,060 --> 03:39:22,660 and so this imp variable now has a string in it 4948 03:39:22,660 --> 03:39:24,480 and I can use the len function, 4949 03:39:24,480 --> 03:39:26,100 pass a string into the len function, 4950 03:39:26,100 --> 03:39:29,040 it says oh 94,626 characters. 4951 03:39:29,040 --> 03:39:32,820 That's kind of a small little file. 4952 03:39:32,820 --> 03:39:35,280 And perfectly okay to read it all in one go. 4953 03:39:36,380 --> 03:39:38,900 And so now I say just print the first 20 characters, 4954 03:39:38,900 --> 03:39:41,300 that's beginning to up to but not including 20, 4955 03:39:41,300 --> 03:39:44,520 and so it shows the first 20 characters 4956 03:39:44,520 --> 03:39:46,500 of that little file is a from line, 4957 03:39:46,500 --> 03:39:48,140 because this is a mailbox file. 4958 03:39:51,080 --> 03:39:53,560 Now let's say we're going to do a searching, 4959 03:39:53,560 --> 03:39:56,140 and we did this loop where you're looking for something. 4960 03:39:56,140 --> 03:39:58,100 And so we're going to search for lines 4961 03:39:58,100 --> 03:40:01,540 that have a prefix of from, okay? 4962 03:40:01,540 --> 03:40:02,380 That's what we're going to do, 4963 03:40:02,380 --> 03:40:03,500 and we're going to print those lines out. 4964 03:40:03,500 --> 03:40:05,460 So there's lots of lines in this file, 4965 03:40:07,140 --> 03:40:09,200 line, line, line, line, from, 4966 03:40:09,200 --> 03:40:11,180 line, line, line, line, from, right? 4967 03:40:11,180 --> 03:40:12,060 On and on and on. 4968 03:40:12,060 --> 03:40:13,640 And we only want to show these lines, 4969 03:40:13,640 --> 03:40:14,640 the ones that match, right? 4970 03:40:14,640 --> 03:40:16,140 That's what we want to do. 4971 03:40:16,140 --> 03:40:20,900 And so we are going to write an open statement 4972 03:40:20,900 --> 03:40:22,720 and then we're going to loop through, 4973 03:40:22,720 --> 03:40:23,980 and we're going to ask the question, 4974 03:40:23,980 --> 03:40:26,980 if the line starts with from, print it. 4975 03:40:26,980 --> 03:40:29,340 So sometimes it's going to skip, skip, skip, skip, 4976 03:40:29,340 --> 03:40:30,180 and then it's going to run it, 4977 03:40:30,180 --> 03:40:32,260 and skip, skip, skip, skip, skip, 4978 03:40:32,260 --> 03:40:34,060 and it's going to run it, skip, skip, skip, 4979 03:40:34,060 --> 03:40:35,940 and then it's going to run it, okay? 4980 03:40:35,940 --> 03:40:38,580 So that's the basic idea, 4981 03:40:38,580 --> 03:40:40,620 and then it'll finish when it's all said and done. 4982 03:40:40,620 --> 03:40:44,220 And so this is like a criteria, this is like a search. 4983 03:40:44,220 --> 03:40:47,040 We're looking for lines that match the string, 4984 03:40:47,040 --> 03:40:50,200 that have the string from as their prefix. 4985 03:40:50,200 --> 03:40:52,960 Now, when we look at the output of this, 4986 03:40:52,960 --> 03:40:54,420 it's kind of weird. 4987 03:40:54,420 --> 03:40:58,120 We see kind of these little blank lines that show up. 4988 03:40:58,120 --> 03:41:01,380 Blank, blank, blank, blank, blank, blank, blank. 4989 03:41:01,380 --> 03:41:03,120 What's going on here? 4990 03:41:04,080 --> 03:41:04,920 What's going on? 4991 03:41:04,920 --> 03:41:06,160 So let's take a quick look. 4992 03:41:07,360 --> 03:41:09,240 The problem is, is new lines. 4993 03:41:09,240 --> 03:41:13,480 Well, I mentioned that the file has new lines in them. 4994 03:41:13,480 --> 03:41:15,640 And so when you do the for loop, 4995 03:41:15,640 --> 03:41:17,200 it doesn't throw the new lines away. 4996 03:41:17,200 --> 03:41:19,000 As you might expect, 4997 03:41:19,000 --> 03:41:20,840 it would be kind of nice if it did, but it doesn't. 4998 03:41:20,840 --> 03:41:23,560 It actually shows you when you read, 4999 03:41:23,560 --> 03:41:26,480 it reads that first line up to and including the new line 5000 03:41:26,480 --> 03:41:28,320 and gives you that back as the variable. 5001 03:41:28,320 --> 03:41:29,960 So that is the first new line. 5002 03:41:29,960 --> 03:41:31,600 So that means it's going to go down. 5003 03:41:31,600 --> 03:41:34,740 And then the print statement actually adds another new line. 5004 03:41:34,740 --> 03:41:36,600 So that's the second line of the file 5005 03:41:36,600 --> 03:41:38,200 has a new line at the end of it, 5006 03:41:38,200 --> 03:41:40,000 and the print statement adds another new line. 5007 03:41:40,000 --> 03:41:41,600 So if we take a look at the code, 5008 03:41:43,040 --> 03:41:46,440 there is a new line, oops, come back. 5009 03:41:46,440 --> 03:41:50,560 If we take a look at the code, 5010 03:41:50,560 --> 03:41:53,600 this variable line has a new line in it, oops, 5011 03:41:53,600 --> 03:41:54,440 where am I at? 5012 03:41:54,440 --> 03:41:56,160 I'm in the wrong slide, there we go. 5013 03:41:58,500 --> 03:42:01,040 Yeah, this is what I want to do. 5014 03:42:01,040 --> 03:42:03,440 If we look at the code, there's a new line in here, 5015 03:42:03,440 --> 03:42:05,200 and then the print adds another new line. 5016 03:42:05,200 --> 03:42:07,600 So the print adds a separate new line. 5017 03:42:07,600 --> 03:42:09,600 And that's how we get two new lines. 5018 03:42:09,600 --> 03:42:10,800 The print statements new line 5019 03:42:10,800 --> 03:42:12,560 and the new line from the file. 5020 03:42:13,840 --> 03:42:14,760 Here's how we fix it. 5021 03:42:14,760 --> 03:42:16,040 And you're going to write this code a lot 5022 03:42:16,040 --> 03:42:18,080 because when you're reading text files, 5023 03:42:18,080 --> 03:42:18,960 you end up with a new line. 5024 03:42:18,960 --> 03:42:20,440 And often you don't want the new line. 5025 03:42:20,440 --> 03:42:23,880 But thankfully, as we saw in the previous chapter, 5026 03:42:23,880 --> 03:42:28,340 there is a nice little function in Python for strings 5027 03:42:28,340 --> 03:42:31,040 called strip that allows you to throw away white space. 5028 03:42:33,080 --> 03:42:37,520 And to review, remember white space 5029 03:42:37,520 --> 03:42:39,160 is anything that doesn't print. 5030 03:42:39,160 --> 03:42:41,900 And this new line is not a non-printing character. 5031 03:42:41,900 --> 03:42:43,440 So our strip gets rid of it. 5032 03:42:43,440 --> 03:42:45,360 So it's a way to get rid of white space. 5033 03:42:45,360 --> 03:42:47,720 And our strip does it from the right end. 5034 03:42:47,720 --> 03:42:50,320 So it's the right end of the string. 5035 03:42:51,880 --> 03:42:54,520 And so if we just are going to loop 5036 03:42:54,520 --> 03:42:55,800 through all the lines in the file, 5037 03:42:55,800 --> 03:42:57,520 we say line equals line our strip. 5038 03:42:57,520 --> 03:43:00,040 And then this variable no longer has the new line 5039 03:43:00,040 --> 03:43:01,080 at the end of it. 5040 03:43:01,080 --> 03:43:02,120 We have our little if statement. 5041 03:43:02,120 --> 03:43:04,680 And if we print it, then this line, 5042 03:43:04,680 --> 03:43:05,880 the data has no thing. 5043 03:43:05,880 --> 03:43:08,200 And then the data has a no new line in it. 5044 03:43:08,200 --> 03:43:09,520 So the print only goes down one. 5045 03:43:09,520 --> 03:43:12,160 And so now we have single spaced output. 5046 03:43:12,160 --> 03:43:13,460 And so you're going to be doing that a lot. 5047 03:43:13,460 --> 03:43:15,960 It's really common to read through a file 5048 03:43:15,960 --> 03:43:18,360 and then just strip the new line 5049 03:43:18,360 --> 03:43:21,740 or any trailing space off the end of that. 5050 03:43:22,640 --> 03:43:25,520 Now, there's a couple of ways to do a loop like this. 5051 03:43:25,520 --> 03:43:28,160 And let's just think of this as 5052 03:43:29,200 --> 03:43:31,940 we're looking for a line, a file 5053 03:43:31,940 --> 03:43:33,640 with lots of different lines in it. 5054 03:43:33,640 --> 03:43:35,000 And we want to ignore all the lines 5055 03:43:35,000 --> 03:43:36,480 except some say good lines. 5056 03:43:36,480 --> 03:43:38,320 And we want to do something with those good lines 5057 03:43:38,320 --> 03:43:39,840 or the lines we're looking for. 5058 03:43:39,840 --> 03:43:40,800 Needle in a haystack. 5059 03:43:40,800 --> 03:43:44,040 This is like searching for a needle in a haystack. 5060 03:43:44,040 --> 03:43:45,640 So if you look at this code at high level, 5061 03:43:45,640 --> 03:43:47,080 we're going to loop through everything. 5062 03:43:47,080 --> 03:43:49,800 And then we're sort of picking which lines are. 5063 03:43:49,800 --> 03:43:52,460 And these are the good lines down here. 5064 03:43:52,460 --> 03:43:54,680 Now, often we have a bunch more code that we want to do. 5065 03:43:54,680 --> 03:43:55,840 And we're not just printing them, 5066 03:43:55,840 --> 03:43:57,160 but we're going to do a lot of code. 5067 03:43:57,160 --> 03:43:59,240 So sometimes you actually structure the loop 5068 03:43:59,240 --> 03:44:01,400 a little bit differently. 5069 03:44:01,400 --> 03:44:02,520 And so the way to do it, 5070 03:44:02,520 --> 03:44:04,280 and this is going to do the exact same thing, 5071 03:44:04,280 --> 03:44:06,400 it's just a little different way 5072 03:44:06,400 --> 03:44:07,680 of thinking about this loop. 5073 03:44:07,680 --> 03:44:09,360 So the top part is the same. 5074 03:44:09,360 --> 03:44:10,400 We're stripping it. 5075 03:44:10,400 --> 03:44:12,840 And what we're doing here is everything's the same here 5076 03:44:12,840 --> 03:44:13,960 except we add this and not. 5077 03:44:13,960 --> 03:44:16,160 If the line does not start with from, 5078 03:44:16,160 --> 03:44:18,080 that's the translation of that. 5079 03:44:18,080 --> 03:44:21,320 If the line does not start with from, continue. 5080 03:44:21,320 --> 03:44:24,380 So basically we have a skipping pattern. 5081 03:44:24,380 --> 03:44:27,880 So the lines we're not interested in, we skip. 5082 03:44:27,880 --> 03:44:30,760 So we come down, we skip a lot of lines. 5083 03:44:30,760 --> 03:44:32,400 Choo, choo, choo, choo, choo. 5084 03:44:32,400 --> 03:44:34,040 And then we find a line that's good, 5085 03:44:34,040 --> 03:44:35,480 and then we fall through. 5086 03:44:35,480 --> 03:44:37,040 So this is the good code. 5087 03:44:37,040 --> 03:44:38,680 And then we have all the other good code 5088 03:44:38,680 --> 03:44:40,080 that we want to do to that line. 5089 03:44:40,080 --> 03:44:41,940 We have that showing up down here. 5090 03:44:43,320 --> 03:44:44,760 And so there's just two patterns 5091 03:44:44,760 --> 03:44:47,920 that are two ways to do the exact same thing. 5092 03:44:47,920 --> 03:44:50,040 So another way to select the lines 5093 03:44:50,040 --> 03:44:52,120 that we're interested in is to use the in operator. 5094 03:44:52,120 --> 03:44:54,280 So we talked before about the in operator 5095 03:44:54,280 --> 03:44:55,600 and how that works. 5096 03:44:55,600 --> 03:44:59,240 So we're basically gonna use the continue skipping method. 5097 03:44:59,240 --> 03:45:00,440 So we're gonna read all the lines, 5098 03:45:00,440 --> 03:45:01,360 these first few lines. 5099 03:45:01,360 --> 03:45:06,360 If uct.ac.za is not in the line, skip it. 5100 03:45:06,960 --> 03:45:09,280 And so this is gonna print out all the lines 5101 03:45:09,280 --> 03:45:14,000 that have the string uct.ac.za in them. 5102 03:45:14,000 --> 03:45:16,080 And so you see this is the output of the program, 5103 03:45:16,080 --> 03:45:17,240 dot, dot, dot, dot, dot. 5104 03:45:20,040 --> 03:45:22,080 Sometimes you'll have programs 5105 03:45:22,080 --> 03:45:24,400 that want to read different files. 5106 03:45:24,400 --> 03:45:26,280 Often I give assignments where I say, 5107 03:45:26,280 --> 03:45:28,280 show me how this program runs on the short file, 5108 03:45:28,280 --> 03:45:30,280 and then show me again how it runs on the long file, 5109 03:45:30,280 --> 03:45:31,320 just like this. 5110 03:45:31,320 --> 03:45:33,640 And so the way we do that to input the file name, 5111 03:45:33,640 --> 03:45:34,720 instead of making the file name 5112 03:45:34,720 --> 03:45:36,700 be a constant to the open call, 5113 03:45:36,700 --> 03:45:39,760 we make the file name be a input. 5114 03:45:39,760 --> 03:45:41,840 So we just run an input statement, 5115 03:45:41,840 --> 03:45:43,360 which gives us a prompt. 5116 03:45:43,360 --> 03:45:45,240 And then we type mbox.txt, 5117 03:45:45,240 --> 03:45:47,200 and then that shows up in this variable fname. 5118 03:45:47,200 --> 03:45:48,880 It's of course a string all the time. 5119 03:45:48,880 --> 03:45:51,160 And we pass that into open, and then we open it, 5120 03:45:51,160 --> 03:45:53,880 and then we do the count operation. 5121 03:45:53,880 --> 03:45:56,880 So if we enter mbox.txt, it counts 1797 5122 03:46:00,600 --> 03:46:02,000 subject lines in mbox. 5123 03:46:02,000 --> 03:46:03,240 And if we give it mbox short, 5124 03:46:03,240 --> 03:46:05,540 it says there are 27 subject lines in mbox. 5125 03:46:05,540 --> 03:46:07,760 And again, this is another one of those ifs, 5126 03:46:07,760 --> 03:46:09,120 and it's just counting, 5127 03:46:09,120 --> 03:46:13,280 but only counting lines that match a particular pattern. 5128 03:46:16,400 --> 03:46:19,040 Okay, so now the user can also type bad file names, 5129 03:46:19,040 --> 03:46:21,200 and we need to be able to deal with that as well. 5130 03:46:21,200 --> 03:46:25,200 And so we're taking a small change to the code. 5131 03:46:26,600 --> 03:46:29,160 The dangerous code is this line right here. 5132 03:46:29,160 --> 03:46:31,440 This line right here is gonna trace back 5133 03:46:31,440 --> 03:46:32,800 if that file doesn't exist. 5134 03:46:32,800 --> 03:46:34,200 So what do we do? 5135 03:46:34,200 --> 03:46:35,840 Well, we're gonna just expand that. 5136 03:46:35,840 --> 03:46:37,640 The rest of this program is exactly the same. 5137 03:46:37,640 --> 03:46:40,380 You know, things different as we've got this line. 5138 03:46:40,380 --> 03:46:42,840 We took out insurance on it, 5139 03:46:42,840 --> 03:46:44,660 and we know that it might blow up, 5140 03:46:44,660 --> 03:46:48,160 and so we have it in a try and accept block. 5141 03:46:50,160 --> 03:46:52,500 So here's how the code runs. 5142 03:46:54,160 --> 03:46:56,080 So, you know, the input runs. 5143 03:46:56,080 --> 03:46:57,400 We type in a good file name. 5144 03:46:57,400 --> 03:46:58,880 It comes in here. 5145 03:46:58,880 --> 03:47:01,200 This works, and so it skips the acceptance, 5146 03:47:01,200 --> 03:47:03,260 so it runs the code and prints out the count. 5147 03:47:03,260 --> 03:47:05,020 So that's the good pattern. 5148 03:47:05,020 --> 03:47:06,940 The bad pattern is here, 5149 03:47:08,300 --> 03:47:09,840 we type in a bad file name. 5150 03:47:09,840 --> 03:47:11,440 It comes in the try accept. 5151 03:47:11,440 --> 03:47:14,120 This file name is non-abubu, 5152 03:47:14,120 --> 03:47:17,160 and it's gonna blow up, so this line blows up. 5153 03:47:17,160 --> 03:47:19,040 So it jumps down into the accept code, 5154 03:47:19,040 --> 03:47:21,120 prints out, file cannot be opened. 5155 03:47:21,120 --> 03:47:22,360 So it prints this out. 5156 03:47:22,360 --> 03:47:24,300 Now this quit is really important, 5157 03:47:24,300 --> 03:47:25,860 because if we don't put this quit in here, 5158 03:47:25,860 --> 03:47:27,340 it's gonna continue down here, 5159 03:47:27,340 --> 03:47:28,340 and that's gonna blow up here, 5160 03:47:28,340 --> 03:47:31,880 because file handle is not defined properly at this point. 5161 03:47:31,880 --> 03:47:33,840 And so what we have is, 5162 03:47:33,840 --> 03:47:36,760 we have this quit is a special function 5163 03:47:36,760 --> 03:47:39,480 where it comes in and never returns. 5164 03:47:39,480 --> 03:47:43,040 So this is a way to terminate the entire Python program 5165 03:47:43,040 --> 03:47:45,440 silently with no trace back, right? 5166 03:47:45,440 --> 03:47:47,600 So we put in our own error message, 5167 03:47:47,600 --> 03:47:48,960 so we look like we're professionals, 5168 03:47:48,960 --> 03:47:51,080 say if we could not open this file, 5169 03:47:51,080 --> 03:47:52,600 and then we stop. 5170 03:47:52,600 --> 03:47:54,200 If you don't, it's gonna come down here, 5171 03:47:54,200 --> 03:47:56,120 and it's gonna trace back, 5172 03:47:56,120 --> 03:47:58,160 trace back right there, it's gonna blow up. 5173 03:47:58,160 --> 03:48:03,160 So the quit is useful when you want to stop executing, 5174 03:48:03,260 --> 03:48:06,060 because you've detected some kind of an error. 5175 03:48:07,220 --> 03:48:09,320 So that's a quick zoom through opening 5176 03:48:09,320 --> 03:48:11,860 and reading through files and doing some patterns. 5177 03:48:13,880 --> 03:48:15,840 Most of the rest of the programs in this course 5178 03:48:15,840 --> 03:48:20,840 are going to say open for our strip, 5179 03:48:21,000 --> 03:48:23,880 do look for, and then do something interesting. 5180 03:48:23,880 --> 03:48:25,920 That's going to be our loop that we're gonna do 5181 03:48:25,920 --> 03:48:28,420 over and over and over again. 5182 03:48:28,420 --> 03:48:32,800 And now we see how this looping and if and iteration 5183 03:48:32,800 --> 03:48:36,600 and variables are starting to come together, 5184 03:48:36,600 --> 03:48:38,560 and you can actually sort of do a program 5185 03:48:38,560 --> 03:48:40,540 that does something useful. 5186 03:48:40,540 --> 03:48:43,480 But before we get to too many more programs, 5187 03:48:43,480 --> 03:48:45,500 we gotta switch a little bit, switch gears 5188 03:48:45,500 --> 03:48:48,000 and talk up next about data structures, 5189 03:48:48,000 --> 03:48:50,080 and that is the shape of data, 5190 03:48:50,080 --> 03:48:54,160 and how we can use more intricate and complex variables 5191 03:48:54,160 --> 03:48:55,720 to help solve our problems. 5192 03:48:59,240 --> 03:49:01,240 Hello and welcome to chapter eight. 5193 03:49:01,240 --> 03:49:03,840 We're gonna talk about lists in this chapter. 5194 03:49:03,840 --> 03:49:07,000 Up to now, we've been talking about algorithms. 5195 03:49:07,000 --> 03:49:09,640 Algorithms are the concept in computer science 5196 03:49:09,640 --> 03:49:13,080 of using the programming language to express the steps 5197 03:49:13,080 --> 03:49:14,320 that you want the computer to go through 5198 03:49:14,320 --> 03:49:16,080 to solve the problem. 5199 03:49:16,080 --> 03:49:19,040 Read some data, convert it to a floating point number, 5200 03:49:19,040 --> 03:49:20,520 check to see if it's greater than 40, 5201 03:49:20,520 --> 03:49:21,820 do one thing if it's greater than 40, 5202 03:49:21,820 --> 03:49:23,120 do another thing if it's not, 5203 03:49:23,120 --> 03:49:24,440 then print out the result. 5204 03:49:24,440 --> 03:49:28,080 Or open a file, read everything. 5205 03:49:28,080 --> 03:49:30,180 If the first line starts with something, do something. 5206 03:49:30,180 --> 03:49:33,320 If not, skip it and then add all the things up. 5207 03:49:33,320 --> 03:49:35,940 Those are steps, those are a series of steps, 5208 03:49:35,940 --> 03:49:37,900 and hopefully by now you're getting to the point 5209 03:49:37,900 --> 03:49:39,960 where you have a good understanding of steps. 5210 03:49:39,960 --> 03:49:43,000 But there's a whole other side of computer programming 5211 03:49:43,000 --> 03:49:44,200 and we call it data structures. 5212 03:49:44,200 --> 03:49:46,520 And data structures is not the steps, 5213 03:49:46,520 --> 03:49:50,520 but instead clever ways that you lay out the data 5214 03:49:50,520 --> 03:49:52,320 and clever ways that you make sure 5215 03:49:52,320 --> 03:49:54,920 that the data does what you want it to do. 5216 03:49:54,920 --> 03:49:57,320 And so that's what we're gonna start talking about now. 5217 03:49:57,320 --> 03:50:00,320 Lists are the first and most simplest data structure. 5218 03:50:00,320 --> 03:50:02,320 Strings are kind of like data structures, 5219 03:50:02,320 --> 03:50:05,440 but lists are probably our first real data structure 5220 03:50:05,440 --> 03:50:06,960 that we're gonna think about and design 5221 03:50:06,960 --> 03:50:08,840 and make use of effectively. 5222 03:50:08,840 --> 03:50:11,860 But before we talk about what is a collection, 5223 03:50:11,860 --> 03:50:13,720 we should talk about what is not a collection. 5224 03:50:13,720 --> 03:50:15,360 So we're familiar with what a variable is. 5225 03:50:15,360 --> 03:50:17,500 We know that a variable is a little piece of memory 5226 03:50:17,500 --> 03:50:19,160 that's got a label on it. 5227 03:50:19,160 --> 03:50:21,080 And then an assignment statement, you know, 5228 03:50:21,080 --> 03:50:23,720 sticks a two into x and then x is, 5229 03:50:23,720 --> 03:50:25,840 and then two is in this little cupboard. 5230 03:50:25,840 --> 03:50:27,560 And then it goes to the next line 5231 03:50:27,560 --> 03:50:29,480 and then four goes into x 5232 03:50:29,480 --> 03:50:31,800 and so the two goes away and the four is there. 5233 03:50:31,800 --> 03:50:35,060 A key thing is you can't have more than one variable 5234 03:50:35,060 --> 03:50:36,920 at any given moment, right? 5235 03:50:36,920 --> 03:50:39,320 And more than one value in a variable. 5236 03:50:39,320 --> 03:50:41,140 So when we move to collections, 5237 03:50:41,140 --> 03:50:43,300 collections are more like suitcases. 5238 03:50:43,300 --> 03:50:44,880 We can put lots of things in them. 5239 03:50:44,880 --> 03:50:46,880 We have ways of organizing them. 5240 03:50:46,880 --> 03:50:49,360 And as we go through lists and dictionaries and tuples, 5241 03:50:49,360 --> 03:50:51,840 we'll see how there are different ways to organize them. 5242 03:50:51,840 --> 03:50:52,680 And as a matter of fact, 5243 03:50:52,680 --> 03:50:55,400 we've been talking about lists for a while. 5244 03:50:55,400 --> 03:50:58,320 Every time we use one of these square bracket syntaxes 5245 03:50:58,320 --> 03:51:00,760 in earlier programs, we've been working with lists. 5246 03:51:00,760 --> 03:51:03,720 And so this is technically a three item list 5247 03:51:03,720 --> 03:51:05,800 with three strings, got commas here, 5248 03:51:05,800 --> 03:51:08,120 Joseph is one string, Glen and Sally are another string. 5249 03:51:08,120 --> 03:51:10,880 And here's another one that is another thing. 5250 03:51:10,880 --> 03:51:14,000 And the list is basically, it's a list constant 5251 03:51:14,000 --> 03:51:15,440 and it's being assigned into a variable. 5252 03:51:15,440 --> 03:51:18,200 So this friends variable has three things in it. 5253 03:51:18,200 --> 03:51:21,780 So that's different than what we've been talking about before. 5254 03:51:22,640 --> 03:51:24,920 So these brackets and bracket structures 5255 03:51:24,920 --> 03:51:27,560 with square brackets are those lists. 5256 03:51:27,560 --> 03:51:29,120 And so the print is just a print 5257 03:51:29,120 --> 03:51:31,040 with parentheses to get the print to work. 5258 03:51:31,040 --> 03:51:35,520 But 124, 76 is a three item integer list. 5259 03:51:35,520 --> 03:51:38,940 Red, yellow and blue is a three item string list. 5260 03:51:38,940 --> 03:51:41,160 But it doesn't all have to be integers or strings. 5261 03:51:41,160 --> 03:51:43,860 Python can handle different things 5262 03:51:43,860 --> 03:51:45,080 and different kinds of data 5263 03:51:45,080 --> 03:51:46,360 in different positions in the list. 5264 03:51:46,360 --> 03:51:50,760 So red, 24, 98.6, a three item list with a string, 5265 03:51:50,760 --> 03:51:53,440 an integer and a floating point number. 5266 03:51:53,440 --> 03:51:56,280 And while we're not gonna use this too much for now, 5267 03:51:56,280 --> 03:51:59,000 this outer list is a three item list 5268 03:51:59,000 --> 03:52:02,000 and the second item is another list. 5269 03:52:02,000 --> 03:52:04,480 So this is kind of alluding toward what we'll do 5270 03:52:04,480 --> 03:52:06,640 when we start talking about data structures. 5271 03:52:06,640 --> 03:52:08,200 And that is we have a structure 5272 03:52:08,200 --> 03:52:09,560 and then we have another structure inside of it. 5273 03:52:09,560 --> 03:52:11,640 And sometimes this can get quite complex. 5274 03:52:11,640 --> 03:52:13,520 And we're doing this for a reason. 5275 03:52:13,520 --> 03:52:16,000 This here has no reason just to show you 5276 03:52:16,000 --> 03:52:18,880 that it's possible that lists can be made up 5277 03:52:18,880 --> 03:52:21,360 of lots of things, including other lists. 5278 03:52:21,360 --> 03:52:25,080 And of course, there is also the notion of the empty list. 5279 03:52:25,080 --> 03:52:28,040 And like I said, I have had to be able to tell you 5280 03:52:28,040 --> 03:52:29,760 about lists all along. 5281 03:52:29,760 --> 03:52:31,480 We use them in for loops. 5282 03:52:31,480 --> 03:52:32,760 We can put lots of things here. 5283 03:52:32,760 --> 03:52:34,180 We can put file handle here. 5284 03:52:34,180 --> 03:52:35,560 We can go through the file. 5285 03:52:35,560 --> 03:52:36,460 We can put a string there. 5286 03:52:36,460 --> 03:52:38,120 We can go through the characters in the string 5287 03:52:38,120 --> 03:52:38,960 and then the list. 5288 03:52:38,960 --> 03:52:40,520 And the iteration variable then goes through 5289 03:52:40,520 --> 03:52:42,680 the successive elements of the list. 5290 03:52:42,680 --> 03:52:45,480 And that's why this prints off y4321. 5291 03:52:45,480 --> 03:52:48,040 And then the loop is done and it prints out a blast off. 5292 03:52:48,040 --> 03:52:50,440 So we've been using them and we've been actually iterating 5293 03:52:50,440 --> 03:52:53,640 through lists with for statements all along. 5294 03:52:54,660 --> 03:52:59,660 So the for statement has been something we use with lists. 5295 03:53:00,620 --> 03:53:03,020 And when you just need to go iterate through the list 5296 03:53:03,020 --> 03:53:05,240 and go through every item in order, 5297 03:53:05,240 --> 03:53:07,240 the for is a great way to do that. 5298 03:53:07,240 --> 03:53:09,340 So friend is our iteration variable. 5299 03:53:09,340 --> 03:53:11,400 Friends is our list variable. 5300 03:53:11,400 --> 03:53:14,040 And so that says friend is gonna successfully take 5301 03:53:14,040 --> 03:53:17,600 on the value Joseph, Glenn, and Sally and print out, 5302 03:53:17,600 --> 03:53:20,040 you know, Happy New Year, Joseph, Glenn, and Sally. 5303 03:53:20,040 --> 03:53:22,400 It runs three times once for each of the values 5304 03:53:22,400 --> 03:53:24,600 and the iteration variable advances. 5305 03:53:24,600 --> 03:53:28,000 Now, I do wanna make it really clear 5306 03:53:28,000 --> 03:53:32,840 that the choice of friends a and friend a, 5307 03:53:32,840 --> 03:53:35,440 singular and plural, is arbitrary and capricious. 5308 03:53:35,440 --> 03:53:38,760 It happens to be convenient and intuitive 5309 03:53:38,760 --> 03:53:41,240 that the iteration variable is one 5310 03:53:41,240 --> 03:53:43,520 and the list variable is more than one. 5311 03:53:43,520 --> 03:53:46,160 But Python has no idea about singular and plurals. 5312 03:53:46,160 --> 03:53:47,960 Matter of fact, Python would care. 5313 03:53:47,960 --> 03:53:50,280 It would be totally equivalent for Python 5314 03:53:50,280 --> 03:53:52,800 to do the same thing, to have the list variable be z 5315 03:53:52,800 --> 03:53:54,860 and the iteration variable be x. 5316 03:53:54,860 --> 03:53:58,080 X will take on the successive values of these three things. 5317 03:53:58,080 --> 03:54:01,200 Now, am I being nice to you by calling this list friends 5318 03:54:01,200 --> 03:54:03,320 and this iteration variable friend? 5319 03:54:03,320 --> 03:54:05,480 I am, but I also don't want it to confuse you 5320 03:54:05,480 --> 03:54:07,320 if you're just a beginning developer. 5321 03:54:08,880 --> 03:54:11,780 So just like strings, we can sort of look within lists. 5322 03:54:11,780 --> 03:54:14,200 Part of the thing is when you put more than one thing 5323 03:54:14,200 --> 03:54:17,160 in a data structure, you need to get them out. 5324 03:54:17,160 --> 03:54:21,200 And so lists have positions, they maintain order, 5325 03:54:21,200 --> 03:54:22,720 and so the first thing in the list 5326 03:54:22,720 --> 03:54:25,520 is the sub-zero position, sub-one, sub-two. 5327 03:54:25,520 --> 03:54:27,460 Just like strings, they're zero-based. 5328 03:54:27,460 --> 03:54:30,280 Just like European elevators, they're zero-based. 5329 03:54:30,280 --> 03:54:33,360 So if we take a look and we say, oh, friends sub-one, 5330 03:54:33,360 --> 03:54:35,760 that's how I read that, the little square brackets, 5331 03:54:35,760 --> 03:54:38,240 when you take a variable here and you say friends sub-one. 5332 03:54:38,240 --> 03:54:40,700 Remember, singular and plural don't matter. 5333 03:54:40,700 --> 03:54:43,840 Friends sub-one means glen, because this is the zero 5334 03:54:43,840 --> 03:54:46,440 and that's the one, and then Sally's the sub-two, 5335 03:54:46,440 --> 03:54:51,040 and so that's what prints glen out in this particular thing. 5336 03:54:51,040 --> 03:54:52,840 Now, lists are mutable. 5337 03:54:52,840 --> 03:54:54,480 Mutable is another word for changeable. 5338 03:54:54,480 --> 03:54:57,440 They can be changed, meaning that a list has three things. 5339 03:54:57,440 --> 03:55:00,040 You can change this thing right in the middle if you want. 5340 03:55:00,040 --> 03:55:01,840 To take a look at what's not mutable, 5341 03:55:01,840 --> 03:55:03,040 strings are not mutable. 5342 03:55:03,040 --> 03:55:06,140 So if I take a look at assigning banana into fruit, 5343 03:55:06,140 --> 03:55:08,640 well, fruit sub-zero is a capital letter B. 5344 03:55:08,640 --> 03:55:10,240 Could we imagine for the moment 5345 03:55:10,240 --> 03:55:14,400 that we could change fruit sub-zero to lowercase b? 5346 03:55:14,400 --> 03:55:16,320 Well, the syntax would be how you would do it 5347 03:55:16,320 --> 03:55:17,760 if you could do it, but it turns out 5348 03:55:17,760 --> 03:55:20,960 that strings are not mutable, 5349 03:55:20,960 --> 03:55:23,320 meaning they're not changeable once you create them. 5350 03:55:23,320 --> 03:55:25,720 And that's why when we do things like lowercase 5351 03:55:25,720 --> 03:55:28,760 or uppercase, we take a look at the fruit 5352 03:55:28,760 --> 03:55:30,680 and we say, give me a lowercase copy of that, 5353 03:55:30,680 --> 03:55:32,600 and then we take the return value from this 5354 03:55:32,600 --> 03:55:33,960 and we store that in x, 5355 03:55:33,960 --> 03:55:36,680 and that's how x becomes a lowercase banana. 5356 03:55:36,680 --> 03:55:39,320 But fruit is still the original one. 5357 03:55:39,320 --> 03:55:41,400 So fruit has not changed. 5358 03:55:41,400 --> 03:55:45,760 Compare and contrast that with a list, though. 5359 03:55:45,760 --> 03:55:49,600 Here we have a five-item list, two, 14, 26, 41. 5360 03:55:49,600 --> 03:55:52,640 And we're gonna do the sub-two position. 5361 03:55:52,640 --> 03:55:55,000 And the sub-two is zero, one, two. 5362 03:55:55,000 --> 03:55:56,280 So that's that one right there. 5363 03:55:56,280 --> 03:55:58,880 And we're going to assign a 28 into it. 5364 03:55:58,880 --> 03:56:00,680 So that 28 is going in here. 5365 03:56:00,680 --> 03:56:02,820 Gonna wipe that out and put 28 in. 5366 03:56:02,820 --> 03:56:05,760 So we can do item assignment in lists 5367 03:56:05,760 --> 03:56:09,280 by putting a bracket syntax on the left-hand side 5368 03:56:09,280 --> 03:56:11,080 to say, don't just put it in a variable, 5369 03:56:11,080 --> 03:56:13,040 put it in this position within the variable. 5370 03:56:13,040 --> 03:56:14,600 So that's what that's doing. 5371 03:56:14,600 --> 03:56:16,080 And when you print that out, the 28, 5372 03:56:16,080 --> 03:56:17,600 everything else is unchanged. 5373 03:56:17,600 --> 03:56:18,840 Meaning the whole list is there. 5374 03:56:18,840 --> 03:56:20,760 There could be 1,000 items in the list. 5375 03:56:20,760 --> 03:56:22,900 And then you're changing the second one. 5376 03:56:25,320 --> 03:56:26,760 We have a function called len. 5377 03:56:26,760 --> 03:56:29,500 We've been using this len function all along 5378 03:56:29,500 --> 03:56:31,360 to take a look at how long strings are. 5379 03:56:31,360 --> 03:56:32,900 It counts the number of characters in the string. 5380 03:56:32,900 --> 03:56:34,800 So that's a nine-character string. 5381 03:56:34,800 --> 03:56:36,880 If we have items in a list, 5382 03:56:36,880 --> 03:56:38,800 len tells us how many items there are. 5383 03:56:38,800 --> 03:56:40,500 It's not like how many characters there are. 5384 03:56:40,500 --> 03:56:42,400 It's the number of things. 5385 03:56:42,400 --> 03:56:44,480 And each thing doesn't have to be a number. 5386 03:56:44,480 --> 03:56:46,600 It could be a number, a string, or even another list. 5387 03:56:46,600 --> 03:56:47,840 And len is the way to say, 5388 03:56:47,840 --> 03:56:49,540 hey, how many things are in there? 5389 03:56:52,640 --> 03:56:55,600 There's a function that returns a list of numbers. 5390 03:56:55,600 --> 03:56:57,680 And we use it, as we'll see in a second, 5391 03:56:57,680 --> 03:56:59,920 to construct specialized loops to go through lists. 5392 03:56:59,920 --> 03:57:01,720 So let's take a look at this range function 5393 03:57:01,720 --> 03:57:03,000 just for a minute. 5394 03:57:03,000 --> 03:57:04,720 So range takes as its parameter 5395 03:57:04,720 --> 03:57:07,560 the number of numbers that you want returned. 5396 03:57:07,560 --> 03:57:10,440 So I'd like a four-item list 5397 03:57:10,440 --> 03:57:13,840 with the numbers zero, up to, but not including four. 5398 03:57:13,840 --> 03:57:16,680 And so it just turns out that that is really useful 5399 03:57:16,680 --> 03:57:19,560 for constructing four loops 5400 03:57:19,560 --> 03:57:20,720 that are counted four loops 5401 03:57:20,720 --> 03:57:22,880 that go to zero, to the one, to the two, 5402 03:57:22,880 --> 03:57:25,340 as compared to the definite loops 5403 03:57:25,340 --> 03:57:28,700 that go through each one. 5404 03:57:28,700 --> 03:57:30,440 And so it's a common thing to say, 5405 03:57:30,440 --> 03:57:32,680 okay, we know how many things are in this list. 5406 03:57:32,680 --> 03:57:34,080 There are three friends. 5407 03:57:34,080 --> 03:57:37,040 And if I put combine, range, and len, 5408 03:57:37,040 --> 03:57:39,200 so I take len friends, which is three, 5409 03:57:39,200 --> 03:57:41,280 and then I take range sub three, 5410 03:57:41,280 --> 03:57:42,960 I get zero, one, and two. 5411 03:57:42,960 --> 03:57:44,160 And so the interesting thing is 5412 03:57:44,160 --> 03:57:46,360 this zero corresponds to the first one, 5413 03:57:46,360 --> 03:57:47,960 one corresponds to the second one, 5414 03:57:47,960 --> 03:57:51,260 and two corresponds to the third one, okay? 5415 03:57:51,260 --> 03:57:55,320 And so we'll use this to construct loops, 5416 03:57:55,320 --> 03:57:58,320 especially when we need to go through an array 5417 03:58:00,920 --> 03:58:02,960 and remember what position we're at. 5418 03:58:02,960 --> 03:58:05,800 And so here's just an example of two different loops. 5419 03:58:05,800 --> 03:58:09,560 This is a four loop that's just gonna go through 5420 03:58:09,560 --> 03:58:10,680 whatever's in this list. 5421 03:58:10,680 --> 03:58:13,660 So friend is just gonna take on the success of values, 5422 03:58:13,660 --> 03:58:15,840 and so it's gonna print out these three things 5423 03:58:15,840 --> 03:58:17,480 just as you would expect. 5424 03:58:17,480 --> 03:58:18,920 And if you don't need to, 5425 03:58:18,920 --> 03:58:19,800 while you're going through the loop, 5426 03:58:19,800 --> 03:58:21,600 know the position, your relative position 5427 03:58:21,600 --> 03:58:24,720 from the top in the loop, that's okay. 5428 03:58:24,720 --> 03:58:27,520 But sometimes you want a little more sophisticated loop. 5429 03:58:27,520 --> 03:58:30,720 And instead, you want to be able to 5430 03:58:31,920 --> 03:58:34,520 loop through where you know the position. 5431 03:58:34,520 --> 03:58:35,880 And so what we do instead is, 5432 03:58:35,880 --> 03:58:38,040 instead of looping through that list itself, 5433 03:58:38,040 --> 03:58:42,800 we do range lend friends, which gives us zero, one, two. 5434 03:58:42,800 --> 03:58:46,360 And then I takes on the success of value zero, one, 5435 03:58:46,360 --> 03:58:47,440 and then two. 5436 03:58:47,440 --> 03:58:49,160 So this loop is gonna run four times, 5437 03:58:49,160 --> 03:58:50,720 and I is zero the first time. 5438 03:58:50,720 --> 03:58:53,640 And we might even just look up the value inside 5439 03:58:53,640 --> 03:58:57,480 that sub-zero value so we get Joseph the first time. 5440 03:58:57,480 --> 03:58:59,200 So prints out Happy New Year Joseph, 5441 03:58:59,200 --> 03:59:01,940 goes and I becomes one now, 5442 03:59:01,940 --> 03:59:03,960 and so it gives us Glen, and that prints out. 5443 03:59:03,960 --> 03:59:04,840 And away you go. 5444 03:59:04,840 --> 03:59:06,600 So if you look at these two loops, 5445 03:59:07,920 --> 03:59:09,040 if you look at these two loops, 5446 03:59:09,040 --> 03:59:11,080 they really do the exact same thing. 5447 03:59:11,080 --> 03:59:12,320 The only difference is this, 5448 03:59:12,320 --> 03:59:14,140 we allowed the four to find its way 5449 03:59:14,140 --> 03:59:15,680 with the iteration variable through. 5450 03:59:15,680 --> 03:59:18,540 And here we created our own I variable 5451 03:59:18,540 --> 03:59:20,600 that went through the positions. 5452 03:59:20,600 --> 03:59:22,320 And they're dense, there's no gaps in here, 5453 03:59:22,320 --> 03:59:26,280 so it's zero through two that it goes through. 5454 03:59:26,280 --> 03:59:28,600 So these two are equivalent. 5455 03:59:28,600 --> 03:59:30,880 There'll be times when you'll want to use one and the other. 5456 03:59:30,880 --> 03:59:33,200 I tend to prefer the first one 5457 03:59:33,200 --> 03:59:37,420 because it's prettier as long as it works for me. 5458 03:59:38,320 --> 03:59:39,960 So that gets us started with loops. 5459 03:59:39,960 --> 03:59:41,400 We'll be back in just a bit. 5460 03:59:44,720 --> 03:59:46,800 Okay, so we've taken a look at loops, 5461 03:59:46,800 --> 03:59:49,560 and now we're gonna just take a little bit of a look 5462 03:59:49,560 --> 03:59:52,800 at some of the operations that you can do with loops. 5463 03:59:52,800 --> 03:59:54,740 Python has this, as we'll soon learn, 5464 03:59:54,740 --> 03:59:57,320 object-oriented approach to its operators. 5465 03:59:57,320 --> 04:00:00,720 And the plus can add strings, and it can add numbers. 5466 04:00:00,720 --> 04:00:03,080 Floating point numbers, integer numbers, strings. 5467 04:00:03,080 --> 04:00:03,960 Et cetera. 5468 04:00:03,960 --> 04:00:08,360 And so the plus similarly works this way with lists. 5469 04:00:08,360 --> 04:00:10,800 The plus looks to its left and looks to its right 5470 04:00:10,800 --> 04:00:12,560 and says, what am I adding? 5471 04:00:12,560 --> 04:00:14,600 And in the case that I'm adding the list one, two, three, 5472 04:00:14,600 --> 04:00:17,600 and the list four, five, six, it concatenates them together. 5473 04:00:17,600 --> 04:00:19,880 And this way it sort of functions like a string, 5474 04:00:19,880 --> 04:00:21,760 and so we get one, two, three, four, five, six. 5475 04:00:21,760 --> 04:00:25,260 It just concatenate this list to another list. 5476 04:00:25,260 --> 04:00:26,840 And it doesn't change A or B 5477 04:00:26,840 --> 04:00:29,560 just like in any kind of assignment statement. 5478 04:00:29,560 --> 04:00:32,560 Calculations on the right side don't change the variables 5479 04:00:32,560 --> 04:00:33,720 and then produce a new variable 5480 04:00:33,720 --> 04:00:35,940 and then assign that into C. 5481 04:00:37,440 --> 04:00:41,720 You can also use list slicing, and it's easy to remember. 5482 04:00:41,720 --> 04:00:43,080 If you remember how strings work, 5483 04:00:43,080 --> 04:00:45,400 lists work exactly the same way. 5484 04:00:45,400 --> 04:00:48,080 So of course it's a little tricky. 5485 04:00:48,080 --> 04:00:49,740 The first number's the starting position. 5486 04:00:49,740 --> 04:00:50,840 They start at zero. 5487 04:00:50,840 --> 04:00:52,880 So one is right there. 5488 04:00:52,880 --> 04:00:54,900 So it's the zero position, the one position. 5489 04:00:54,900 --> 04:00:56,880 Start at one, right? 5490 04:00:56,880 --> 04:00:58,920 But go up two, but not including three. 5491 04:00:58,920 --> 04:01:01,980 There's one, two, three. 5492 04:01:01,980 --> 04:01:04,480 So this goes up two, but not including three. 5493 04:01:04,480 --> 04:01:07,120 And that's why we get 41, 12 out of that. 5494 04:01:07,120 --> 04:01:08,720 So up two, but not including. 5495 04:01:08,720 --> 04:01:11,560 I'll just say that over and over and over again. 5496 04:01:11,560 --> 04:01:14,700 If we do, you can leave the first part out. 5497 04:01:14,700 --> 04:01:16,240 You can leave the first part out here, 5498 04:01:16,240 --> 04:01:18,760 and you can say, oh, up two, but not including four. 5499 04:01:18,760 --> 04:01:19,980 So that starts at the beginning, 5500 04:01:19,980 --> 04:01:22,240 goes up two, but not including four. 5501 04:01:22,240 --> 04:01:24,920 And so that's how we get that piece right there. 5502 04:01:24,920 --> 04:01:29,640 We can say start at the position three, 5503 04:01:29,640 --> 04:01:32,680 zero, one, two, three, start at position three, 5504 04:01:32,680 --> 04:01:34,360 and go to the end. 5505 04:01:34,360 --> 04:01:35,840 Now the fact that the number three is in here 5506 04:01:35,840 --> 04:01:37,860 is sort of irrelevant. 5507 04:01:37,860 --> 04:01:41,040 Three to the end is those three numbers. 5508 04:01:41,040 --> 04:01:43,420 And then you can do the whole list with slicing as well. 5509 04:01:43,420 --> 04:01:45,460 Again, these pretty much are the exact same examples 5510 04:01:45,460 --> 04:01:47,580 I used when I was doing strings. 5511 04:01:47,580 --> 04:01:49,080 They're pretty much the same. 5512 04:01:52,360 --> 04:01:54,040 There's a number of different methods, 5513 04:01:54,040 --> 04:01:56,360 and you can look up all the documentation in the list. 5514 04:01:56,360 --> 04:01:58,680 I often just use the dir command 5515 04:01:58,680 --> 04:02:00,320 to remind myself of them. 5516 04:02:00,320 --> 04:02:01,680 A pen we'll look at. 5517 04:02:01,680 --> 04:02:04,520 Count looks for certain values in the list. 5518 04:02:04,520 --> 04:02:06,680 Extend adds things to the end of the list. 5519 04:02:06,680 --> 04:02:08,800 Index looks things up in the list. 5520 04:02:08,800 --> 04:02:12,620 Insert allows the list to sort of be expanded in the middle. 5521 04:02:12,620 --> 04:02:14,620 Pop pulls things off the top. 5522 04:02:14,620 --> 04:02:16,940 Remove removes an item in the middle. 5523 04:02:16,940 --> 04:02:19,520 Reverse flips the order of them and sort, 5524 04:02:19,520 --> 04:02:23,880 up puts them sorted order based on the values. 5525 04:02:23,880 --> 04:02:28,880 So let's look at a couple of these. 5526 04:02:31,200 --> 04:02:33,180 So if we build a list from scratch, 5527 04:02:33,180 --> 04:02:34,880 we have a way to ask for an empty list. 5528 04:02:34,880 --> 04:02:37,560 There are a couple different ways to ask for an empty list. 5529 04:02:37,560 --> 04:02:40,300 We could use just two square brackets next to each other. 5530 04:02:40,300 --> 04:02:42,380 But this is a form we call the constructor form 5531 04:02:42,380 --> 04:02:44,320 where we say, hey Python, make a list. 5532 04:02:44,320 --> 04:02:48,080 In this case, the word list is like a reserved word to Python. 5533 04:02:48,080 --> 04:02:52,000 It's really a reserved class, but say, 5534 04:02:52,000 --> 04:02:54,600 list parentheses says make me an empty list 5535 04:02:54,600 --> 04:02:57,000 and then assign that list into stuff. 5536 04:02:57,000 --> 04:02:59,920 So stuff is now, it's a list of object, 5537 04:02:59,920 --> 04:03:03,040 it's a type list, but it has nothing in it. 5538 04:03:03,040 --> 04:03:04,640 And then we can call the append method, 5539 04:03:04,640 --> 04:03:07,120 stuff.append and stick book in. 5540 04:03:07,120 --> 04:03:09,320 And then we say, oh, and that knows how long, 5541 04:03:09,320 --> 04:03:11,000 the stuff knows how long it is, 5542 04:03:11,000 --> 04:03:12,920 where the end is and how to add something to it 5543 04:03:12,920 --> 04:03:15,320 and then add a 99 to it, and we print it out. 5544 04:03:15,320 --> 04:03:18,960 We got book a 99, reminding ourselves that lists, 5545 04:03:18,960 --> 04:03:21,560 while they're often the same types of variables, 5546 04:03:21,560 --> 04:03:23,920 the same types of values in the various positions 5547 04:03:23,920 --> 04:03:26,840 in the list, it doesn't always have to be that way. 5548 04:03:26,840 --> 04:03:28,920 Then we say, oh, we'll stuff that append cookie, 5549 04:03:28,920 --> 04:03:30,880 you can keep on going, and then we end up 5550 04:03:30,880 --> 04:03:33,640 with three things and the cookie. 5551 04:03:35,040 --> 04:03:38,720 We have an in operator, works pretty much like 5552 04:03:38,720 --> 04:03:42,440 the in operator in a string, is nine in my list? 5553 04:03:42,440 --> 04:03:44,760 And that's pretty simple, and the answer of course is yes, 5554 04:03:44,760 --> 04:03:46,080 nine is in my list. 5555 04:03:46,080 --> 04:03:47,560 Is 15 in my list? 5556 04:03:47,560 --> 04:03:51,160 Looking through, no it's not, 15 is not in my list. 5557 04:03:51,160 --> 04:03:52,800 And then there's the not in operator, 5558 04:03:52,800 --> 04:03:54,800 think of that as kind of like one operator. 5559 04:03:54,800 --> 04:03:56,440 Is 20 not in the list? 5560 04:03:56,440 --> 04:03:58,640 And the answer is, since it's not there, is true. 5561 04:03:58,640 --> 04:04:00,760 And so that's a way to just, you know, 5562 04:04:00,760 --> 04:04:03,560 it's kind of like starts with or in for strings, 5563 04:04:03,560 --> 04:04:05,380 same kind of stuff. 5564 04:04:05,380 --> 04:04:07,880 Lists are in order, and they're sortable, 5565 04:04:07,880 --> 04:04:11,280 and so this is something that we take good advantage of. 5566 04:04:11,280 --> 04:04:13,760 A lot of what computers want to do is sort stuff, 5567 04:04:13,760 --> 04:04:15,320 you know, look all these things up, 5568 04:04:15,320 --> 04:04:17,480 append them, and then get them sorted. 5569 04:04:17,480 --> 04:04:22,000 And so there is this method inside of list, 5570 04:04:22,000 --> 04:04:23,080 that's just the sort method. 5571 04:04:23,080 --> 04:04:25,400 So here we, you know, put three values 5572 04:04:25,400 --> 04:04:27,360 in zero, one, two positions, zero, one, and two, 5573 04:04:27,360 --> 04:04:30,080 Joseph, Glenn, and Sally, and then we tell the list 5574 04:04:30,080 --> 04:04:32,040 to sort itself, and then we print it out. 5575 04:04:32,040 --> 04:04:34,280 Now this is actually sort of the list in place, 5576 04:04:34,280 --> 04:04:36,320 which is different than upper and lower, 5577 04:04:36,320 --> 04:04:39,080 because if you remember, strings are not mutable, 5578 04:04:39,080 --> 04:04:41,200 but lists are mutable, and so you say, 5579 04:04:41,200 --> 04:04:43,760 hey, just sort yourself, okay? 5580 04:04:43,760 --> 04:04:45,960 And so just sort yourself, and then it sorts it, 5581 04:04:45,960 --> 04:04:48,000 and then it's in alphabetical order, 5582 04:04:48,000 --> 04:04:49,040 Glenn, Joseph, and Sally. 5583 04:04:49,040 --> 04:04:51,760 I happen to be clever, I only put strings in there, 5584 04:04:51,760 --> 04:04:53,280 and I put my upper case and lower case 5585 04:04:53,280 --> 04:04:56,320 in a very consistent pattern, but the list has changed, 5586 04:04:56,320 --> 04:05:00,080 and if I look at list sub one, that is the second item, 5587 04:05:00,080 --> 04:05:02,680 which is Joseph that prints out right down there. 5588 04:05:06,360 --> 04:05:07,840 There's a whole bunch of built-in functions 5589 04:05:07,840 --> 04:05:09,080 to help manipulate list. 5590 04:05:09,080 --> 04:05:12,480 The other things I was showing was sort is a method 5591 04:05:12,480 --> 04:05:14,440 that's part of list, but there are other functions 5592 04:05:14,440 --> 04:05:17,400 that take list as their arguments. 5593 04:05:17,400 --> 04:05:19,360 We already talked about the lend function, 5594 04:05:19,360 --> 04:05:21,120 tells you how many items there are. 5595 04:05:21,120 --> 04:05:24,160 There is pretty obvious max, it says go through 5596 04:05:24,160 --> 04:05:28,600 and find the largest, min, go through and find the smallest, 5597 04:05:28,600 --> 04:05:31,920 sum goes through, adds them all up, 5598 04:05:31,920 --> 04:05:34,760 and we can say let's do average by taking the sum 5599 04:05:34,760 --> 04:05:38,120 of all of them and dividing it by the length, 5600 04:05:38,120 --> 04:05:40,340 and you might think to yourself, oh wow, 5601 04:05:40,340 --> 04:05:42,280 I wish we'd have known this such few chapters back 5602 04:05:42,280 --> 04:05:43,860 when we were having to write all those loops 5603 04:05:43,860 --> 04:05:47,620 to do max, min, sum, largest, smallest, et cetera. 5604 04:05:47,620 --> 04:05:48,760 You can kind of think in your mind 5605 04:05:48,760 --> 04:05:50,760 that inside each one of these functions is a loop 5606 04:05:50,760 --> 04:05:52,920 that does pretty much what you did in those chapters, 5607 04:05:52,920 --> 04:05:55,400 and part of the reason we did that back then, 5608 04:05:55,400 --> 04:05:56,560 even though these things were here, 5609 04:05:56,560 --> 04:05:58,920 was they're kind of easy loops to understand, 5610 04:05:59,800 --> 04:06:02,400 and so those are there, 5611 04:06:02,400 --> 04:06:07,400 and basically there allows two different ways 5612 04:06:08,240 --> 04:06:10,520 of building loops to do the maximum and minimum. 5613 04:06:10,520 --> 04:06:13,760 Now it's not necessarily all that much easier 5614 04:06:13,760 --> 04:06:17,840 to do something using these 5615 04:06:17,840 --> 04:06:20,940 because you either can do them the old way, 5616 04:06:20,940 --> 04:06:24,400 or you can make a list and then use these functions. 5617 04:06:24,400 --> 04:06:26,720 So let's take a look, and I'll just say 5618 04:06:26,720 --> 04:06:30,280 that these two bits of code are doing the exact same thing, 5619 04:06:30,280 --> 04:06:32,400 and what they are is they're implementing a program 5620 04:06:32,400 --> 04:06:34,360 that's gonna repeatedly ask for numbers 5621 04:06:34,360 --> 04:06:36,160 until we type the word done, 5622 04:06:36,160 --> 04:06:37,680 and then it's gonna compute the average 5623 04:06:37,680 --> 04:06:39,000 and tell us what they are, 5624 04:06:39,000 --> 04:06:43,760 and so using sort of the stuff from the loop chapter, 5625 04:06:43,760 --> 04:06:45,840 we start with a total variable and a count variable, 5626 04:06:45,840 --> 04:06:49,160 set them to zero, and then we read a number, 5627 04:06:49,160 --> 04:06:51,280 we check for done to break out, 5628 04:06:51,280 --> 04:06:53,720 but then we convert it to a floating point value, 5629 04:06:53,720 --> 04:06:55,800 and then we say total equals total plus value, 5630 04:06:55,800 --> 04:06:56,960 and count equals count plus one, 5631 04:06:56,960 --> 04:06:59,600 and so this is gonna run over and over and over again, 5632 04:06:59,600 --> 04:07:01,400 however many times we're gonna do this, 5633 04:07:01,400 --> 04:07:04,060 and then it's gonna pop out, and when it's done, 5634 04:07:04,060 --> 04:07:05,680 it's gonna have this value of total, 5635 04:07:05,680 --> 04:07:08,280 the running total will become the overall total, 5636 04:07:08,280 --> 04:07:11,200 divided by count, and it'll print the average out, okay? 5637 04:07:11,200 --> 04:07:14,080 And so that's kinda how we would have done this 5638 04:07:14,080 --> 04:07:16,760 before we knew how to do this with lists. 5639 04:07:16,760 --> 04:07:19,080 Now, let's take a look at the other one. 5640 04:07:20,160 --> 04:07:23,000 In the other one, we say let's make an empty list, 5641 04:07:23,000 --> 04:07:25,120 remember this is that constructor syntax 5642 04:07:25,120 --> 04:07:27,600 that says to Python, make me an empty list, 5643 04:07:27,600 --> 04:07:29,600 and assign the empty list. 5644 04:07:29,600 --> 04:07:32,120 It has nothing in it, right, but it is a list, 5645 04:07:32,120 --> 04:07:34,640 has nothing in it, into the variable num list. 5646 04:07:34,640 --> 04:07:36,520 Now we're gonna write another loop, 5647 04:07:36,520 --> 04:07:39,200 this part here is the same, these three lines, 5648 04:07:39,200 --> 04:07:42,400 read the number if it's done, quit, and convert it to value. 5649 04:07:42,400 --> 04:07:44,680 But instead of doing the actual calculation right now, 5650 04:07:44,680 --> 04:07:46,480 what we're gonna do is just append it to the list. 5651 04:07:46,480 --> 04:07:48,160 So the list will start out empty, 5652 04:07:48,160 --> 04:07:49,780 then the three will be in the list, 5653 04:07:49,780 --> 04:07:51,040 then the nine will be in the list, 5654 04:07:51,040 --> 04:07:52,600 then the five will be in the list. 5655 04:07:52,600 --> 04:07:54,840 So we're appending, each time through the loop, 5656 04:07:54,840 --> 04:07:56,480 we're appending into the list. 5657 04:07:56,480 --> 04:08:00,040 So we're just growing the list every time I read a value, 5658 04:08:00,040 --> 04:08:01,560 instead of actually computing something 5659 04:08:01,560 --> 04:08:03,000 with the value that we've got. 5660 04:08:03,000 --> 04:08:05,280 So in either case, we get value, 5661 04:08:05,280 --> 04:08:08,160 and in one case, we append it to the list. 5662 04:08:08,160 --> 04:08:10,480 And then finally, it finishes, the break happens, 5663 04:08:10,480 --> 04:08:12,380 and then we just say, oh, hey, Python, 5664 04:08:12,380 --> 04:08:13,580 sum up everything in the list, 5665 04:08:13,580 --> 04:08:14,880 add these three numbers together, 5666 04:08:14,880 --> 04:08:17,720 and then take the divided by the length of all those things, 5667 04:08:17,720 --> 04:08:19,040 and you'll have the average. 5668 04:08:19,040 --> 04:08:24,040 And so these two things give us exactly the same output. 5669 04:08:24,360 --> 04:08:25,600 Now there is one difference, 5670 04:08:25,600 --> 04:08:29,920 if there was like one million or one billion numbers, 5671 04:08:29,920 --> 04:08:31,260 they actually have to all be stored 5672 04:08:31,260 --> 04:08:32,560 in the memory simultaneously. 5673 04:08:32,560 --> 04:08:35,160 Whereas here, it's actually doing the calculation, 5674 04:08:35,160 --> 04:08:38,160 of the billion numbers, and not using up so much memory. 5675 04:08:38,160 --> 04:08:40,280 For most of the things that you're gonna be doing, 5676 04:08:40,280 --> 04:08:42,720 the difference in memory, there is a difference in memory. 5677 04:08:42,720 --> 04:08:45,380 This uses, this one here uses more memory, 5678 04:08:46,560 --> 04:08:49,540 but I can't draw very well, more memory. 5679 04:08:51,200 --> 04:08:54,080 It uses more memory, but it doesn't really matter 5680 04:08:54,080 --> 04:08:55,440 by the time it's all said and done. 5681 04:08:55,440 --> 04:08:59,520 And so for you, the difference between these things 5682 04:08:59,520 --> 04:09:02,600 is not all that significant, but it's important to understand 5683 04:09:02,600 --> 04:09:03,800 that they're just two techniques 5684 04:09:03,800 --> 04:09:06,220 to accomplish the same thing with lists. 5685 04:09:09,760 --> 04:09:11,800 So now we're gonna wrap up and talk a little bit 5686 04:09:11,800 --> 04:09:13,680 about how strings and lists are related. 5687 04:09:13,680 --> 04:09:16,480 They're sort of related in that they both have zero base 5688 04:09:16,480 --> 04:09:19,080 things and we use the square bracket operator 5689 04:09:19,080 --> 04:09:20,960 to do various things. 5690 04:09:20,960 --> 04:09:23,160 But there's a lot of situations where we're looking 5691 04:09:23,160 --> 04:09:26,520 at our data and we're combining the use of lists and strings. 5692 04:09:26,520 --> 04:09:28,080 So let me show you the first thing, 5693 04:09:28,080 --> 04:09:30,360 probably the coolest thing. 5694 04:09:30,360 --> 04:09:32,320 We're gonna use it a lot the rest of the class, 5695 04:09:32,320 --> 04:09:34,280 and that is the split function. 5696 04:09:34,280 --> 04:09:36,660 So let's take a string, we've got ABC here, 5697 04:09:36,660 --> 04:09:37,760 it's with three words. 5698 04:09:37,760 --> 04:09:40,080 What we're interested in the fact is that there's spaces 5699 04:09:40,080 --> 04:09:41,160 in this word. 5700 04:09:41,160 --> 04:09:42,800 And what split does is says, you know, 5701 04:09:42,800 --> 04:09:44,720 I'm gonna look through this thing, I'm gonna find this, 5702 04:09:44,720 --> 04:09:47,420 and I'm gonna break this into pieces, 5703 04:09:47,420 --> 04:09:49,320 and I'm gonna return you a list 5704 04:09:49,320 --> 04:09:51,000 of the separate individual pieces. 5705 04:09:51,000 --> 04:09:54,280 So look for blanks and break it in pieces 5706 04:09:54,280 --> 04:09:55,760 and give me back the pieces. 5707 04:09:55,760 --> 04:09:58,540 So I'll print these out and now you see that it's a list 5708 04:09:58,540 --> 04:10:00,520 with three items, with three words. 5709 04:10:00,520 --> 04:10:03,360 The spaces are gone, but it's given it to us. 5710 04:10:03,360 --> 04:10:05,220 So it's like, split this into words, please, 5711 04:10:05,220 --> 04:10:06,820 and give me the individual words, 5712 04:10:06,820 --> 04:10:08,620 give me a list of individual words, 5713 04:10:08,620 --> 04:10:11,160 rather than a big long string with spaces in the middle of it. 5714 04:10:11,160 --> 04:10:13,800 And that is a quick way to go from a line, 5715 04:10:13,800 --> 04:10:17,680 and it's really common, a lot of things we're going like, 5716 04:10:17,680 --> 04:10:20,280 go get the second thing, or the third thing, or whatever. 5717 04:10:20,280 --> 04:10:21,320 So the split's really nice, 5718 04:10:21,320 --> 04:10:23,160 because then you can just grab stuff. 5719 04:10:23,160 --> 04:10:25,000 And so you say, oh, how many things did I get? 5720 04:10:25,000 --> 04:10:27,600 Well, I got three, the len function tells us that. 5721 04:10:27,600 --> 04:10:30,080 And I can print the first word I got, 5722 04:10:30,080 --> 04:10:32,240 which is, and with the subzero, 5723 04:10:32,240 --> 04:10:34,800 and that'll be like with, will be the first word, 5724 04:10:34,800 --> 04:10:36,680 because that's the subzero position. 5725 04:10:36,680 --> 04:10:39,520 So I read something, I split it, 5726 04:10:39,520 --> 04:10:41,060 I can say there's three things, 5727 04:10:41,060 --> 04:10:43,520 and I can look at stuff the first word, basically, 5728 04:10:43,520 --> 04:10:44,960 without really knowing much. 5729 04:10:44,960 --> 04:10:47,840 Now, if you remember earlier, and we'll see this, 5730 04:10:47,840 --> 04:10:51,740 we used find and slicing to do a similar kind of thing, 5731 04:10:51,740 --> 04:10:55,440 but people tend to prefer the split. 5732 04:10:55,440 --> 04:11:00,280 And you can, you know, oops, go back. 5733 04:11:00,280 --> 04:11:04,320 You can also then loop through them, 5734 04:11:04,320 --> 04:11:07,160 so you can split these things into stuff as a word, 5735 04:11:07,160 --> 04:11:09,440 and then go through with w, 5736 04:11:09,440 --> 04:11:12,320 and then it's gonna go through, 5737 04:11:12,320 --> 04:11:15,200 w's gonna take the successive with three words. 5738 04:11:15,200 --> 04:11:18,280 And so you can make a loop by reading some data, 5739 04:11:18,280 --> 04:11:19,960 splitting it, then writing a for loop, 5740 04:11:19,960 --> 04:11:22,280 and then it's effectively going through the words 5741 04:11:22,280 --> 04:11:23,680 in that line of data. 5742 04:11:23,680 --> 04:11:26,840 And so that's a really powerful concept that we'll use 5743 04:11:26,840 --> 04:11:29,680 in a lot of the programs that we're going to write. 5744 04:11:29,680 --> 04:11:32,660 Just a couple of bits about this and how it works. 5745 04:11:33,560 --> 04:11:36,160 Split with no parameters here, it looks for spaces, 5746 04:11:36,160 --> 04:11:40,400 but it also treats a bunch of spaces as a single space. 5747 04:11:40,400 --> 04:11:42,400 And so it's pretty smart about that, 5748 04:11:42,400 --> 04:11:44,240 and so even though this has a lot of spaces 5749 04:11:44,240 --> 04:11:47,260 between lot and of, you only see lot of, 5750 04:11:47,260 --> 04:11:48,760 all the spaces are gone. 5751 04:11:48,760 --> 04:11:50,880 It does something special about spaces. 5752 04:11:50,880 --> 04:11:53,840 It's really white space, so tabs or new lines 5753 04:11:53,840 --> 04:11:58,840 or other characters would also qualify in split, basically. 5754 04:11:58,880 --> 04:12:01,780 Now, you don't always have to split based on spaces, 5755 04:12:01,780 --> 04:12:03,860 and a lot of data that you're gonna run into, 5756 04:12:03,860 --> 04:12:05,440 you're gonna wanna split on something else. 5757 04:12:05,440 --> 04:12:08,320 And so here's some data that looks like we're using colons 5758 04:12:08,320 --> 04:12:11,140 to separate the first, second, and third piece. 5759 04:12:11,140 --> 04:12:14,520 Now, if you just call split, split's looking for spaces. 5760 04:12:14,520 --> 04:12:16,800 And so split gives you back a list 5761 04:12:16,800 --> 04:12:18,660 of the things broken apart with spaces, 5762 04:12:18,660 --> 04:12:20,840 but there's not a single space in that line, 5763 04:12:20,840 --> 04:12:23,560 and so we get a list, see, it's a list, 5764 04:12:23,560 --> 04:12:24,560 but there's only one item, 5765 04:12:24,560 --> 04:12:25,960 and the semicolons are sitting there. 5766 04:12:25,960 --> 04:12:26,800 Split doesn't go like, 5767 04:12:26,800 --> 04:12:29,200 whoa, this looks like it should be semicolons. 5768 04:12:29,200 --> 04:12:31,360 Split's job is to use spaces 5769 04:12:31,360 --> 04:12:34,800 and split the string based on spaces, okay? 5770 04:12:34,800 --> 04:12:38,240 But given that this is something we like to do, 5771 04:12:38,240 --> 04:12:39,840 you can tell split what character 5772 04:12:39,840 --> 04:12:41,920 you'd actually like to split on. 5773 04:12:41,920 --> 04:12:43,460 Now, it's not quite as clever 5774 04:12:43,460 --> 04:12:45,540 when splitting on something other than spaces. 5775 04:12:45,540 --> 04:12:47,160 It doesn't understand that, you know, 5776 04:12:47,160 --> 04:12:48,980 if there's a bunch of semicolons in a row, 5777 04:12:48,980 --> 04:12:52,560 it still thinks of those as splitting points to split, 5778 04:12:52,560 --> 04:12:55,480 but in this particular case where there's no spaces, 5779 04:12:55,480 --> 04:12:56,920 you know, and it's gonna split that. 5780 04:12:56,920 --> 04:12:59,920 So it says split this based on the semicolon 5781 04:12:59,920 --> 04:13:03,840 instead of being based on the space. 5782 04:13:03,840 --> 04:13:06,400 And so if you take a look at what comes out of this, 5783 04:13:06,400 --> 04:13:09,520 we split on semicolon, now we have a three-item list, 5784 04:13:09,520 --> 04:13:11,320 and we get first, second, and third. 5785 04:13:11,320 --> 04:13:14,220 And a lot of your data comes out of some logging system 5786 04:13:14,220 --> 04:13:17,400 or some router status updates, 5787 04:13:17,400 --> 04:13:18,960 who knows what you're looking at, 5788 04:13:18,960 --> 04:13:21,840 but the delimiter is often something other than space, 5789 04:13:21,840 --> 04:13:24,320 and you can do that with split. 5790 04:13:26,840 --> 04:13:28,840 So this is a useful thing 5791 04:13:28,840 --> 04:13:31,400 when parsing things like our email address, right? 5792 04:13:31,400 --> 04:13:33,860 We wanted to get things like the email address, 5793 04:13:33,860 --> 04:13:36,960 this second piece, off of the line. 5794 04:13:38,340 --> 04:13:43,080 And so we can use split to take advantage of this. 5795 04:13:43,080 --> 04:13:44,440 And so here's a little loop 5796 04:13:44,440 --> 04:13:47,080 that's just gonna print out not the email addresses, 5797 04:13:47,080 --> 04:13:49,440 but instead the day of the week. 5798 04:13:49,440 --> 04:13:51,000 We're gonna print the day of the week out 5799 04:13:51,000 --> 04:13:51,960 for all these things. 5800 04:13:51,960 --> 04:13:52,880 How do we do that? 5801 04:13:52,880 --> 04:13:55,680 Well, we can observe really quickly 5802 04:13:55,680 --> 04:13:58,680 that if we split based on spaces, 5803 04:14:02,440 --> 04:14:06,320 it's the zero, one, two, it's the two position. 5804 04:14:06,320 --> 04:14:09,160 So we can quickly write a bit of code 5805 04:14:09,160 --> 04:14:13,360 that opens the file, then loops through the lines, 5806 04:14:13,360 --> 04:14:15,160 we do this all the time now. 5807 04:14:15,160 --> 04:14:18,080 The strip takes off the end of the new lines. 5808 04:14:18,080 --> 04:14:20,800 We can check to see if it starts with from space, right? 5809 04:14:20,800 --> 04:14:23,520 From space is our key, so we're ignoring, 5810 04:14:23,520 --> 04:14:24,960 we're ignoring all of the lines 5811 04:14:24,960 --> 04:14:26,200 that don't start with from space, 5812 04:14:26,200 --> 04:14:28,480 but then we find a line that starts with from space, 5813 04:14:28,480 --> 04:14:31,920 and we split it, and then we just print out the second word. 5814 04:14:31,920 --> 04:14:33,800 And so we get the second word of the lines 5815 04:14:33,800 --> 04:14:37,960 that start with from, and that's how this thing works. 5816 04:14:37,960 --> 04:14:42,960 Now, sometimes we want to dig into it deeper, 5817 04:14:45,120 --> 04:14:47,080 and we will take something, split it, 5818 04:14:47,080 --> 04:14:49,080 and then split another piece of it again 5819 04:14:49,080 --> 04:14:50,560 with a different delimiter. 5820 04:14:50,560 --> 04:14:53,040 So let's just say that the thing that we want to achieve 5821 04:14:53,040 --> 04:14:56,160 is getting the part after the at sign for email addresses. 5822 04:14:56,160 --> 04:14:59,160 And we did this with, again, find and pose 5823 04:14:59,160 --> 04:15:01,040 and stuff like that, but you can use split 5824 04:15:01,040 --> 04:15:02,480 to do this as well. 5825 04:15:02,480 --> 04:15:03,560 So the first thing we're gonna do 5826 04:15:03,560 --> 04:15:04,480 is we're gonna take this line, 5827 04:15:04,480 --> 04:15:06,520 we're gonna split it based on spaces, right? 5828 04:15:06,520 --> 04:15:09,480 Chop, chop, chop, chop, chop, chop, 5829 04:15:09,480 --> 04:15:11,240 and the fact that there's an extra space there, 5830 04:15:11,240 --> 04:15:14,040 doesn't matter, split happily just like zooms through that. 5831 04:15:14,040 --> 04:15:18,320 And then words sub one, zero, one, two, 5832 04:15:18,320 --> 04:15:20,200 word sub one is this email address, 5833 04:15:20,200 --> 04:15:22,320 so we'll put that in a variable called email, 5834 04:15:22,320 --> 04:15:25,160 and so email will be a string that's just this. 5835 04:15:25,160 --> 04:15:27,600 So in two lines, we've pulled out 5836 04:15:27,600 --> 04:15:29,840 the second address into a variable. 5837 04:15:29,840 --> 04:15:34,000 Then what we're going to do is we're going to re-split that. 5838 04:15:34,000 --> 04:15:36,120 We're gonna take this string we've got 5839 04:15:36,120 --> 04:15:37,720 and split it based on at sign, 5840 04:15:37,720 --> 04:15:39,400 because we know it's an email address. 5841 04:15:39,400 --> 04:15:41,160 So we get a new set of pieces, 5842 04:15:41,160 --> 04:15:43,040 the first part is the person's name, 5843 04:15:43,040 --> 04:15:46,720 and the second part is the host name 5844 04:15:46,720 --> 04:15:48,920 that their email is hosted on. 5845 04:15:48,920 --> 04:15:52,080 And then what we can do then is we just happen to know that, 5846 04:15:54,240 --> 04:15:57,000 we just happen to know that this is the zero item 5847 04:15:57,000 --> 04:15:59,360 and this is the one item, so we can get at that. 5848 04:15:59,360 --> 04:16:01,400 So the interesting thing of going here, 5849 04:16:01,400 --> 04:16:03,840 if you think back to how we did this before 5850 04:16:03,840 --> 04:16:06,640 with find and pose and all that stuff, 5851 04:16:06,640 --> 04:16:08,920 it's really a lot cleaner and we don't, 5852 04:16:08,920 --> 04:16:12,400 for me, I can look at this after you understand it 5853 04:16:12,400 --> 04:16:14,560 and it's easy for me to understand that it's correct, 5854 04:16:14,560 --> 04:16:17,400 whereas that pose stuff, you gotta add one 5855 04:16:17,400 --> 04:16:20,720 and start the second find after, just remember that. 5856 04:16:20,720 --> 04:16:22,200 And this is a lot cleaner way, 5857 04:16:22,200 --> 04:16:23,720 and this is a more typical way 5858 04:16:23,720 --> 04:16:27,240 of pulling this kind of information out of a line. 5859 04:16:28,320 --> 04:16:30,080 So in this chapter, we've talked about lists, 5860 04:16:30,080 --> 04:16:31,800 we've talked about the concept of collections, 5861 04:16:31,800 --> 04:16:33,120 that's our first data structure, 5862 04:16:33,120 --> 04:16:34,600 we're not just doing algorithms, 5863 04:16:34,600 --> 04:16:36,800 we kinda know algorithms now, 5864 04:16:36,800 --> 04:16:38,080 but now we're gonna do data structures. 5865 04:16:38,080 --> 04:16:40,360 And in this chapter and the next two chapters 5866 04:16:40,360 --> 04:16:42,040 are our foundational data structures 5867 04:16:42,040 --> 04:16:43,520 and then we'll, like everything, 5868 04:16:43,520 --> 04:16:45,640 we'll make more complex data structures 5869 04:16:45,640 --> 04:16:48,280 by composing those data structures together. 5870 04:16:48,280 --> 04:16:51,280 We've looked at how strings and lists connect together 5871 04:16:51,280 --> 04:16:55,160 and how split works and these are all really powerful tools 5872 04:16:55,160 --> 04:16:57,000 that we're gonna use going forward. 5873 04:16:57,000 --> 04:17:02,000 Now we're gonna take a look at how we would write some code 5874 04:17:04,600 --> 04:17:08,240 to do some parsing, read some data. 5875 04:17:08,240 --> 04:17:09,280 As a matter of fact, we're gonna read 5876 04:17:09,280 --> 04:17:12,080 through our famous mailbox data, 5877 04:17:12,080 --> 04:17:15,800 look for lines that begin with from space 5878 04:17:15,800 --> 04:17:17,160 and extract the third word. 5879 04:17:17,160 --> 04:17:19,480 As a matter of fact, we already have some of this code 5880 04:17:19,480 --> 04:17:21,480 already written, we're gonna debug it. 5881 04:17:21,480 --> 04:17:23,960 We're gonna look at code and we're gonna debug it. 5882 04:17:23,960 --> 04:17:25,720 So here we go, here we have it 5883 04:17:25,720 --> 04:17:28,240 and it's a pretty basic program. 5884 04:17:28,240 --> 04:17:31,520 It opens a file, loops through the file, 5885 04:17:31,520 --> 04:17:34,760 throws away the white space, splits it into words 5886 04:17:34,760 --> 04:17:36,760 and checks to see if the zeroth word, 5887 04:17:36,760 --> 04:17:39,120 the first word is from and if it's not, 5888 04:17:39,120 --> 04:17:41,200 we skip and read the next line. 5889 04:17:41,200 --> 04:17:43,960 And otherwise, if we find a line that starts 5890 04:17:43,960 --> 04:17:46,880 with from space, then we print the third word, 5891 04:17:46,880 --> 04:17:48,720 which is word sub two. 5892 04:17:48,720 --> 04:17:50,160 Okay, so this is what we've got 5893 04:17:50,160 --> 04:17:52,900 and we carefully saved this file 5894 04:17:52,900 --> 04:17:57,340 into the same folder that we've got, EX08. 5895 04:17:57,340 --> 04:18:02,340 And so let's go ahead, cd, desktop, Python for everybody, 5896 04:18:03,280 --> 04:18:05,260 EX underscore 08. 5897 04:18:06,160 --> 04:18:11,160 And so this is some files, we got our day of the week, 5898 04:18:11,320 --> 04:18:15,180 Python and our inbox short, so that's sitting there, okay? 5899 04:18:15,180 --> 04:18:16,700 And so let's run this program. 5900 04:18:16,700 --> 04:18:19,360 This is the program we've got right here, 5901 04:18:19,360 --> 04:18:24,360 Python three, dow.py and it doesn't work. 5902 04:18:26,960 --> 04:18:29,360 Now, by now you've seen a few trace backs 5903 04:18:29,360 --> 04:18:30,500 and there you go. 5904 04:18:32,120 --> 04:18:36,040 So, you know, when you look at a trace back, 5905 04:18:36,040 --> 04:18:39,080 you think to yourself, well, I made a mistake 5906 04:18:39,080 --> 04:18:42,160 and you've gotten pretty good at looking at that line. 5907 04:18:42,160 --> 04:18:44,080 So there you are, you're like, this is the line, 5908 04:18:44,080 --> 04:18:46,520 there must be something wrong on this line 5909 04:18:46,520 --> 04:18:48,720 and you wanna change it. 5910 04:18:48,720 --> 04:18:50,880 But that line's not actually the problem 5911 04:18:50,880 --> 04:18:51,920 in this particular thing. 5912 04:18:51,920 --> 04:18:54,360 And so you gotta be careful sometimes. 5913 04:18:54,360 --> 04:18:56,480 And one of the things that you didn't notice 5914 04:18:56,480 --> 04:19:00,320 in this one right away is that it actually worked. 5915 04:19:00,320 --> 04:19:02,840 It printed the first line out. 5916 04:19:02,840 --> 04:19:05,620 So if we take a look at our data set, 5917 04:19:05,620 --> 04:19:08,120 it found the line started with from space, 5918 04:19:08,120 --> 04:19:11,160 it split it and printed out the third word 5919 04:19:11,160 --> 04:19:13,200 and it blew up later. 5920 04:19:13,200 --> 04:19:16,080 And so part of the problem is that we don't know 5921 04:19:16,080 --> 04:19:19,560 what it was doing when it blew up. 5922 04:19:19,560 --> 04:19:21,600 And so the first thing I'd like to do 5923 04:19:21,600 --> 04:19:26,000 in this kind of a situation is find the line 5924 04:19:26,000 --> 04:19:29,400 and make sure there's a print statement right before it. 5925 04:19:29,400 --> 04:19:34,400 And so I'm gonna print words colon and then comma WDS. 5926 04:19:35,360 --> 04:19:38,580 I wanna print right before the line that blows up 5927 04:19:38,580 --> 04:19:41,720 so that I know really when this finally does blow up, 5928 04:19:41,720 --> 04:19:44,000 what was going on in that line. 5929 04:19:44,000 --> 04:19:46,720 So I'm gonna run it again. 5930 04:19:48,600 --> 04:19:51,000 And oop, did I forget to save it? 5931 04:19:51,000 --> 04:19:52,760 No, I forgot to save it, look at that. 5932 04:19:52,760 --> 04:19:54,960 See the little blue dot, forgot to save it. 5933 04:19:58,680 --> 04:20:00,480 So now we see a whole bunch of output. 5934 04:20:00,480 --> 04:20:02,360 And we see that it's actually doing a whole lot of work 5935 04:20:02,360 --> 04:20:03,960 before it's blowing up. 5936 04:20:03,960 --> 04:20:06,940 And so you see that it prints the words out 5937 04:20:06,940 --> 04:20:08,980 from that first line and prints out Saturday, 5938 04:20:08,980 --> 04:20:10,640 which is exactly what we expect. 5939 04:20:10,640 --> 04:20:12,280 It's the third word in the line. 5940 04:20:12,280 --> 04:20:13,680 And then reads a whole bunch of stuff 5941 04:20:13,680 --> 04:20:16,360 and it's actually, what it's doing now is ignoring. 5942 04:20:16,360 --> 04:20:18,520 Let me just put something here. 5943 04:20:18,520 --> 04:20:22,340 I'm gonna say print ignore. 5944 04:20:26,240 --> 04:20:29,120 So I can keep track of when these lines are being ignored. 5945 04:20:29,120 --> 04:20:32,360 So let's run it again and have the word ignore pop up. 5946 04:20:32,360 --> 04:20:35,000 Right, and so it's doing a lot of ignoring. 5947 04:20:35,000 --> 04:20:39,200 It finds these words, prints out Saturday, 5948 04:20:39,200 --> 04:20:40,760 reads this line and ignores it, 5949 04:20:40,760 --> 04:20:42,200 reads this line and ignores it, 5950 04:20:42,200 --> 04:20:43,360 reads this line and ignores it. 5951 04:20:43,360 --> 04:20:47,520 So a lot of stuff's going on here that you might not realize. 5952 04:20:47,520 --> 04:20:50,760 And so we have to take a look at what the problem is. 5953 04:20:50,760 --> 04:20:54,260 And so it is now blowing up word sub zero. 5954 04:20:54,260 --> 04:20:56,240 And now we can scroll down and we can look 5955 04:20:56,240 --> 04:20:59,420 at exactly what happened right before the trace back. 5956 04:20:59,420 --> 04:21:02,480 So we really now know exactly what happened 5957 04:21:02,480 --> 04:21:03,400 before the trace back. 5958 04:21:03,400 --> 04:21:05,200 And the interesting thing is, 5959 04:21:05,200 --> 04:21:08,840 is that there is an empty, empty string. 5960 04:21:08,840 --> 04:21:10,660 I mean empty array. 5961 04:21:10,660 --> 04:21:12,560 There's an array with zero items. 5962 04:21:12,560 --> 04:21:14,700 So I'm gonna print the line out too. 5963 04:21:16,640 --> 04:21:19,920 Print line colon. 5964 04:21:19,920 --> 04:21:21,340 Now I haven't changed my program at all. 5965 04:21:21,340 --> 04:21:24,080 I'm just trying to figure out what's going on here. 5966 04:21:24,080 --> 04:21:26,280 So I'll save that and I'm gonna run it. 5967 04:21:27,280 --> 04:21:30,080 And we've got a lot of stuff and it's still working. 5968 04:21:30,080 --> 04:21:34,000 It reads a line, it reads a line, splits it into words, 5969 04:21:34,000 --> 04:21:35,240 and then prints out Saturday, 5970 04:21:35,240 --> 04:21:37,560 which is the third word on the line. 5971 04:21:37,560 --> 04:21:40,900 Now here it reads a line and this line is a blank line. 5972 04:21:40,900 --> 04:21:43,860 And it has, because it's a blank line, 5973 04:21:43,860 --> 04:21:47,900 the split returns no words and that's what blows up. 5974 04:21:47,900 --> 04:21:50,220 And the problem now is, oh, wait a sec, 5975 04:21:50,220 --> 04:21:52,140 list index out of range. 5976 04:21:52,140 --> 04:21:55,500 So word sub zero is not valid, which is the first word, 5977 04:21:55,500 --> 04:21:57,320 when there are no words. 5978 04:21:57,320 --> 04:22:01,020 So this is a statement that works most of the time. 5979 04:22:01,020 --> 04:22:02,860 Now you might think, oh, I wanna just put a try 5980 04:22:02,860 --> 04:22:04,300 and accept in there. 5981 04:22:04,300 --> 04:22:08,140 Well, the right thing to do is to say to yourself, 5982 04:22:08,140 --> 04:22:09,920 oh, wait a second. 5983 04:22:09,920 --> 04:22:14,500 If the, I don't have enough words, 5984 04:22:14,500 --> 04:22:17,600 if the length of the words is less than one, 5985 04:22:20,660 --> 04:22:21,500 continue. 5986 04:22:23,500 --> 04:22:25,780 So basically it's gonna come through here, 5987 04:22:25,780 --> 04:22:28,240 it's gonna split it and if we don't have any words, 5988 04:22:28,240 --> 04:22:31,820 meaning it's a blank line, then we're gonna skip it. 5989 04:22:31,820 --> 04:22:32,860 So let's run that. 5990 04:22:35,160 --> 04:22:37,000 So now this ran all the way to the end. 5991 04:22:37,000 --> 04:22:39,220 It did a lot of stuff and it did not blow up 5992 04:22:39,220 --> 04:22:44,180 specifically, didn't have a trace back. 5993 04:22:44,180 --> 04:22:47,220 Another way to protect this would be to, 5994 04:22:48,100 --> 04:22:49,140 we'll take this part out. 5995 04:22:49,140 --> 04:22:50,860 This is called a guardian pattern. 5996 04:22:54,280 --> 04:22:55,420 Right, guardian pattern, 5997 04:22:55,420 --> 04:22:58,420 because this is dangerous. 5998 04:22:58,420 --> 04:23:02,100 This could blow up, but this, it won't blow up 5999 04:23:02,100 --> 04:23:06,020 if it makes it past here and it won't come through there 6000 04:23:06,020 --> 04:23:08,220 under the conditions that are causing it to blow up. 6001 04:23:08,220 --> 04:23:11,460 Another way to do this might be to protect it as follows. 6002 04:23:11,460 --> 04:23:12,940 To say, oh, wait a sec. 6003 04:23:14,260 --> 04:23:16,640 If the line is a blank line, 6004 04:23:18,780 --> 04:23:22,020 no, continue. 6005 04:23:22,020 --> 04:23:24,900 So now what we're gonna do is we're gonna skip blank lines. 6006 04:23:24,900 --> 04:23:29,900 I even say this, print skip blank. 6007 04:23:29,900 --> 04:23:34,900 So if it's a blank, we're gonna skip blank and keep going. 6008 04:23:38,340 --> 04:23:40,100 This will skip blank lines. 6009 04:23:40,100 --> 04:23:41,540 It'll come through here 6010 04:23:41,540 --> 04:23:43,860 and this will skip lines that don't have from, 6011 04:23:43,860 --> 04:23:46,300 but because we're not processing blank lines, 6012 04:23:46,300 --> 04:23:48,620 words of zero always works. 6013 04:23:48,620 --> 04:23:51,480 So I can run this code and it works again. 6014 04:23:51,480 --> 04:23:54,320 So here we have a blank line, we skipped it. 6015 04:23:54,320 --> 04:23:56,760 Here we have a blank line, we skipped it. 6016 04:23:56,760 --> 04:23:59,040 Now here we had a non blank lines, we parsed it, 6017 04:23:59,040 --> 04:24:00,660 but then we ignored it. 6018 04:24:00,660 --> 04:24:03,440 And then up here, we'll find it from somewhere. 6019 04:24:03,440 --> 04:24:05,120 Doo doo doo doo doo. 6020 04:24:05,120 --> 04:24:07,940 Let's find it from, here it comes. 6021 04:24:15,420 --> 04:24:17,640 Oh, no, there's ignore, ignore. 6022 04:24:17,640 --> 04:24:22,640 I got too much debug print, I can't find it. 6023 04:24:29,120 --> 04:24:31,180 Here, I'll just hunt for from with find. 6024 04:24:37,840 --> 04:24:39,360 Okay, so there we go. 6025 04:24:39,360 --> 04:24:41,920 There it's from and we print the thing out. 6026 04:24:41,920 --> 04:24:45,640 So we're getting a lot of extra stuff. 6027 04:24:45,640 --> 04:24:48,840 So I'm gonna comment out some of these debugs. 6028 04:24:51,560 --> 04:24:52,760 And I'm actually just gonna get rid 6029 04:24:52,760 --> 04:24:54,520 of this whole skipping of the blank line. 6030 04:24:54,520 --> 04:24:55,940 I'm gonna do it with the words. 6031 04:24:55,940 --> 04:24:59,180 I'm gonna go back to the guardian we had before. 6032 04:25:05,880 --> 04:25:09,200 If the number of words that we got, 6033 04:25:09,200 --> 04:25:16,200 ln of words, is less than one, continue. 6034 04:25:16,680 --> 04:25:19,380 Okay, so now this is gonna be a working program. 6035 04:25:22,400 --> 04:25:26,160 Oops, I gotta take another print statement out. 6036 04:25:27,580 --> 04:25:29,440 Gotta take another print statement out. 6037 04:25:29,440 --> 04:25:31,280 We sort of know what we're doing here. 6038 04:25:34,000 --> 04:25:36,960 Okay, so this looks like a pretty safe thing. 6039 04:25:36,960 --> 04:25:40,560 This guardian is protecting this dangerous. 6040 04:25:40,560 --> 04:25:42,200 I'll get rid of that one too. 6041 04:25:42,200 --> 04:25:45,380 This is the words that was traced back. 6042 04:25:45,380 --> 04:25:47,160 And nothing else in this thing changed 6043 04:25:47,160 --> 04:25:50,100 from when we started except we've added this little guardian. 6044 04:25:50,100 --> 04:25:52,400 Now the interesting thing is if it comes through here 6045 04:25:52,400 --> 04:25:55,280 and prints words of two, what happens if somehow 6046 04:25:55,280 --> 04:25:58,720 we find a line that has from is its first word 6047 04:25:59,960 --> 04:26:02,640 and there's only one word on, this is gonna blow up. 6048 04:26:02,640 --> 04:26:07,640 So we can make our guardian a little stronger. 6049 04:26:07,960 --> 04:26:11,120 And we can say, you know what, we're gonna skip this line 6050 04:26:11,120 --> 04:26:12,800 if it doesn't have the three words in it. 6051 04:26:12,800 --> 04:26:15,120 So it has to have at least three words. 6052 04:26:15,120 --> 04:26:17,360 And if we see less than three words, we're gonna skip it. 6053 04:26:17,360 --> 04:26:19,760 And that just makes the guardian a bit stronger. 6054 04:26:21,980 --> 04:26:25,480 And so the program works safely and you see these things 6055 04:26:25,480 --> 04:26:29,080 where sometimes you wanna check to see reasonable, 6056 04:26:29,080 --> 04:26:31,160 that your assumptions about the data are reasonable 6057 04:26:31,160 --> 04:26:34,240 and skip things where the data is not reasonable. 6058 04:26:35,120 --> 04:26:36,960 So that's one guardian pattern. 6059 04:26:36,960 --> 04:26:39,380 Let me show you a slightly different way to do this. 6060 04:26:39,380 --> 04:26:41,480 And this is with an or statement. 6061 04:26:41,480 --> 04:26:45,200 So I'm gonna take this code, copy that, 6062 04:26:45,200 --> 04:26:47,080 and put it here with or. 6063 04:26:47,960 --> 04:26:49,400 Get rid of all this stuff. 6064 04:26:50,600 --> 04:26:55,600 This is the guardian in a compound statement. 6065 04:26:55,600 --> 04:27:00,600 So what we're saying is if there are less than three words 6066 04:27:04,740 --> 04:27:09,740 on the line or if the first word is not from, continue. 6067 04:27:10,620 --> 04:27:13,860 Now we're doing this in order because the way it works 6068 04:27:13,860 --> 04:27:17,900 is or is true if either that's true or this is true. 6069 04:27:17,900 --> 04:27:21,180 But if it knows that this is true, 6070 04:27:21,180 --> 04:27:22,900 then it doesn't bother checking this. 6071 04:27:22,900 --> 04:27:25,120 And the checking of this is what blows up, 6072 04:27:25,120 --> 04:27:26,940 what causes the trace back. 6073 04:27:26,940 --> 04:27:29,220 So if we flip this order, it would fail. 6074 04:27:29,220 --> 04:27:31,860 If we do it in this order, it will work. 6075 04:27:31,860 --> 04:27:33,640 So let's do this one right. 6076 04:27:35,980 --> 04:27:37,140 It works. 6077 04:27:37,140 --> 04:27:39,560 But if I get this backwards, 6078 04:27:45,400 --> 04:27:48,760 it's gonna check this before it checks this. 6079 04:27:48,760 --> 04:27:53,060 And we're going to go back to failing again. 6080 04:27:53,060 --> 04:27:55,760 So you gotta get the order of these things right. 6081 04:27:55,760 --> 04:28:00,760 The guardian comes before in the or. 6082 04:28:02,900 --> 04:28:04,620 The guardian comes before. 6083 04:28:04,620 --> 04:28:07,240 And if this is true, then it doesn't check this. 6084 04:28:07,240 --> 04:28:09,800 This is called short circuit evaluation 6085 04:28:09,800 --> 04:28:11,920 where it knows that as long as this part's true, 6086 04:28:11,920 --> 04:28:15,020 it doesn't evaluate this second part. 6087 04:28:15,020 --> 04:28:18,820 And so now we have a guardian in a compound statement. 6088 04:28:18,820 --> 04:28:21,180 You'll see this a lot. 6089 04:28:21,180 --> 04:28:22,700 Sometimes if it's more complex, 6090 04:28:22,700 --> 04:28:24,200 you do it in multiple statements, 6091 04:28:24,200 --> 04:28:27,820 or you fall through, check for sanity, check for sanity, 6092 04:28:27,820 --> 04:28:30,140 and only run the code. 6093 04:28:31,400 --> 04:28:35,360 So I hope that that was useful to you, 6094 04:28:35,360 --> 04:28:37,340 looking a little bit about how to debug 6095 04:28:37,340 --> 04:28:40,540 where you don't just start chopping on the line 6096 04:28:40,540 --> 04:28:41,400 that had the problem. 6097 04:28:41,400 --> 04:28:42,680 It's not always that line 6098 04:28:42,680 --> 04:28:44,700 because we never did change that line. 6099 04:28:44,700 --> 04:28:46,320 Although we did change it a little bit at the end, 6100 04:28:46,320 --> 04:28:47,800 we added this guardian here. 6101 04:28:47,800 --> 04:28:49,360 But we also fixed it without it. 6102 04:28:50,320 --> 04:28:52,160 Sometimes you add some print statements 6103 04:28:52,160 --> 04:28:53,300 to figure out what's going on 6104 04:28:53,300 --> 04:28:56,280 before you just start chopping on that line. 6105 04:28:56,280 --> 04:28:58,940 So again, I hope this helps. 6106 04:28:58,940 --> 04:28:59,780 Thanks. 6107 04:29:03,360 --> 04:29:04,880 Hello and welcome to chapter nine. 6108 04:29:04,880 --> 04:29:07,280 Now we're gonna talk about Python dictionaries. 6109 04:29:07,280 --> 04:29:11,260 Python dictionaries are probably the thing 6110 04:29:11,260 --> 04:29:15,120 that most programmers love the most about Python 6111 04:29:15,120 --> 04:29:16,260 because they're very powerful. 6112 04:29:16,260 --> 04:29:18,240 They're like a little in-memory database. 6113 04:29:18,240 --> 04:29:20,560 It's the second of our kinds of collections 6114 04:29:20,560 --> 04:29:22,840 and probably the best collection. 6115 04:29:24,000 --> 04:29:25,200 To review what a collection is, 6116 04:29:25,200 --> 04:29:27,840 it is a situation where we are going to have a variable, 6117 04:29:27,840 --> 04:29:29,640 like a list or a dictionary, 6118 04:29:29,640 --> 04:29:32,320 that we can put multiple pieces of information in 6119 04:29:32,320 --> 04:29:34,960 rather than a single piece of information. 6120 04:29:34,960 --> 04:29:36,680 And of course, prior to collections, 6121 04:29:36,680 --> 04:29:38,740 we would put something into X 6122 04:29:38,740 --> 04:29:40,360 and then we would put something else into X 6123 04:29:40,360 --> 04:29:42,160 and it would be overwritten. 6124 04:29:42,160 --> 04:29:46,320 And now with lists, we can append things on to the end. 6125 04:29:46,320 --> 04:29:50,280 And so if we compare lists and dictionaries, 6126 04:29:50,280 --> 04:29:54,520 the list is sort of the organized version of the collections. 6127 04:29:54,520 --> 04:29:56,000 Everything stays in order. 6128 04:29:56,000 --> 04:29:57,880 You add something, it always adds to the end. 6129 04:29:57,880 --> 04:30:00,240 You take something, it sort of compacts itself. 6130 04:30:00,240 --> 04:30:02,760 It's zero through the n minus one, 6131 04:30:02,760 --> 04:30:04,440 where n is the number of items. 6132 04:30:04,440 --> 04:30:07,280 And so it's very organized, kind of like a Pringles, 6133 04:30:07,280 --> 04:30:09,420 where the potato chips are nicely stacked. 6134 04:30:11,160 --> 04:30:13,440 Dictionaries are messier. 6135 04:30:13,440 --> 04:30:16,240 You can put things into dictionaries. 6136 04:30:16,240 --> 04:30:19,640 There's no real sense of order in dictionaries. 6137 04:30:19,640 --> 04:30:20,760 Everything has a key. 6138 04:30:20,760 --> 04:30:22,220 So you sort of throw things in 6139 04:30:22,220 --> 04:30:24,520 and they kind of mix around in there somehow. 6140 04:30:24,520 --> 04:30:26,420 And you pull things out based on the key. 6141 04:30:26,420 --> 04:30:29,560 It's like you sort of stick a label on it, 6142 04:30:30,880 --> 04:30:34,000 where you say, okay, I'm gonna take this thing 6143 04:30:36,120 --> 04:30:37,600 and I'm gonna put Chuck on it. 6144 04:30:39,000 --> 04:30:44,000 And I'm gonna take these sunglasses with the Chuck label 6145 04:30:46,120 --> 04:30:48,080 and I'm gonna throw it into the dictionary 6146 04:30:48,080 --> 04:30:50,000 and I'm like, hey, give me back Chuck. 6147 04:30:50,000 --> 04:30:51,480 I'm like, oh, here's your sunglasses 6148 04:30:51,480 --> 04:30:53,280 because you mark everything. 6149 04:30:53,280 --> 04:30:56,180 This is like the key. 6150 04:30:56,180 --> 04:30:57,120 This is the value. 6151 04:30:57,120 --> 04:30:59,440 I took a pair of sunglasses and I threw it in. 6152 04:30:59,440 --> 04:31:03,680 So it's kind of like a purse or it's sort of like a mess. 6153 04:31:03,680 --> 04:31:05,880 And so the idea is you have these labels 6154 04:31:05,880 --> 04:31:08,800 that you put on everything that you're gonna throw in. 6155 04:31:08,800 --> 04:31:11,840 Like I'm gonna put, so it won't stick to my keys. 6156 04:31:13,240 --> 04:31:15,400 You know, what else do I got here? 6157 04:31:15,400 --> 04:31:19,000 I'm gonna stick a label on my pen, a Chuck label, 6158 04:31:19,000 --> 04:31:21,400 and I'm gonna store a pen in my dictionary 6159 04:31:21,400 --> 04:31:22,540 with a Chuck label. 6160 04:31:23,660 --> 04:31:27,800 And so it's like having a purse or a bag or a backpack 6161 04:31:27,800 --> 04:31:31,820 where you have things labeled and you can throw things in 6162 04:31:31,820 --> 04:31:34,480 and label them and you can shout into your bag and say, 6163 04:31:34,480 --> 04:31:37,200 give me the calculator or give me the candy 6164 04:31:37,200 --> 04:31:39,560 or whatever it is that you have labeled them. 6165 04:31:39,560 --> 04:31:41,200 You have to come up with the labels 6166 04:31:41,200 --> 04:31:43,880 and then you can use the labels to get things back out. 6167 04:31:43,880 --> 04:31:46,800 And like I said, they're probably the most powerful thing. 6168 04:31:46,800 --> 04:31:48,960 And they're basically this concept 6169 04:31:48,960 --> 04:31:51,320 that's generally referred to as associative arrays, 6170 04:31:51,320 --> 04:31:54,520 which means they're like lists, but they have these keys. 6171 04:31:54,520 --> 04:31:57,280 And so the associative means the association 6172 04:31:57,280 --> 04:31:58,920 between a key and a value. 6173 04:31:58,920 --> 04:32:01,120 Whereas in a list, there's a position in a value 6174 04:32:01,120 --> 04:32:04,240 and the position is less powerful and less flexible. 6175 04:32:04,240 --> 04:32:06,840 Most modern programming languages have this notion 6176 04:32:06,840 --> 04:32:08,120 of associative arrays. 6177 04:32:08,120 --> 04:32:09,920 If they don't, they're sort of unpopular 6178 04:32:09,920 --> 04:32:12,400 because once you get using them, they're like, 6179 04:32:12,400 --> 04:32:13,560 whoa, they're so powerful. 6180 04:32:13,560 --> 04:32:15,000 If you ever find yourself in a language 6181 04:32:15,000 --> 04:32:17,840 that doesn't have them, you'll freak out. 6182 04:32:17,840 --> 04:32:20,560 They have different names like property maps 6183 04:32:20,560 --> 04:32:22,920 or hash maps or property bags, 6184 04:32:22,920 --> 04:32:24,480 depending on the language you're using, 6185 04:32:24,480 --> 04:32:26,040 but they all are the same thing. 6186 04:32:26,040 --> 04:32:27,740 They're key value pairs. 6187 04:32:28,940 --> 04:32:31,880 So the idea of a dictionary is that, 6188 04:32:31,880 --> 04:32:32,840 or the idea of any collection 6189 04:32:32,840 --> 04:32:34,800 is putting more than one thing in. 6190 04:32:34,800 --> 04:32:36,000 And then the difference is, 6191 04:32:36,000 --> 04:32:39,360 is that you have ways of indexing it. 6192 04:32:39,360 --> 04:32:41,000 So this basically line says, 6193 04:32:41,000 --> 04:32:42,400 let's make ourselves a dictionary, 6194 04:32:42,400 --> 04:32:45,160 just like we constructed an empty list. 6195 04:32:45,160 --> 04:32:47,560 And I want to store 12 into this dictionary 6196 04:32:47,560 --> 04:32:49,520 and I want to label it money. 6197 04:32:49,520 --> 04:32:52,160 And so on the left-hand side, when we use this money, 6198 04:32:52,160 --> 04:32:54,420 that's the label that we're going to give it. 6199 04:32:54,420 --> 04:32:56,240 And so 12 is being placed in the dictionary. 6200 04:32:56,240 --> 04:32:58,000 That's like taking the 12, 6201 04:32:58,000 --> 04:33:00,080 throwing it in the dictionary with a label of money. 6202 04:33:00,080 --> 04:33:00,960 I can't, yeah. 6203 04:33:02,000 --> 04:33:03,640 Three's going in with a label of candy 6204 04:33:03,640 --> 04:33:05,480 and 75 is going in with tissues. 6205 04:33:05,480 --> 04:33:06,740 We say, what's in there? 6206 04:33:06,740 --> 04:33:08,000 And there's no order to it. 6207 04:33:08,000 --> 04:33:10,040 And sometimes the order can even change 6208 04:33:10,040 --> 04:33:11,540 inside of a dictionary. 6209 04:33:11,540 --> 04:33:13,560 Although there are more advanced versions of dictionaries 6210 04:33:13,560 --> 04:33:15,240 that maintain some kind of order, 6211 04:33:15,240 --> 04:33:18,480 but for now let's just not worry about the ordering of them. 6212 04:33:19,720 --> 04:33:20,640 If we say, what's in there? 6213 04:33:20,640 --> 04:33:22,080 You say, oh, there's three things in there. 6214 04:33:22,080 --> 04:33:24,520 There is 12, 75, and three, 6215 04:33:24,520 --> 04:33:26,800 and stored under the keys, money, 6216 04:33:26,800 --> 04:33:28,880 tissues, and candy, respectively. 6217 04:33:28,880 --> 04:33:32,439 We can ask, using the index operator, 6218 04:33:32,439 --> 04:33:33,519 what is purse of candy? 6219 04:33:33,520 --> 04:33:35,640 And that's like saying, hey, give me back candy. 6220 04:33:35,640 --> 04:33:39,840 And out comes the number three, which is that. 6221 04:33:39,840 --> 04:33:40,880 We can update stuff. 6222 04:33:40,880 --> 04:33:43,320 So we can say, go grab the candy version, 6223 04:33:43,320 --> 04:33:45,480 add two to it, make five, 6224 04:33:45,480 --> 04:33:46,919 and then store that back into candy. 6225 04:33:46,919 --> 04:33:51,919 And so now we see that candy has been set up to be five. 6226 04:33:55,279 --> 04:33:58,319 And so if you look at the difference 6227 04:33:58,320 --> 04:33:59,720 between lists and dictionaries, 6228 04:33:59,720 --> 04:34:02,599 they both can have new items added to them. 6229 04:34:02,599 --> 04:34:03,879 We haven't talked a lot about deleting, 6230 04:34:03,880 --> 04:34:05,919 but items can be deleted from them. 6231 04:34:05,919 --> 04:34:07,959 The difference is the indexing mechanism, 6232 04:34:07,960 --> 04:34:10,400 how we look things up, how we store things, 6233 04:34:10,400 --> 04:34:11,680 and how we look things up. 6234 04:34:11,680 --> 04:34:14,480 So we make an empty list, we make an empty dictionary. 6235 04:34:14,480 --> 04:34:16,840 We add 21 to the end, and we add 183 to the end, 6236 04:34:16,840 --> 04:34:18,599 and we ask it, and it says, oh, 6237 04:34:18,599 --> 04:34:21,679 position zero is 21, and position one is 183. 6238 04:34:21,680 --> 04:34:23,320 We don't see the positions when we print it out, 6239 04:34:23,320 --> 04:34:24,720 because it's sort of implicit. 6240 04:34:24,720 --> 04:34:27,020 Here we're gonna, and mark 21 with age, 6241 04:34:27,020 --> 04:34:29,400 and stick it in, and mark 182 with course, 6242 04:34:29,400 --> 04:34:31,279 and stick it in, and then we're gonna print it out, 6243 04:34:31,279 --> 04:34:34,199 and there we got course and age mapped. 6244 04:34:34,200 --> 04:34:37,520 And we can add 23 and stick it back in age, 6245 04:34:37,520 --> 04:34:40,860 and that overwrites, so the 21 becomes the 23. 6246 04:34:40,860 --> 04:34:42,200 We can do the same thing in a list, 6247 04:34:42,200 --> 04:34:43,800 except we say lists of zero, 6248 04:34:43,800 --> 04:34:46,520 because in lists, the indexing is position, 6249 04:34:46,520 --> 04:34:50,100 and so this 21 becomes 23. 6250 04:34:51,919 --> 04:34:53,479 And again, you just look at them, 6251 04:34:53,480 --> 04:34:55,099 and you can think of each of these 6252 04:34:55,099 --> 04:34:57,999 as pretty much doing roughly the same thing, 6253 04:34:58,000 --> 04:35:00,119 except the indexing mechanism. 6254 04:35:00,119 --> 04:35:03,359 The values are the same, but the keys are different. 6255 04:35:03,360 --> 04:35:05,560 So in lists, the keys are always the position, 6256 04:35:05,560 --> 04:35:06,960 and you don't get to assign those 6257 04:35:06,960 --> 04:35:09,360 other than the fact that the order in which you put them in 6258 04:35:09,360 --> 04:35:11,279 implicitly assigns a position, 6259 04:35:11,279 --> 04:35:15,039 and in dictionaries, the key is a string. 6260 04:35:15,919 --> 04:35:17,479 You can actually use other things. 6261 04:35:17,480 --> 04:35:19,759 I use strings a lot in this lecture, 6262 04:35:19,759 --> 04:35:21,839 but that just kinda keeps things simple 6263 04:35:21,840 --> 04:35:23,720 until you get good at it. 6264 04:35:23,720 --> 04:35:27,080 You can actually use numbers as the dictionary index, 6265 04:35:27,080 --> 04:35:28,500 the dictionary keys if you want, 6266 04:35:28,500 --> 04:35:30,720 but the values are things you put in 6267 04:35:30,720 --> 04:35:33,599 and manage in those dictionaries. 6268 04:35:33,599 --> 04:35:37,319 So we can, just like lists, we have dictionary literals, 6269 04:35:37,320 --> 04:35:39,480 and what's nice about dictionary literals 6270 04:35:39,480 --> 04:35:43,080 is that they use the exact same syntax as the printout, 6271 04:35:43,080 --> 04:35:44,720 and so it starts with a curly brace, 6272 04:35:44,720 --> 04:35:46,080 ends with a curly brace, 6273 04:35:46,080 --> 04:35:48,400 and then has a series of key colon value, 6274 04:35:48,400 --> 04:35:50,960 key colon value, key colon value, 6275 04:35:50,960 --> 04:35:53,259 and this is sort of the associative array bit. 6276 04:35:53,259 --> 04:35:56,039 We are associating one with a key chuck. 6277 04:35:56,040 --> 04:35:58,200 We are associating 42 with a key thread, 6278 04:35:58,200 --> 04:36:00,880 more associating Jan and 100. 6279 04:36:00,880 --> 04:36:02,919 Then we print it out, it kinda looks exactly the same, 6280 04:36:02,919 --> 04:36:05,679 so the print statements in Python are nice 6281 04:36:05,680 --> 04:36:07,599 in that you ask what's in a thing, 6282 04:36:07,599 --> 04:36:10,119 you show the stuff, and it shows you in the syntax 6283 04:36:10,119 --> 04:36:11,839 that if you type that into Python, 6284 04:36:11,840 --> 04:36:16,279 that would be how you do a constant. 6285 04:36:16,279 --> 04:36:18,319 And if you just say empty array, 6286 04:36:18,320 --> 04:36:21,640 you see me also do D-I-C-T. 6287 04:36:21,640 --> 04:36:23,040 This is constructor where you say 6288 04:36:23,040 --> 04:36:24,599 make a new empty dictionary. 6289 04:36:24,599 --> 04:36:26,659 This is an empty dictionary constant. 6290 04:36:26,660 --> 04:36:29,460 These two things are pretty much the exact same thing. 6291 04:36:29,460 --> 04:36:33,040 This is a shortcut to doing this. 6292 04:36:33,040 --> 04:36:37,480 The empty curly braces is a shortcut 6293 04:36:37,480 --> 04:36:41,919 to do the construction. 6294 04:36:41,919 --> 04:36:43,519 So up next, we're gonna talk about 6295 04:36:43,520 --> 04:36:46,279 sort of one of the really common applications 6296 04:36:46,279 --> 04:36:48,319 of dictionaries, and that is counting. 6297 04:36:52,279 --> 04:36:53,819 So now we're gonna talk to you about 6298 04:36:53,820 --> 04:36:56,400 one of the common applications of dictionaries, 6299 04:36:56,400 --> 04:36:58,400 and that is making histograms. 6300 04:36:58,400 --> 04:37:01,200 It's counting the frequency of things. 6301 04:37:01,200 --> 04:37:03,520 And so if you think of a histogram as, 6302 04:37:03,520 --> 04:37:07,160 it's a little graph, and there is A, 6303 04:37:07,160 --> 04:37:09,360 how many A's, how many B's, and how many C's, 6304 04:37:09,360 --> 04:37:10,400 and there's a histogram that says, 6305 04:37:10,400 --> 04:37:12,599 oh, there's this many of that, and this many of that, 6306 04:37:12,599 --> 04:37:14,959 and these are like buckets, these are frequencies, 6307 04:37:14,960 --> 04:37:17,480 and this is how many times it happens, so a histogram. 6308 04:37:17,480 --> 04:37:18,680 But we're gonna do this thing 6309 04:37:18,680 --> 04:37:20,840 where we're gonna count people's names, 6310 04:37:20,840 --> 04:37:23,200 and we're gonna kinda count how many that we see. 6311 04:37:23,200 --> 04:37:25,240 But the interesting thing that we're gonna solve, 6312 04:37:25,240 --> 04:37:27,400 just like many of the things in the computer, 6313 04:37:27,400 --> 04:37:28,840 is we can't just sort of look at the data, 6314 04:37:28,840 --> 04:37:30,540 we gotta look at the data iteratively, 6315 04:37:30,540 --> 04:37:33,320 one piece of data at a time. 6316 04:37:33,320 --> 04:37:35,919 So I'm gonna give you a little problem, okay? 6317 04:37:35,919 --> 04:37:38,939 I'm gonna show you a series of names, one at a time, 6318 04:37:38,939 --> 04:37:42,479 and I want you to count for each name, 6319 04:37:42,480 --> 04:37:44,360 make a little bucket, and then keep counting 6320 04:37:44,360 --> 04:37:46,439 how many things for each of the different names, okay? 6321 04:37:46,439 --> 04:37:48,879 You'll notice that you have to start with one, 6322 04:37:48,880 --> 04:37:51,800 and then you move across, so just watch this, 6323 04:37:51,800 --> 04:37:55,080 and tell me how many, 6324 04:37:55,080 --> 04:37:57,599 how many, what's the most common name 6325 04:37:57,599 --> 04:37:59,519 of the set of names I'm about to show you, 6326 04:37:59,520 --> 04:38:01,520 and how many do we see? 6327 04:38:01,520 --> 04:38:23,680 One, two, three, four, five. 6328 04:38:31,520 --> 04:38:36,520 So how many, what was the most common name 6329 04:38:38,240 --> 04:38:40,500 and how many times did you see it? 6330 04:38:40,500 --> 04:38:42,279 That's the question. 6331 04:38:42,279 --> 04:38:44,239 Now, here comes the review. 6332 04:38:44,240 --> 04:38:46,119 So for humans, it's so much easier for you 6333 04:38:46,119 --> 04:38:47,359 to just look at this and you think, 6334 04:38:47,360 --> 04:38:49,119 how did my brain look at that? 6335 04:38:49,119 --> 04:38:51,559 And you're like, okay, what is pretty common? 6336 04:38:51,560 --> 04:38:55,560 Oh, maybe, maybe Chen is common. 6337 04:38:55,560 --> 04:38:59,360 Oh, Chen, Chen, Chen, no. 6338 04:38:59,360 --> 04:39:04,360 Maybe Chen is common, one, two, three, four, yeah. 6339 04:39:04,480 --> 04:39:07,759 Anybody else have, Markov's got three, C7. 6340 04:39:07,759 --> 04:39:11,699 And so you'll notice how our minds, without computers, 6341 04:39:11,700 --> 04:39:15,360 we just sort of like bounce, branch in bound. 6342 04:39:15,360 --> 04:39:17,720 We have hypotheses and then we decide, 6343 04:39:17,720 --> 04:39:21,560 yep, it's Chen, that's it, and there's four of them. 6344 04:39:21,560 --> 04:39:24,279 Now, how did your brain think about this 6345 04:39:24,279 --> 04:39:27,279 as we were going through them one at a time? 6346 04:39:27,279 --> 04:39:30,239 Well, my guess is if you really had to do this a lot, 6347 04:39:30,240 --> 04:39:32,320 you would make a little picture like this. 6348 04:39:32,320 --> 04:39:36,119 And then what you would do is if you saw a new name, 6349 04:39:36,119 --> 04:39:38,120 XYZ, you'd add it to the list 6350 04:39:38,120 --> 04:39:39,880 and give it a tick mark of one. 6351 04:39:39,880 --> 04:39:42,980 And then if you saw C7 again, you'd give that a tick mark. 6352 04:39:42,980 --> 04:39:45,480 And if you saw XYZ again, you'd make a tick mark. 6353 04:39:45,480 --> 04:39:48,480 And then you'd keep adding to these tick marks, right? 6354 04:39:48,480 --> 04:39:49,720 And that's how you would do it. 6355 04:39:49,720 --> 04:39:52,440 And you wouldn't, like many of the things we do in a loop, 6356 04:39:52,440 --> 04:39:54,360 you wouldn't really know what the most common was 6357 04:39:54,360 --> 04:39:55,480 one until the end. 6358 04:39:55,480 --> 04:39:57,400 And then you'd sort of take a look at these numbers 6359 04:39:57,400 --> 04:40:00,840 and you'd say, okay, that's the most common number. 6360 04:40:00,840 --> 04:40:03,040 And then you'd be done. 6361 04:40:03,040 --> 04:40:05,200 But you have to watch them one at a time. 6362 04:40:05,200 --> 04:40:06,740 You can't just bounce around. 6363 04:40:08,200 --> 04:40:12,180 And so that's how we're gonna use dictionaries 6364 04:40:12,180 --> 04:40:13,680 to achieve that. 6365 04:40:13,680 --> 04:40:16,000 Again, instinctively as humans, we just look at the stuff. 6366 04:40:16,000 --> 04:40:17,340 But if you add a million things, 6367 04:40:17,340 --> 04:40:18,680 you probably wanna write a Python program 6368 04:40:18,680 --> 04:40:19,760 and use dictionaries. 6369 04:40:19,760 --> 04:40:21,120 And so this is the idea. 6370 04:40:21,120 --> 04:40:22,600 And there's two basic things that happen. 6371 04:40:22,600 --> 04:40:24,520 One is the first time you see a name. 6372 04:40:24,520 --> 04:40:27,160 Like I say, is this name there already? 6373 04:40:27,160 --> 04:40:28,160 If it's there already, 6374 04:40:28,160 --> 04:40:30,320 you really just wanna add one to it, right? 6375 04:40:30,320 --> 04:40:31,520 That's the adding of a tick. 6376 04:40:31,520 --> 04:40:34,520 And or you wanna see for the first time, 6377 04:40:34,520 --> 04:40:36,600 you know, blah, blah, blah, blah, blah, and give it a one. 6378 04:40:36,600 --> 04:40:41,040 And so you can use the name as the key. 6379 04:40:41,040 --> 04:40:42,260 And then one is the value. 6380 04:40:42,260 --> 04:40:44,840 And then first time you see Chen, you stick one in there. 6381 04:40:44,840 --> 04:40:47,240 And so at this point inside the dictionary, 6382 04:40:47,240 --> 04:40:50,280 sort of dynamically adding as soon as it sees a new name, 6383 04:40:50,280 --> 04:40:51,720 it adds another slot in here. 6384 04:40:52,600 --> 04:40:54,120 But then if you see the same name again, 6385 04:40:54,120 --> 04:40:56,520 like Chen again, then you end up with a one, 6386 04:40:56,520 --> 04:40:58,320 add one to it, and so it's two. 6387 04:40:58,320 --> 04:40:59,600 And so at that point, Chen is two. 6388 04:40:59,600 --> 04:41:03,400 And so you can see how you can both extend the dictionary 6389 04:41:03,400 --> 04:41:08,080 by encountering a new name or adding when you see a name 6390 04:41:08,080 --> 04:41:09,860 that you've already seen before. 6391 04:41:11,080 --> 04:41:14,160 The problem with dictionaries is like everything in Python, 6392 04:41:14,160 --> 04:41:16,600 there are rules about what you can and can't do. 6393 04:41:16,600 --> 04:41:17,680 And one of the, I think, 6394 04:41:17,680 --> 04:41:19,820 kind of frustrating things about dictionaries 6395 04:41:19,820 --> 04:41:23,160 is that you can't just look for a key that doesn't exist. 6396 04:41:23,160 --> 04:41:24,720 So this is a fresh brand new dictionary, 6397 04:41:24,720 --> 04:41:27,720 we do a constructor there, and we print out sub csev, 6398 04:41:27,720 --> 04:41:30,680 and boom, it blows up, and that's bad. 6399 04:41:30,680 --> 04:41:33,040 But we can solve this by the in operator. 6400 04:41:33,040 --> 04:41:34,680 The in operator we've used in the for loops. 6401 04:41:34,680 --> 04:41:37,140 We've used it in lists, we've used it in strings. 6402 04:41:37,140 --> 04:41:41,080 So that is a question, it's saying, is csev in CCC? 6403 04:41:41,080 --> 04:41:44,680 Well, this is this empty one, and so it is no, it is not. 6404 04:41:44,680 --> 04:41:46,240 Csev is not in CCC. 6405 04:41:46,240 --> 04:41:49,780 And so using this in operator, we can avoid the traceback. 6406 04:41:49,780 --> 04:41:52,640 We can say, if it's not there, put it in. 6407 04:41:52,640 --> 04:41:54,400 If it is there, add one to it. 6408 04:41:54,400 --> 04:41:58,160 And that leads us to this bit of code. 6409 04:41:58,160 --> 04:42:00,600 Okay, and that is the kind of code 6410 04:42:00,600 --> 04:42:01,600 that we're gonna build a histogram, 6411 04:42:01,600 --> 04:42:04,400 this is gonna histogram code, okay? 6412 04:42:04,400 --> 04:42:07,600 And so this is gonna have name as our iterator names. 6413 04:42:07,600 --> 04:42:10,480 Sorry, I made them singular and plural, that's nice, 6414 04:42:10,480 --> 04:42:13,480 but so name is gonna be csev-chen, csev-gen. 6415 04:42:13,480 --> 04:42:15,480 Now normally, we'll be reading this from a file, 6416 04:42:15,480 --> 04:42:17,800 but for now, keep it easy. 6417 04:42:17,800 --> 04:42:19,020 We're gonna go through this. 6418 04:42:19,020 --> 04:42:20,860 And we're gonna have counts as our dictionary. 6419 04:42:20,860 --> 04:42:22,380 So that starts out empty. 6420 04:42:22,380 --> 04:42:24,180 And we're gonna do a simple if then else 6421 04:42:24,180 --> 04:42:25,600 every time through the loop. 6422 04:42:25,600 --> 04:42:28,160 If the name we're looking at is not in the dictionary 6423 04:42:28,160 --> 04:42:31,460 already is the key, then set it to be one. 6424 04:42:31,460 --> 04:42:36,260 If it's not, go get the old value, count sub name, 6425 04:42:36,260 --> 04:42:38,560 and then add one to it and stick it back in. 6426 04:42:38,560 --> 04:42:42,920 So this line right here is new, adding a new thing. 6427 04:42:42,920 --> 04:42:45,120 And this line right here is adding 6428 04:42:45,120 --> 04:42:46,960 some things to existing things. 6429 04:42:46,960 --> 04:42:49,280 And you do this long enough, you start with an empty one, 6430 04:42:49,280 --> 04:42:52,080 and you do this long enough, at the very end, 6431 04:42:52,080 --> 04:42:55,880 it will print out the histogram that you're looking for, 6432 04:42:55,880 --> 04:42:57,280 the histogram you're looking for. 6433 04:42:57,280 --> 04:42:59,360 And so you say, oh, we've seen csev twice, 6434 04:42:59,360 --> 04:43:01,080 gen once, and gen twice. 6435 04:43:01,080 --> 04:43:02,500 And so that's the idea. 6436 04:43:02,500 --> 04:43:05,380 And so this can run a million times if you want. 6437 04:43:09,440 --> 04:43:14,120 Now, this notion of checking to see if a key exists 6438 04:43:14,120 --> 04:43:16,040 and doing one thing if it doesn't exist 6439 04:43:16,040 --> 04:43:18,320 and doing another thing if it does exist 6440 04:43:18,320 --> 04:43:24,560 is such a common practice that the dictionary object has 6441 04:43:24,560 --> 04:43:29,560 this method called get that collapses these four lines 6442 04:43:29,560 --> 04:43:30,800 into one line. 6443 04:43:30,800 --> 04:43:33,920 And so the idea is you're going to do one thing if it's in there 6444 04:43:33,920 --> 04:43:35,840 and you're going to retrieve the current thing. 6445 04:43:35,840 --> 04:43:37,920 Otherwise, you're going to pick a default value. 6446 04:43:37,920 --> 04:43:39,440 In this case, we'll pick one. 6447 04:43:39,440 --> 04:43:40,600 I mean, you pick zero. 6448 04:43:40,600 --> 04:43:44,740 This is like the default, meaning what is not there. 6449 04:43:44,740 --> 04:43:48,040 And if you say counts, now counts is a dictionary, dot get. 6450 04:43:48,040 --> 04:43:49,240 That's like string dot upper. 6451 04:43:49,240 --> 04:43:50,380 That's a method. 6452 04:43:50,380 --> 04:43:53,440 You give it a key and then a default. 6453 04:43:53,440 --> 04:43:55,680 And if the key exists, you get back what's in the key. 6454 04:43:55,680 --> 04:43:59,480 If the key doesn't exist, you get the default. 6455 04:43:59,480 --> 04:44:02,000 And with no trace back, this works. 6456 04:44:02,000 --> 04:44:03,480 So the best way to think about this 6457 04:44:03,480 --> 04:44:08,960 is those four lines are equal to that one line. 6458 04:44:08,960 --> 04:44:11,400 Because x is either going to be whatever was in there before 6459 04:44:11,400 --> 04:44:14,120 if it exists or it's going to be zero. 6460 04:44:14,120 --> 04:44:16,200 Now, the nice thing about zero is the next thing we're 6461 04:44:16,200 --> 04:44:17,360 going to do is we're going to add one to it. 6462 04:44:17,360 --> 04:44:18,840 So that that's going to get us to one. 6463 04:44:18,840 --> 04:44:25,720 So collapsing that loop that we saw before, 6464 04:44:25,720 --> 04:44:28,920 collapsing that loop, we can make it just a one line loop. 6465 04:44:28,920 --> 04:44:31,280 And this will become an idiom. 6466 04:44:31,280 --> 04:44:33,720 This will become something that you will get used to. 6467 04:44:33,720 --> 04:44:36,200 And you will use over and over and over again. 6468 04:44:36,200 --> 04:44:38,440 And after a while, right now, you're looking at it, boy, 6469 04:44:38,440 --> 04:44:41,880 boy, that's a lot of syntax and semicolons and whatever. 6470 04:44:41,880 --> 04:44:44,240 After a while, you just type this and not even think about it. 6471 04:44:44,240 --> 04:44:45,840 It's an idiom. 6472 04:44:45,840 --> 04:44:48,120 It's basically included in this idiom 6473 04:44:48,120 --> 04:44:50,880 is how to both create new entries in dictionaries 6474 04:44:50,880 --> 04:44:54,400 and update existing entries by adding one to them. 6475 04:44:54,400 --> 04:44:56,960 So everything else in this is the same. 6476 04:44:56,960 --> 04:44:59,120 Name is going to go through these five values. 6477 04:44:59,120 --> 04:45:01,400 And we're going to say counts of name equals 6478 04:45:01,400 --> 04:45:04,960 counts.get name comma zero plus one. 6479 04:45:04,960 --> 04:45:07,960 And so if, for example, this already has a one in it, 6480 04:45:07,960 --> 04:45:11,000 then this is going to be one plus one becomes two. 6481 04:45:11,000 --> 04:45:14,000 If it's not, it's going to be zero plus one equals two. 6482 04:45:14,000 --> 04:45:18,160 And so this is the idea of if new set it to one, not zero, 6483 04:45:18,160 --> 04:45:20,800 set it to one because the first time you see something, 6484 04:45:20,800 --> 04:45:22,680 the count should be one, not zero. 6485 04:45:22,680 --> 04:45:24,400 So that's why we make this default. 6486 04:45:24,400 --> 04:45:26,480 Now the get can be used for anything. 6487 04:45:26,480 --> 04:45:29,240 It just so happens that zero is a common default 6488 04:45:29,240 --> 04:45:31,480 because it's really common that we're using this 6489 04:45:31,480 --> 04:45:33,600 to basically make a histogram, right? 6490 04:45:33,600 --> 04:45:36,000 Little histogram of a, b, c, right? 6491 04:45:36,000 --> 04:45:38,160 And so we need to make a d, 6492 04:45:38,160 --> 04:45:39,960 but then the histogram has to start at one. 6493 04:45:39,960 --> 04:45:43,760 So that's basically the simplified counting 6494 04:45:43,760 --> 04:45:44,600 with get. 6495 04:45:44,600 --> 04:45:47,680 And there's a lot of things that we're going to do 6496 04:45:47,680 --> 04:45:52,400 inside of Python that do have to do with frequencies 6497 04:45:52,400 --> 04:45:54,680 and how many times certain things happened. 6498 04:45:54,680 --> 04:45:57,880 And this pattern is a really good pattern 6499 04:45:57,880 --> 04:45:59,200 to absolutely know. 6500 04:46:02,920 --> 04:46:05,640 So now what we're going to do is we're going to switch 6501 04:46:05,640 --> 04:46:07,440 from just looping through strings, 6502 04:46:07,440 --> 04:46:08,720 instead loop through files. 6503 04:46:08,720 --> 04:46:10,680 And it's going to take a little bit of work 6504 04:46:10,680 --> 04:46:11,860 because we have to open the file 6505 04:46:11,860 --> 04:46:14,360 and we'll bring a lot of things together at this point. 6506 04:46:14,360 --> 04:46:16,240 So here would be another task 6507 04:46:16,240 --> 04:46:19,260 and that is here's a bunch of text from the book 6508 04:46:19,260 --> 04:46:22,360 and you can just split this into words 6509 04:46:22,360 --> 04:46:25,720 and count and find out what the most common word is 6510 04:46:25,720 --> 04:46:28,640 and how many times it occurs. 6511 04:46:28,640 --> 04:46:30,360 So go ahead and try to do this for a second. 6512 04:46:30,360 --> 04:46:31,960 Feel free to pause. 6513 04:46:31,960 --> 04:46:33,120 Actually don't bother pausing. 6514 04:46:33,120 --> 04:46:33,960 This is too hard. 6515 04:46:33,960 --> 04:46:35,160 We should write a program for this. 6516 04:46:35,160 --> 04:46:36,640 It's not easy. 6517 04:46:36,640 --> 04:46:37,720 Humans don't like this. 6518 04:46:37,720 --> 04:46:39,040 It makes you concentrate. 6519 04:46:39,040 --> 04:46:42,560 And so here is a counting pattern 6520 04:46:42,560 --> 04:46:44,120 where we're going to take a line 6521 04:46:44,120 --> 04:46:46,840 and then later we'll read this in a file. 6522 04:46:46,840 --> 04:46:51,120 And so this is just an adaptation improvement 6523 04:46:51,120 --> 04:46:52,020 of the previous thing. 6524 04:46:52,020 --> 04:46:53,940 So we're going to start with an empty dictionary. 6525 04:46:53,940 --> 04:46:56,280 We're going to ask for a line of text and read it in. 6526 04:46:56,280 --> 04:46:57,640 And then we're going to use split. 6527 04:46:57,640 --> 04:46:58,920 So remember the list of words? 6528 04:46:58,920 --> 04:47:02,040 Well, what we're going to get here is a list of words. 6529 04:47:02,040 --> 04:47:04,160 We'll print it out and we'll run this counting. 6530 04:47:04,160 --> 04:47:06,440 This is the little loop. 6531 04:47:06,440 --> 04:47:08,600 For every word in whatever this was, 6532 04:47:08,600 --> 04:47:13,040 we're going to do this idiom of either adding a new entry 6533 04:47:13,040 --> 04:47:15,000 or adding one to an existing entry 6534 04:47:15,000 --> 04:47:16,360 and then printing that out. 6535 04:47:16,360 --> 04:47:18,800 So let's take a look at what we get there. 6536 04:47:18,800 --> 04:47:22,520 So if we run this, we can give it some text 6537 04:47:22,520 --> 04:47:24,800 and I've got this, this will be all one line. 6538 04:47:24,800 --> 04:47:26,340 And then it splits it into words 6539 04:47:26,340 --> 04:47:29,040 and you see that these words here are split, split, 6540 04:47:29,040 --> 04:47:29,880 split, split. 6541 04:47:29,880 --> 04:47:31,580 I mean that's strings and splits. 6542 04:47:31,580 --> 04:47:34,360 Remember strings and lists and split. 6543 04:47:34,360 --> 04:47:37,440 And so now the counting is gonna go through this list. 6544 04:47:37,440 --> 04:47:39,560 The clown ran after the, 6545 04:47:39,560 --> 04:47:41,200 and it's gonna build a histogram. 6546 04:47:41,200 --> 04:47:44,400 The clown, you know, one clown, 6547 04:47:44,400 --> 04:47:47,860 the up, up, up of these things are gonna go up, right? 6548 04:47:47,860 --> 04:47:49,360 That's this histogram. 6549 04:47:49,360 --> 04:47:50,880 And then when it's all said and done, 6550 04:47:50,880 --> 04:47:52,600 we end up with the histogram. 6551 04:47:52,600 --> 04:47:54,440 And so counts is the dictionary 6552 04:47:54,440 --> 04:47:55,600 that ends up with a histogram. 6553 04:47:55,600 --> 04:47:57,680 And we can start by inspection, see, 6554 04:47:57,680 --> 04:48:00,320 oh, the is the most common word. 6555 04:48:00,320 --> 04:48:02,360 And there are seven of those, right? 6556 04:48:02,360 --> 04:48:04,260 So if we sort of take a look at this, 6557 04:48:04,260 --> 04:48:05,920 we start out, we make a dictionary, 6558 04:48:05,920 --> 04:48:08,960 we read in a line of text, the text goes in. 6559 04:48:08,960 --> 04:48:13,320 We, and then we split that and we print the words out. 6560 04:48:13,320 --> 04:48:15,240 So these are the words, right? 6561 04:48:15,240 --> 04:48:16,080 Then we have a for loop 6562 04:48:16,080 --> 04:48:18,000 that's gonna loop through all those things 6563 04:48:18,000 --> 04:48:19,840 and then produce a dictionary. 6564 04:48:19,840 --> 04:48:21,760 And when we print the dictionary out, 6565 04:48:21,760 --> 04:48:23,180 that's what we're gonna get. 6566 04:48:23,180 --> 04:48:25,440 And the seven, okay? 6567 04:48:25,440 --> 04:48:27,660 So that's one line of text. 6568 04:48:27,660 --> 04:48:31,300 That's how you walk across the words in a line of text 6569 04:48:31,300 --> 04:48:34,440 after you split the line into separate words. 6570 04:48:34,440 --> 04:48:35,880 So now we're gonna look at ways 6571 04:48:35,880 --> 04:48:37,360 that you can loop through dictionaries. 6572 04:48:37,360 --> 04:48:39,840 We just produced a loop that can build a dictionary, 6573 04:48:39,840 --> 04:48:42,200 but now we're gonna look at a dictionary. 6574 04:48:42,200 --> 04:48:44,320 And so we'll start with a very, very simple example 6575 04:48:44,320 --> 04:48:46,880 and then we'll work to a slightly more complex example. 6576 04:48:46,880 --> 04:48:48,920 So here's a dictionary, just the constant, 6577 04:48:48,920 --> 04:48:51,960 Chuck is one, Fred's 42, and Jan's 100. 6578 04:48:51,960 --> 04:48:55,520 And so we're gonna use a definite loop with a four, 6579 04:48:55,520 --> 04:48:56,760 four key and counts. 6580 04:48:56,760 --> 04:49:00,400 Now it doesn't have to be a key, but key is a good name 6581 04:49:00,400 --> 04:49:04,560 because these are keys and values, K, V, K, V, 6582 04:49:04,560 --> 04:49:05,400 keys and values. 6583 04:49:05,400 --> 04:49:06,760 I just mentally think of this 6584 04:49:06,760 --> 04:49:08,880 as keys and values and keys and values. 6585 04:49:08,880 --> 04:49:12,700 So this iteration variable is gonna walk the keys. 6586 04:49:12,700 --> 04:49:16,160 It's not gonna walk the values, it's gonna walk the keys. 6587 04:49:16,160 --> 04:49:19,180 Chuck, Fred, Jan, not necessarily in that particular order. 6588 04:49:19,180 --> 04:49:20,880 As you see, it goes Jan, Chuck, Fred, 6589 04:49:20,880 --> 04:49:23,060 because just because I typed it in in this order, 6590 04:49:23,060 --> 04:49:25,200 it's not like a list, it doesn't stay in that order. 6591 04:49:25,200 --> 04:49:27,820 It might move around a little bit as we add data to it 6592 04:49:27,820 --> 04:49:30,440 or as we set the data up. 6593 04:49:30,440 --> 04:49:32,760 And so you can, in the loop, you can get the key, 6594 04:49:32,760 --> 04:49:35,160 and so that's what prints out the Jan, Chuck, Fred, 6595 04:49:35,160 --> 04:49:37,520 but then you can also get the corresponding count 6596 04:49:37,520 --> 04:49:41,620 for each one of these by just pulling it out of the array. 6597 04:49:41,620 --> 04:49:43,320 I mean, pulling it out of the dictionary, right? 6598 04:49:43,320 --> 04:49:45,800 And so we can pull out the corresponding value, 6599 04:49:45,800 --> 04:49:48,160 and so we print out Jan 100, Chuck 1, Fred 2, 6600 04:49:48,160 --> 04:49:50,540 and that runs this loop three times. 6601 04:49:50,540 --> 04:49:53,660 So if you just use the N and you give a dictionary here, 6602 04:49:53,660 --> 04:49:56,000 remember all the different things we've been able to put there 6603 04:49:56,000 --> 04:50:00,160 on the end of a for loop and dictionary's another thing 6604 04:50:00,160 --> 04:50:03,780 we can put on and we get a list of keys. 6605 04:50:03,780 --> 04:50:05,540 Now there's a couple of methods that allow us 6606 04:50:05,540 --> 04:50:09,500 to get the keys and so we have, you know, 6607 04:50:09,500 --> 04:50:11,300 we can say turn this into a list 6608 04:50:11,300 --> 04:50:12,720 and we get a list of the keys. 6609 04:50:12,720 --> 04:50:15,240 So this is a dictionary, the same dictionary. 6610 04:50:15,240 --> 04:50:16,480 We get a list of the keys. 6611 04:50:16,480 --> 04:50:19,480 You can also get a list of the keys by using the keys method. 6612 04:50:19,480 --> 04:50:21,720 So that's take this dictionary, JJJ, 6613 04:50:21,720 --> 04:50:23,600 and give me all the keys, which gives me a list, 6614 04:50:23,600 --> 04:50:25,240 which is kind of the same thing. 6615 04:50:25,240 --> 04:50:27,120 And then we can ask for the values 6616 04:50:27,120 --> 04:50:28,780 and they give me just then the values 6617 04:50:28,780 --> 04:50:30,440 extracted out of this dictionary. 6618 04:50:30,440 --> 04:50:31,500 So that's nice. 6619 04:50:32,440 --> 04:50:34,560 Now the one thing is that while I said 6620 04:50:34,560 --> 04:50:37,120 you can't predict the order, if in two statements 6621 04:50:37,120 --> 04:50:39,320 you ask for the keys and then the values, 6622 04:50:39,320 --> 04:50:41,320 they at least come out in the same order, 6623 04:50:41,320 --> 04:50:43,020 even though you can't necessarily predict the order 6624 04:50:43,020 --> 04:50:45,800 that they come out in the same order. 6625 04:50:45,800 --> 04:50:48,960 And then there is a third thing that we can do 6626 04:50:48,960 --> 04:50:52,720 and that is list, ask for the items. 6627 04:50:52,720 --> 04:50:54,960 We can say give me the items. 6628 04:50:54,960 --> 04:50:57,780 And that gives us a list. 6629 04:50:57,780 --> 04:51:02,080 This is our first really kind of composite 6630 04:51:02,080 --> 04:51:05,120 combined data structure where it is a list, 6631 04:51:05,120 --> 04:51:08,680 a three item list, zero, one, two. 6632 04:51:08,680 --> 04:51:11,840 And inside that there is what are called two tuples. 6633 04:51:11,840 --> 04:51:15,880 Jan maps to 100, Chuck maps to one, Fred maps to 42. 6634 04:51:15,880 --> 04:51:18,000 Coming up next we're gonna have a whole chapter on that 6635 04:51:18,000 --> 04:51:21,040 and so just take a look at that for the moment 6636 04:51:21,040 --> 04:51:25,560 and we will come back to that in some detail later. 6637 04:51:25,560 --> 04:51:28,200 This whole items idea that gives us back 6638 04:51:28,200 --> 04:51:30,160 a list of key value pairs, 6639 04:51:30,160 --> 04:51:32,200 because it's not just a list of keys or a list of values, 6640 04:51:32,200 --> 04:51:33,840 it's actually a list of key value pairs, 6641 04:51:33,840 --> 04:51:37,680 allows us to write in Python a very clever and elegant loop. 6642 04:51:38,920 --> 04:51:41,960 What we can do is actually this items gives us back 6643 04:51:41,960 --> 04:51:45,840 each item in the list has a key and a value 6644 04:51:45,840 --> 04:51:47,800 and we can actually take two iteration variables. 6645 04:51:47,800 --> 04:51:51,000 For a, a, comma, b, b, b, this is two iteration variables 6646 04:51:51,000 --> 04:51:53,000 and if you're coming from another programming language, 6647 04:51:53,000 --> 04:51:55,360 this is super cool and it's a Python only feature. 6648 04:51:55,360 --> 04:51:57,480 I never have seen another language that's capable 6649 04:51:57,480 --> 04:52:00,760 of doing something this simple and that elegantly. 6650 04:52:00,760 --> 04:52:02,800 So what this basically does is says 6651 04:52:02,800 --> 04:52:04,360 we're gonna simultaneously advance 6652 04:52:04,360 --> 04:52:05,600 these two iteration variables. 6653 04:52:05,600 --> 04:52:07,720 So this is gonna be the key and the value, 6654 04:52:07,720 --> 04:52:09,160 the K and the V. 6655 04:52:09,160 --> 04:52:11,160 Key and the value is gonna be Chuck one, 6656 04:52:11,160 --> 04:52:16,000 then they're both gonna advance, Fred 42, Jan 100. 6657 04:52:16,000 --> 04:52:17,720 And so that means in this simple loop 6658 04:52:17,720 --> 04:52:18,720 if we just print them out 6659 04:52:18,720 --> 04:52:20,100 we're gonna get the key value pairs. 6660 04:52:20,100 --> 04:52:21,400 Of course in the order. 6661 04:52:21,400 --> 04:52:23,680 And so it's sort of a, a, a, and b, b, b, 6662 04:52:23,680 --> 04:52:27,320 simultaneously walk down these key value pairs. 6663 04:52:27,320 --> 04:52:29,160 And so that's really pretty 6664 04:52:29,160 --> 04:52:31,080 and it makes for a very succinct loop. 6665 04:52:31,080 --> 04:52:33,640 It's the syntax is a little sort of disquieting 6666 04:52:33,640 --> 04:52:36,440 when you first see it, but it's a super elegant thing 6667 04:52:36,440 --> 04:52:38,600 and you just have to say items. 6668 04:52:38,600 --> 04:52:41,800 If you don't say items, you just get the keys. 6669 04:52:41,800 --> 04:52:43,640 If you say items, you get the key value pairs 6670 04:52:43,640 --> 04:52:46,160 and you have to have two iteration variables. 6671 04:52:46,160 --> 04:52:47,520 If you don't have two iteration variables 6672 04:52:47,520 --> 04:52:48,920 and use items, it'll complain and say, 6673 04:52:48,920 --> 04:52:50,080 what are you doing? 6674 04:52:50,080 --> 04:52:51,200 I'm giving you two things 6675 04:52:51,200 --> 04:52:53,000 and you don't have two variables to receive them. 6676 04:52:53,000 --> 04:52:57,380 So two iteration variables and items are basically related. 6677 04:52:58,560 --> 04:53:01,720 Now we're going to take a look 6678 04:53:01,720 --> 04:53:05,900 and this is code that I showed you perhaps many weeks ago 6679 04:53:05,900 --> 04:53:08,080 about, I said this is a little story 6680 04:53:08,080 --> 04:53:11,560 about how to read a file and count all the words in the file. 6681 04:53:11,560 --> 04:53:12,780 And now we're back to it. 6682 04:53:12,780 --> 04:53:14,560 And at this point you should understand 6683 04:53:14,560 --> 04:53:16,680 every single character of this program, 6684 04:53:16,680 --> 04:53:19,000 every single concept of the program. 6685 04:53:19,000 --> 04:53:20,480 You should literally stare at this 6686 04:53:20,480 --> 04:53:22,480 and look at it, code it, play with it 6687 04:53:22,480 --> 04:53:24,880 until you absolutely understand it. 6688 04:53:24,880 --> 04:53:26,560 So let's take a look. 6689 04:53:27,840 --> 04:53:29,680 Again, I showed you this weeks ago. 6690 04:53:30,640 --> 04:53:33,240 So we're going to ask for a file name. 6691 04:53:33,240 --> 04:53:35,360 Then we're going to open the file name. 6692 04:53:35,360 --> 04:53:37,200 Then we're going to make an empty dictionary. 6693 04:53:37,200 --> 04:53:39,360 Again, this is all stuff you've done before. 6694 04:53:39,360 --> 04:53:40,960 And then we're going to have an iteration variable 6695 04:53:40,960 --> 04:53:44,340 that's going to go through the lines in the file, right? 6696 04:53:44,340 --> 04:53:47,260 So line is going to go line, line, line. 6697 04:53:47,260 --> 04:53:49,320 Then we are going to split that line, 6698 04:53:49,320 --> 04:53:52,320 each line into words, chop, chop, chop, chop. 6699 04:53:53,160 --> 04:53:57,000 So that's words is the list of the words in one line. 6700 04:53:57,000 --> 04:53:57,840 We're inside of a loop 6701 04:53:57,840 --> 04:53:59,560 that's going to go through all the lines. 6702 04:53:59,560 --> 04:54:01,080 And then what we're going to do 6703 04:54:01,080 --> 04:54:04,840 is we're going to have the word iteration 6704 04:54:04,840 --> 04:54:07,160 iterate through each word in the line. 6705 04:54:07,160 --> 04:54:08,000 And then what we're going to do 6706 04:54:08,000 --> 04:54:10,400 is take each word in the line. 6707 04:54:10,400 --> 04:54:12,800 I'm going to do this histogram, right? 6708 04:54:12,800 --> 04:54:16,000 So this is going to run not only just for every line, 6709 04:54:16,000 --> 04:54:17,400 but for every word in every line. 6710 04:54:17,400 --> 04:54:19,840 So we have a nested loop for every line. 6711 04:54:19,840 --> 04:54:21,940 Then we split it and then we go across the line. 6712 04:54:21,940 --> 04:54:22,960 So it's almost like a typewriter. 6713 04:54:22,960 --> 04:54:27,000 We go, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch. 6714 04:54:27,000 --> 04:54:27,960 And that's what we're doing. 6715 04:54:27,960 --> 04:54:32,840 Tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch, tch. 6716 04:54:32,840 --> 04:54:36,520 So it's like the outer loop is going down, down, down the lines. 6717 04:54:36,520 --> 04:54:39,320 And the inner loop is going across, across, across the words. 6718 04:54:39,320 --> 04:54:41,080 And eventually we are going to see 6719 04:54:41,080 --> 04:54:43,040 in this middle, in this last line, 6720 04:54:43,040 --> 04:54:44,960 every single word in the file. 6721 04:54:44,960 --> 04:54:47,760 And we're going to do the accounts get word plus one, 6722 04:54:47,760 --> 04:54:50,800 which is our magic histogram making line 6723 04:54:50,800 --> 04:54:52,400 that if you don't remember what that is, 6724 04:54:52,400 --> 04:54:54,800 go back a couple of slides, I just talked about it. 6725 04:54:54,800 --> 04:54:56,200 At this point in the code, 6726 04:54:56,200 --> 04:54:57,840 and it's important to be able to draw these lines, 6727 04:54:57,840 --> 04:55:00,640 at this point in the code, you have the histogram 6728 04:55:00,640 --> 04:55:02,840 and it's in the variable counts. 6729 04:55:02,840 --> 04:55:06,320 Now, we want to find the largest one. 6730 04:55:06,320 --> 04:55:08,520 Now we have written loops 6731 04:55:08,520 --> 04:55:10,640 that can find the largest in a list, 6732 04:55:10,640 --> 04:55:13,080 but now we want to find the largest value 6733 04:55:13,080 --> 04:55:15,360 in the key value pairs of a dictionary. 6734 04:55:16,440 --> 04:55:18,200 So we're going to start with, 6735 04:55:18,200 --> 04:55:20,600 we're going to know what the largest count is 6736 04:55:20,600 --> 04:55:22,720 and the largest word of the, has that count. 6737 04:55:22,720 --> 04:55:24,200 And we're going to set them both to none 6738 04:55:24,200 --> 04:55:25,280 because we're going to prime our loop. 6739 04:55:25,280 --> 04:55:27,360 We have to prime our loop and we're going to say to none. 6740 04:55:27,360 --> 04:55:29,560 And so then we're going to write one of these cool things 6741 04:55:29,560 --> 04:55:31,480 that says for word, comma, count. 6742 04:55:31,480 --> 04:55:32,920 So word and count are going to go through 6743 04:55:32,920 --> 04:55:35,880 the key value pairs because we've got items here. 6744 04:55:35,880 --> 04:55:37,320 So it's going to go through the key value pairs, 6745 04:55:37,320 --> 04:55:40,200 loop through each key, whatever it was. 6746 04:55:40,200 --> 04:55:42,120 There could be a million words in here. 6747 04:55:42,120 --> 04:55:43,440 We're going to go through every one. 6748 04:55:43,440 --> 04:55:45,960 And what we're going to do is we're going to make sure 6749 04:55:45,960 --> 04:55:49,280 that key big count is the current largest count 6750 04:55:49,280 --> 04:55:50,560 we've seen so far. 6751 04:55:50,560 --> 04:55:54,160 And if it's none, well, then we haven't seen anything 6752 04:55:54,160 --> 04:55:56,360 or the current, the count we just read 6753 04:55:56,360 --> 04:55:58,760 is greater than the big count so far, 6754 04:55:58,760 --> 04:56:01,000 we are going to jump in and this is sort of like, 6755 04:56:01,000 --> 04:56:03,800 oh, this is a new personal best count 6756 04:56:03,800 --> 04:56:05,800 for this particular dataset. 6757 04:56:05,800 --> 04:56:08,160 And so we're going to remember the word in big word 6758 04:56:08,160 --> 04:56:10,200 and we're going to remember the count in big count. 6759 04:56:10,200 --> 04:56:11,720 So this is just a max loop. 6760 04:56:12,720 --> 04:56:14,760 It's a maximum loop with the extra thing 6761 04:56:14,760 --> 04:56:18,360 that we're recording in addition to what count 6762 04:56:18,360 --> 04:56:21,120 is the largest, what the word that was associated 6763 04:56:21,120 --> 04:56:22,560 with that count, we're recording it. 6764 04:56:22,560 --> 04:56:24,200 So again, this is a starting part of the loop. 6765 04:56:24,200 --> 04:56:25,120 We're going to do some work. 6766 04:56:25,120 --> 04:56:27,400 And then when we exit the bottom of this, 6767 04:56:27,400 --> 04:56:30,200 big word is going to be the word that is the most common 6768 04:56:30,200 --> 04:56:32,400 and big count is the number of times. 6769 04:56:32,400 --> 04:56:35,920 And so if we run a file, we say, oh, in that file, 6770 04:56:35,920 --> 04:56:37,520 two is the most common word. 6771 04:56:37,520 --> 04:56:38,720 And it's 16 times. 6772 04:56:38,720 --> 04:56:41,040 If we run the clown file, well, 6773 04:56:41,040 --> 04:56:42,960 the is the most common word in seven. 6774 04:56:42,960 --> 04:56:46,600 And so this now is, and this could have a very large file 6775 04:56:46,600 --> 04:56:49,800 and give you the most common word. 6776 04:56:49,800 --> 04:56:54,800 And so that is sort of a really good application 6777 04:56:55,140 --> 04:56:56,480 of dictionaries. 6778 04:56:56,480 --> 04:56:58,520 So dictionaries are the most powerful, 6779 04:56:58,520 --> 04:57:02,040 well, they're the most powerful collection 6780 04:57:02,040 --> 04:57:03,240 we've seen so far. 6781 04:57:04,320 --> 04:57:06,300 It is good to see both lists and dictionaries 6782 04:57:06,300 --> 04:57:08,840 to understand what collections are. 6783 04:57:08,840 --> 04:57:12,240 They are things inside of Python that can handle 6784 04:57:12,240 --> 04:57:13,940 more than one item inside of it. 6785 04:57:13,940 --> 04:57:15,560 And we'll learn about another collection 6786 04:57:15,560 --> 04:57:16,900 about tuples in a second. 6787 04:57:18,240 --> 04:57:20,200 Just understand the get method 6788 04:57:20,200 --> 04:57:23,280 because that leads to very compact code, 6789 04:57:23,280 --> 04:57:24,720 understanding their various ways 6790 04:57:24,720 --> 04:57:25,920 to iterate through dictionaries. 6791 04:57:25,920 --> 04:57:29,520 And so we've learned a lot, but in the next section, 6792 04:57:29,520 --> 04:57:32,160 we will learn even more and put these together 6793 04:57:32,160 --> 04:57:34,320 and do some sorting and do some other stuff 6794 04:57:34,320 --> 04:57:38,140 and really start to see the real power of dictionaries. 6795 04:57:42,320 --> 04:57:44,480 This is, I'm gonna do some coding. 6796 04:57:44,480 --> 04:57:48,660 It's related to the dictionaries chapter, chapter nine. 6797 04:57:48,660 --> 04:57:51,040 And we're gonna do some word counting. 6798 04:57:51,040 --> 04:57:55,720 That's basically right out of the slides for, 6799 04:57:55,720 --> 04:57:58,600 but I'm gonna just write the code in front of you 6800 04:57:58,600 --> 04:58:00,980 rather than have you look at it in the book. 6801 04:58:00,980 --> 04:58:04,180 So what we're gonna do is I've got my text editor 6802 04:58:04,180 --> 04:58:09,180 up here and let me start by making a new folder. 6803 04:58:09,240 --> 04:58:14,240 New folder for my chapter nine exercise. 6804 04:58:14,760 --> 04:58:17,440 And then I'm gonna go and make an untitled file. 6805 04:58:17,440 --> 04:58:19,280 That was from the previous one. 6806 04:58:19,280 --> 04:58:24,280 And I'll do what I always do, print hello and save it. 6807 04:58:28,040 --> 04:58:31,480 And save it here into exercise 09 6808 04:58:31,480 --> 04:58:35,880 and ex09.py. 6809 04:58:35,880 --> 04:58:40,040 So now I have a folder that's in my py4e folder 6810 04:58:40,040 --> 04:58:44,080 and that happens to be in my desktop. 6811 04:58:44,080 --> 04:58:47,900 Py4e is my folder on my desktop. 6812 04:58:47,900 --> 04:58:50,900 And now I have all of these subfolders, cd ex08. 6813 04:58:53,520 --> 04:58:55,520 ls is dir on windows. 6814 04:58:57,040 --> 04:59:01,440 ls, oops, I gotta go up one. 6815 04:59:01,440 --> 04:59:05,760 cd ex09 ls, so I've got that file right there. 6816 04:59:05,760 --> 04:59:07,800 Now I'm gonna wanna read some files 6817 04:59:07,800 --> 04:59:10,880 and so I'm gonna bring some files down, 6818 04:59:10,880 --> 04:59:14,960 a couple of files, Python for everybody, 6819 04:59:14,960 --> 04:59:18,720 code3, intro.txt, so I've got this URL 6820 04:59:18,720 --> 04:59:21,440 and I'm gonna save it, save page as. 6821 04:59:21,440 --> 04:59:23,360 And it's really important that I save it 6822 04:59:23,360 --> 04:59:26,160 in the same folder as I'm gonna write my code 6823 04:59:26,160 --> 04:59:28,800 just so that when I open this file it knows where it's at. 6824 04:59:28,800 --> 04:59:30,400 So I've saved that one. 6825 04:59:30,400 --> 04:59:33,840 And I'm gonna also take this clown text. 6826 04:59:33,840 --> 04:59:37,040 I'll use this to make my life simple 6827 04:59:37,040 --> 04:59:39,040 so I have a real short thing that I can show you 6828 04:59:39,040 --> 04:59:42,480 how it works and so now if I go back to my terminal, 6829 04:59:44,320 --> 04:59:48,400 I see I've got exercise 09 Python, intro.txt 6830 04:59:48,400 --> 04:59:51,160 and clown.txt, okay? 6831 04:59:51,160 --> 04:59:55,020 So let's go back to my text editor and get started. 6832 04:59:55,020 --> 05:00:00,020 I will prompt for the file name input enter file colon space. 6833 05:00:08,420 --> 05:00:09,540 Now I'm gonna do something. 6834 05:00:09,540 --> 05:00:14,540 If the length of the F name that I just read 6835 05:00:16,760 --> 05:00:21,760 is less than one, I'm gonna say F name equals clown.txt. 6836 05:00:21,760 --> 05:00:26,760 I do this so that I can just hit enter 6837 05:00:28,640 --> 05:00:30,400 and it defaults to clown.txt. 6838 05:00:30,400 --> 05:00:32,720 If I want to give it a different name, I can. 6839 05:00:32,720 --> 05:00:35,660 So if I just hit enter at this prompt, 6840 05:00:35,660 --> 05:00:38,260 then this will give me a string that's zero length. 6841 05:00:38,260 --> 05:00:40,840 So if it's less than one, I'll just assume that. 6842 05:00:40,840 --> 05:00:42,300 So let me open that. 6843 05:00:43,480 --> 05:00:47,640 Handle equals open F name. 6844 05:00:47,640 --> 05:00:52,640 And let's read through it for line in handle. 6845 05:00:58,540 --> 05:01:02,300 We'll strip it, line equals line.rstrip 6846 05:01:02,300 --> 05:01:05,180 to take the white space off the right hand side 6847 05:01:05,180 --> 05:01:07,500 and then we're gonna say print line. 6848 05:01:07,500 --> 05:01:09,580 Again, I'm not just doing this. 6849 05:01:09,580 --> 05:01:12,660 I really, when I write code, I just saved it. 6850 05:01:12,660 --> 05:01:16,340 When I write code, I do these kind of stuff all the time 6851 05:01:16,340 --> 05:01:18,900 just for my own sanity checking. 6852 05:01:18,900 --> 05:01:22,900 And so now I'm gonna run python3 ex09.py 6853 05:01:25,700 --> 05:01:27,420 just to test that. 6854 05:01:27,420 --> 05:01:29,260 I'm gonna hit enter now and it's gonna assume, 6855 05:01:29,260 --> 05:01:31,420 hopefully, clown.txt if it all goes well. 6856 05:01:31,420 --> 05:01:35,140 And yep, it read one line, okay? 6857 05:01:35,140 --> 05:01:36,460 So that part's working. 6858 05:01:36,460 --> 05:01:38,580 I'll just leave that print statement in. 6859 05:01:38,580 --> 05:01:41,980 The next thing I wanna do is kind of a classic thing 6860 05:01:41,980 --> 05:01:44,060 where we're gonna go read a bunch of lines 6861 05:01:44,060 --> 05:01:46,780 and then go horizontally across those lines in words. 6862 05:01:46,780 --> 05:01:47,940 So I'm gonna split that. 6863 05:01:47,940 --> 05:01:52,380 WDS equals line.split 6864 05:01:54,500 --> 05:01:57,540 and print WDS. 6865 05:01:57,540 --> 05:02:01,500 So I'll print that and I'm gonna save it and test it. 6866 05:02:01,500 --> 05:02:04,700 I really love to test things over and over. 6867 05:02:04,700 --> 05:02:05,740 There's the actual line. 6868 05:02:05,740 --> 05:02:08,860 This file clown.txt only has one line 6869 05:02:08,860 --> 05:02:11,940 and it breaks it into words and so I have those words. 6870 05:02:11,940 --> 05:02:16,940 Let's just run it again with intro.txt. 6871 05:02:17,540 --> 05:02:18,820 So this will have a lot of lines. 6872 05:02:18,820 --> 05:02:21,180 Line, line, line, line, line, lots of lines. 6873 05:02:21,180 --> 05:02:23,020 Every line has a prints out the line 6874 05:02:23,020 --> 05:02:26,500 and then prints out the words that we split it into, okay? 6875 05:02:26,500 --> 05:02:29,580 So now I kinda, one of the things that I do here 6876 05:02:29,580 --> 05:02:32,400 is I wanna believe, now I sort of can believe 6877 05:02:32,400 --> 05:02:34,180 everything from here up. 6878 05:02:34,180 --> 05:02:35,740 Like, oh, it's gonna open the file. 6879 05:02:35,740 --> 05:02:36,700 It's gonna read through the lines 6880 05:02:36,700 --> 05:02:38,260 and I'm gonna split them into words. 6881 05:02:38,260 --> 05:02:40,180 And so then I'll just kind of behind it, 6882 05:02:40,180 --> 05:02:44,140 I'll just say, okay, I'll just comment that out. 6883 05:02:44,140 --> 05:02:49,140 Now I need another for loop for W in WDS. 6884 05:02:51,060 --> 05:02:54,740 Now words is a Python list and has some number of words 6885 05:02:54,740 --> 05:02:57,980 in it, zero or 12 or whatever was on the line. 6886 05:02:57,980 --> 05:03:00,980 And now I'm gonna print out the word, okay? 6887 05:03:06,380 --> 05:03:09,500 And so now it will go through that horizontally. 6888 05:03:09,500 --> 05:03:11,740 Now I'll just do clown.txt. 6889 05:03:11,740 --> 05:03:15,300 So that you see, I'm not printing the line out. 6890 05:03:15,300 --> 05:03:17,340 That's the words that have been parsed from the, 6891 05:03:17,340 --> 05:03:18,700 split from the line. 6892 05:03:18,700 --> 05:03:20,340 And now we got this loop. 6893 05:03:20,340 --> 05:03:22,160 Now, one of the thing that's interesting 6894 05:03:22,160 --> 05:03:25,600 is just to make sure that you're going through all the words. 6895 05:03:25,600 --> 05:03:28,260 And I like a print statement here 6896 05:03:28,260 --> 05:03:32,900 to know that W is going to successfully take on literally 6897 05:03:32,900 --> 05:03:34,220 all the words of this file. 6898 05:03:34,220 --> 05:03:37,060 So if I comment this print statement out 6899 05:03:37,060 --> 05:03:42,060 and I run it again, clown.txt, that for loop starting 6900 05:03:42,780 --> 05:03:44,860 from here is every word in that file 6901 05:03:44,860 --> 05:03:46,700 which happens to only be one line. 6902 05:03:46,700 --> 05:03:49,160 But now if I do the same thing for intro.txt, 6903 05:03:53,980 --> 05:03:55,380 it's just gonna go through the words. 6904 05:03:55,380 --> 05:03:57,660 And in a sense, by nesting these two loops, 6905 05:03:57,660 --> 05:03:59,000 we're gonna hit all the lines. 6906 05:03:59,000 --> 05:04:02,600 And that's a lot of stuff, but it hit all of the lines, 6907 05:04:02,600 --> 05:04:04,340 all the words, and away we go. 6908 05:04:04,340 --> 05:04:07,540 Okay, so here's where a dictionary comes in. 6909 05:04:10,020 --> 05:04:14,620 I'm gonna make a variable called DI for dictionary. 6910 05:04:14,620 --> 05:04:16,720 And I'm gonna say, give me a dictionary. 6911 05:04:16,720 --> 05:04:19,180 Now, D-I-C-T is not something you can choose. 6912 05:04:19,180 --> 05:04:22,780 That's saying make, that's defining the type of dictionary. 6913 05:04:22,780 --> 05:04:25,100 DI is a variable that I chose. 6914 05:04:25,100 --> 05:04:28,220 Okay, so the key thing to this dictionary 6915 05:04:28,220 --> 05:04:29,940 is we're gonna make a counter. 6916 05:04:29,940 --> 05:04:33,740 And we're gonna use W, the word absorb, 6917 05:04:33,740 --> 05:04:37,460 elegant, whatever, and we're gonna use that as the index. 6918 05:04:37,460 --> 05:04:42,460 So the simple thing to do is to say if W is in DI, 6919 05:04:45,820 --> 05:04:50,820 then we can say W's, I mean, the dictionary sub the word 6920 05:04:51,380 --> 05:04:55,580 which is our key and the key value store of the dictionary 6921 05:04:55,580 --> 05:04:58,940 is equal to the value that we had before in that area, 6922 05:04:58,940 --> 05:05:01,260 D sub W plus one. 6923 05:05:01,260 --> 05:05:06,260 And if it's not in there, else D-I sub W equals one. 6924 05:05:14,920 --> 05:05:17,920 And I'm gonna print, print new. 6925 05:05:24,760 --> 05:05:28,720 So every time we see a new word, it's gonna say new. 6926 05:05:28,720 --> 05:05:33,480 And I'm going to also then print W and the current value 6927 05:05:33,480 --> 05:05:36,380 of the counter for W as it's going through. 6928 05:05:36,380 --> 05:05:38,240 Now notice how far in I'm indented. 6929 05:05:38,240 --> 05:05:40,600 This is all part of this inner loop. 6930 05:05:40,600 --> 05:05:44,240 So this is the loop that's gonna run every single word. 6931 05:05:44,240 --> 05:05:47,760 Okay, and I'm gonna run this first with clown. 6932 05:05:49,120 --> 05:05:54,120 So it runs slowly, okay, so we saw the was new 6933 05:05:54,120 --> 05:05:59,120 and the count is one, clown is new, count is one, 6934 05:06:00,040 --> 05:06:02,120 ran is new, the count is one. 6935 05:06:02,120 --> 05:06:04,080 After is new, the count is one. 6936 05:06:04,080 --> 05:06:07,460 Now we saw the again, but now we made the count be two. 6937 05:06:09,980 --> 05:06:10,940 Let's print here. 6938 05:06:19,360 --> 05:06:20,520 I'll say existing. 6939 05:06:20,520 --> 05:06:24,160 So you can kind of see it. 6940 05:06:24,160 --> 05:06:25,880 Now in the print, I'm printing this, 6941 05:06:25,880 --> 05:06:27,640 let's make it even a little more verbose. 6942 05:06:27,640 --> 05:06:32,640 Print W and then I will make it so it prints the, 6943 05:06:34,880 --> 05:06:38,200 it prints the word before and the count after 6944 05:06:38,200 --> 05:06:40,200 and then whether it's existing or new. 6945 05:06:40,200 --> 05:06:42,040 So we'll put a lot of print statements in. 6946 05:06:42,040 --> 05:06:45,760 Print statements are cheap, okay? 6947 05:06:45,760 --> 05:06:49,400 So now we see the word the, it's the first time we see it 6948 05:06:49,400 --> 05:06:50,640 and we set it to one. 6949 05:06:50,640 --> 05:06:52,720 We see the clown, it's the first time we see it, 6950 05:06:52,720 --> 05:06:53,680 we set it to one. 6951 05:06:53,680 --> 05:06:55,680 We see ran, new, one. 6952 05:06:56,680 --> 05:07:00,640 Later on, we see the, it's already in. 6953 05:07:00,640 --> 05:07:03,920 So existing means it was already in the dictionary. 6954 05:07:03,920 --> 05:07:08,200 W as a key was already in the dictionary, okay? 6955 05:07:08,200 --> 05:07:10,000 And so that's why we added one to it. 6956 05:07:10,000 --> 05:07:14,280 So the old value was one and then we added di sub the 6957 05:07:14,280 --> 05:07:18,780 equals di sub, di sub the equals di sub the plus one. 6958 05:07:18,780 --> 05:07:21,360 W is the string, the, T-H-E. 6959 05:07:21,360 --> 05:07:24,960 That's what that string is, okay? 6960 05:07:24,960 --> 05:07:27,280 And so we've made it all the way through 6961 05:07:27,280 --> 05:07:29,760 and you see the in this one line occurred 6962 05:07:29,760 --> 05:07:31,420 ultimately seven times. 6963 05:07:31,420 --> 05:07:34,880 So now I want to print out the contents of this dictionary 6964 05:07:35,880 --> 05:07:37,440 at the very end of both loops. 6965 05:07:37,440 --> 05:07:39,480 So I got it de-indent twice 6966 05:07:41,520 --> 05:07:43,520 and so that will give us the counts. 6967 05:07:43,520 --> 05:07:44,360 Okay? 6968 05:07:48,240 --> 05:07:51,000 And so this is what we get when it's all said and done. 6969 05:07:51,000 --> 05:07:54,120 You know, the happened seven times, 6970 05:07:54,120 --> 05:07:57,480 but it just worked through its way through, okay? 6971 05:07:58,560 --> 05:08:00,080 So you got that. 6972 05:08:00,080 --> 05:08:03,120 Now, this is a pretty verbose way of doing this 6973 05:08:03,120 --> 05:08:05,000 but I did it sort of the slow way to show 6974 05:08:05,000 --> 05:08:06,680 that there are two situations. 6975 05:08:06,680 --> 05:08:08,640 If it's already there, you increment it 6976 05:08:08,640 --> 05:08:10,400 and if it's not there, you set it to one, 6977 05:08:10,400 --> 05:08:12,040 effectively inserting it, right? 6978 05:08:12,040 --> 05:08:14,160 So you insert it and set it to one 6979 05:08:14,160 --> 05:08:18,220 with this D-I sub the equals one, okay? 6980 05:08:18,220 --> 05:08:20,520 But let's get a little less verbose here, 6981 05:08:20,520 --> 05:08:21,920 get rid of some of these print statements 6982 05:08:21,920 --> 05:08:23,720 because we kind of covered all that. 6983 05:08:29,240 --> 05:08:33,160 Get rid of this line and go back to printing W and D-I-W 6984 05:08:33,160 --> 05:08:34,960 at the end, we'll leave that one in. 6985 05:08:34,960 --> 05:08:36,960 So what I want to do is I want to look 6986 05:08:36,960 --> 05:08:41,960 at this bit of code right here, this if W in D-I-Ls. 6987 05:08:45,880 --> 05:08:47,560 We do this so much with dictionaries 6988 05:08:47,560 --> 05:08:52,120 that there is an easy mechanism to do this 6989 05:08:52,120 --> 05:08:53,400 that combines these four lines 6990 05:08:53,400 --> 05:08:55,680 into a single kind of contraction. 6991 05:08:55,680 --> 05:08:58,340 And so I'm gonna do this, I'm gonna print, 6992 05:09:00,100 --> 05:09:03,000 let's put two stars out, then the word 6993 05:09:03,000 --> 05:09:08,000 and D-I dot get of the word comma negative 99. 6994 05:09:13,360 --> 05:09:17,680 Okay, and so this D-I dot get of the word 6995 05:09:17,680 --> 05:09:19,080 is the important part. 6996 05:09:19,080 --> 05:09:21,600 The way it is, is this is a dictionary, 6997 05:09:21,600 --> 05:09:24,520 dot get says in its first parameters, 6998 05:09:24,520 --> 05:09:26,720 the key to lookup, which is word like the 6999 05:09:26,720 --> 05:09:28,880 or fell or clown or whatever, 7000 05:09:28,880 --> 05:09:31,800 and 99 is the default value that we get 7001 05:09:31,800 --> 05:09:33,480 if the key doesn't exist. 7002 05:09:33,480 --> 05:09:38,040 So this is an effect, an if then else, right? 7003 05:09:38,040 --> 05:09:42,680 This little D-I dot get W negative 99 is, 7004 05:09:44,540 --> 05:09:45,960 if it's in there, do one thing, 7005 05:09:45,960 --> 05:09:48,080 if it's not in there, do something else, okay? 7006 05:09:48,080 --> 05:09:50,420 So let me show you how this works 7007 05:09:50,420 --> 05:09:53,900 and you'll see that the 99 will happen when, 7008 05:09:53,900 --> 05:09:58,900 okay, so the first time we see the get returns 99, 7009 05:10:05,880 --> 05:10:07,560 right, so let's move it over here. 7010 05:10:07,560 --> 05:10:11,520 The first time we see the, the is not in the dictionary. 7011 05:10:11,520 --> 05:10:15,960 So this D-I dot get of the word the in the dictionary 7012 05:10:15,960 --> 05:10:19,940 gives us back the negative 99, okay? 7013 05:10:19,940 --> 05:10:22,880 And this still is working and so the is one, 7014 05:10:22,880 --> 05:10:27,160 clown is whatever, but away we go, okay? 7015 05:10:27,160 --> 05:10:29,040 Let's do it this way, let me comment this out. 7016 05:10:29,040 --> 05:10:31,240 Let me comment this one out and run it again 7017 05:10:31,240 --> 05:10:34,120 so it's a little clearer what's going on. 7018 05:10:34,120 --> 05:10:38,400 Okay, so the first time we see the, 7019 05:10:38,400 --> 05:10:39,680 the is not in the dictionary. 7020 05:10:39,680 --> 05:10:40,800 The first time we see clown 7021 05:10:40,800 --> 05:10:43,600 and we know it's negative 99, 7022 05:10:43,600 --> 05:10:47,060 but here we asked for it and the is one 7023 05:10:47,060 --> 05:10:50,120 because we've seen it before. 7024 05:10:50,120 --> 05:10:53,480 And so that's just this get mechanism 7025 05:10:53,480 --> 05:10:58,480 allows us to get the new value 7026 05:10:58,840 --> 05:11:00,880 or get a value out if the key exists 7027 05:11:00,880 --> 05:11:05,880 and specify a default if it's not there. 7028 05:11:06,120 --> 05:11:11,120 So I'm gonna go old count equals D-I dot get 7029 05:11:13,320 --> 05:11:16,400 W comma zero. 7030 05:11:16,400 --> 05:11:18,040 So instead of using 99 here, 7031 05:11:18,040 --> 05:11:22,760 I'm gonna just get rid of all this is what I'm saying 7032 05:11:22,760 --> 05:11:24,800 is look up in this dictionary. 7033 05:11:24,800 --> 05:11:28,760 Get is a function that's part of all dictionaries. 7034 05:11:28,760 --> 05:11:31,160 Look up using the key W which is the, 7035 05:11:31,160 --> 05:11:36,160 and if I don't get it, give me back zero. 7036 05:11:36,160 --> 05:11:49,160 And so I'm gonna say print word comma old comma old count. 7037 05:11:49,660 --> 05:11:53,300 And now what I can say, whatever the old count is, 7038 05:11:53,300 --> 05:11:55,500 it's either the value that was in there or zero. 7039 05:11:55,500 --> 05:12:07,140 And now I can say new count equals old count. 7040 05:12:07,140 --> 05:12:08,820 And now, let's see new count, 7041 05:12:08,820 --> 05:12:15,820 and I can say dictionary sub word is equal to new count. 7042 05:12:18,940 --> 05:12:21,900 So instead I'm gonna get rid of this if then else then. 7043 05:12:21,900 --> 05:12:26,380 This is basically saying, look up the old count that we have. 7044 05:12:26,380 --> 05:12:27,960 If you don't find one, use a zero. 7045 05:12:27,960 --> 05:12:29,300 We'll print that out. 7046 05:12:29,300 --> 05:12:32,340 And then I'm gonna say afterwards, 7047 05:12:41,340 --> 05:12:42,900 I'm gonna print the new count. 7048 05:12:44,780 --> 05:12:46,380 Now, and so, 7049 05:12:46,380 --> 05:12:51,380 we'll print the old count. 7050 05:12:53,700 --> 05:12:55,580 Here are some of these blanks. 7051 05:12:55,580 --> 05:12:56,660 Print the old count. 7052 05:13:02,780 --> 05:13:04,700 And you can see the old count with the, 7053 05:13:04,700 --> 05:13:07,740 because the doesn't exist, was zero, the new one's one. 7054 05:13:07,740 --> 05:13:09,940 Clowns old is zero, new is one. 7055 05:13:09,940 --> 05:13:12,700 Clowns old ran old zero. 7056 05:13:12,700 --> 05:13:15,280 But now we get to the, its old count was one, 7057 05:13:15,280 --> 05:13:18,640 and now its new count is two, okay? 7058 05:13:18,640 --> 05:13:22,620 So by using this get and saying if we don't find it, 7059 05:13:22,620 --> 05:13:23,820 we'll assume the count is zero. 7060 05:13:23,820 --> 05:13:25,840 That makes a lot of sense, right? 7061 05:13:30,420 --> 05:13:35,420 If not there, the count is zero. 7062 05:13:38,100 --> 05:13:43,100 If the key is not there, the count is zero, okay? 7063 05:13:43,100 --> 05:13:46,660 So that's what this line does. 7064 05:13:46,660 --> 05:13:51,300 If get the value under the key, associated with the key, 7065 05:13:51,300 --> 05:13:53,620 or give me zero back. 7066 05:13:53,620 --> 05:13:55,300 And then I can take that old number 7067 05:13:55,300 --> 05:13:57,900 and just add one to it and then stick it back in. 7068 05:13:57,900 --> 05:14:01,940 Now this is ultimately not how we tend to do it, okay? 7069 05:14:01,940 --> 05:14:05,620 We tend to blend this all into one big long statement. 7070 05:14:05,620 --> 05:14:10,380 Di sub w equals this part 7071 05:14:10,380 --> 05:14:15,020 plus one, okay? 7072 05:14:15,020 --> 05:14:19,500 So that says get the old value from this key or zero 7073 05:14:19,500 --> 05:14:20,820 and then add one to it, 7074 05:14:20,820 --> 05:14:24,140 because that really combines all of these lines 7075 05:14:24,140 --> 05:14:26,480 into a single line, okay? 7076 05:14:27,500 --> 05:14:28,940 So I'm gonna delete them now. 7077 05:14:34,460 --> 05:14:36,640 And now we've combined this all into one, 7078 05:14:36,640 --> 05:14:41,640 what effectively is an idiom. 7079 05:14:42,560 --> 05:14:47,560 Retrieve, create, update, counter, all in one line. 7080 05:14:54,260 --> 05:14:57,140 I'll still print out, in this case I'll just say 7081 05:14:57,140 --> 05:15:02,140 di sub w and then we'll see the counter, okay? 7082 05:15:02,140 --> 05:15:07,140 And so now I'll run this, we don't, we have a new, 7083 05:15:11,020 --> 05:15:13,020 but now we see at the second time it's two 7084 05:15:13,020 --> 05:15:15,540 and so we see car the first time, 7085 05:15:15,540 --> 05:15:17,500 we see that the second time, we see car, 7086 05:15:17,500 --> 05:15:20,160 I mean the third time, we see car the second time 7087 05:15:20,160 --> 05:15:22,220 and away we go, okay? 7088 05:15:22,220 --> 05:15:24,460 And so that's pretty straightforward 7089 05:15:24,460 --> 05:15:27,460 and so it really kind of typo there. 7090 05:15:27,460 --> 05:15:32,460 So let's just get rid of that and run it with a clown stuff 7091 05:15:36,100 --> 05:15:38,460 and we get the right data there 7092 05:15:38,460 --> 05:15:43,460 and let's run it with intro dot txt and there we go, okay? 7093 05:15:48,500 --> 05:15:51,500 And so it's tearing out a bunch of words 7094 05:15:51,500 --> 05:15:53,220 and giving us a dictionary. 7095 05:15:53,220 --> 05:15:58,060 And giving us a dictionary, so that was a lot of work 7096 05:15:58,060 --> 05:16:01,300 to get to this line 16 that has the dictionary in it. 7097 05:16:01,300 --> 05:16:04,300 Now we wanna find the most common word. 7098 05:16:15,660 --> 05:16:17,740 And so we're gonna loop through this dictionary 7099 05:16:17,740 --> 05:16:20,580 and part of it is like once we printed this dictionary out 7100 05:16:20,580 --> 05:16:22,580 and we verified that it's right, 7101 05:16:22,580 --> 05:16:25,220 don't worry too much about the code up here, right? 7102 05:16:25,220 --> 05:16:28,620 Matter of fact, I can take out some of these print statements 7103 05:16:28,620 --> 05:16:31,180 and we can kinda trust all this 7104 05:16:31,180 --> 05:16:33,780 and so now we're gonna work on this, okay? 7105 05:16:33,780 --> 05:16:35,900 Now we wanna find the most common word. 7106 05:16:35,900 --> 05:16:37,360 Now this is like a maximum loop. 7107 05:16:37,360 --> 05:16:42,360 So if you recall, we have a whole set of key value pairs, 7108 05:16:42,540 --> 05:16:46,240 communicate goes to two, is to two, skills is three. 7109 05:16:46,240 --> 05:16:48,340 So we have these key value pairs 7110 05:16:48,340 --> 05:16:50,340 and we're gonna loop through and look for the maximum. 7111 05:16:50,340 --> 05:16:54,860 Now in a dictionary, we can loop through the key value pairs 7112 05:16:54,860 --> 05:16:59,860 with the following syntax for, you know, 7113 05:17:00,260 --> 05:17:04,060 I would call these variables K and V for key and value, 7114 05:17:04,060 --> 05:17:09,060 but yeah, in the dictionaries name.items 7115 05:17:10,140 --> 05:17:13,420 and items is a method inside of all dictionaries 7116 05:17:13,420 --> 05:17:16,660 that says give me the key value pairs 7117 05:17:16,660 --> 05:17:18,380 and we need two iteration variables. 7118 05:17:18,380 --> 05:17:20,220 So this is like an assignment statement 7119 05:17:20,220 --> 05:17:21,060 for K and V. 7120 05:17:21,060 --> 05:17:23,660 K and V take on the successive values 7121 05:17:23,660 --> 05:17:27,140 for the keys, the key and the value, okay? 7122 05:17:27,140 --> 05:17:30,740 So if I just now print K comma V 7123 05:17:32,820 --> 05:17:34,740 and I'll take this print statement out 7124 05:17:35,820 --> 05:17:40,220 and then run the code on, oops, what I forgot, 7125 05:17:40,220 --> 05:17:43,260 oh, I fell back into my Python two days. 7126 05:17:44,140 --> 05:17:45,680 Need parentheses for my print. 7127 05:17:46,800 --> 05:17:49,300 So there's clown and it just prints it out 7128 05:17:49,300 --> 05:17:51,220 and it's kind of the same thing except it's pretty 7129 05:17:51,220 --> 05:17:55,220 where we're putting each one on a line, okay? 7130 05:17:55,220 --> 05:17:57,500 So the K, the V is the value. 7131 05:17:57,500 --> 05:17:59,860 So we're looking for the largest value, oops. 7132 05:18:02,200 --> 05:18:05,040 So the thing is we know that the values 7133 05:18:05,040 --> 05:18:08,220 are always numbers that are greater than one. 7134 05:18:08,220 --> 05:18:13,220 So I'm gonna do kind of a quickie maximum loop. 7135 05:18:14,300 --> 05:18:17,820 Largest equals negative one. 7136 05:18:17,820 --> 05:18:19,620 Now in previous times, we've seen that this 7137 05:18:19,620 --> 05:18:21,380 is a bad assumption, but because we know 7138 05:18:21,380 --> 05:18:23,320 these are counters that are always positive, 7139 05:18:23,320 --> 05:18:26,300 it turns out this is not a bad idea. 7140 05:18:26,300 --> 05:18:30,620 And so I can say if the value is greater 7141 05:18:30,620 --> 05:18:32,700 than the largest we've seen so far, 7142 05:18:38,400 --> 05:18:41,540 largest equals the value. 7143 05:18:44,620 --> 05:18:46,820 Okay, and when that loop is all done, 7144 05:18:46,820 --> 05:18:48,980 we can print the largest. 7145 05:18:57,860 --> 05:19:00,180 Okay, and so this is just a max loop 7146 05:19:00,180 --> 05:19:01,780 and we're using this value. 7147 05:19:01,780 --> 05:19:04,180 That's the number, the value is the second thing. 7148 05:19:05,340 --> 05:19:06,180 Oops. 7149 05:19:06,180 --> 05:19:11,180 Ah, can't type Python. 7150 05:19:20,540 --> 05:19:21,960 Oh, it's a typo. 7151 05:19:22,940 --> 05:19:25,700 Yeah, I'm not using value, I'm using V. 7152 05:19:25,700 --> 05:19:28,240 So largest equals V, let's try it again. 7153 05:19:30,860 --> 05:19:32,300 Okay, so we're all done with seven. 7154 05:19:32,300 --> 05:19:34,700 So these were the things that we were looking for 7155 05:19:34,700 --> 05:19:37,220 and it was looking for the maximum 7156 05:19:37,220 --> 05:19:40,280 and it just dutifully found seven was the largest. 7157 05:19:40,280 --> 05:19:42,740 But we also wanna know what the word is. 7158 05:19:42,740 --> 05:19:47,740 And so what we can say here is we can say the word is none, 7159 05:19:48,920 --> 05:19:52,260 meaning it's just like we don't know what the word is. 7160 05:19:52,260 --> 05:19:56,060 And then whenever we catch this new largest number, 7161 05:19:56,060 --> 05:19:59,340 we say the word equals W. 7162 05:19:59,340 --> 05:20:04,340 So I like to think of this as capture, 7163 05:20:04,340 --> 05:20:09,340 remember the word that was largest. 7164 05:20:13,580 --> 05:20:15,140 Right, that's what I'm doing. 7165 05:20:15,140 --> 05:20:19,540 R, E, M, E, M, R, M, E, M, 7166 05:20:22,140 --> 05:20:24,860 remember, right, that's tough. 7167 05:20:24,860 --> 05:20:26,940 R, E, M, E, M, B, E, R, there we go. 7168 05:20:28,420 --> 05:20:31,700 So we're gonna, this trick here is, 7169 05:20:31,700 --> 05:20:33,660 not only knowing what the largest number was, 7170 05:20:33,660 --> 05:20:36,620 but the word that was associated with the largest number. 7171 05:20:36,620 --> 05:20:40,820 So now I can print out at the end the word and the largest 7172 05:20:40,820 --> 05:20:41,860 and that's the count. 7173 05:20:46,540 --> 05:20:48,180 Okay, and so now we know that, 7174 05:20:50,740 --> 05:20:52,540 oops, did we make a mistake here? 7175 05:20:58,740 --> 05:21:00,700 Okay, that does not look good 7176 05:21:00,700 --> 05:21:04,060 because it says car and seven. 7177 05:21:05,820 --> 05:21:08,040 If V is greater than the largest, 7178 05:21:09,060 --> 05:21:11,380 oh, it's not W. 7179 05:21:12,900 --> 05:21:14,780 I used a really bad variable. 7180 05:21:14,780 --> 05:21:17,020 See, that's the whole value there. 7181 05:21:17,020 --> 05:21:17,920 There we go. 7182 05:21:17,920 --> 05:21:19,720 It's K, which is the key. 7183 05:21:21,500 --> 05:21:22,340 Key. 7184 05:21:23,820 --> 05:21:26,060 I was gonna say that was quite the bug. 7185 05:21:26,060 --> 05:21:26,900 See what happened there? 7186 05:21:26,900 --> 05:21:29,620 I had this as W and it just happened to be, 7187 05:21:29,620 --> 05:21:32,200 it was the last word on the file. 7188 05:21:35,700 --> 05:21:38,140 Car, the last word in the file 7189 05:21:38,140 --> 05:21:40,240 because I used a wrong variable. 7190 05:21:44,140 --> 05:21:46,140 No, little mistakes, little mistakes. 7191 05:21:48,940 --> 05:21:51,040 The and seven. 7192 05:21:51,040 --> 05:21:53,420 Okay, so let's get rid of this print statement 7193 05:21:53,420 --> 05:21:56,100 because we kind of know what's going on here 7194 05:21:56,100 --> 05:21:59,260 and away we go and this should now work. 7195 05:21:59,260 --> 05:22:00,100 If we run it. 7196 05:22:05,180 --> 05:22:07,180 I can even get rid of the word done here. 7197 05:22:11,780 --> 05:22:12,820 There we go, the seven. 7198 05:22:12,820 --> 05:22:15,260 Now, the cool thing about this is this code runs 7199 05:22:15,260 --> 05:22:17,540 just as easily with one line of code 7200 05:22:17,540 --> 05:22:20,860 or the intro of the book, intro.txt, 7201 05:22:20,860 --> 05:22:24,260 and not surprisingly, it's still the most common word 7202 05:22:24,260 --> 05:22:27,540 in the introduction.txt, I seem to like that word, 7203 05:22:27,540 --> 05:22:29,900 and it's 226 times. 7204 05:22:29,900 --> 05:22:34,900 Okay, and so that is the basic pattern of reading some, 7205 05:22:35,420 --> 05:22:37,800 this is just a word loop now sometimes, 7206 05:22:37,800 --> 05:22:39,540 there would be some, you know, 7207 05:22:39,540 --> 05:22:41,900 checking to see if the line is the one you're interested in, 7208 05:22:41,900 --> 05:22:43,180 maybe tearing apart the line, 7209 05:22:43,180 --> 05:22:44,780 but it's at the end of the day, 7210 05:22:44,780 --> 05:22:48,060 this idiom of starting a dictionary. 7211 05:22:48,060 --> 05:22:50,060 Now, it's a common problem to know 7212 05:22:50,060 --> 05:22:51,140 where to start the dictionary. 7213 05:22:51,140 --> 05:22:54,260 Do you want to accumulate the numbers for the whole file 7214 05:22:54,260 --> 05:22:58,420 so you don't want to put it in between line six and line seven? 7215 05:22:58,420 --> 05:23:03,260 Okay, so I hope that particular thing helps a little bit, 7216 05:23:03,260 --> 05:23:05,320 helps you understand dictionaries. 7217 05:23:09,500 --> 05:23:11,020 Hello and welcome to chapter 10. 7218 05:23:11,020 --> 05:23:12,980 Now, we're gonna talk about our third kind of collection 7219 05:23:12,980 --> 05:23:15,880 called tuples, but tuples are really a lot like lists, 7220 05:23:15,880 --> 05:23:18,200 there's not too much to them, 7221 05:23:18,200 --> 05:23:21,820 they're really kind of reductionist version of lists there. 7222 05:23:21,820 --> 05:23:24,100 So they function very much like lists, 7223 05:23:24,100 --> 05:23:28,260 and that, you know, they have things, 7224 05:23:28,260 --> 05:23:30,300 and the difference is there are no square braces, 7225 05:23:30,300 --> 05:23:33,860 there is a parenthesis, round brace or whatever, 7226 05:23:33,860 --> 05:23:36,300 and they have positions zero, one, and two, 7227 05:23:36,300 --> 05:23:39,900 just like a list, and you can look things up, X sub two, 7228 05:23:39,900 --> 05:23:43,020 so X sub two is actually the third element here, 7229 05:23:43,020 --> 05:23:45,220 and so that prints out Joseph. 7230 05:23:45,220 --> 05:23:47,820 You can assign, you know, make a tuple here, 7231 05:23:47,820 --> 05:23:50,260 this is the constant syntax for a tuple, 7232 05:23:50,260 --> 05:23:52,460 and print that out, and the print statement shows you 7233 05:23:52,460 --> 05:23:54,180 that this is a tuple, not a list, 7234 05:23:54,180 --> 05:23:55,740 by showing you round parenthesis, 7235 05:23:55,740 --> 05:23:58,220 and a whole bunch of functions that work with lists 7236 05:23:58,220 --> 05:24:00,500 work the same way with tuples. 7237 05:24:00,500 --> 05:24:03,240 You can put a tuple at the end of an end statement 7238 05:24:03,240 --> 05:24:04,780 in a four, as you might expect, 7239 05:24:04,780 --> 05:24:06,420 and then it iterates through the tuples, 7240 05:24:06,420 --> 05:24:09,600 tuples maintain order, so it prints out one, nine, and two. 7241 05:24:09,600 --> 05:24:14,080 So, literally this bit of code here could be identical, 7242 05:24:14,080 --> 05:24:16,300 whether it was a list or a tuple, 7243 05:24:16,300 --> 05:24:19,040 it really would do the exact same thing. 7244 05:24:19,040 --> 05:24:21,940 The difference between tuples are that they are immutable, 7245 05:24:21,940 --> 05:24:23,460 once you create the tuple, 7246 05:24:23,460 --> 05:24:25,820 you can only sort of assign a tuple, 7247 05:24:25,820 --> 05:24:28,380 but you can't modify it, you can modify a list. 7248 05:24:28,380 --> 05:24:30,320 So if we take a look at a list here, 7249 05:24:30,320 --> 05:24:32,400 we make a list that's nine, eight, seven, 7250 05:24:32,400 --> 05:24:34,180 and we say X sub two equals six, 7251 05:24:34,180 --> 05:24:36,980 well, that just means this seven becomes a six, 7252 05:24:36,980 --> 05:24:41,180 and that's just natural, meaning we can reassign slots, 7253 05:24:41,180 --> 05:24:43,420 we can delete things, we can insert things, 7254 05:24:43,420 --> 05:24:46,260 we can mutate them, we can change them, 7255 05:24:46,260 --> 05:24:49,700 so they're changeable, right, they're changeable. 7256 05:24:49,700 --> 05:24:53,460 But, if we try to do that same thing with a string, 7257 05:24:53,460 --> 05:24:54,820 so we say Y equals ABC, 7258 05:24:54,820 --> 05:24:56,860 and we know that this is position zero, one, and two, 7259 05:24:56,860 --> 05:25:00,060 but if we try to say, let's change the C to a D, 7260 05:25:00,060 --> 05:25:03,620 by saying Y sub two equals D, that is not allowed. 7261 05:25:03,620 --> 05:25:05,940 And it says it doesn't support item assignment, 7262 05:25:05,940 --> 05:25:09,580 and this little bracket, you know, X sub two, 7263 05:25:09,580 --> 05:25:13,060 is what they call item assignment inside of Python. 7264 05:25:13,060 --> 05:25:14,840 And so if we do the same thing then 7265 05:25:14,840 --> 05:25:19,060 with a three element tuple, put that in Z, 7266 05:25:19,060 --> 05:25:22,140 and we try to change this slot to be a zero, 7267 05:25:22,140 --> 05:25:24,580 it's gonna blow up, because it's the exact same thing. 7268 05:25:24,580 --> 05:25:26,140 And that has to do with the fact that, 7269 05:25:26,140 --> 05:25:30,000 once this assignment is made, this is not modifiable. 7270 05:25:30,000 --> 05:25:32,660 Now, it turns out that the reason it's not modifiable 7271 05:25:32,660 --> 05:25:34,380 is for efficiency. 7272 05:25:35,780 --> 05:25:40,780 They take up less storage, they are quicker to access, 7273 05:25:40,840 --> 05:25:43,740 and they're really designed internally behind the scenes 7274 05:25:43,740 --> 05:25:46,280 in ways we don't really need to understand. 7275 05:25:46,280 --> 05:25:49,360 They're just more efficient than lists. 7276 05:25:49,360 --> 05:25:50,840 If all you wanna do is store a list, 7277 05:25:50,840 --> 05:25:52,360 and look at it, and then throw it away, 7278 05:25:52,360 --> 05:25:54,420 you probably should use a tuple instead. 7279 05:25:54,420 --> 05:25:56,600 So there's a lot of things that you can do with lists 7280 05:25:56,600 --> 05:25:57,960 that you also can't do with tuples, 7281 05:25:57,960 --> 05:25:59,960 but they're really just a corollary 7282 05:25:59,960 --> 05:26:02,480 of this notion of non-mutability. 7283 05:26:02,480 --> 05:26:04,120 And so, like, you can sort a list, 7284 05:26:04,120 --> 05:26:05,520 but you can't sort tuples. 7285 05:26:05,520 --> 05:26:08,400 You can add a five to the end of three, two, one. 7286 05:26:08,400 --> 05:26:10,880 Can't do that in a tuple, but you can in a list. 7287 05:26:10,880 --> 05:26:13,560 And flip the order, dot, dot, dot, dot, dot, dot. 7288 05:26:13,560 --> 05:26:18,480 So anything that you can do to a list that modifies the list, 7289 05:26:18,480 --> 05:26:20,120 not allowed for tuples. 7290 05:26:20,120 --> 05:26:23,040 And so you can take a look at the kinds of things 7291 05:26:23,040 --> 05:26:27,600 that are inside the methods that are part of each list, 7292 05:26:27,600 --> 05:26:30,200 append, count, extend, index, insert, pop, 7293 05:26:30,200 --> 05:26:32,480 all of these, many of these are modifying, 7294 05:26:32,480 --> 05:26:34,120 and then count and index are the only ones 7295 05:26:34,120 --> 05:26:36,520 that work for tuples. 7296 05:26:36,520 --> 05:26:40,000 And so tuples are limited lists. 7297 05:26:40,000 --> 05:26:41,800 Now, at some point, there's gonna be a but here 7298 05:26:41,800 --> 05:26:44,040 to say, why do we like them? 7299 05:26:44,040 --> 05:26:46,760 And the reason that we like them is that 7300 05:26:46,760 --> 05:26:47,960 they're just more efficient. 7301 05:26:47,960 --> 05:26:50,480 They don't have to build in Python 7302 05:26:50,480 --> 05:26:53,960 in its own internal organization of these objects. 7303 05:26:53,960 --> 05:26:56,440 It knows that they'll never be modified, 7304 05:26:56,440 --> 05:26:58,360 because when you make a tuple, you as the programmer 7305 05:26:58,360 --> 05:27:00,200 saying, I'm never gonna modify this, 7306 05:27:00,200 --> 05:27:02,600 and Python won't let you do it. 7307 05:27:02,600 --> 05:27:05,280 So it's higher performance, better memory use, 7308 05:27:05,280 --> 05:27:06,960 and you know, to a beginning programmer, 7309 05:27:06,960 --> 05:27:09,600 that doesn't really matter, but that's the reason. 7310 05:27:09,600 --> 05:27:13,280 And so we tend to use tuples in situations 7311 05:27:13,280 --> 05:27:14,600 where we're gonna make a temporary variable 7312 05:27:14,600 --> 05:27:17,160 and then temporarily use it just a little bit 7313 05:27:17,160 --> 05:27:19,080 and then throw it away without really messing with it. 7314 05:27:19,080 --> 05:27:21,280 And we tend to use lists to build things up, 7315 05:27:21,280 --> 05:27:22,720 et cetera, et cetera, et cetera. 7316 05:27:24,640 --> 05:27:27,760 So the other thing that's interesting about tuples, 7317 05:27:27,760 --> 05:27:29,440 and we've actually sort of seen this, 7318 05:27:29,440 --> 05:27:33,200 is that you can put a tuple that includes variables 7319 05:27:33,200 --> 05:27:35,120 on the left side of the assignment. 7320 05:27:35,120 --> 05:27:37,880 And this takes a little getting used to, 7321 05:27:37,880 --> 05:27:40,400 but it's really cool, and no other language 7322 05:27:40,400 --> 05:27:41,560 that I know of does this. 7323 05:27:41,560 --> 05:27:44,520 So if we say x comma y, that's a two tuple. 7324 05:27:44,520 --> 05:27:45,560 Both have two variables. 7325 05:27:45,560 --> 05:27:47,520 You can't put constants on this side. 7326 05:27:47,520 --> 05:27:49,960 You know, it's like saying x equals four, 7327 05:27:49,960 --> 05:27:53,160 y equals Fred, right? 7328 05:27:53,160 --> 05:27:56,000 So what happens is, is you can put a tuple 7329 05:27:56,000 --> 05:27:57,760 on the far side of an assignment statement, 7330 05:27:57,760 --> 05:28:00,960 and the four goes to x, and the Fred goes to y. 7331 05:28:00,960 --> 05:28:01,920 And you say, what's in y? 7332 05:28:01,920 --> 05:28:03,120 Well, y is indeed Fred. 7333 05:28:03,120 --> 05:28:05,460 And so this is like two assignment statements. 7334 05:28:05,460 --> 05:28:07,240 Now, the way I've got this syntax, 7335 05:28:07,240 --> 05:28:09,440 I would probably do two separate statements, 7336 05:28:09,440 --> 05:28:12,040 just not to show off that I know how to do tuples. 7337 05:28:14,200 --> 05:28:15,840 And so you can, here's another one, 7338 05:28:15,840 --> 05:28:17,760 and they just move correspondingly. 7339 05:28:17,760 --> 05:28:20,240 If you don't have two here, and you do have two here, 7340 05:28:21,240 --> 05:28:24,440 well, if you have three here, or two here, and three here, 7341 05:28:24,440 --> 05:28:25,840 and you don't match the number there, 7342 05:28:25,840 --> 05:28:26,780 you get in some trouble. 7343 05:28:26,780 --> 05:28:29,480 Now, if you just say x equals tuple, 7344 05:28:29,480 --> 05:28:31,180 then that is the tuple in the list. 7345 05:28:31,180 --> 05:28:35,440 But this is just a simple straight 99 value going into a. 7346 05:28:35,440 --> 05:28:38,760 So you can put tuples as the left-hand side. 7347 05:28:38,760 --> 05:28:41,680 And you can even do things like return a tuple from functions. 7348 05:28:41,680 --> 05:28:45,140 That's a real nice Python feature that I like a lot. 7349 05:28:45,140 --> 05:28:47,160 Tuples are also related to dictionaries, 7350 05:28:47,160 --> 05:28:49,200 as we've seen in the previous chapter. 7351 05:28:49,200 --> 05:28:50,620 So here we make a little dictionary. 7352 05:28:50,620 --> 05:28:52,720 We make an empty dictionary by constructing 7353 05:28:52,720 --> 05:28:54,240 an empty dictionary, stick it in d. 7354 05:28:54,240 --> 05:28:56,160 So d is sort of like this place 7355 05:28:56,160 --> 05:28:58,080 that can hold key value pairs. 7356 05:28:58,080 --> 05:29:01,080 And we put csev, and there's a two in there, 7357 05:29:01,080 --> 05:29:03,180 and chen1, and there's a four in there. 7358 05:29:03,180 --> 05:29:05,200 So we have this associative mapping 7359 05:29:05,200 --> 05:29:08,960 between csev and two, and chen1 and four, all stuff we know. 7360 05:29:08,960 --> 05:29:11,440 And now we say, hey, we're gonna loop 7361 05:29:11,440 --> 05:29:13,160 through the key value pairs here, 7362 05:29:13,160 --> 05:29:16,600 and we've seen this syntax before, k,v. 7363 05:29:16,600 --> 05:29:17,960 So this is a tuple. 7364 05:29:17,960 --> 05:29:20,120 So you can think of this as each one of these things 7365 05:29:20,120 --> 05:29:22,080 is going to get assigned into this tuple, 7366 05:29:22,080 --> 05:29:23,680 which means the key ends up in, 7367 05:29:23,680 --> 05:29:25,760 and the first one's the key, and the second one's the value. 7368 05:29:25,760 --> 05:29:29,520 I use the variable kv all the time in code that I write, 7369 05:29:29,520 --> 05:29:30,860 just for my own sanity. 7370 05:29:30,860 --> 05:29:33,280 So kv are gonna iterate successively 7371 05:29:33,280 --> 05:29:36,920 through the successive keys and values in them. 7372 05:29:36,920 --> 05:29:38,720 So this is gonna run twice, 7373 05:29:38,720 --> 05:29:41,680 and k is gonna be csev2, and chen1, four. 7374 05:29:41,680 --> 05:29:44,380 The order just happened to stay the same. 7375 05:29:45,280 --> 05:29:49,680 And so if you say, what is in one of these things, 7376 05:29:49,680 --> 05:29:51,280 you can actually take d items, 7377 05:29:51,280 --> 05:29:53,560 the items method within that dictionary, 7378 05:29:53,560 --> 05:29:56,640 and say, hey, give me back, give that to me back, 7379 05:29:56,640 --> 05:29:57,680 and then print tops. 7380 05:29:57,680 --> 05:30:00,280 And this is, it's a special kind of a class, 7381 05:30:00,280 --> 05:30:03,840 but really ultimately it is a list of tuples. 7382 05:30:03,840 --> 05:30:07,040 This is two, this is the zero, and this is the two, 7383 05:30:07,040 --> 05:30:08,960 the one, the first and the second, 7384 05:30:08,960 --> 05:30:12,000 and then within each thing you get, you have a two tuple. 7385 05:30:12,000 --> 05:30:16,600 And so in a sense, this k and v are iterating 7386 05:30:16,600 --> 05:30:20,040 through those things when we're putting d items here 7387 05:30:20,040 --> 05:30:21,440 and d items there. 7388 05:30:23,200 --> 05:30:25,560 One nice thing about tuples is that they're comparable. 7389 05:30:25,560 --> 05:30:26,960 They're comparable in the same way 7390 05:30:26,960 --> 05:30:28,080 that strings are comparable, 7391 05:30:28,080 --> 05:30:30,240 meaning that they're compared from left to right 7392 05:30:30,240 --> 05:30:33,880 with the leftmost or zero tuple being the most significant. 7393 05:30:33,880 --> 05:30:36,440 And it doesn't compare any further than it has to 7394 05:30:36,440 --> 05:30:39,300 if it's asking less than. 7395 05:30:39,300 --> 05:30:41,560 So if it's looking at, say, this first tuple, 7396 05:30:41,560 --> 05:30:43,880 it starts at the left and says, okay, 7397 05:30:43,880 --> 05:30:46,160 ask the question, tell me true or false. 7398 05:30:46,160 --> 05:30:47,720 Is zero less than five? 7399 05:30:47,720 --> 05:30:48,920 The answer is true. 7400 05:30:48,920 --> 05:30:52,000 And so the answer to this overall expression is true, 7401 05:30:52,000 --> 05:30:54,640 and it doesn't even compare those two numbers, 7402 05:30:54,640 --> 05:30:57,920 those second and third number, they don't compare them. 7403 05:30:57,920 --> 05:31:01,780 If, on the other hand, we're asking is this less than that, 7404 05:31:01,780 --> 05:31:03,360 it only looks at the first one 7405 05:31:03,360 --> 05:31:05,240 and asks if it can answer the question. 7406 05:31:05,240 --> 05:31:07,280 The answer is, well, they're both zero, 7407 05:31:07,280 --> 05:31:08,820 and so I can't answer the question, 7408 05:31:08,820 --> 05:31:11,160 so I have to go to the second one, second pair, 7409 05:31:11,160 --> 05:31:14,520 and one is less than three, and so that means this is true. 7410 05:31:14,520 --> 05:31:16,720 And it does not check this. 7411 05:31:16,720 --> 05:31:19,720 Even though 20 million is bigger than four, 7412 05:31:19,720 --> 05:31:23,600 it doesn't matter because these are the numbers 7413 05:31:23,600 --> 05:31:26,440 that cause the true to happen. 7414 05:31:26,440 --> 05:31:31,440 And the same is true if you do this with strings. 7415 05:31:31,640 --> 05:31:33,360 Again, we start the first one. 7416 05:31:33,360 --> 05:31:36,160 So Jones, Sally, well, that's the same, 7417 05:31:36,160 --> 05:31:37,320 so we don't know the answer yet, 7418 05:31:37,320 --> 05:31:40,960 and so Sally, Sam, well, okay, S, S, 7419 05:31:40,960 --> 05:31:43,960 well, they're the same, A, A, they're the same, 7420 05:31:43,960 --> 05:31:46,820 O, L, and M. 7421 05:31:46,820 --> 05:31:50,600 L is less than M, so the actual letter 7422 05:31:50,600 --> 05:31:52,920 that makes the difference here is the L and the M 7423 05:31:52,920 --> 05:31:55,000 and leads to us being true. 7424 05:31:55,000 --> 05:31:56,840 And so it goes left to right, 7425 05:31:56,840 --> 05:31:58,280 but then even when it's doing strings, 7426 05:31:58,280 --> 05:31:59,120 it's going left to right. 7427 05:31:59,120 --> 05:32:02,360 That's just how string comparison works. 7428 05:32:02,360 --> 05:32:07,360 And if we say, is Jones Sally greater than Adam, Sam? 7429 05:32:09,560 --> 05:32:10,800 Well, we checked the first one, 7430 05:32:10,800 --> 05:32:12,480 and we checked the J and the A. 7431 05:32:12,480 --> 05:32:14,520 Well, J is greater than A, 7432 05:32:14,520 --> 05:32:16,000 and so we don't have to look at anything else. 7433 05:32:16,000 --> 05:32:18,360 We don't have to look at any more of these characters. 7434 05:32:18,360 --> 05:32:20,440 We don't have to look at the second thing in the tuple. 7435 05:32:20,440 --> 05:32:22,920 We have to look at that is enough to be true. 7436 05:32:22,920 --> 05:32:26,360 So it only scans until it has a definitive answer. 7437 05:32:26,360 --> 05:32:28,200 It doesn't scan any further. 7438 05:32:29,640 --> 05:32:30,760 So now what we're going to do 7439 05:32:30,760 --> 05:32:32,920 is use this comparable capability 7440 05:32:32,920 --> 05:32:34,320 to sort these lists of tuples 7441 05:32:34,320 --> 05:32:36,200 and then bring this all back 7442 05:32:36,200 --> 05:32:38,000 and connect it more to dictionaries. 7443 05:32:41,920 --> 05:32:43,680 So now we can take advantage of the notion 7444 05:32:43,680 --> 05:32:46,280 of comparing tuples and use sorting. 7445 05:32:46,280 --> 05:32:49,480 And so what we're going to produce is a list of tuples, 7446 05:32:49,480 --> 05:32:51,800 and then we're going to sort them, right? 7447 05:32:51,800 --> 05:32:54,280 And so we can get a list of tuples from a dictionary 7448 05:32:54,280 --> 05:32:56,200 and then we can sort that list of tuples, 7449 05:32:56,200 --> 05:32:58,560 and then we can end up sorting dictionary items 7450 05:32:58,560 --> 05:33:00,160 by taking this two-step process. 7451 05:33:00,160 --> 05:33:02,600 Convert dictionary to a list, sort the list, 7452 05:33:02,600 --> 05:33:06,360 and then we can have a sorted dictionary values, okay? 7453 05:33:06,360 --> 05:33:08,360 And so we'll do this a couple of different times. 7454 05:33:08,360 --> 05:33:10,440 So if we take a look at this code right here, 7455 05:33:10,440 --> 05:33:12,000 we have our happy little dictionary, 7456 05:33:12,000 --> 05:33:15,960 A, B, C, A maps to 10, B maps to one, C maps to 20. 7457 05:33:15,960 --> 05:33:17,120 Like what are we going to get here? 7458 05:33:17,120 --> 05:33:19,980 Well, it comes out, the mapping is the right way, 7459 05:33:19,980 --> 05:33:21,720 but the order is whatever. 7460 05:33:21,720 --> 05:33:24,200 And now we say this function called sorted, 7461 05:33:24,200 --> 05:33:26,440 which takes inside a sequence 7462 05:33:26,440 --> 05:33:29,320 and then returns us a sorted version of that, 7463 05:33:29,320 --> 05:33:30,760 a list that's sorted. 7464 05:33:30,760 --> 05:33:32,640 And so it says sort D items. 7465 05:33:32,640 --> 05:33:35,120 So it's basically going to take this list 7466 05:33:35,120 --> 05:33:37,320 and compare the A's and the C's and the B's, 7467 05:33:37,320 --> 05:33:40,000 and because it's a dictionary and all the keys are unique, 7468 05:33:40,000 --> 05:33:41,120 there's never going to be equality. 7469 05:33:41,120 --> 05:33:43,320 So it really is going to just sort this by keys 7470 05:33:43,320 --> 05:33:45,360 and never get to looking at the values. 7471 05:33:45,360 --> 05:33:48,720 You could construct a list that had duplicate, 7472 05:33:48,720 --> 05:33:50,160 you could make a list of tuples 7473 05:33:50,160 --> 05:33:53,280 that had duplicates in the first like we did before, 7474 05:33:53,280 --> 05:33:56,120 but given that this coming from a dictionary, 7475 05:33:56,120 --> 05:33:58,840 the first thing is going to always be unique and distinct. 7476 05:33:58,840 --> 05:34:00,760 And so if we say sorted D of items 7477 05:34:00,760 --> 05:34:04,080 that we're passing this stuff into sorted, 7478 05:34:04,080 --> 05:34:07,240 sorted is going to go around, move stuff around, 7479 05:34:07,240 --> 05:34:10,600 and then give us back a sorted version, 7480 05:34:10,600 --> 05:34:13,240 sorted in ascending order based on key 7481 05:34:13,240 --> 05:34:14,920 without looking at the value. 7482 05:34:14,920 --> 05:34:19,600 And so that's a way to see dictionaries 7483 05:34:19,600 --> 05:34:22,720 sorted by key is just say sorted of D sub items. 7484 05:34:22,720 --> 05:34:25,560 And sorted is a function, and so it just picks stuff. 7485 05:34:25,560 --> 05:34:27,360 And so this is the kind of loop 7486 05:34:27,360 --> 05:34:29,960 that you're going to write to do that. 7487 05:34:29,960 --> 05:34:32,560 You know, we did this before, we took sorted, 7488 05:34:32,560 --> 05:34:34,640 and we got these sorted by keys. 7489 05:34:34,640 --> 05:34:38,040 And so you can just make this nice and simple for key value. 7490 05:34:38,040 --> 05:34:40,280 By the way, you can eliminate the parentheses here, 7491 05:34:40,280 --> 05:34:42,280 and I think it's prettier if you eliminate the parentheses, 7492 05:34:42,280 --> 05:34:43,720 but you could put parentheses. 7493 05:34:43,720 --> 05:34:46,320 This is still a tuple without the parentheses 7494 05:34:46,320 --> 05:34:49,420 for key and value in sorted, 7495 05:34:49,420 --> 05:34:50,880 so that says go through D items, 7496 05:34:50,880 --> 05:34:53,240 but before I go through them, please sort them. 7497 05:34:53,240 --> 05:34:55,800 So that means K is going to go through A, B, and C 7498 05:34:55,800 --> 05:34:58,800 deterministically every single time it's going to go. 7499 05:34:58,800 --> 05:35:00,360 And of course, value is going to go 7500 05:35:00,360 --> 05:35:01,600 through the corresponding value, 7501 05:35:01,600 --> 05:35:06,600 so now we can print this out nicely sorted by key. 7502 05:35:06,640 --> 05:35:11,640 And that's a real nice succinct little way to say that. 7503 05:35:12,000 --> 05:35:15,000 I mean, again, these are one of the kind of things 7504 05:35:15,000 --> 05:35:17,000 that people really like about Python 7505 05:35:17,000 --> 05:35:19,040 is that you can do pretty powerful things 7506 05:35:19,040 --> 05:35:21,280 with easy to under, I mean, you know, 7507 05:35:21,280 --> 05:35:22,480 you might have seen this for the first time, 7508 05:35:22,480 --> 05:35:24,360 but ultimately you look at that, eventually you'll be like, 7509 05:35:24,360 --> 05:35:26,560 oh yeah, I see exactly what that's doing. 7510 05:35:26,560 --> 05:35:28,420 Easy, not hard at all. 7511 05:35:29,560 --> 05:35:32,760 So, but let's say we're looking for the most common word, 7512 05:35:32,760 --> 05:35:35,400 which we have been for weeks and weeks and weeks now. 7513 05:35:36,640 --> 05:35:39,760 And so we want to sort by values, not key. 7514 05:35:39,760 --> 05:35:42,880 So this is an example of where we're going to construct 7515 05:35:42,880 --> 05:35:45,920 a data structure, we're going to imagine a data structure, 7516 05:35:45,920 --> 05:35:47,240 and then we're going to write code 7517 05:35:47,240 --> 05:35:48,320 to construct the data structure, 7518 05:35:48,320 --> 05:35:50,080 and then that's going to make our problem easy. 7519 05:35:50,080 --> 05:35:52,000 So this is an example of using 7520 05:35:52,000 --> 05:35:54,920 cleverly constructed data structures to do this. 7521 05:35:54,920 --> 05:35:57,500 And the data structure that we're going to create 7522 05:35:57,500 --> 05:36:00,880 is a list of tuples where the value is first 7523 05:36:00,880 --> 05:36:02,080 and the key is second. 7524 05:36:02,080 --> 05:36:05,240 So you can just with items get key value. 7525 05:36:05,240 --> 05:36:06,640 I want value key. 7526 05:36:06,640 --> 05:36:09,060 So let's take a look at this code. 7527 05:36:09,060 --> 05:36:10,920 Take your time and get it right. 7528 05:36:10,920 --> 05:36:13,180 So KV goes in C items. 7529 05:36:13,180 --> 05:36:15,800 Well, that is unsorted and going to have, 7530 05:36:15,800 --> 05:36:18,640 go through whatever A, B, and C, in whatever order. 7531 05:36:18,640 --> 05:36:20,280 And we're going to make a new list. 7532 05:36:20,280 --> 05:36:23,480 So this is a data structure that we're creating temporarily. 7533 05:36:23,480 --> 05:36:26,040 And what we're going to do is this is a list. 7534 05:36:28,920 --> 05:36:32,920 And we are going to append to that list a tuple. 7535 05:36:32,920 --> 05:36:35,980 So this is going to be a list of tuples. 7536 05:36:37,440 --> 05:36:41,920 Except we're not going to append them in key value order. 7537 05:36:41,920 --> 05:36:44,000 We're going to flip them and append the first part 7538 05:36:44,000 --> 05:36:45,400 of the tuple is going to be the value 7539 05:36:45,400 --> 05:36:47,680 and the second part is going to be the key. 7540 05:36:47,680 --> 05:36:49,160 So we end up with this. 7541 05:36:49,160 --> 05:36:50,800 This is sort of our temporary data structure 7542 05:36:50,800 --> 05:36:54,760 that we have constructed to make our job really easy. 7543 05:36:54,760 --> 05:36:59,720 So this ends up being 10A, 22C, 1B. 7544 05:36:59,720 --> 05:37:00,880 Now we just kind of flipped them. 7545 05:37:00,880 --> 05:37:03,960 We took this order and then we flipped them around. 7546 05:37:03,960 --> 05:37:05,480 And so now we have this nice little list 7547 05:37:05,480 --> 05:37:07,160 sitting in memory in a variable. 7548 05:37:07,160 --> 05:37:08,720 And that's really simple. 7549 05:37:08,720 --> 05:37:12,200 We can say, oh, look, we can use sorted. 7550 05:37:12,200 --> 05:37:14,900 And we can sort by now the values 7551 05:37:14,900 --> 05:37:16,520 because they're the first thing. 7552 05:37:16,520 --> 05:37:19,380 The sorted doesn't know how we produce this list. 7553 05:37:19,380 --> 05:37:21,780 It just looks at that and says, oh, that's a list of tuples. 7554 05:37:21,780 --> 05:37:24,520 I'm going to always sort by looking at the first item 7555 05:37:24,520 --> 05:37:25,660 in any tuple. 7556 05:37:25,660 --> 05:37:27,440 And I'm going to add reverse equals true 7557 05:37:27,440 --> 05:37:28,760 so I get a descending sort. 7558 05:37:28,760 --> 05:37:33,760 So I see that the value that is highest ends up being first. 7559 05:37:34,440 --> 05:37:37,280 And so that changes this and I'm just sort it 7560 05:37:37,280 --> 05:37:39,720 and then reassign it back into temp. 7561 05:37:39,720 --> 05:37:40,980 And I'll print this out. 7562 05:37:40,980 --> 05:37:44,040 And so now you see it's sorted in descending order of key. 7563 05:37:44,040 --> 05:37:48,960 So it's value key, value key, value key, 7564 05:37:48,960 --> 05:37:51,800 but it's sorted in descending order, okay? 7565 05:37:51,800 --> 05:37:55,200 And so that's an example sort of of just like, 7566 05:37:55,200 --> 05:37:56,920 you know, if I just made a data destruction, 7567 05:37:56,920 --> 05:37:58,080 I flipped those things around, 7568 05:37:58,080 --> 05:38:00,800 I could use sorted to sort these things. 7569 05:38:00,800 --> 05:38:02,120 There's many other ways you could do it, 7570 05:38:02,120 --> 05:38:04,520 but there's sort of like the more elegant way of doing it. 7571 05:38:04,520 --> 05:38:08,160 And the clever bit here is like make a new list 7572 05:38:08,160 --> 05:38:10,360 and make it be a little bit different, okay? 7573 05:38:10,360 --> 05:38:13,960 So here we're going to print out the top 10 7574 05:38:13,960 --> 05:38:16,280 most common words in a file. 7575 05:38:16,280 --> 05:38:18,120 And most of this code is review. 7576 05:38:18,120 --> 05:38:22,720 So if we take a look at it, we're gonna open a file. 7577 05:38:22,720 --> 05:38:25,360 We're gonna start a dictionary for our counting. 7578 05:38:25,360 --> 05:38:27,560 We're going to, you know, 7579 05:38:27,560 --> 05:38:32,560 there's gonna be words and lines, right? 7580 05:38:32,680 --> 05:38:34,240 And so we're gonna have a for loop. 7581 05:38:34,240 --> 05:38:36,920 This for loop is gonna go through each line. 7582 05:38:36,920 --> 05:38:38,240 And then of course we're gonna split them 7583 05:38:38,240 --> 05:38:40,040 which is busting them into pieces. 7584 05:38:40,040 --> 05:38:42,120 And then we have a for loop within that. 7585 05:38:42,120 --> 05:38:45,000 And this for loop is gonna go through each word. 7586 05:38:45,000 --> 05:38:48,160 And so that means that by nesting these loops, 7587 05:38:48,160 --> 05:38:49,920 we're going through each line 7588 05:38:49,920 --> 05:38:51,440 and then within the line we're going through a word. 7589 05:38:51,440 --> 05:38:53,440 Then we go to the next line and go through the words. 7590 05:38:53,440 --> 05:38:55,280 And eventually this line of code, 7591 05:38:55,280 --> 05:38:58,560 count sub word equals counts dot get word zero plus one, 7592 05:38:58,560 --> 05:39:03,080 are idiom for making a histogram, right? 7593 05:39:03,080 --> 05:39:05,040 This line right here is an idiom. 7594 05:39:05,040 --> 05:39:06,920 If you don't know already what that is, 7595 05:39:06,920 --> 05:39:09,360 go back to the previous dictionary lecture 7596 05:39:09,360 --> 05:39:11,040 and understand it, understand it, 7597 05:39:11,040 --> 05:39:14,000 because you're just gonna use it over and over again. 7598 05:39:14,000 --> 05:39:15,040 So now at this point, 7599 05:39:15,040 --> 05:39:17,360 and I always like drawing horizontal lines in code 7600 05:39:17,360 --> 05:39:18,200 when we write it. 7601 05:39:18,200 --> 05:39:20,280 At this point, coming through at this point, 7602 05:39:20,280 --> 05:39:21,480 counts is right. 7603 05:39:21,480 --> 05:39:22,960 Counts is the histogram. 7604 05:39:22,960 --> 05:39:24,000 It's not sorted. 7605 05:39:24,000 --> 05:39:25,440 So now we wanna sort it. 7606 05:39:25,440 --> 05:39:27,840 So we're going to make a new list. 7607 05:39:27,840 --> 05:39:29,840 We're gonna loop through key value. 7608 05:39:29,840 --> 05:39:31,280 And then we're gonna make a tuple. 7609 05:39:31,280 --> 05:39:32,920 I'm making this be two lines 7610 05:39:32,920 --> 05:39:34,440 to make it a little easier, value key. 7611 05:39:34,440 --> 05:39:35,440 So I'm flipping it, right? 7612 05:39:35,440 --> 05:39:37,800 So I'm flipping the order of these things. 7613 05:39:37,800 --> 05:39:38,800 That's making a tuple. 7614 05:39:38,800 --> 05:39:42,280 And then I'm appending that tuple to the list, okay? 7615 05:39:42,280 --> 05:39:44,600 So at the end of this, 7616 05:39:44,600 --> 05:39:46,640 we have a list of tuples 7617 05:39:48,720 --> 05:39:53,720 in value key order, vk, vk, right? 7618 05:39:54,400 --> 05:39:56,120 So at this point, coming through here, 7619 05:39:56,120 --> 05:39:58,480 I've got in my LST variable, 7620 05:39:58,480 --> 05:40:01,080 I've got this really useful bit of code, 7621 05:40:01,080 --> 05:40:03,080 or useful bit of data that I produced. 7622 05:40:03,080 --> 05:40:05,600 And then I'm like, oh, now it's ready to be sorted. 7623 05:40:05,600 --> 05:40:07,000 Poof, sort. 7624 05:40:07,000 --> 05:40:10,160 So take list, sort it back and sort it in descending order, 7625 05:40:10,160 --> 05:40:11,840 and then stick that back in list. 7626 05:40:11,840 --> 05:40:13,440 Now we wanna print it out, 7627 05:40:13,440 --> 05:40:15,040 but we don't wanna print it out. 7628 05:40:16,960 --> 05:40:19,320 So we got a nice sorted list coming down here. 7629 05:40:19,320 --> 05:40:21,880 We don't wanna print it out in value key, 7630 05:40:21,880 --> 05:40:22,760 because that's what it is. 7631 05:40:22,760 --> 05:40:26,800 It's in parenthesis v, k order, but it's in sorted. 7632 05:40:26,800 --> 05:40:31,800 And we know that the highest value is here on down. 7633 05:40:31,800 --> 05:40:33,840 And so we're gonna say, we're gonna run through, 7634 05:40:33,840 --> 05:40:36,100 and now we're gonna go through this new list, 7635 05:40:36,100 --> 05:40:38,800 only the first 10, start at the beginning up to, 7636 05:40:38,800 --> 05:40:42,260 but not including number 10, which is the first 10, 7637 05:40:42,260 --> 05:40:44,760 for value key in, and so value is good. 7638 05:40:44,760 --> 05:40:46,240 So this is the iteration variable 7639 05:40:46,240 --> 05:40:49,280 that's gonna go through each of these things, on and down, 7640 05:40:49,280 --> 05:40:51,100 and then we're just gonna print it out flipping it. 7641 05:40:51,100 --> 05:40:54,980 So we reflip it, flip, flip, we print it out key value, 7642 05:40:54,980 --> 05:40:56,820 and it's going to work. 7643 05:40:58,540 --> 05:41:03,280 Okay, so that is one way of doing this. 7644 05:41:03,280 --> 05:41:04,420 And this slide right here, 7645 05:41:04,420 --> 05:41:06,920 you absolutely do not need to figure out, 7646 05:41:06,920 --> 05:41:10,040 but some of you will look at this slide and you're like, 7647 05:41:10,040 --> 05:41:11,560 why didn't you show us that in the beginning? 7648 05:41:11,560 --> 05:41:14,000 And others of you will be like, no, no, no, no, no, 7649 05:41:14,000 --> 05:41:16,000 keep telling me this stuff here. 7650 05:41:16,000 --> 05:41:19,300 So I don't know exactly the term for this, 7651 05:41:19,300 --> 05:41:22,080 but this is a very procedural. 7652 05:41:22,080 --> 05:41:24,820 This is a classic algorithms and data structures approach 7653 05:41:24,820 --> 05:41:26,160 to solving this problem. 7654 05:41:27,600 --> 05:41:31,120 This next thing uses what are called lambdas, 7655 05:41:31,120 --> 05:41:33,320 and they kind of create what's called, 7656 05:41:33,320 --> 05:41:34,720 what I call a closed form, 7657 05:41:34,720 --> 05:41:36,280 where you kind of do it in all one statement, 7658 05:41:36,280 --> 05:41:38,360 and there's all this implicit stuff going on. 7659 05:41:38,360 --> 05:41:39,760 So if you don't get this right away, 7660 05:41:39,760 --> 05:41:41,820 don't worry too much about that. 7661 05:41:41,820 --> 05:41:46,820 But roughly, this single line does everything 7662 05:41:47,600 --> 05:41:50,000 that bottom half of that program does. 7663 05:41:50,000 --> 05:41:52,620 I mean, if you go back, if we go back to here, 7664 05:41:52,620 --> 05:41:55,760 it's pretty much this line does everything, 7665 05:41:55,760 --> 05:41:58,240 does that in one line, okay? 7666 05:41:58,240 --> 05:41:59,800 It doesn't create the counts, 7667 05:41:59,800 --> 05:42:01,680 and it doesn't print out the top 10, 7668 05:42:01,680 --> 05:42:04,680 but it does everything in that middle bit. 7669 05:42:04,680 --> 05:42:06,520 So let's take a look at this. 7670 05:42:06,520 --> 05:42:08,600 So we all are gonna collapse this down. 7671 05:42:08,600 --> 05:42:12,080 So we have a print, that print sees the end of the print. 7672 05:42:12,080 --> 05:42:13,440 And then we have sorted, 7673 05:42:13,440 --> 05:42:17,960 and remember that sorted takes as input a list. 7674 05:42:17,960 --> 05:42:20,080 And so that's not too bad, and returns us a list. 7675 05:42:20,080 --> 05:42:22,280 And so we'll print the return from sorted. 7676 05:42:24,120 --> 05:42:26,160 And then this is the funny part, 7677 05:42:26,160 --> 05:42:27,720 the fun part, funny part. 7678 05:42:27,720 --> 05:42:32,100 This is called list comprehension. 7679 05:42:32,100 --> 05:42:33,720 And we have square brackets, 7680 05:42:33,720 --> 05:42:36,640 and we say to Python, this is a list. 7681 05:42:36,640 --> 05:42:39,120 But instead of listing the things, 7682 05:42:39,120 --> 05:42:41,920 or having a constant one comma two comma three, 7683 05:42:41,920 --> 05:42:43,640 or a pen to pen to pen, 7684 05:42:43,640 --> 05:42:45,620 we are going to create an expression 7685 05:42:45,620 --> 05:42:48,760 that will act as a generator for all the elements. 7686 05:42:48,760 --> 05:42:50,640 And so this basically says, 7687 05:42:50,640 --> 05:42:53,960 this is a list of two tuples, V and K, 7688 05:42:53,960 --> 05:42:56,400 and then this is sort of implied. 7689 05:42:56,400 --> 05:42:59,480 For all KV in CDOT items. 7690 05:42:59,480 --> 05:43:01,960 And so this is like a for loop, 7691 05:43:01,960 --> 05:43:03,760 that is sort of driving this, 7692 05:43:03,760 --> 05:43:07,000 think of this as like stamp, stamp, stamp, stamp, stamp. 7693 05:43:07,000 --> 05:43:09,320 However many times it has to make a stamp. 7694 05:43:09,320 --> 05:43:11,480 And so that's producing a list. 7695 05:43:11,480 --> 05:43:13,380 Ch-ch-ch-ch-ch, right? 7696 05:43:13,380 --> 05:43:15,500 It just manufactures this list. 7697 05:43:15,500 --> 05:43:18,460 And then that list is sort of manufactured in the moment. 7698 05:43:18,460 --> 05:43:20,520 There's no stock, it's not put in a variable. 7699 05:43:20,520 --> 05:43:24,720 Python makes that list according to the stamping pattern 7700 05:43:24,720 --> 05:43:26,720 that you've told it to stamp out this list. 7701 05:43:26,720 --> 05:43:28,880 And then it passes that stamped out list 7702 05:43:28,880 --> 05:43:30,520 without even storing it in a variable, 7703 05:43:30,520 --> 05:43:33,280 into sorted, sorted moves the list around, 7704 05:43:33,280 --> 05:43:35,120 because it is just a list of tuples, 7705 05:43:35,120 --> 05:43:37,040 and then gives us back the sorted list. 7706 05:43:37,040 --> 05:43:40,560 And so I didn't put reverse equals true on here, 7707 05:43:40,560 --> 05:43:43,920 but you see that this is sorted in ascending order now 7708 05:43:43,920 --> 05:43:44,960 by key. 7709 05:43:44,960 --> 05:43:48,360 And I did that all in one little statement. 7710 05:43:49,360 --> 05:43:51,360 So look at this, 7711 05:43:51,360 --> 05:43:54,280 this is also one of the beautiful things about Python 7712 05:43:54,280 --> 05:43:55,640 that you can build these things, 7713 05:43:55,640 --> 05:43:57,680 and you can build more complex versions of this, 7714 05:43:57,680 --> 05:44:00,000 and there's a lot of real elegant things 7715 05:44:00,000 --> 05:44:02,920 that you can do in Python that are really succinct. 7716 05:44:02,920 --> 05:44:06,000 You should be careful, because in the beginning 7717 05:44:06,000 --> 05:44:07,800 I think this is easier to understand, 7718 05:44:07,800 --> 05:44:09,200 even though after a while you're like, 7719 05:44:09,200 --> 05:44:12,160 wait a sec, why am I putting all these extra lines in? 7720 05:44:12,160 --> 05:44:14,360 Because this is not so hard to understand, 7721 05:44:14,360 --> 05:44:17,680 but at some point you will want to master 7722 05:44:17,680 --> 05:44:21,200 this more powerful and more succinct version of Python 7723 05:44:21,200 --> 05:44:24,160 that expresses it in terms of the data you wanna see 7724 05:44:24,160 --> 05:44:26,000 rather than the steps you wanna take. 7725 05:44:26,960 --> 05:44:28,960 So this sort of finishes up tuples. 7726 05:44:28,960 --> 05:44:30,840 We've done a bunch of stuff. 7727 05:44:30,840 --> 05:44:33,440 I mean, really, they're simple and elegant. 7728 05:44:33,440 --> 05:44:36,280 Tuples, lists, and dictionaries are all related. 7729 05:44:36,280 --> 05:44:38,040 They're really three different, 7730 05:44:38,040 --> 05:44:40,360 kind of three foundational data structures, 7731 05:44:40,360 --> 05:44:42,880 three foundational collections of Python. 7732 05:44:42,880 --> 05:44:45,960 And we combine those in a lot of different ways. 7733 05:44:49,680 --> 05:44:51,200 And now in this little bit of lesson, 7734 05:44:51,200 --> 05:44:53,600 we are going to talk about some tuples, 7735 05:44:53,600 --> 05:44:57,040 and we're going to create a list of the most common words 7736 05:44:57,040 --> 05:45:00,600 and find out how to sort a dictionary by the values 7737 05:45:00,600 --> 05:45:02,400 instead of by the key. 7738 05:45:02,400 --> 05:45:04,920 We're gonna use the clown.txt file 7739 05:45:04,920 --> 05:45:07,080 and the intro.txt file. 7740 05:45:07,080 --> 05:45:10,600 And I'm gonna start with the code from exercise nine 7741 05:45:10,600 --> 05:45:13,480 that I just did from chapter nine. 7742 05:45:13,480 --> 05:45:15,040 It's not exactly one of the exercises, 7743 05:45:15,040 --> 05:45:16,600 but it's very similar to them. 7744 05:45:16,600 --> 05:45:18,400 And I'm going to make a copy, 7745 05:45:18,400 --> 05:45:19,840 and I'm gonna keep it in the same folder. 7746 05:45:19,840 --> 05:45:21,920 I'm gonna keep it in the ex09 folder 7747 05:45:21,920 --> 05:45:26,520 and just call it ex10 because this code 7748 05:45:26,520 --> 05:45:30,240 is going to do much of the same stuff, 7749 05:45:30,240 --> 05:45:31,520 and it's gonna read these same files. 7750 05:45:31,520 --> 05:45:33,720 And so I've got myself exercise 10. 7751 05:45:33,720 --> 05:45:34,920 Exercise nine is still here. 7752 05:45:34,920 --> 05:45:37,960 Exercise 10 is now what I'm editing, exercise 10. 7753 05:45:37,960 --> 05:45:39,760 But I'm in the exercise nine folder. 7754 05:45:40,880 --> 05:45:45,800 So in exercise nine, we look for the most common word, 7755 05:45:45,800 --> 05:45:47,800 but we wanna find the five most common words, 7756 05:45:47,800 --> 05:45:49,840 which is gonna require us to sort. 7757 05:45:49,840 --> 05:45:51,360 So I'm gonna get rid of that code 7758 05:45:51,360 --> 05:45:52,920 because it's not really how we're gonna do it. 7759 05:45:52,920 --> 05:45:56,400 There we manually loop through it and found the maximum. 7760 05:45:56,400 --> 05:45:58,880 And so I'm gonna just run this. 7761 05:45:58,880 --> 05:46:03,880 CD, desktop, Python for everybody, ex09. 7762 05:46:05,320 --> 05:46:06,960 Now if I do an ls, you see that I've got 7763 05:46:06,960 --> 05:46:09,760 ex09.py intro.txt. 7764 05:46:09,760 --> 05:46:14,760 So I'll run python3 ex10.py and run the clown data. 7765 05:46:17,160 --> 05:46:19,400 And we see that we see the dictionary 7766 05:46:19,400 --> 05:46:22,120 is properly making it in this code right here. 7767 05:46:22,120 --> 05:46:23,040 That doesn't change. 7768 05:46:23,040 --> 05:46:25,480 It reads the file, reads all the lines, 7769 05:46:25,480 --> 05:46:27,040 goes through and splits it into words, 7770 05:46:27,040 --> 05:46:28,160 and then goes through the words 7771 05:46:28,160 --> 05:46:31,720 and does the idiom of using dictionary get 7772 05:46:31,720 --> 05:46:33,360 to maintain the counters, 7773 05:46:33,360 --> 05:46:34,680 and we print it out at the very end. 7774 05:46:34,680 --> 05:46:38,240 So the new code we're going to write is down here. 7775 05:46:40,240 --> 05:46:42,440 So let's first do a few things. 7776 05:46:43,920 --> 05:46:48,920 If I can say x is equal to the dictionary, 7777 05:46:48,920 --> 05:46:53,920 dot items, and this gives us basically a list, print x. 7778 05:46:55,440 --> 05:46:58,480 This gives us a list of the key value pairs. 7779 05:46:58,480 --> 05:46:59,640 This prints out the dictionary, 7780 05:46:59,640 --> 05:47:02,120 but if we do it this way and use items, 7781 05:47:02,120 --> 05:47:04,840 it gives us the key value pairs. 7782 05:47:04,840 --> 05:47:07,120 Okay, and so that's what we got. 7783 05:47:07,120 --> 05:47:07,960 Key value pairs. 7784 05:47:07,960 --> 05:47:11,680 Now we can sort this based on the value 7785 05:47:11,680 --> 05:47:13,480 because tuples can be compared. 7786 05:47:13,480 --> 05:47:15,680 This can be compared with this. 7787 05:47:15,680 --> 05:47:20,120 And because d is lower than r, then this one is lower. 7788 05:47:20,120 --> 05:47:23,920 This whole, this ran tuple comes after the down tuple. 7789 05:47:23,920 --> 05:47:25,840 So we can sort this whole thing. 7790 05:47:25,840 --> 05:47:29,840 And I'll do this by just putting the word sorted here 7791 05:47:29,840 --> 05:47:31,760 and say, give me a sorted version of that. 7792 05:47:31,760 --> 05:47:35,000 Now it's going to do it based on the order of the tuples. 7793 05:47:35,000 --> 05:47:37,960 This is going to be more, higher precedence than this. 7794 05:47:37,960 --> 05:47:39,800 So if I print it this way, 7795 05:47:42,080 --> 05:47:45,600 run it again, you'll see that it's sorted. 7796 05:47:45,600 --> 05:47:48,800 And now is after and car, 7797 05:47:48,800 --> 05:47:51,160 it's in alphabetical order by key. 7798 05:47:51,160 --> 05:47:55,040 And so we could actually print the first five 7799 05:47:55,040 --> 05:47:56,880 up to, but not including five 7800 05:47:56,880 --> 05:48:00,960 by adding a list on the slice, a list slice here. 7801 05:48:00,960 --> 05:48:05,680 And so that will show you only the first five, right? 7802 05:48:05,680 --> 05:48:07,160 Except that that's not what we're trying to do. 7803 05:48:07,160 --> 05:48:10,880 We really want to sort by this, okay? 7804 05:48:10,880 --> 05:48:15,880 So we have this mechanism that can take a list 7805 05:48:16,800 --> 05:48:19,320 and sort it based on the tuple values. 7806 05:48:19,320 --> 05:48:23,000 If we could create a list where it was one comma after 7807 05:48:23,000 --> 05:48:26,640 instead of after comma one and make it exact same thing, 7808 05:48:26,640 --> 05:48:29,760 then we could actually then sort it and it would be fine. 7809 05:48:29,760 --> 05:48:30,760 Okay? 7810 05:48:30,760 --> 05:48:32,920 So let me show you a couple of ways, 7811 05:48:32,920 --> 05:48:35,160 at least one way to do that, okay? 7812 05:48:37,560 --> 05:48:38,720 Get rid of this. 7813 05:48:38,720 --> 05:48:43,000 We're gonna hand construct a list 7814 05:48:43,000 --> 05:48:46,760 and just call it temp equals, give me a new list. 7815 05:48:47,840 --> 05:48:49,640 Temp equals new list. 7816 05:48:49,640 --> 05:48:54,640 And then four K comma V in the dictionary.items. 7817 05:48:58,880 --> 05:49:02,480 And I'll just start by printing K comma V. 7818 05:49:02,480 --> 05:49:05,880 So we see, and this is where it's really nice 7819 05:49:05,880 --> 05:49:08,280 to do these with the clown code first 7820 05:49:08,280 --> 05:49:11,560 and then only do your test on the bigger file later. 7821 05:49:11,560 --> 05:49:13,680 And so it's pretty much the same thing 7822 05:49:13,680 --> 05:49:16,520 we are going through in key value order, 7823 05:49:16,520 --> 05:49:19,520 which is dictionary order, which is not sorted at all. 7824 05:49:19,520 --> 05:49:20,440 Okay? 7825 05:49:20,440 --> 05:49:22,720 Now, instead of printing this out, 7826 05:49:22,720 --> 05:49:26,560 we are going to, let me do this in a couple of steps. 7827 05:49:26,560 --> 05:49:31,560 Make a new tuple and I'll just call it newt 7828 05:49:32,760 --> 05:49:36,600 equals parenthesis V comma K. 7829 05:49:36,600 --> 05:49:40,120 Okay, so this is, I'm saying make a new tuple. 7830 05:49:40,120 --> 05:49:42,960 This is like a new tuple with two items in it 7831 05:49:42,960 --> 05:49:47,120 and I'm gonna make the value and the key. 7832 05:49:47,120 --> 05:49:52,120 Okay, so then I'm going to say temp.append newt, newtuple. 7833 05:49:56,080 --> 05:49:59,760 So I'm gonna end up with a list of tuples. 7834 05:49:59,760 --> 05:50:01,400 Let me comment this one out 7835 05:50:01,400 --> 05:50:03,160 and I'm gonna then, when I'm done here, 7836 05:50:03,160 --> 05:50:07,000 I'm gonna print temp. 7837 05:50:11,200 --> 05:50:15,440 So if I run clown.txt, you see what happens in temp. 7838 05:50:15,440 --> 05:50:18,440 It's still, well, let's print temp twice. 7839 05:50:23,640 --> 05:50:26,360 I mean, it's not sorted, it's flipped. 7840 05:50:28,280 --> 05:50:32,280 Let's print it, that's okay. 7841 05:50:32,280 --> 05:50:34,040 We'll just, that's the flipped one. 7842 05:50:36,280 --> 05:50:38,440 Okay, so it's flipped and all we did is we made it, 7843 05:50:38,440 --> 05:50:41,800 instead of car comma three, it's three comma car. 7844 05:50:41,800 --> 05:50:43,520 But now we have a list. 7845 05:50:44,800 --> 05:50:45,960 Okay? 7846 05:50:45,960 --> 05:50:49,160 So now it's flipped and now we can sort that. 7847 05:50:49,160 --> 05:50:54,160 We can say temp equals sorted temp. 7848 05:50:55,440 --> 05:50:59,000 So it says, takes temp and sort it and give it back to me 7849 05:50:59,000 --> 05:51:04,000 and now I'm gonna say print sorted comma temp. 7850 05:51:10,680 --> 05:51:12,680 Okay, so here's the first print. 7851 05:51:13,680 --> 05:51:15,720 When we flipped it, we've got two tent, 7852 05:51:17,040 --> 05:51:18,320 but it's not sorted at all. 7853 05:51:18,320 --> 05:51:20,680 But after we sorted it, it's sorted by tuple 7854 05:51:20,680 --> 05:51:22,880 and the lowest is one after. 7855 05:51:22,880 --> 05:51:26,160 So you'll notice that one is the same as one, 7856 05:51:26,160 --> 05:51:28,080 so it checked the second item in the tuple. 7857 05:51:28,080 --> 05:51:32,920 So down comes before after, fell becomes after down. 7858 05:51:32,920 --> 05:51:36,000 Intro on alphabetical order, but now we get the twos. 7859 05:51:36,000 --> 05:51:41,000 So all the ones sort there and then the twos come here, 7860 05:51:42,200 --> 05:51:44,360 but then within the twos, it's sorted in alphabetical order 7861 05:51:44,360 --> 05:51:48,980 because like a string, if the first character matches, 7862 05:51:48,980 --> 05:51:50,760 then it looks to the second character. 7863 05:51:50,760 --> 05:51:54,120 And then we see, oh, here we go, the threes 7864 05:51:54,120 --> 05:51:55,840 and then the one we actually wanted, 7865 05:51:55,840 --> 05:51:58,880 the highest one is the seven. 7866 05:51:58,880 --> 05:52:01,600 And so one of the things we can do is we can say, 7867 05:52:01,600 --> 05:52:03,360 you'll notice that we want the highest one, 7868 05:52:03,360 --> 05:52:04,640 not the lowest one. 7869 05:52:04,640 --> 05:52:07,840 So we can just tell this with this parameter, 7870 05:52:07,840 --> 05:52:09,940 reverse equals true. 7871 05:52:13,320 --> 05:52:15,560 And we just say, hey, sorted, do this backwards, 7872 05:52:15,560 --> 05:52:19,080 do it from highest to lowest rather than lowest to highest. 7873 05:52:19,080 --> 05:52:24,080 And now our sorted one says seven, the, et cetera. 7874 05:52:24,080 --> 05:52:25,600 Okay. 7875 05:52:25,600 --> 05:52:27,720 And so we want the first five, 7876 05:52:30,400 --> 05:52:35,400 we can say up to, but not including five. 7877 05:52:37,920 --> 05:52:40,780 So this is now the top five. 7878 05:52:42,820 --> 05:52:46,640 So the sorted one is, that's the top five. 7879 05:52:46,640 --> 05:52:48,560 If there is, it's a tie, we're gonna go 7880 05:52:48,560 --> 05:52:49,960 and reverse alphabetical order, 7881 05:52:49,960 --> 05:52:52,720 but let's not worry about that too much for now. 7882 05:52:52,720 --> 05:52:56,720 So it makes a flipped list, then it sorts the flipped list. 7883 05:52:59,060 --> 05:53:02,880 Now, if I just wanted to print it out nicer, 7884 05:53:02,880 --> 05:53:04,840 I could loop through this new list. 7885 05:53:04,840 --> 05:53:09,480 I could say for V comma K, remember this is a flipped list. 7886 05:53:09,480 --> 05:53:11,560 So the sensible thing is what's coming out, 7887 05:53:11,560 --> 05:53:13,200 I mean, coming out of this list, 7888 05:53:13,200 --> 05:53:17,200 each tuple is value comma key in temp. 7889 05:53:17,200 --> 05:53:19,080 And I'm only gonna go up through five up through, 7890 05:53:19,080 --> 05:53:21,840 but not including five, so the first five. 7891 05:53:21,840 --> 05:53:24,520 And so I'm pulling them back out as value key 7892 05:53:24,520 --> 05:53:26,360 because that's what they are. 7893 05:53:26,360 --> 05:53:31,360 They're value key, see value key, value key, value key. 7894 05:53:31,700 --> 05:53:33,340 So V is gonna go through these 7895 05:53:33,340 --> 05:53:34,920 and K is gonna go through these. 7896 05:53:34,920 --> 05:53:38,520 And then I'm just gonna print K comma V. 7897 05:53:38,520 --> 05:53:40,400 So this is kind of my flipping backwards 7898 05:53:40,400 --> 05:53:42,880 because I wanna see them this way. 7899 05:53:44,800 --> 05:53:47,160 And thus the most common one, car three. 7900 05:53:47,160 --> 05:53:50,120 And so it's just going through this up through the fifth one 7901 05:53:50,120 --> 05:53:51,720 and then printing them out. 7902 05:53:51,720 --> 05:53:55,320 Okay, so let me comment this out. 7903 05:53:55,320 --> 05:53:56,980 Let me comment that out. 7904 05:54:01,280 --> 05:54:03,460 Let me just delete this. 7905 05:54:03,460 --> 05:54:05,240 So we have a dictionary. 7906 05:54:05,240 --> 05:54:07,640 Let me comment out the dictionary. 7907 05:54:07,640 --> 05:54:09,720 We have a dictionary, we make a list 7908 05:54:09,720 --> 05:54:12,600 and we make these reversed tuples 7909 05:54:12,600 --> 05:54:14,440 where we have the value first and the key second. 7910 05:54:14,440 --> 05:54:17,680 We're setting it up so the sort's gonna work. 7911 05:54:17,680 --> 05:54:19,760 And then once it's sorted, we have to flip them back. 7912 05:54:19,760 --> 05:54:22,880 So we flip them for sorting from key value 7913 05:54:22,880 --> 05:54:25,160 to value key for sorting. 7914 05:54:25,160 --> 05:54:28,280 We do the sort, then we flip them back with key value 7915 05:54:28,280 --> 05:54:29,220 and print them out. 7916 05:54:34,640 --> 05:54:35,800 And it works fine. 7917 05:54:35,800 --> 05:54:40,800 So let's try our big file, intro.txt and there you go. 7918 05:54:41,180 --> 05:54:46,180 Those are the five most common words in intro.txt. 7919 05:54:46,580 --> 05:54:49,240 So you might ask yourself, why did we use tuples? 7920 05:54:49,240 --> 05:54:52,040 We probably, we could have really used lists for this 7921 05:54:52,040 --> 05:54:53,860 but tuples are more efficient than lists 7922 05:54:53,860 --> 05:54:56,380 and you notice that we weren't gonna modify. 7923 05:54:56,380 --> 05:54:59,000 We did modify the temp list, it's a list of tuples 7924 05:54:59,000 --> 05:55:02,760 but the tuples within the list, we weren't gonna modify. 7925 05:55:02,760 --> 05:55:05,720 And so we tend not to make lists 7926 05:55:05,720 --> 05:55:07,760 if we can get away with using tuples. 7927 05:55:07,760 --> 05:55:11,880 And so that's why we made this flipped tuple thing. 7928 05:55:11,880 --> 05:55:15,680 Okay, so I hope that was useful to you. 7929 05:55:15,680 --> 05:55:20,680 Hope to see you on the net. 7930 05:55:20,840 --> 05:55:24,080 Hello and welcome to chapter 11, regular expressions. 7931 05:55:24,080 --> 05:55:25,800 The fun thing about this chapter is 7932 05:55:25,800 --> 05:55:27,360 unlike all the rest of the chapters, 7933 05:55:27,360 --> 05:55:30,640 you sort of had to really understand every single thing 7934 05:55:30,640 --> 05:55:33,240 in chapters one through 11 built on one another, 7935 05:55:33,240 --> 05:55:35,500 one through 10 built on one another. 7936 05:55:35,500 --> 05:55:38,520 But you can really get along without using chapter 11. 7937 05:55:38,520 --> 05:55:41,000 It's not a really required topic 7938 05:55:41,000 --> 05:55:43,200 but it's a fun topic and an interesting topic. 7939 05:55:43,200 --> 05:55:46,880 So you can relax a little bit and realize 7940 05:55:46,880 --> 05:55:48,680 that you may or may not like regular expressions 7941 05:55:48,680 --> 05:55:49,960 and if you don't like them, that's okay. 7942 05:55:49,960 --> 05:55:50,880 You don't have to use them. 7943 05:55:50,880 --> 05:55:52,440 You can go for your whole life 7944 05:55:52,440 --> 05:55:54,500 without using regular expressions. 7945 05:55:54,500 --> 05:55:56,800 The idea of a regular expression is that 7946 05:55:56,800 --> 05:55:59,960 you come up with a language. 7947 05:55:59,960 --> 05:56:02,040 It's a little character based programming language 7948 05:56:02,040 --> 05:56:06,080 where you can do smart searching basically. 7949 05:56:06,080 --> 05:56:07,800 Start searching and as you'll see in a bit 7950 05:56:07,800 --> 05:56:11,020 with smart extraction. 7951 05:56:11,020 --> 05:56:14,560 And it's really almost programmable wild card expressions. 7952 05:56:14,560 --> 05:56:16,280 There's no looping but there is looping 7953 05:56:16,280 --> 05:56:17,720 and there's all this implicit thing 7954 05:56:17,720 --> 05:56:19,640 and you say look for patterns that look like this 7955 05:56:19,640 --> 05:56:22,920 and then you give back things that match those patterns. 7956 05:56:22,920 --> 05:56:24,760 We do searching for everything. 7957 05:56:24,760 --> 05:56:26,920 We're looking through large blocks of text. 7958 05:56:26,920 --> 05:56:29,400 Say go find me everything that has the word Python in it 7959 05:56:29,400 --> 05:56:30,320 or something like that. 7960 05:56:30,320 --> 05:56:32,120 So that's just such a common thing to do 7961 05:56:32,120 --> 05:56:35,000 and regular expressions are a very structured way 7962 05:56:35,000 --> 05:56:37,220 to go about searching for information. 7963 05:56:37,220 --> 05:56:39,200 They're very powerful but they're also very cryptic 7964 05:56:39,200 --> 05:56:40,360 and you may not like them 7965 05:56:40,360 --> 05:56:42,600 but they're a lot of fun actually once you understand them. 7966 05:56:42,600 --> 05:56:45,360 Learning how to program them takes a while. 7967 05:56:45,360 --> 05:56:47,240 Writing good regular expression programs 7968 05:56:47,240 --> 05:56:50,240 requires some try it, play with it, check it, 7969 05:56:50,240 --> 05:56:51,520 try it, check it, try it, check it. 7970 05:56:51,520 --> 05:56:54,240 But once you get them they're really quite cool. 7971 05:56:54,240 --> 05:56:56,500 It's a very old programming language. 7972 05:56:57,800 --> 05:57:00,000 It comes almost from the 1960s. 7973 05:57:00,000 --> 05:57:02,420 The concept of it's a theory of computing 7974 05:57:02,420 --> 05:57:03,260 where they were trying to come up 7975 05:57:03,260 --> 05:57:05,200 with theory of languages and regular expressions 7976 05:57:05,200 --> 05:57:09,400 was one form of languages that computers could understand. 7977 05:57:09,400 --> 05:57:11,960 And so it has some fun old words. 7978 05:57:11,960 --> 05:57:16,320 And one of the advantages of knowing regular expressions 7979 05:57:16,320 --> 05:57:18,720 is that you're kind of a cool person. 7980 05:57:18,720 --> 05:57:22,040 You can take a quick look at this XKCD 7981 05:57:22,040 --> 05:57:25,360 that sort of captures the devil may care, 7982 05:57:25,360 --> 05:57:28,880 awesome power that regular expressions do. 7983 05:57:28,880 --> 05:57:32,600 And while we're at it, you know, 7984 05:57:32,600 --> 05:57:33,720 while we're talking about awesome, 7985 05:57:33,720 --> 05:57:35,120 I do want to take this moment 7986 05:57:35,120 --> 05:57:36,800 and show you my awesome tattoos. 7987 05:57:36,800 --> 05:57:39,080 And so you may not know this 7988 05:57:39,080 --> 05:57:40,480 but I got a couple tattoos here. 7989 05:57:40,480 --> 05:57:42,000 Here's the first tattoo. 7990 05:57:42,000 --> 05:57:44,000 This is where I went to, got my PhD 7991 05:57:44,000 --> 05:57:46,520 and this is my University of Michigan faculty 7992 05:57:46,520 --> 05:57:47,360 member position. 7993 05:57:47,360 --> 05:57:50,560 I got PhD in engineering and I teach in a school 7994 05:57:50,560 --> 05:57:52,160 of information and library science. 7995 05:57:52,160 --> 05:57:53,880 And then I have this other tattoo 7996 05:57:53,880 --> 05:57:57,840 and this tattoo is what I call the ring of compliance. 7997 05:57:57,840 --> 05:57:59,400 I work on learning management systems 7998 05:57:59,400 --> 05:58:01,600 and educational technology and standards. 7999 05:58:01,600 --> 05:58:02,640 And there's this standard called 8000 05:58:02,640 --> 05:58:03,840 learning tools interoperability, 8001 05:58:03,840 --> 05:58:06,320 which if you're using this course 8002 05:58:06,320 --> 05:58:07,280 and doing the auto grader, 8003 05:58:07,280 --> 05:58:09,520 it uses learning tools interoperability to integrate 8004 05:58:09,520 --> 05:58:11,000 into whatever learning management system 8005 05:58:11,000 --> 05:58:12,520 you happen to be using. 8006 05:58:12,520 --> 05:58:14,280 And one of those learning management systems 8007 05:58:14,280 --> 05:58:15,920 is the open source learning management system 8008 05:58:15,920 --> 05:58:17,600 that I helped write called Sakai. 8009 05:58:17,600 --> 05:58:20,120 And these are the rest of the major vendors. 8010 05:58:20,120 --> 05:58:22,800 And the idea of that tattoo was 8011 05:58:22,800 --> 05:58:26,160 that I would put the tattoo of every vendor 8012 05:58:26,160 --> 05:58:28,120 that would comply with learning tools interoperability. 8013 05:58:28,120 --> 05:58:29,080 So you'll notice Coursera, 8014 05:58:29,080 --> 05:58:31,680 I help Coursera put learning tools interoperability in. 8015 05:58:31,680 --> 05:58:33,920 And so the auto graders integrate into Coursera, 8016 05:58:33,920 --> 05:58:36,360 Blackboard or Canvas or Sakai or Moodle 8017 05:58:36,360 --> 05:58:37,400 or often those other things. 8018 05:58:37,400 --> 05:58:41,200 So it's just like a cool techno thing, 8019 05:58:41,200 --> 05:58:43,520 just like regular expressions. 8020 05:58:43,520 --> 05:58:46,640 So I've got a URL here for regular expression quick guide. 8021 05:58:46,640 --> 05:58:48,440 You might wanna print this out 8022 05:58:48,440 --> 05:58:50,880 so that you can look at it 8023 05:58:50,880 --> 05:58:53,280 even while you're watching this lecture 8024 05:58:53,280 --> 05:58:55,240 because it's a little programming language 8025 05:58:55,240 --> 05:58:56,480 except that it's character based, 8026 05:58:56,480 --> 05:58:58,240 not line based and not keyword based. 8027 05:58:58,240 --> 05:59:00,480 It has certain active characters 8028 05:59:00,480 --> 05:59:02,960 that the character means something 8029 05:59:02,960 --> 05:59:06,840 versus the character represents the character itself. 8030 05:59:06,840 --> 05:59:08,200 And so the regular expressions 8031 05:59:08,200 --> 05:59:09,760 is not part of the base Python 8032 05:59:09,760 --> 05:59:11,080 but it's distributed with Python. 8033 05:59:11,080 --> 05:59:13,680 So you have to put an import re at the top 8034 05:59:13,680 --> 05:59:15,000 to say that's really saying 8035 05:59:15,000 --> 05:59:17,040 pull in the regular expression library. 8036 05:59:17,040 --> 05:59:20,680 And there is a couple of functions inside that re.search 8037 05:59:20,680 --> 05:59:22,620 which is kind of like a really smart version 8038 05:59:22,620 --> 05:59:25,120 of the find method inside of strings. 8039 05:59:25,120 --> 05:59:28,800 And re.findall which is kind of like 8040 05:59:28,800 --> 05:59:31,160 taking and stamping your way through a loop 8041 05:59:31,160 --> 05:59:33,960 through a string and finding all of the things 8042 05:59:33,960 --> 05:59:38,040 that match a particular pattern and then extracting those. 8043 05:59:38,040 --> 05:59:41,220 And we'll talk about both of these in this lecture. 8044 05:59:42,220 --> 05:59:44,700 So here's a really simple piece of code 8045 05:59:44,700 --> 05:59:46,160 where I'm just gonna sort of show you 8046 05:59:46,160 --> 05:59:47,760 sort of before and after. 8047 05:59:47,760 --> 05:59:50,620 So here's a thing where we're looking for lines 8048 05:59:50,620 --> 05:59:52,200 that begin with from colon. 8049 05:59:52,200 --> 05:59:55,040 And so we open a file, we loop through the whole file, 8050 05:59:55,040 --> 05:59:57,020 we strip off the lines text 8051 05:59:57,020 --> 05:59:59,720 and then we say if line.find from 8052 05:59:59,720 --> 06:00:02,280 is greater than equal to zero, then we print it. 8053 06:00:02,280 --> 06:00:03,840 It gives you negative one if it's not found. 8054 06:00:03,840 --> 06:00:05,920 And so reads all the lines and once in a while 8055 06:00:05,920 --> 06:00:07,000 it'll print it out, reads all the lines 8056 06:00:07,000 --> 06:00:08,120 once in a while print it out. 8057 06:00:08,120 --> 06:00:10,920 So that's kind of like a needle in a haystack. 8058 06:00:10,920 --> 06:00:12,360 Use regular expressions to do that. 8059 06:00:12,360 --> 06:00:14,700 We have to import the regular expression library. 8060 06:00:14,700 --> 06:00:16,360 These lines are the same, we're gonna loop through, 8061 06:00:16,360 --> 06:00:17,480 we're gonna strip. 8062 06:00:17,480 --> 06:00:20,120 And now we're gonna say if re.search, 8063 06:00:20,120 --> 06:00:22,120 the way to say this is within the library 8064 06:00:22,120 --> 06:00:25,360 regular expressions, go find the search function 8065 06:00:25,360 --> 06:00:30,320 and search for the string from in the string line. 8066 06:00:30,320 --> 06:00:32,560 So this is the line to search whereas here 8067 06:00:32,560 --> 06:00:34,920 it was more object-oriented where we say line.find 8068 06:00:34,920 --> 06:00:37,800 and we say re.search and we pass in line as parameter. 8069 06:00:37,800 --> 06:00:38,880 These two things are equivalent 8070 06:00:38,880 --> 06:00:40,180 which means most of the time it's gonna run 8071 06:00:40,180 --> 06:00:42,500 and once in a while hit a line and it'll print that out 8072 06:00:42,500 --> 06:00:44,080 and then it'll finish the whole thing. 8073 06:00:44,080 --> 06:00:49,080 So that is doing what we would do with the find operation 8074 06:00:49,960 --> 06:00:50,980 with regular expressions. 8075 06:00:50,980 --> 06:00:55,000 Now, searching with regular expressions 8076 06:00:55,000 --> 06:00:57,000 has these special characters and so here we have 8077 06:00:57,000 --> 06:00:59,440 the same basic code except now we're saying 8078 06:00:59,440 --> 06:01:02,440 if line starts with from, so we're not using find anymore 8079 06:01:03,680 --> 06:01:06,080 and that way we're only gonna get that thing 8080 06:01:06,080 --> 06:01:08,480 in the first position not like blah blah blah blah 8081 06:01:08,480 --> 06:01:11,600 from colon, we don't want that to match, 8082 06:01:11,600 --> 06:01:13,760 we only want it to match here at the beginning of the line. 8083 06:01:13,760 --> 06:01:15,400 And so that's what we use line starts with. 8084 06:01:15,400 --> 06:01:17,280 So it's gonna do the same thing and find lines 8085 06:01:17,280 --> 06:01:20,560 that have the prefix and print those out and then be done. 8086 06:01:20,560 --> 06:01:22,900 Now in regular expression search we don't in a sense 8087 06:01:22,900 --> 06:01:25,560 change the method, we have a certain number of things 8088 06:01:25,560 --> 06:01:28,040 we can do with strings based on what they built in. 8089 06:01:28,040 --> 06:01:30,260 But in regular expression we actually can turn 8090 06:01:30,260 --> 06:01:33,120 this first parameter into code. 8091 06:01:33,120 --> 06:01:36,760 And so what's happening here is the caret, 8092 06:01:36,760 --> 06:01:38,140 if you go back to the little cheat sheet, 8093 06:01:38,140 --> 06:01:40,800 caret means this is the beginning of line. 8094 06:01:40,800 --> 06:01:42,880 It's a virtual character that matches the beginning line. 8095 06:01:42,880 --> 06:01:44,840 It's like from that starts at the beginning. 8096 06:01:44,840 --> 06:01:47,780 So from at the beginning does match 8097 06:01:47,780 --> 06:01:49,480 and from in the middle does not match 8098 06:01:49,480 --> 06:01:51,240 by putting that little caret there. 8099 06:01:51,240 --> 06:01:53,160 Same thing, line is what we're searching 8100 06:01:53,160 --> 06:01:55,160 and then from is what we caret from. 8101 06:01:55,160 --> 06:01:58,160 Line from at the beginning is what we're looking for. 8102 06:01:58,160 --> 06:02:00,160 And so again it does the exact same thing. 8103 06:02:00,160 --> 06:02:01,720 Only prints lines that have from colon 8104 06:02:01,720 --> 06:02:03,980 is the first character in the line. 8105 06:02:03,980 --> 06:02:06,480 So the difference is we look for a method 8106 06:02:06,480 --> 06:02:09,360 and the other one is we program the regular expression. 8107 06:02:09,360 --> 06:02:13,480 So we're gonna run out of methods in the string class 8108 06:02:13,480 --> 06:02:15,680 long before we run out of things that we can do 8109 06:02:15,680 --> 06:02:17,400 with regular expressions. 8110 06:02:18,520 --> 06:02:20,720 And so a couple other special characters 8111 06:02:20,720 --> 06:02:23,640 that caret matches the beginning of the line. 8112 06:02:23,640 --> 06:02:25,600 So caret matches the beginning of the line. 8113 06:02:25,600 --> 06:02:28,120 This capital X matches itself. 8114 06:02:28,120 --> 06:02:31,080 Dot is a wildcard that matches any character 8115 06:02:31,080 --> 06:02:33,440 and then some of the characters in regular expressions 8116 06:02:33,440 --> 06:02:35,760 modify the immediately preceding character. 8117 06:02:35,760 --> 06:02:39,400 And so that says look for a line that starts with X 8118 06:02:39,400 --> 06:02:42,840 and then has many characters, that's these two things. 8119 06:02:42,840 --> 06:02:45,840 Zero or more characters followed by a colon. 8120 06:02:45,840 --> 06:02:47,640 And so you can see that it's sort of, 8121 06:02:47,640 --> 06:02:49,660 it's this sort of like expanding stamp. 8122 06:02:49,660 --> 06:02:51,160 It's like oh there's an X at the beginning of the line, 8123 06:02:51,160 --> 06:02:52,520 that line, it looks good. 8124 06:02:52,520 --> 06:02:54,160 I got some characters here and then I got a colon, 8125 06:02:54,160 --> 06:02:55,000 that's good. 8126 06:02:55,000 --> 06:02:59,920 So this is an X, some characters, and a colon, check. 8127 06:02:59,920 --> 06:03:02,400 X, some characters, and a colon, check. 8128 06:03:02,400 --> 06:03:04,680 X, and these things, away we go. 8129 06:03:04,680 --> 06:03:06,820 And so you can, that's what's gonna match. 8130 06:03:06,820 --> 06:03:09,440 And so you can see how some of these characters are special. 8131 06:03:09,440 --> 06:03:11,000 Again, go back to your cheat sheet. 8132 06:03:11,000 --> 06:03:12,000 Some of them are special 8133 06:03:12,000 --> 06:03:13,640 and some of them are actual characters. 8134 06:03:13,640 --> 06:03:16,300 And this colon and X are just, they're not special, 8135 06:03:16,300 --> 06:03:19,160 they're just the characters, okay? 8136 06:03:19,160 --> 06:03:22,920 Now, sometimes you wanna be a little more clear 8137 06:03:22,920 --> 06:03:23,760 on your match. 8138 06:03:23,760 --> 06:03:25,680 So, let's take a look at these lines 8139 06:03:25,680 --> 06:03:28,720 that match that particular thing that we just did. 8140 06:03:28,720 --> 06:03:31,720 So we have these two, X dash civ colon, 8141 06:03:31,720 --> 06:03:33,280 X dash D, stem dash result, 8142 06:03:33,280 --> 06:03:34,840 like these are from mail messages. 8143 06:03:34,840 --> 06:03:36,800 And then one of the mail messages has a line in it 8144 06:03:36,800 --> 06:03:39,740 that says X dash plain is behind schedule. 8145 06:03:39,740 --> 06:03:41,880 And this matches. 8146 06:03:41,880 --> 06:03:43,480 Is that what you really wanted? 8147 06:03:43,480 --> 06:03:46,080 And so what we can basically say is, 8148 06:03:46,080 --> 06:03:48,780 because this is an X, this is some number of characters, 8149 06:03:48,780 --> 06:03:50,740 and that's a colon, it matches. 8150 06:03:50,740 --> 06:03:51,580 It has to match. 8151 06:03:51,580 --> 06:03:55,600 That's this rule applied to this line results in a yes. 8152 06:03:55,600 --> 06:03:56,600 It does. 8153 06:03:56,600 --> 06:04:00,040 And so how can you be a little more clear 8154 06:04:00,040 --> 06:04:01,440 as to what you want to match 8155 06:04:01,440 --> 06:04:03,440 and what you don't want to match? 8156 06:04:03,440 --> 06:04:06,360 So we can write code. 8157 06:04:06,360 --> 06:04:09,200 So now what we're going to say is, 8158 06:04:10,760 --> 06:04:12,340 we wanna match the beginning of the line 8159 06:04:12,340 --> 06:04:15,080 and we wanna capital X and we wanna dash. 8160 06:04:15,080 --> 06:04:16,940 So now we're gonna match those first two characters, 8161 06:04:16,940 --> 06:04:19,360 X dash at the beginning of the line. 8162 06:04:19,360 --> 06:04:21,400 Carrot X dash says first two characters of the line 8163 06:04:21,400 --> 06:04:22,640 must be X dash. 8164 06:04:22,640 --> 06:04:24,160 Now we have another special character. 8165 06:04:24,160 --> 06:04:25,980 Again, refer to your cheat sheet. 8166 06:04:25,980 --> 06:04:30,980 Backslash capital S means a non-whitespace character, right? 8167 06:04:31,240 --> 06:04:33,200 Any character other than whitespace. 8168 06:04:33,200 --> 06:04:35,640 And then plus means one or more times, 8169 06:04:35,640 --> 06:04:38,040 one or more non-whitespace characters. 8170 06:04:38,040 --> 06:04:39,860 That's what this whole thing says. 8171 06:04:39,860 --> 06:04:41,720 One or more non-whitespace characters 8172 06:04:41,720 --> 06:04:43,800 and followed by a colon, which is just a character. 8173 06:04:43,800 --> 06:04:46,200 So now we have X dash followed by one or more 8174 06:04:46,200 --> 06:04:48,480 non-whitespace characters followed by a colon. 8175 06:04:48,480 --> 06:04:50,920 X dash followed by one or more non-whitespace characters 8176 06:04:50,920 --> 06:04:52,040 followed by a colon. 8177 06:04:52,040 --> 06:04:54,160 Here we have X dash followed by one or more, 8178 06:04:54,160 --> 06:04:55,960 whoops, there's a space there. 8179 06:04:55,960 --> 06:04:57,260 And so this doesn't match. 8180 06:04:57,260 --> 06:04:58,760 Even though there's a colon there, 8181 06:04:58,760 --> 06:05:01,160 it means that between the dash and the colon, 8182 06:05:01,160 --> 06:05:02,600 you can only have some number 8183 06:05:02,600 --> 06:05:04,140 of non-whitespace characters. 8184 06:05:04,140 --> 06:05:07,560 So this is a no, it does not match. 8185 06:05:07,560 --> 06:05:11,040 And so you just can, if you didn't wanna match this, 8186 06:05:11,040 --> 06:05:15,520 you then sort of create a more precise, 8187 06:05:15,520 --> 06:05:17,320 you know, we could even have a thing that said, 8188 06:05:17,320 --> 06:05:20,080 I want X dash with an uppercase character, 8189 06:05:20,080 --> 06:05:21,840 uppercase letter, if you wanted to. 8190 06:05:21,840 --> 06:05:23,760 And so there's all kind of fine tuning 8191 06:05:23,760 --> 06:05:28,120 if you sort of learn the structure that you've got to do. 8192 06:05:28,120 --> 06:05:29,960 And so that's kind of the matching 8193 06:05:29,960 --> 06:05:32,720 where you're taking a whole line and taking this template 8194 06:05:32,720 --> 06:05:36,040 and deciding if the template anywhere in that line matches. 8195 06:05:36,040 --> 06:05:37,640 And now what we're gonna do is use this 8196 06:05:37,640 --> 06:05:40,840 to actually pull data out of strings 8197 06:05:40,840 --> 06:05:45,840 using the regular expression library. 8198 06:05:46,640 --> 06:05:48,600 So now we're going to move from merely matching 8199 06:05:48,600 --> 06:05:49,840 to matching and extracting. 8200 06:05:49,840 --> 06:05:51,220 So we're going to say, hey, 8201 06:05:51,220 --> 06:05:53,560 I would like to not only have you take this template, 8202 06:05:53,560 --> 06:05:55,240 this little pattern, the string pattern, 8203 06:05:55,240 --> 06:05:58,160 regular expression pattern, run it across the line, 8204 06:05:58,160 --> 06:06:00,680 I want you to give me all the ones that match 8205 06:06:00,680 --> 06:06:01,960 and I want a list of those. 8206 06:06:01,960 --> 06:06:04,020 And that's what we're going to use the find all. 8207 06:06:04,020 --> 06:06:06,080 So search gives a true false, 8208 06:06:06,080 --> 06:06:09,080 find all gives a list of all the strings that match. 8209 06:06:09,080 --> 06:06:10,040 So if there's four of them, 8210 06:06:10,040 --> 06:06:11,400 you'll get four things in the list. 8211 06:06:11,400 --> 06:06:14,000 If there's nothing that matches, you'll get an empty list. 8212 06:06:14,000 --> 06:06:17,140 So let's take a look at what we got going here. 8213 06:06:17,140 --> 06:06:20,320 So instead of calling search, we call find all. 8214 06:06:20,320 --> 06:06:22,680 We still pass in the string that we're looking through. 8215 06:06:22,680 --> 06:06:25,840 And then we have our little template pattern. 8216 06:06:25,840 --> 06:06:29,480 And this is a new bit of regular expression. 8217 06:06:29,480 --> 06:06:32,600 Any little bracket operation, square bracket 8218 06:06:32,600 --> 06:06:35,400 is one character, that's just a character, 8219 06:06:35,400 --> 06:06:39,840 but then there in between here is a set of allowed characters. 8220 06:06:39,840 --> 06:06:43,120 So zero dash nine means a single digit. 8221 06:06:43,120 --> 06:06:45,440 Zero, one, two, three, four, five, six, seven, eight, 8222 06:06:45,440 --> 06:06:48,560 or nine, but that's really one character. 8223 06:06:48,560 --> 06:06:51,040 And then we have, so that's one character. 8224 06:06:52,000 --> 06:06:53,560 And then when the plus applies to that, 8225 06:06:53,560 --> 06:06:56,800 which means if we look at this whole thing, 8226 06:06:56,800 --> 06:06:59,620 this whole thing says one or more digits. 8227 06:06:59,620 --> 06:07:02,280 That's the code we write in a regular expression 8228 06:07:02,280 --> 06:07:03,720 that says one or more digits. 8229 06:07:03,720 --> 06:07:04,660 And we're just gonna use that 8230 06:07:04,660 --> 06:07:06,540 in our regular expression by itself. 8231 06:07:06,540 --> 06:07:09,400 So we're going to look for any string 8232 06:07:09,400 --> 06:07:11,960 that's one or more digits and pull it out 8233 06:07:11,960 --> 06:07:12,800 and give it back to me. 8234 06:07:12,800 --> 06:07:14,920 So it's gonna look, so that's my little template, 8235 06:07:14,920 --> 06:07:17,120 stamp, stamp, stamp, stamp, oop, got it. 8236 06:07:17,120 --> 06:07:19,160 Stamp, stamp, stamp, stamp, stamp, stamp, stamp, stamp, 8237 06:07:19,160 --> 06:07:22,260 oop, got it, stamp, stamp, stamp, stamp, got it. 8238 06:07:22,260 --> 06:07:25,000 So what we get back after we ask find all, 8239 06:07:25,000 --> 06:07:27,960 to find all of the one or more digit strings 8240 06:07:27,960 --> 06:07:30,840 is two, nine, and 42. 8241 06:07:30,840 --> 06:07:32,480 So it actually parsed it, it split it, 8242 06:07:32,480 --> 06:07:33,760 it found all these things and said, 8243 06:07:33,760 --> 06:07:38,080 I found them all for you and here they are, two, 19, and 42. 8244 06:07:38,080 --> 06:07:40,800 So it's a list of three strings 8245 06:07:40,800 --> 06:07:41,720 because that's how many it found. 8246 06:07:41,720 --> 06:07:43,000 Now it might have found none 8247 06:07:43,000 --> 06:07:45,680 and we would have got an empty list at that point. 8248 06:07:45,680 --> 06:07:48,640 But it found some, okay? 8249 06:07:48,640 --> 06:07:51,600 So just as an example, we did this thing, 8250 06:07:51,600 --> 06:07:54,760 we get two, 19, and 42, but if I said this, 8251 06:07:54,760 --> 06:07:59,760 that basically is a uppercase vowel, A, E, I, O, or U. 8252 06:08:00,240 --> 06:08:04,440 So that's one letter and that's one or more. 8253 06:08:04,440 --> 06:08:08,040 So it's saying something like A, A would match, 8254 06:08:08,040 --> 06:08:12,060 E, I would match, O, O would match. 8255 06:08:12,060 --> 06:08:14,000 But if you look now, it's saying, okay, 8256 06:08:14,000 --> 06:08:18,400 I'm looking for one or more, minimum one or more uppercase, 8257 06:08:18,400 --> 06:08:20,360 A, E, I, O, U is a set of characters, 8258 06:08:20,360 --> 06:08:22,680 one or more uppercase letters. 8259 06:08:22,680 --> 06:08:25,360 And so it says, look, do you find, oh, there's an uppercase, 8260 06:08:25,360 --> 06:08:27,760 but it's an M, no, no, no, no uppercase, no uppercase, 8261 06:08:27,760 --> 06:08:30,880 no uppercase, no uppercase, found nothing, 8262 06:08:30,880 --> 06:08:35,140 did not find anything and so it gives us back an empty list. 8263 06:08:35,140 --> 06:08:37,940 And so it's like, find all the things that match this 8264 06:08:37,940 --> 06:08:41,400 and the answer is, none match, here's your list of nothing. 8265 06:08:41,400 --> 06:08:43,400 Okay, and so you have to check, 8266 06:08:43,400 --> 06:08:45,240 that's how you have to check even if you got something 8267 06:08:45,240 --> 06:08:47,080 because it's not gonna return you false, 8268 06:08:47,080 --> 06:08:49,540 it returns you a list with no items in it. 8269 06:08:50,480 --> 06:08:53,800 Now, the way it works, like I said, 8270 06:08:53,800 --> 06:08:55,480 it sort of is taking this template 8271 06:08:55,480 --> 06:08:58,040 and stamping it across the line, 8272 06:08:58,040 --> 06:08:59,400 stamping across the characters. 8273 06:08:59,400 --> 06:09:03,280 Now, there's a behavior that might not be intuitive, 8274 06:09:03,280 --> 06:09:05,360 intuitive you at the very beginning, 8275 06:09:05,360 --> 06:09:08,160 but the notion of what we call greedy matching. 8276 06:09:08,160 --> 06:09:10,320 And that is, when it can match 8277 06:09:10,320 --> 06:09:13,720 more than one possible string, overlapping string, 8278 06:09:13,720 --> 06:09:16,640 it chooses the largest of the overlapping strings. 8279 06:09:16,640 --> 06:09:19,360 And so the easiest way to show this with an example, 8280 06:09:19,360 --> 06:09:21,880 and we're saying, I want something that starts with an F 8281 06:09:21,880 --> 06:09:25,440 with one or more characters and ends with a colon. 8282 06:09:25,440 --> 06:09:28,880 So that's my little stamp, that's my template. 8283 06:09:28,880 --> 06:09:31,680 So starts with an F, good, that's good. 8284 06:09:31,680 --> 06:09:34,440 One or more characters, da da da da da, have a colon. 8285 06:09:34,440 --> 06:09:38,000 So that could be from colon, that would match. 8286 06:09:38,000 --> 06:09:39,400 But look, I've got another colon here, 8287 06:09:39,400 --> 06:09:40,800 and this is just continuing on 8288 06:09:40,800 --> 06:09:42,960 with one or more characters and this. 8289 06:09:42,960 --> 06:09:45,300 So the question is, do we get this 8290 06:09:45,300 --> 06:09:47,380 or do we get this part, right? 8291 06:09:47,380 --> 06:09:49,740 And the answer is, with greedy matching, 8292 06:09:49,740 --> 06:09:52,680 is we get the larger of the two, okay? 8293 06:09:52,680 --> 06:09:56,160 And so what you get back is somewhat counterintuitive. 8294 06:09:56,160 --> 06:09:58,640 You get the whole thing as the match, from colon, 8295 06:09:58,640 --> 06:10:01,640 using the, we could have got from colon, 8296 06:10:01,640 --> 06:10:03,800 but the reason it picks this is this one's longer. 8297 06:10:03,800 --> 06:10:06,720 So any time it has a choice, it picks the longer one, 8298 06:10:06,720 --> 06:10:08,360 and that's what greedy is, meaning, 8299 06:10:08,360 --> 06:10:10,760 it probably better described as larger 8300 06:10:10,760 --> 06:10:15,760 or tending toward the longest string or something like that. 8301 06:10:16,200 --> 06:10:18,260 So you can, of course, suppress this behavior, 8302 06:10:18,260 --> 06:10:20,820 like everything, in programming regular expressions, 8303 06:10:20,820 --> 06:10:23,520 you simply add another character. 8304 06:10:23,520 --> 06:10:26,560 And so now, it's going to say, 8305 06:10:26,560 --> 06:10:29,120 I would like to start with letter F, 8306 06:10:29,120 --> 06:10:31,080 any character, one or more times, 8307 06:10:31,080 --> 06:10:33,360 and then this question mark, this is still one, 8308 06:10:33,360 --> 06:10:37,320 you know, one little thing, non-greedy, okay? 8309 06:10:38,440 --> 06:10:41,160 And so that just says, do it not greedy, 8310 06:10:41,160 --> 06:10:43,480 which just means that it prefers 8311 06:10:43,480 --> 06:10:44,760 the shorter of the strings. 8312 06:10:44,760 --> 06:10:48,480 And so now, it could still match this string or this string, 8313 06:10:48,480 --> 06:10:50,680 but because it's been told to not be greedy, 8314 06:10:50,680 --> 06:10:52,120 it chooses this string instead, 8315 06:10:52,120 --> 06:10:53,360 and that's the string that we get. 8316 06:10:53,360 --> 06:10:54,880 And so that's the not greedy, 8317 06:10:54,880 --> 06:10:57,400 and you just add the question mark after the asterisk. 8318 06:10:57,400 --> 06:10:59,760 So it's usually an asterisk question mark 8319 06:10:59,760 --> 06:11:02,600 or a plus question mark, though that's a two thing, 8320 06:11:02,600 --> 06:11:04,820 that's zero more characters, non-greedy, 8321 06:11:04,820 --> 06:11:07,480 and that's one or more characters, non-greedy. 8322 06:11:07,480 --> 06:11:08,800 Actually, most of the time, 8323 06:11:10,160 --> 06:11:12,360 it seems to me that the non-greedy 8324 06:11:12,360 --> 06:11:13,600 would be the more reasonable default, 8325 06:11:13,600 --> 06:11:14,800 but that's not how it is. 8326 06:11:14,800 --> 06:11:17,760 A greedy is the default, and non-greedy is optional. 8327 06:11:17,760 --> 06:11:21,920 Now, we can play some more with this stuff, okay? 8328 06:11:21,920 --> 06:11:25,640 And so let's take a look at this little example 8329 06:11:25,640 --> 06:11:30,280 where we have a non-blank characters, backslash capital S, 8330 06:11:30,280 --> 06:11:32,840 one or more of those non-blank characters, 8331 06:11:32,840 --> 06:11:34,160 followed by an at sign, 8332 06:11:34,160 --> 06:11:36,480 and then again, one or more non-blank characters. 8333 06:11:36,480 --> 06:11:38,560 So this is looking for strings that have an at sign 8334 06:11:38,560 --> 06:11:40,780 with non-blank characters on both sides. 8335 06:11:40,780 --> 06:11:43,960 This is an example of where it sort of comes to this at, 8336 06:11:43,960 --> 06:11:46,440 and it goes this way, and it does it in a greedy manner. 8337 06:11:46,440 --> 06:11:48,320 If you told it to not be greedy, 8338 06:11:48,320 --> 06:11:50,920 it would give you this, these three characters, 8339 06:11:50,920 --> 06:11:52,280 but we're telling it to go greedy, 8340 06:11:52,280 --> 06:11:53,760 so it goes all the way to here, 8341 06:11:53,760 --> 06:11:57,280 and stops at this blank, and then stops at this blank. 8342 06:11:57,280 --> 06:11:58,680 And so that's a nice little thing. 8343 06:11:58,680 --> 06:12:03,360 Find the at signs, go to the first blank, blank, 8344 06:12:03,360 --> 06:12:04,440 and pull that stuff out. 8345 06:12:04,440 --> 06:12:06,520 And so that, with one little match, 8346 06:12:06,520 --> 06:12:07,620 you've pulled this thing out. 8347 06:12:07,620 --> 06:12:10,780 Now, of course, we've done that before with other techniques. 8348 06:12:11,880 --> 06:12:15,560 So that's just another way to pull stuff out. 8349 06:12:16,680 --> 06:12:21,160 Now, if we, we get this whole thing, 8350 06:12:21,160 --> 06:12:23,680 but what if that's not exactly what we wanted? 8351 06:12:23,680 --> 06:12:28,680 We can tell, we can give it a matching string 8352 06:12:30,200 --> 06:12:32,000 that's different than the extracting string 8353 06:12:32,000 --> 06:12:35,000 by adding parentheses, and so here's another example 8354 06:12:35,000 --> 06:12:38,640 where we basically say, this is our string, 8355 06:12:38,640 --> 06:12:43,640 we wanna match from at the beginning, followed by a space, 8356 06:12:43,960 --> 06:12:47,160 followed by, ignore the parentheses for the minute, 8357 06:12:47,160 --> 06:12:48,460 one or more non-blank characters, 8358 06:12:48,460 --> 06:12:49,640 followed by an at sign, 8359 06:12:49,640 --> 06:12:51,560 followed by one or more non-blank characters. 8360 06:12:51,560 --> 06:12:54,040 So this is also going to, if there's no from, 8361 06:12:54,040 --> 06:12:56,040 it's not going to be looking for that, right? 8362 06:12:56,040 --> 06:12:58,800 So it demands the from is here, so it matches that, 8363 06:12:58,800 --> 06:13:00,840 and the space is demanded as well. 8364 06:13:00,840 --> 06:13:03,640 And then it says, oh, non-blank characters, great. 8365 06:13:03,640 --> 06:13:05,040 I got an at sign, great. 8366 06:13:05,040 --> 06:13:06,960 Non-blank characters, oops, stop there. 8367 06:13:06,960 --> 06:13:09,560 And so this is what's going to match. 8368 06:13:09,560 --> 06:13:11,900 Now, the key is that we don't actually want that back 8369 06:13:11,900 --> 06:13:13,120 in our extraction. 8370 06:13:13,120 --> 06:13:15,220 What we really want back in our extraction 8371 06:13:15,220 --> 06:13:17,000 is this part right here. 8372 06:13:17,000 --> 06:13:19,480 So what we do is we put parentheses in. 8373 06:13:19,480 --> 06:13:22,080 Parentheses don't, are a code, 8374 06:13:22,080 --> 06:13:24,720 they're code in the regular expression world. 8375 06:13:24,720 --> 06:13:26,620 Parentheses say, start your extraction 8376 06:13:26,620 --> 06:13:27,880 and end your extraction. 8377 06:13:27,880 --> 06:13:31,120 And so when you do this with a parenthesis, 8378 06:13:31,120 --> 06:13:33,840 when you do it without a parenthesis, 8379 06:13:33,840 --> 06:13:36,360 you get the whole from, right? 8380 06:13:36,360 --> 06:13:37,620 Without a parenthesis. 8381 06:13:39,440 --> 06:13:42,440 Oh wait, no, okay, that doesn't have the from in it, so. 8382 06:13:43,360 --> 06:13:46,360 But if you do that with the parenthesis, 8383 06:13:46,360 --> 06:13:50,640 you match the from but you only get the this bit 8384 06:13:50,640 --> 06:13:51,840 to come out as well. 8385 06:13:51,840 --> 06:13:55,960 So you can add this to make the matching part more precise 8386 06:13:55,960 --> 06:13:58,160 but without changing what you get returned 8387 06:13:58,160 --> 06:14:00,380 and you specify what you want to get returned 8388 06:14:00,380 --> 06:14:01,900 with the parentheses. 8389 06:14:03,020 --> 06:14:06,520 So next I want to show you just a couple of different ways 8390 06:14:06,520 --> 06:14:08,240 to use these newfound skills. 8391 06:14:12,440 --> 06:14:14,040 So now what we want to do is use some of these 8392 06:14:14,040 --> 06:14:16,940 newfound skills in some more practical applications 8393 06:14:16,940 --> 06:14:18,780 of regular expressions. 8394 06:14:18,780 --> 06:14:22,580 So let's go back to the way we first tore apart strings 8395 06:14:22,580 --> 06:14:26,080 and look at the situation where if you recall, 8396 06:14:26,080 --> 06:14:28,040 we just wanted the host name, right? 8397 06:14:28,040 --> 06:14:29,880 This is an email address and we're interested 8398 06:14:29,880 --> 06:14:30,800 in the host name. 8399 06:14:30,800 --> 06:14:34,320 So we have this string and we go find the at, right? 8400 06:14:34,320 --> 06:14:38,220 The find looks up and tells us the at is at position 21. 8401 06:14:38,220 --> 06:14:39,600 And then what we do is we say, okay, 8402 06:14:39,600 --> 06:14:42,040 let's look beyond there to the space 8403 06:14:42,040 --> 06:14:45,680 and that tells us the space is in position 31. 8404 06:14:45,680 --> 06:14:48,840 And then we're saying we can extract starting 8405 06:14:48,840 --> 06:14:51,680 beyond the at sign up to but not including the space 8406 06:14:51,680 --> 06:14:55,540 by saying at post plus one colon space position. 8407 06:14:55,540 --> 06:14:58,280 And when we get that, now we have to have a thing 8408 06:14:58,280 --> 06:15:00,860 that decides to only look at this on from lines 8409 06:15:00,860 --> 06:15:02,540 but then it can print out the host 8410 06:15:02,540 --> 06:15:04,980 that is extracting of this information. 8411 06:15:04,980 --> 06:15:07,600 So that was one way that we did that, right? 8412 06:15:07,600 --> 06:15:08,600 One way we did it. 8413 06:15:08,600 --> 06:15:12,440 The next way we did this was the double split pattern, 8414 06:15:13,320 --> 06:15:14,160 right? 8415 06:15:14,160 --> 06:15:15,760 So we said, okay, let's take this line, 8416 06:15:15,760 --> 06:15:19,400 let's break it into words based on spaces. 8417 06:15:19,400 --> 06:15:20,320 That's what words is. 8418 06:15:20,320 --> 06:15:25,040 So that's zero, one, two, three, four, five, six. 8419 06:15:25,040 --> 06:15:26,680 And then we know that the email address 8420 06:15:26,680 --> 06:15:30,040 on lines that start with from space is the second one. 8421 06:15:30,040 --> 06:15:31,480 So we pull out email address, 8422 06:15:31,480 --> 06:15:34,880 which pulls this bit out into email. 8423 06:15:34,880 --> 06:15:37,520 And then we're gonna split that again 8424 06:15:37,520 --> 06:15:38,800 based on the at sign. 8425 06:15:40,160 --> 06:15:42,440 So we're gonna split this part again based on the at sign. 8426 06:15:42,440 --> 06:15:43,520 So it splits right there 8427 06:15:43,520 --> 06:15:46,320 and then this becomes the zero and one in pieces. 8428 06:15:46,320 --> 06:15:49,200 And then pieces sub one is that host. 8429 06:15:49,200 --> 06:15:51,820 And if we print that out, we get the host. 8430 06:15:51,820 --> 06:15:53,320 So that's the double split pattern. 8431 06:15:53,320 --> 06:15:55,280 Nice thing about that is you don't have to keep track. 8432 06:15:55,280 --> 06:15:57,640 The little plus ones kind of annoying 8433 06:15:57,640 --> 06:15:59,840 to use the space position. 8434 06:15:59,840 --> 06:16:02,360 That previous one, that's just hard to remember. 8435 06:16:02,360 --> 06:16:06,240 It's just, I've written this code way too many times 8436 06:16:06,240 --> 06:16:08,440 in my career and I've made mistakes 8437 06:16:08,440 --> 06:16:10,040 and I have to debug it every single time. 8438 06:16:10,040 --> 06:16:11,240 And I print all these numbers out. 8439 06:16:11,240 --> 06:16:13,200 I'm like, did I get it right? 8440 06:16:13,200 --> 06:16:14,240 Oh, I did it in Python. 8441 06:16:14,240 --> 06:16:15,080 I did it in Java. 8442 06:16:15,080 --> 06:16:15,960 I did it in C. 8443 06:16:15,960 --> 06:16:17,360 Wait a second, did it differently? 8444 06:16:17,360 --> 06:16:20,760 And so it's, so this is a lot cleaner. 8445 06:16:20,760 --> 06:16:22,400 I mean, I can write this every time 8446 06:16:22,400 --> 06:16:23,680 and I know it's gonna work every time. 8447 06:16:23,680 --> 06:16:25,200 I barely even need to test this code 8448 06:16:25,200 --> 06:16:27,200 because it's so obvious. 8449 06:16:27,200 --> 06:16:29,400 So double split is another way of extracting stuff. 8450 06:16:29,400 --> 06:16:31,960 But if we look at this thing with the regular expression, 8451 06:16:31,960 --> 06:16:33,680 we can say, oh, okay, 8452 06:16:33,680 --> 06:16:37,480 let's use a regular expression to do this. 8453 06:16:37,480 --> 06:16:39,560 So we'll start looking through the string. 8454 06:16:39,560 --> 06:16:40,880 We'll start by saying, hey, 8455 06:16:40,880 --> 06:16:43,120 let's look until we find an at sign. 8456 06:16:44,320 --> 06:16:47,960 Then let's start extracting with the parentheses. 8457 06:16:47,960 --> 06:16:51,520 And then once we have found the at sign, 8458 06:16:51,520 --> 06:16:54,000 let's look for non-blank characters. 8459 06:16:54,000 --> 06:16:56,680 This is a set of characters. 8460 06:16:56,680 --> 06:17:01,680 This caret as the first one means not a blank. 8461 06:17:01,680 --> 06:17:03,840 So that's another way to do non-blank, 8462 06:17:03,840 --> 06:17:07,200 not a set of characters which are everything but blank. 8463 06:17:07,200 --> 06:17:09,280 That's what this little bit is saying. 8464 06:17:09,280 --> 06:17:12,080 Star means zero more times, 8465 06:17:12,080 --> 06:17:13,720 which means it's gonna run, run, run, run, run 8466 06:17:13,720 --> 06:17:15,980 until it finds a blank which is gonna stop it. 8467 06:17:15,980 --> 06:17:18,200 The greediness is what keeps pushing it, right? 8468 06:17:18,200 --> 06:17:19,720 It's, this is a greedy match. 8469 06:17:19,720 --> 06:17:20,840 That asterisk is greedy 8470 06:17:20,840 --> 06:17:22,520 because there's no question mark after it. 8471 06:17:22,520 --> 06:17:26,920 And so that does go and starts at the at sign 8472 06:17:28,040 --> 06:17:31,200 with the parentheses, goes to the space, 8473 06:17:31,200 --> 06:17:33,640 and that's the end parentheses and that's what prints out. 8474 06:17:33,640 --> 06:17:37,160 Now, Y is gonna be a list that's a one item list 8475 06:17:37,160 --> 06:17:39,080 that has the string in it that we're looking for, 8476 06:17:39,080 --> 06:17:40,840 but you can just go sub-zero 8477 06:17:40,840 --> 06:17:43,180 to get that guy right out of there, okay? 8478 06:17:43,180 --> 06:17:45,680 So that's sort of the regular expression version of it. 8479 06:17:45,680 --> 06:17:49,720 But we can make this a more fine-tuned thing. 8480 06:17:49,720 --> 06:17:53,320 So we can say, look, we also wanna pick the line 8481 06:17:53,320 --> 06:17:55,560 and we wanna know if there are, 8482 06:17:55,560 --> 06:17:58,440 if we don't get that line, we wanna skip it. 8483 06:17:58,440 --> 06:18:00,320 If we do get the line, we wanna extract the data. 8484 06:18:00,320 --> 06:18:03,000 And we can do this all in a single regular expression. 8485 06:18:03,000 --> 06:18:06,380 So again, we say start from the beginning of the line. 8486 06:18:06,380 --> 06:18:08,800 And if it's gotta be a from, followed by a space, 8487 06:18:08,800 --> 06:18:11,800 and then followed by any number of characters, 8488 06:18:11,800 --> 06:18:14,200 dot star, followed by an at sign. 8489 06:18:14,200 --> 06:18:17,440 So this has to match, we see a space, 8490 06:18:17,440 --> 06:18:19,040 then we're gonna have any number of characters, 8491 06:18:19,040 --> 06:18:20,720 and then we're gonna see an at sign. 8492 06:18:20,720 --> 06:18:23,200 And then we're going to start extracting, 8493 06:18:23,200 --> 06:18:24,600 and then we're gonna go non-blank, non-blank, 8494 06:18:24,600 --> 06:18:27,120 non-blank, non-blank, non-blank, up-blank, 8495 06:18:27,120 --> 06:18:29,720 and extracting, and out that comes. 8496 06:18:29,720 --> 06:18:31,960 And this has the advantage of the previous one 8497 06:18:31,960 --> 06:18:34,680 in that that makes it much more precise. 8498 06:18:34,680 --> 06:18:38,120 If we look at the previous one, while it works on good lines, 8499 06:18:38,120 --> 06:18:39,760 it might actually trigger on lines 8500 06:18:39,760 --> 06:18:41,400 that we actually don't wanna see. 8501 06:18:41,400 --> 06:18:43,720 So this allows us to refine it 8502 06:18:43,720 --> 06:18:46,280 so it only actually does this to lines that we care about. 8503 06:18:46,280 --> 06:18:49,560 So it's sort of both an if statement 8504 06:18:49,560 --> 06:18:53,080 and a splitting, extracting, going on all at the same time 8505 06:18:53,080 --> 06:18:55,500 by having a bigger string that we're matching 8506 06:18:55,500 --> 06:18:56,560 than we're extracting. 8507 06:18:56,560 --> 06:19:00,400 So it's a way to kind of clean up your data. 8508 06:19:00,400 --> 06:19:02,600 So here is a simple program 8509 06:19:02,600 --> 06:19:05,080 that we're going to just put all this together 8510 06:19:05,080 --> 06:19:07,040 and actually accomplish something. 8511 06:19:07,040 --> 06:19:09,080 And so we're gonna read through 8512 06:19:09,080 --> 06:19:12,720 and look for lines in a file that have this form. 8513 06:19:12,720 --> 06:19:14,780 And we're gonna extract this number, 8514 06:19:14,780 --> 06:19:19,780 and then we're going to compute the maximum of this. 8515 06:19:20,520 --> 06:19:21,640 So we're gonna extract this number 8516 06:19:21,640 --> 06:19:23,720 and then convert it to a float and compute the maximum. 8517 06:19:23,720 --> 06:19:26,520 So we're gonna open a file, 8518 06:19:26,520 --> 06:19:29,360 we're gonna write a for loop, we're gonna strip. 8519 06:19:29,360 --> 06:19:30,840 So we're gonna do this for every line in the file, 8520 06:19:30,840 --> 06:19:33,480 but the first thing we wanna do is not get line, 8521 06:19:33,480 --> 06:19:36,640 we wanna discard all the lines except ones that have this. 8522 06:19:36,640 --> 06:19:39,360 So our regular expression is look for lines 8523 06:19:39,360 --> 06:19:42,460 that start with x dash d span dash confidence colon. 8524 06:19:42,460 --> 06:19:43,940 So that's a pretty strong match. 8525 06:19:43,940 --> 06:19:46,600 If that's not there, we're not gonna get anything. 8526 06:19:46,600 --> 06:19:49,320 And then there's a space, there's a space, 8527 06:19:49,320 --> 06:19:51,540 and then start extracting, 8528 06:19:51,540 --> 06:19:56,440 and then go as long, one or more digits and dots, 8529 06:19:56,440 --> 06:19:58,840 that's a single character, and that's one or more, 8530 06:19:58,840 --> 06:19:59,960 and then stop extracting. 8531 06:19:59,960 --> 06:20:01,840 So that says start extracting, 8532 06:20:01,840 --> 06:20:04,800 da da da da, greedy, greedy, greedy, greedy, stop extracting. 8533 06:20:04,800 --> 06:20:06,800 And so that's what we're going to get. 8534 06:20:06,800 --> 06:20:09,520 Now, if the line doesn't have this, 8535 06:20:09,520 --> 06:20:12,540 it means missing in some way, 8536 06:20:12,540 --> 06:20:14,000 whether it's this prefix or this number. 8537 06:20:14,000 --> 06:20:16,200 If the number's missing, it's gonna fail too. 8538 06:20:16,200 --> 06:20:19,640 We're going to get back a list, an empty list. 8539 06:20:19,640 --> 06:20:21,640 So the first thing you have to do is check to see 8540 06:20:21,640 --> 06:20:23,440 if you actually got a match. 8541 06:20:23,440 --> 06:20:25,720 So you say if the number of items in the list, 8542 06:20:25,720 --> 06:20:28,640 len of stuff, is not equal to one, continue. 8543 06:20:28,640 --> 06:20:33,160 And so this is the skip all the lines that don't match. 8544 06:20:33,160 --> 06:20:35,040 Skip, skip, skip, skip, skip, skip. 8545 06:20:35,040 --> 06:20:37,240 So there could be thousands of lines that don't match. 8546 06:20:37,240 --> 06:20:39,640 But then, when this match hits, 8547 06:20:39,640 --> 06:20:42,520 it's gonna come down and fall through, right? 8548 06:20:42,520 --> 06:20:46,000 So that, most of the lines will skip up, 8549 06:20:46,000 --> 06:20:47,880 but then when we actually get one, 8550 06:20:47,880 --> 06:20:52,160 and we know instantly that we've got one and stuff sub zero 8551 06:20:52,160 --> 06:20:55,460 because that's what we extracted, is this number. 8552 06:20:55,460 --> 06:20:57,220 And we can take the floating point of it. 8553 06:20:57,220 --> 06:20:58,520 We append it to our list. 8554 06:20:58,520 --> 06:21:00,240 We made a list to store them. 8555 06:21:00,240 --> 06:21:01,640 That runs. 8556 06:21:01,640 --> 06:21:02,580 The list grows. 8557 06:21:03,800 --> 06:21:06,080 And then we just say, what was the largest one? 8558 06:21:06,080 --> 06:21:09,520 And so you can run this and see that. 8559 06:21:09,520 --> 06:21:11,000 We have an escape character. 8560 06:21:11,000 --> 06:21:12,900 And the whole idea is that sometimes 8561 06:21:12,900 --> 06:21:14,080 all these little special characters 8562 06:21:14,080 --> 06:21:15,720 that make a lot of sense to us, 8563 06:21:15,720 --> 06:21:16,880 we actually want to search for it. 8564 06:21:16,880 --> 06:21:19,220 So what if we want to search for a dollar sign? 8565 06:21:19,220 --> 06:21:23,120 Well, we just prefix it with the backslash. 8566 06:21:23,120 --> 06:21:25,160 And that just means this is a real dollar sign. 8567 06:21:25,160 --> 06:21:27,440 So backslash dollar is a real dollar sign. 8568 06:21:27,440 --> 06:21:30,960 So this says, I would like a dollar sign 8569 06:21:30,960 --> 06:21:34,440 followed by one or more digits or dots. 8570 06:21:34,440 --> 06:21:36,880 And so that's going to match a dollar sign 8571 06:21:36,880 --> 06:21:38,160 followed by one or more digits. 8572 06:21:38,160 --> 06:21:39,080 Dots are okay. 8573 06:21:39,080 --> 06:21:40,640 This is a set, remember. 8574 06:21:40,640 --> 06:21:42,640 Zero dash nine or dot. 8575 06:21:42,640 --> 06:21:44,780 That's a set of the list of legit characters. 8576 06:21:44,780 --> 06:21:47,480 This is a range of characters that's a shortcut 8577 06:21:47,480 --> 06:21:48,440 to how to make the set. 8578 06:21:48,440 --> 06:21:50,560 You could make it be zero, one, two, three, four, five, 8579 06:21:50,560 --> 06:21:51,760 seven, eight, nine, dot. 8580 06:21:51,760 --> 06:21:53,920 Or zero dash nine and it assumes that. 8581 06:21:53,920 --> 06:21:55,000 And that's one or more. 8582 06:21:55,000 --> 06:21:57,440 So then it stops because this is a space. 8583 06:21:57,440 --> 06:21:59,160 It's greedy matching. 8584 06:21:59,160 --> 06:22:00,480 Then it pulls this out. 8585 06:22:00,480 --> 06:22:03,000 So that's kind of why greedy has to be the default. 8586 06:22:03,000 --> 06:22:05,960 Because otherwise, if it wasn't doing greedy matching, 8587 06:22:05,960 --> 06:22:07,540 oops, come back, come back. 8588 06:22:08,900 --> 06:22:11,200 If it wasn't doing greedy matching, 8589 06:22:11,200 --> 06:22:14,520 it would, if it wasn't doing greedy matching, 8590 06:22:14,520 --> 06:22:16,780 it would stop here because it would find a dollar sign. 8591 06:22:16,780 --> 06:22:19,660 Non greedy would find a dollar sign and one character 8592 06:22:19,660 --> 06:22:23,360 and then it would give us dollar one rather than dollar 10. 8593 06:22:23,360 --> 06:22:26,560 So, in summary, regular expressions 8594 06:22:26,560 --> 06:22:28,700 are a cryptic but powerful language 8595 06:22:28,700 --> 06:22:31,560 and they're an acquired taste. 8596 06:22:31,560 --> 06:22:35,320 I think that, I bet eventually you'll find them fun 8597 06:22:35,320 --> 06:22:37,540 even though on your first impression 8598 06:22:37,540 --> 06:22:39,540 you might not think that they're so fun. 8599 06:22:42,960 --> 06:22:44,920 Welcome to network programs. 8600 06:22:44,920 --> 06:22:45,760 This is chapter 12. 8601 06:22:45,760 --> 06:22:47,960 Now we're going to learn a little bit 8602 06:22:47,960 --> 06:22:52,960 about how we talk to resources on the network using Python. 8603 06:22:52,960 --> 06:22:56,160 Now, this is a really quick introduction 8604 06:22:56,160 --> 06:22:57,400 to how the network really works. 8605 06:22:57,400 --> 06:22:59,680 I have a whole book that I wrote. 8606 06:22:59,680 --> 06:23:01,800 It's also translated into Spanish 8607 06:23:01,800 --> 06:23:04,800 on how the network works starting at the very lowest 8608 06:23:04,800 --> 06:23:07,680 layer packets and everything right on up. 8609 06:23:07,680 --> 06:23:09,160 And it's actually really easy to read. 8610 06:23:09,160 --> 06:23:11,680 I wrote it for a high school audience. 8611 06:23:11,680 --> 06:23:14,040 It's a short book and pretty easy to read. 8612 06:23:15,080 --> 06:23:16,920 So if you read that book, you will understand 8613 06:23:16,920 --> 06:23:20,040 that there is this layered architecture, 8614 06:23:20,040 --> 06:23:23,080 the TCP architecture that sort of runs our network 8615 06:23:23,080 --> 06:23:24,200 at the lowest layer. 8616 06:23:24,200 --> 06:23:26,360 On one side here, this is your computer 8617 06:23:26,360 --> 06:23:28,080 and this is a server computer. 8618 06:23:28,080 --> 06:23:30,200 And if you sort of want a webpage, 8619 06:23:30,200 --> 06:23:31,520 it goes across the network, 8620 06:23:31,520 --> 06:23:33,760 does this like 15 or 20 times, 8621 06:23:33,760 --> 06:23:36,760 then it goes up into the server, reads the data 8622 06:23:36,760 --> 06:23:41,520 and then the data comes back 15, 20 hops for the packets 8623 06:23:41,520 --> 06:23:44,800 and then it's shown to you as what you see. 8624 06:23:46,500 --> 06:23:48,720 And so, that's how it works. 8625 06:23:48,720 --> 06:23:51,200 And there's these layers that we're not gonna talk about 8626 06:23:51,200 --> 06:23:53,280 in this section but I talk about in that book. 8627 06:23:54,400 --> 06:23:56,360 The layers of the link layer which talk about 8628 06:23:56,360 --> 06:23:58,640 how to get over one hop, the internet layer 8629 06:23:58,640 --> 06:24:03,640 which talks about how to construct say 15 or so hops 8630 06:24:03,640 --> 06:24:06,000 to get packets back and forth, 8631 06:24:06,000 --> 06:24:08,800 that's the sort of lower level bits. 8632 06:24:08,800 --> 06:24:10,880 We're gonna start at what we call the transport layer 8633 06:24:10,880 --> 06:24:14,320 and that's the layer where your computer sort of assumes 8634 06:24:14,320 --> 06:24:17,120 that it can make a phone call to another computer, 8635 06:24:17,120 --> 06:24:20,120 another process running on a program on this computer, 8636 06:24:20,120 --> 06:24:22,000 talks to a program on this computer 8637 06:24:22,000 --> 06:24:24,120 and then it kind of comes back, okay? 8638 06:24:24,120 --> 06:24:26,880 And so, we're gonna leave this alone, 8639 06:24:26,880 --> 06:24:28,800 we're gonna ignore it, we're gonna assume 8640 06:24:28,800 --> 06:24:30,400 that there's this nice reliable pipe 8641 06:24:30,400 --> 06:24:32,840 that's going from point A to point B 8642 06:24:32,840 --> 06:24:34,160 and what are we gonna do with the pipe? 8643 06:24:34,160 --> 06:24:36,960 But if you're interested, take a look at the book. 8644 06:24:36,960 --> 06:24:39,240 So, we're just gonna start with a pipe, 8645 06:24:39,240 --> 06:24:42,880 some kind of a connection, we have two processes, 8646 06:24:42,880 --> 06:24:46,760 process, process and we have some kind of a connection 8647 06:24:46,760 --> 06:24:48,880 between them and it is a connection 8648 06:24:48,880 --> 06:24:53,440 that we can both use to talk and to listen. 8649 06:24:53,440 --> 06:24:56,160 In nerd terms, we call these things sockets 8650 06:24:56,160 --> 06:24:58,680 and that is one process running on one computer, 8651 06:24:58,680 --> 06:25:01,760 another process running on another second computer 8652 06:25:01,760 --> 06:25:03,880 connected through the internet somehow 8653 06:25:03,880 --> 06:25:07,960 and one computer speaks into that socket and it comes out 8654 06:25:07,960 --> 06:25:10,960 and the other computer returns something and it comes. 8655 06:25:10,960 --> 06:25:14,400 And so, this is a bi-directional protocol of data 8656 06:25:14,400 --> 06:25:17,360 which is a series of, in effect, data phone calls 8657 06:25:17,360 --> 06:25:18,960 between applications. 8658 06:25:18,960 --> 06:25:21,480 So, the application might be, on your side, 8659 06:25:21,480 --> 06:25:22,920 this might be your browser. 8660 06:25:23,960 --> 06:25:27,320 Chrome, Firefox, Internet Explorer. 8661 06:25:27,320 --> 06:25:29,240 On the other side, this is a web server. 8662 06:25:30,120 --> 06:25:33,120 Might be internet IIS, internet something something 8663 06:25:33,120 --> 06:25:37,080 from Microsoft or Apache or Java Tomcat. 8664 06:25:37,080 --> 06:25:40,160 There's another program and you are making phone calls 8665 06:25:40,160 --> 06:25:41,360 between these programs. 8666 06:25:41,360 --> 06:25:45,480 Now, in general, these servers here stay up all the time 8667 06:25:45,480 --> 06:25:47,640 and you sort of just can make a request 8668 06:25:47,640 --> 06:25:50,000 when you feel like it in your program 8669 06:25:50,000 --> 06:25:51,280 but that's what we're going to do 8670 06:25:51,280 --> 06:25:53,680 and this is what we call a socket. 8671 06:25:53,680 --> 06:25:55,840 So, that little connection, that phone call, 8672 06:25:55,840 --> 06:25:58,660 that data phone call is what we call a socket. 8673 06:25:59,520 --> 06:26:03,160 Now, you have to decide which of the systems 8674 06:26:03,160 --> 06:26:05,520 you're gonna talk to and then which of the services 8675 06:26:05,520 --> 06:26:07,840 on those systems or which process. 8676 06:26:07,840 --> 06:26:10,640 And so, we have this concept called port numbers 8677 06:26:10,640 --> 06:26:13,080 and they're best thought of like extensions on phones. 8678 06:26:13,080 --> 06:26:15,520 So, one organization has one phone number 8679 06:26:15,520 --> 06:26:17,120 and it says, please enter the extension 8680 06:26:17,120 --> 06:26:19,040 of the party you'd like to talk to. 8681 06:26:19,040 --> 06:26:20,680 Well, that's kind of what ports are. 8682 06:26:20,680 --> 06:26:22,760 They're like, here is, I'm a server 8683 06:26:22,760 --> 06:26:23,920 and I'm connected to the internet. 8684 06:26:23,920 --> 06:26:26,220 Please enter the extension of the process 8685 06:26:26,220 --> 06:26:28,120 that you would like to talk to. 8686 06:26:28,120 --> 06:26:30,920 And so, for example, there might be processes 8687 06:26:30,920 --> 06:26:33,200 running on various computers 8688 06:26:33,200 --> 06:26:36,800 and so the email is known to hang out on port 25 8689 06:26:36,800 --> 06:26:38,240 or extension 25. 8690 06:26:38,240 --> 06:26:41,680 Login, insecure login lives on port 23. 8691 06:26:41,680 --> 06:26:45,280 Insecure web lives on 80 and secure web lives on 443 8692 06:26:45,280 --> 06:26:47,120 and there's a couple of different protocols. 8693 06:26:47,120 --> 06:26:49,640 Say if you have your mail stored on Gmail 8694 06:26:49,640 --> 06:26:51,840 and you have a local mail client, 8695 06:26:51,840 --> 06:26:53,900 say like Thunderbird or Apple Mail, 8696 06:26:53,900 --> 06:26:56,660 that talks a protocol to pull that mail across 8697 06:26:56,660 --> 06:26:58,360 and those live on various ports. 8698 06:26:58,360 --> 06:27:01,320 So, these ports are those extensions 8699 06:27:01,320 --> 06:27:05,320 and by convention, we have standards 8700 06:27:05,320 --> 06:27:08,600 that tell us what to roughly expect at those ports. 8701 06:27:08,600 --> 06:27:10,820 So, when you're talking to port 80, 8702 06:27:10,820 --> 06:27:15,560 you expect to talk to a web server or an HTTP server. 8703 06:27:15,560 --> 06:27:17,040 If you're talking on port 23, 8704 06:27:17,040 --> 06:27:19,000 you expect to talk to a telnet server 8705 06:27:19,000 --> 06:27:20,440 and on and on and on and on. 8706 06:27:20,440 --> 06:27:22,100 And so, these are the extensions, 8707 06:27:22,100 --> 06:27:26,040 the typical commonly used default extensions 8708 06:27:26,040 --> 06:27:29,100 for various network application processes 8709 06:27:29,100 --> 06:27:31,140 that are serving us data. 8710 06:27:31,140 --> 06:27:32,840 Now, sometimes you'll go to a URL 8711 06:27:32,840 --> 06:27:34,560 and you'll see in that URL, 8712 06:27:34,560 --> 06:27:35,800 there's a colon and a number, 8713 06:27:35,800 --> 06:27:39,160 that means it's a web server that's running on a port 8714 06:27:39,160 --> 06:27:42,180 other than the official 80 or 443 port. 8715 06:27:43,400 --> 06:27:47,260 Now, in Python, we can talk to these sockets, right? 8716 06:27:47,260 --> 06:27:49,980 We can just talk to them and it's really easy, 8717 06:27:49,980 --> 06:27:51,560 surprisingly easy. 8718 06:27:52,480 --> 06:27:55,080 We have to import socket because that's a library. 8719 06:27:55,080 --> 06:27:58,800 It comes with Python, but until you can use it, 8720 06:27:58,800 --> 06:28:01,880 you can't use it in your program until you say it. 8721 06:28:01,880 --> 06:28:04,240 And then you, basically in the socket library, 8722 06:28:04,240 --> 06:28:07,360 call socket function, that's what that syntax is saying. 8723 06:28:08,800 --> 06:28:10,040 You're making a socket. 8724 06:28:10,040 --> 06:28:11,040 Now, the key to a socket, 8725 06:28:11,040 --> 06:28:14,560 it's sort of like an unopened file handle. 8726 06:28:14,560 --> 06:28:16,280 It's half of a file handle. 8727 06:28:16,280 --> 06:28:19,020 It's an outward looking thing that's not yet connected. 8728 06:28:19,020 --> 06:28:21,360 These parameters, you're just gonna type them in. 8729 06:28:21,360 --> 06:28:22,800 This says we're gonna make a socket 8730 06:28:22,800 --> 06:28:23,920 that goes across the internet 8731 06:28:23,920 --> 06:28:25,240 and it's a stream socket, 8732 06:28:25,240 --> 06:28:27,400 which means that it's a series of characters 8733 06:28:27,400 --> 06:28:28,840 that come one after another 8734 06:28:28,840 --> 06:28:30,680 rather than a series of blocks of text. 8735 06:28:30,680 --> 06:28:32,920 There's another kind that's harder to deal with, 8736 06:28:32,920 --> 06:28:34,000 but we're gonna do this. 8737 06:28:34,000 --> 06:28:36,200 So this, don't worry about this line. 8738 06:28:36,200 --> 06:28:38,040 Just know that this creates a socket, 8739 06:28:38,040 --> 06:28:40,160 but does not associate it. 8740 06:28:40,160 --> 06:28:43,520 The very next line, we get back a socket object 8741 06:28:43,520 --> 06:28:46,040 in this variable that I'm storing in the variable mySock. 8742 06:28:46,040 --> 06:28:48,120 And then when you wanna make a connection 8743 06:28:48,120 --> 06:28:50,660 across the internet to the far end, 8744 06:28:50,660 --> 06:28:52,800 you say, oh, hey, dear socket, 8745 06:28:52,800 --> 06:28:54,560 extend yourself across the internet. 8746 06:28:54,560 --> 06:28:59,320 Make the phone call to this host, data.pr4e.org, 8747 06:28:59,320 --> 06:29:00,880 and on that port 80. 8748 06:29:00,880 --> 06:29:02,320 So that's making the phone call. 8749 06:29:02,320 --> 06:29:03,740 This is like the phone number 8750 06:29:03,740 --> 06:29:05,440 and this is like the phone extension. 8751 06:29:05,440 --> 06:29:07,500 So that's, we haven't sent any data yet. 8752 06:29:07,500 --> 06:29:11,300 We have simply rung the phone of a process, 8753 06:29:11,300 --> 06:29:12,940 hopefully living on port 80. 8754 06:29:12,940 --> 06:29:14,180 If it's there, great. 8755 06:29:14,180 --> 06:29:15,240 This might blow up. 8756 06:29:15,240 --> 06:29:16,360 This one here won't blow up, 8757 06:29:16,360 --> 06:29:18,360 but this line here could blow up. 8758 06:29:18,360 --> 06:29:19,960 If there's nothing sitting on that process, 8759 06:29:19,960 --> 06:29:20,920 it would come back and say, 8760 06:29:20,920 --> 06:29:23,120 oh, you try to call, you got no answer. 8761 06:29:23,120 --> 06:29:24,720 That's a legitimate thing to happen. 8762 06:29:24,720 --> 06:29:26,580 Maybe you don't have a network connection 8763 06:29:26,580 --> 06:29:29,160 or maybe that service is down on that server 8764 06:29:29,160 --> 06:29:30,760 or the whole server is down. 8765 06:29:30,760 --> 06:29:35,760 But, so I just, it's kind of amazing 8766 06:29:35,960 --> 06:29:37,740 that we're sitting here in Python 8767 06:29:37,740 --> 06:29:41,640 and in three lines we have probably 8768 06:29:41,640 --> 06:29:43,800 a half a million engineers who built this thing 8769 06:29:43,800 --> 06:29:45,420 called the internet, all these protocols 8770 06:29:45,420 --> 06:29:46,760 and all this software. 8771 06:29:46,760 --> 06:29:50,600 And we just made use of it in three lines of Python. 8772 06:29:50,600 --> 06:29:52,560 And in case, this is one of the reasons 8773 06:29:52,560 --> 06:29:55,040 that people absolutely love Python, 8774 06:29:55,040 --> 06:29:56,540 absolutely love Python. 8775 06:29:57,560 --> 06:29:59,260 So now that we have a socket, 8776 06:29:59,260 --> 06:30:01,760 we have to ask ourselves what kind of data 8777 06:30:01,760 --> 06:30:04,080 are we going to send and then what kind of data 8778 06:30:04,080 --> 06:30:07,420 are we going to expect to receive across that socket? 8779 06:30:11,680 --> 06:30:13,000 So now we have a socket. 8780 06:30:13,000 --> 06:30:15,520 We are going to talk about what we're going to do with it. 8781 06:30:15,520 --> 06:30:17,760 So the socket basically functions at this level. 8782 06:30:17,760 --> 06:30:20,160 Your application is saying, make me a socket, 8783 06:30:20,160 --> 06:30:21,720 which is sort of this end point. 8784 06:30:21,720 --> 06:30:23,360 And then the connect actually connects 8785 06:30:23,360 --> 06:30:25,440 to an application on the far side. 8786 06:30:25,440 --> 06:30:28,160 And there's a port involved, so that might be port 80. 8787 06:30:28,160 --> 06:30:30,680 And this is the far host and that could be 8788 06:30:30,680 --> 06:30:35,680 www.py4e.org or data.py4e.org. 8789 06:30:36,080 --> 06:30:39,080 Okay, and so the socket is solving this. 8790 06:30:39,080 --> 06:30:43,720 And the question then is what are we going to send 8791 06:30:43,720 --> 06:30:45,340 and what are we going to expect to get back? 8792 06:30:45,340 --> 06:30:47,640 And that's what we call the application protocol. 8793 06:30:47,640 --> 06:30:50,040 So we know that these two have made a phone call. 8794 06:30:50,040 --> 06:30:51,880 There's no different than making the phone call 8795 06:30:51,880 --> 06:30:54,880 and saying, you know, hello, right? 8796 06:30:54,880 --> 06:30:57,760 And everyone knows that when the phone rings 8797 06:30:57,760 --> 06:31:00,040 and you pick it up, you're supposed to say hello. 8798 06:31:00,040 --> 06:31:01,440 And that's part of our protocol. 8799 06:31:01,440 --> 06:31:03,320 So who talks first, right? 8800 06:31:03,320 --> 06:31:07,320 So the dominant protocol that we use in this section 8801 06:31:07,320 --> 06:31:08,680 is the HTTP protocol. 8802 06:31:08,680 --> 06:31:11,720 The key is hypertext transfer protocol. 8803 06:31:11,720 --> 06:31:13,520 It's dominant, it's really easy to use. 8804 06:31:13,520 --> 06:31:15,000 That's why I use it as an example. 8805 06:31:15,000 --> 06:31:16,640 But realize that there are many others, 8806 06:31:16,640 --> 06:31:19,880 like mail and file transfer and remote login 8807 06:31:19,880 --> 06:31:21,700 and all kinds of other protocols. 8808 06:31:21,700 --> 06:31:23,600 Each is a different application protocol. 8809 06:31:23,600 --> 06:31:26,320 They all use sort of sockets at their lower level. 8810 06:31:26,320 --> 06:31:29,480 But then on top of that, they layer the rules of the road 8811 06:31:29,480 --> 06:31:33,000 for retrieving hypertext web pages. 8812 06:31:33,000 --> 06:31:36,960 And we have used these for all kinds of other things. 8813 06:31:36,960 --> 06:31:38,240 So the protocol, like I said, 8814 06:31:38,240 --> 06:31:39,800 is like who answers the phone first? 8815 06:31:39,800 --> 06:31:40,920 What do they say? 8816 06:31:40,920 --> 06:31:43,000 What happens if the person doesn't answer right? 8817 06:31:43,000 --> 06:31:44,240 Can you hear me now? 8818 06:31:44,240 --> 06:31:45,560 Those kinds of things. 8819 06:31:45,560 --> 06:31:47,240 And it's a real simple thing. 8820 06:31:47,240 --> 06:31:48,360 And all you really need to do 8821 06:31:48,360 --> 06:31:50,480 is so that both sides can agree, 8822 06:31:50,480 --> 06:31:51,640 you have to write a thing 8823 06:31:51,640 --> 06:31:53,280 that's like the rules in the middle 8824 06:31:53,280 --> 06:31:56,120 and say, okay, everybody, as long as we all do this, 8825 06:31:56,120 --> 06:31:57,560 we'll be fine. 8826 06:31:57,560 --> 06:31:59,440 It's as simple as picking on which side of the road 8827 06:31:59,440 --> 06:32:00,640 the cars can drive on. 8828 06:32:00,640 --> 06:32:02,580 It works fine no matter which side. 8829 06:32:04,880 --> 06:32:06,160 But if each car randomly picked, 8830 06:32:06,160 --> 06:32:07,900 it would be really kind of a mess. 8831 06:32:09,060 --> 06:32:10,800 So if you look at the typical URL, 8832 06:32:10,800 --> 06:32:12,100 and this is one of the things 8833 06:32:12,100 --> 06:32:15,280 that the web innovators in 1980 8834 06:32:15,280 --> 06:32:16,920 really invented that was wonderful. 8835 06:32:16,920 --> 06:32:19,100 And it seems second nature today, 8836 06:32:19,100 --> 06:32:21,280 but in 1990, it was rather revolutionary. 8837 06:32:21,280 --> 06:32:23,960 And that these uniform resource locators 8838 06:32:23,960 --> 06:32:27,200 encrypted included in themselves a protocol, 8839 06:32:27,200 --> 06:32:29,820 the host to connect to, and the document to retrieve. 8840 06:32:29,820 --> 06:32:33,440 So this is one of the clever, clever ideas 8841 06:32:33,440 --> 06:32:35,160 that the web came up with, 8842 06:32:35,160 --> 06:32:37,880 because we used to have to pick a program 8843 06:32:37,880 --> 06:32:42,480 like FTP or Telnet or whatever, SMTP. 8844 06:32:42,480 --> 06:32:44,280 Then we had to go to the right host, 8845 06:32:44,280 --> 06:32:47,520 and then we had to talk to that host a certain way. 8846 06:32:47,520 --> 06:32:50,520 So in HTTP, it's a really simple protocol 8847 06:32:50,520 --> 06:32:54,480 invented in 1989 and 1990 by Tim Berners-Lee 8848 06:32:54,480 --> 06:32:58,760 and Robert Caillou at CERN. 8849 06:32:59,840 --> 06:33:03,640 And they created a protocol that we have grown to know 8850 06:33:03,640 --> 06:33:06,700 and love and use for way more than retrieving documents, 8851 06:33:06,700 --> 06:33:09,340 as we'll see in the upcoming chapters. 8852 06:33:09,340 --> 06:33:11,160 So we're gonna talk a little bit about what happens 8853 06:33:11,160 --> 06:33:13,280 when you click on a page that has a link. 8854 06:33:13,280 --> 06:33:15,440 Now, there's all kind of fancy stuff that can go on, 8855 06:33:15,440 --> 06:33:17,000 but this is the basics. 8856 06:33:17,000 --> 06:33:19,040 And so let's just imagine for the moment 8857 06:33:19,040 --> 06:33:21,320 you start sitting looking at a web page, 8858 06:33:21,320 --> 06:33:22,960 drchuck.com slash page one, 8859 06:33:22,960 --> 06:33:25,400 and inside that there is a hyperlink. 8860 06:33:25,400 --> 06:33:29,600 It is a indication that says when you click on this page, 8861 06:33:29,600 --> 06:33:31,160 go to a different page. 8862 06:33:31,160 --> 06:33:34,600 And in that, you see the name of the page 8863 06:33:34,600 --> 06:33:36,160 that you're supposed to go to. 8864 06:33:37,240 --> 06:33:41,520 So we click on this link, and that is a browser. 8865 06:33:41,520 --> 06:33:44,780 This is an application, this is a process, 8866 06:33:46,440 --> 06:33:49,000 or an app that's running on your computer. 8867 06:33:49,000 --> 06:33:50,560 This is the browser, okay? 8868 06:33:50,560 --> 06:33:53,880 And when the browser sees the click inside your computer, 8869 06:33:53,880 --> 06:33:56,560 then the browser makes a connection 8870 06:33:56,560 --> 06:34:00,200 to port 80 on the web server, drchuck.com, 8871 06:34:00,200 --> 06:34:02,520 and sends the request. 8872 06:34:02,520 --> 06:34:06,660 This request that it sends is precisely specified 8873 06:34:06,660 --> 06:34:09,040 by a standard, which we will see in a second. 8874 06:34:09,040 --> 06:34:12,440 Then the web server does some magic work. 8875 06:34:12,440 --> 06:34:14,320 Oops, let's go back. 8876 06:34:14,320 --> 06:34:16,400 Then the web server does some magic work in here, 8877 06:34:16,400 --> 06:34:19,080 reads some files, runs some code, does whatever, 8878 06:34:19,080 --> 06:34:23,680 constructs an answer to our phone call, and sends it back. 8879 06:34:23,680 --> 06:34:26,240 And it sends, in this case, back a web page 8880 06:34:26,240 --> 06:34:30,000 in the format of HTML, the hypertext markup link, 8881 06:34:30,000 --> 06:34:32,020 which is different than HTTP, 8882 06:34:32,020 --> 06:34:33,960 which is the protocol that we're exchanging. 8883 06:34:33,960 --> 06:34:36,340 HTML is the format of the document we're getting back. 8884 06:34:36,340 --> 06:34:38,480 And in this has an anchor tag, 8885 06:34:38,480 --> 06:34:41,440 href and an end of anchor tag, and some highlighted text. 8886 06:34:41,440 --> 06:34:44,240 And now your browser gets this back 8887 06:34:44,240 --> 06:34:47,120 and then renders it according to the rules of HTML 8888 06:34:47,120 --> 06:34:50,120 and CSS and JavaScript, et cetera, parses it, 8889 06:34:50,120 --> 06:34:51,600 and then makes a pretty web page. 8890 06:34:51,600 --> 06:34:53,520 And this web page happens to have a link 8891 06:34:53,520 --> 06:34:55,480 back to the first page, and if you click there, 8892 06:34:55,480 --> 06:34:57,900 it will do this over and over and over again. 8893 06:34:57,900 --> 06:35:00,380 And that is the request response cycle. 8894 06:35:00,380 --> 06:35:03,680 And that's governed by a series of internet standards. 8895 06:35:03,680 --> 06:35:05,640 These are standards that were built 8896 06:35:05,640 --> 06:35:08,000 from the 60s, 70s, 80s, and 90s, 8897 06:35:08,000 --> 06:35:09,840 and continue to this day, 8898 06:35:09,840 --> 06:35:12,140 by a group called the Internet Engineering Task Force, 8899 06:35:12,140 --> 06:35:14,120 or IETF. 8900 06:35:14,120 --> 06:35:17,000 The documents they produce are called RFCs, 8901 06:35:17,000 --> 06:35:19,280 which stands for Request for Comments. 8902 06:35:20,640 --> 06:35:24,760 The RFC, the word RFC is kind of like a sort of joke, 8903 06:35:24,760 --> 06:35:25,600 as it were. 8904 06:35:28,600 --> 06:35:30,900 They're trying to be kind of funny in that, 8905 06:35:30,900 --> 06:35:32,600 funny is not the right word. 8906 06:35:32,600 --> 06:35:34,520 It's ironic in that they're trying to say, 8907 06:35:34,520 --> 06:35:36,400 even so in the protocols of the internet 8908 06:35:36,400 --> 06:35:38,980 that we've used for several decades, 8909 06:35:38,980 --> 06:35:40,920 they're always interested in improvements. 8910 06:35:40,920 --> 06:35:42,400 And that's what the RFC stands for. 8911 06:35:42,400 --> 06:35:44,720 And they're all named RFC-whatever. 8912 06:35:45,560 --> 06:35:47,080 And if we were gonna cruise around, 8913 06:35:47,080 --> 06:35:49,240 we could find some various RFCs. 8914 06:35:49,240 --> 06:35:52,000 And this is RFC 2616. 8915 06:35:52,000 --> 06:35:53,800 It might have been revised since then. 8916 06:35:53,800 --> 06:35:55,440 But this is like a document, 8917 06:35:55,440 --> 06:35:56,560 and this is what they look like. 8918 06:35:56,560 --> 06:35:59,240 Hypertext Transfer Protocol version one. 8919 06:35:59,240 --> 06:36:00,320 And so you're reading this document, 8920 06:36:00,320 --> 06:36:01,400 you're gonna write a browser, 8921 06:36:01,400 --> 06:36:04,640 and you wanna talk the application protocol that is HTTP. 8922 06:36:04,640 --> 06:36:06,440 This is one of many documents 8923 06:36:06,440 --> 06:36:09,440 that helps define what HTTP is. 8924 06:36:09,440 --> 06:36:11,000 So if you look down and you look down and say, 8925 06:36:11,000 --> 06:36:12,560 oh, here's what a request looks like. 8926 06:36:12,560 --> 06:36:15,120 This is how I'm gonna get a document from the server. 8927 06:36:15,120 --> 06:36:17,040 And you keep reading, and you keep reading, 8928 06:36:17,040 --> 06:36:21,320 and it says, you're supposed to have the request method 8929 06:36:21,320 --> 06:36:23,560 with a space, with the request URL, 8930 06:36:23,560 --> 06:36:25,560 the request method with a space, 8931 06:36:25,560 --> 06:36:27,880 with a URI with a space, the HTTP version, 8932 06:36:27,880 --> 06:36:29,200 and the carriage eternal line feed. 8933 06:36:29,200 --> 06:36:30,600 That's what it's saying. 8934 06:36:30,600 --> 06:36:32,960 And so it looks kind of like this, right? 8935 06:36:32,960 --> 06:36:36,120 We say get the document followed by a space. 8936 06:36:36,120 --> 06:36:37,120 There's gotta be one space. 8937 06:36:37,120 --> 06:36:38,560 You do two spaces, 8938 06:36:38,560 --> 06:36:41,560 and it's going to be quite frustrating, okay? 8939 06:36:41,560 --> 06:36:45,080 And so this is an example that you can run 8940 06:36:50,080 --> 06:36:51,640 on Linux operating systems 8941 06:36:51,640 --> 06:36:54,520 and Macintosh operating systems with no changes. 8942 06:36:54,520 --> 06:36:57,040 If you install Telnet on your Windows box, 8943 06:36:57,040 --> 06:36:59,560 you should be able to run something like this as well. 8944 06:36:59,560 --> 06:37:04,320 So Telnet is a program that we used in the old days. 8945 06:37:04,320 --> 06:37:05,920 It used to be how we logged into servers, 8946 06:37:05,920 --> 06:37:08,040 but because it doesn't encrypt your data back and forth, 8947 06:37:08,040 --> 06:37:09,360 we don't use it anymore, 8948 06:37:09,360 --> 06:37:13,480 but it basically is a program that can open a socket 8949 06:37:13,480 --> 06:37:15,800 to a host on a port. 8950 06:37:15,800 --> 06:37:18,600 And I'm saying Telnet to this host on port 80. 8951 06:37:18,600 --> 06:37:20,240 And at this point, I am connected, 8952 06:37:20,240 --> 06:37:21,800 and whatever I type on my keyboard 8953 06:37:21,800 --> 06:37:23,560 is gonna be sent to that server. 8954 06:37:23,560 --> 06:37:24,400 Now if you're doing this, 8955 06:37:24,400 --> 06:37:27,280 you probably wanna cut and paste this really fast, 8956 06:37:27,280 --> 06:37:28,800 because if you take too long, 8957 06:37:28,800 --> 06:37:30,800 most web servers will be like, you're a human. 8958 06:37:30,800 --> 06:37:31,760 I don't wanna talk to humans. 8959 06:37:31,760 --> 06:37:32,880 I wanna talk to programs. 8960 06:37:32,880 --> 06:37:35,200 So remember to type this fast enough, 8961 06:37:35,200 --> 06:37:38,000 and then you have to hit Enter twice. 8962 06:37:38,000 --> 06:37:39,800 So you have to have a blank line here. 8963 06:37:39,800 --> 06:37:42,040 Just type this exactly as it's shown, 8964 06:37:42,040 --> 06:37:44,560 and then you will get back the server. 8965 06:37:44,560 --> 06:37:45,400 If you do it right, 8966 06:37:45,400 --> 06:37:47,960 the server and the server is properly configured. 8967 06:37:47,960 --> 06:37:50,080 The server will give you back some headers, 8968 06:37:51,600 --> 06:37:53,800 and this is metadata about the document you're going to get. 8969 06:37:53,800 --> 06:37:56,680 For example, it's saying it's got text slash HTML, 8970 06:37:56,680 --> 06:37:58,120 which means that the remaining stuff 8971 06:37:58,120 --> 06:38:00,520 is gonna be in HTML, Hypertext Markup Language. 8972 06:38:00,520 --> 06:38:03,480 It has a blank line, and then the actual document, 8973 06:38:03,480 --> 06:38:05,200 and then the connection is closed. 8974 06:38:05,200 --> 06:38:08,320 And so if you do this, you can set this up in a way 8975 06:38:08,320 --> 06:38:11,040 that you can run this on your own computer, 8976 06:38:11,040 --> 06:38:15,200 and in effect, hack through the back door a web server. 8977 06:38:15,200 --> 06:38:18,160 Now you can't hack the secure web servers, 8978 06:38:18,160 --> 06:38:20,680 and mail servers used to be easy to hack, 8979 06:38:20,680 --> 06:38:21,900 but they're harder to hack now 8980 06:38:21,900 --> 06:38:24,080 because they challenge you for information. 8981 06:38:24,080 --> 06:38:28,080 But part of the reason I'm so obsessed with the command line 8982 06:38:28,080 --> 06:38:29,680 is this is how real hackers work, 8983 06:38:29,680 --> 06:38:32,480 and they know how to talk some of these protocols 8984 06:38:32,480 --> 06:38:33,320 more directly. 8985 06:38:33,320 --> 06:38:36,600 And so we think of this beautiful sophisticated application 8986 06:38:36,600 --> 06:38:39,040 talking to some other thing, and it's all pretty, 8987 06:38:39,040 --> 06:38:42,040 and we got wonderful clicky buttons and nice usability, 8988 06:38:42,040 --> 06:38:46,160 but the reality is, like in the Matrix Reloaded here, 8989 06:38:46,160 --> 06:38:49,240 the kinds of things that really talented hackers are doing 8990 06:38:49,240 --> 06:38:52,480 use command lines, and they really know what's going on, 8991 06:38:52,480 --> 06:38:53,320 and that's how they do it. 8992 06:38:53,320 --> 06:38:56,240 They understand what's going on better than the developers 8993 06:38:56,240 --> 06:38:58,560 of the computers that are trying to be resistant 8994 06:38:58,560 --> 06:38:59,400 to the hacking. 8995 06:38:59,400 --> 06:39:02,360 So I come from a long line of using the command line, 8996 06:39:02,360 --> 06:39:05,160 and that's why I encourage you to use the command line 8997 06:39:05,160 --> 06:39:07,120 in this course. 8998 06:39:07,120 --> 06:39:08,240 So the next thing we're going to do 8999 06:39:08,240 --> 06:39:10,640 is we're going to go up into the application layer, 9000 06:39:10,640 --> 06:39:12,960 and instead of typing those commands by hand, 9001 06:39:12,960 --> 06:39:15,980 we're going to actually send them from Python 9002 06:39:15,980 --> 06:39:19,180 and write a very simple Python web browser. 9003 06:39:22,920 --> 06:39:25,040 In this section, we're going to write a web browser 9004 06:39:25,040 --> 06:39:27,080 using Python, so we've already got a socket. 9005 06:39:27,080 --> 06:39:28,520 We know how to write a socket. 9006 06:39:28,520 --> 06:39:31,880 In the previous section, we played with the protocol, 9007 06:39:31,880 --> 06:39:34,000 and used Telnet to do it by hand, 9008 06:39:34,000 --> 06:39:35,560 and now we're going to do it in Python. 9009 06:39:35,560 --> 06:39:38,600 And what you're going to find is it's not that hard. 9010 06:39:40,520 --> 06:39:41,700 So here we go. 9011 06:39:41,700 --> 06:39:45,440 So the first three lines of this program, import socket, 9012 06:39:45,440 --> 06:39:46,280 make the socket. 9013 06:39:46,280 --> 06:39:48,960 Remember, the socket isn't really got the connection, 9014 06:39:48,960 --> 06:39:51,580 so when you make the socket, again, 9015 06:39:51,580 --> 06:39:53,200 we're going to make a stream-based socket, 9016 06:39:53,200 --> 06:39:55,300 and it's suitable for going across the internet. 9017 06:39:55,300 --> 06:39:58,480 The connection, it's like ring, phone call, 9018 06:39:58,480 --> 06:40:02,640 connect to data.pr4e.org and port 80, 9019 06:40:02,640 --> 06:40:06,160 and so that basically says extend the socket across 9020 06:40:06,160 --> 06:40:08,060 and connect to a web server, 9021 06:40:08,060 --> 06:40:10,400 and so there's got to be a piece of software running, 9022 06:40:10,400 --> 06:40:13,920 and this will blow up if the software is not running, okay? 9023 06:40:14,900 --> 06:40:18,640 So then, now we've got a phone, we've made a phone call. 9024 06:40:18,640 --> 06:40:22,480 Now, whether or not the remote side says hello or not 9025 06:40:22,480 --> 06:40:24,240 is up to the application protocol, 9026 06:40:24,240 --> 06:40:26,520 and in this case, the web servers say nothing, 9027 06:40:26,520 --> 06:40:27,880 and they wait for you to talk first, 9028 06:40:27,880 --> 06:40:30,080 so we're the web browser in this case, 9029 06:40:30,080 --> 06:40:31,760 and so we're going to talk first, 9030 06:40:31,760 --> 06:40:34,280 and we know what, because we read the documentation, 9031 06:40:34,280 --> 06:40:35,400 we know that we're going to send get, 9032 06:40:35,400 --> 06:40:37,480 blah, blah, blah, blah, blah, blah, blah, blah, blah, 9033 06:40:37,480 --> 06:40:38,760 space, blah, blah, blah, blah, blah, 9034 06:40:38,760 --> 06:40:41,080 HT1, and then two new lines. 9035 06:40:41,080 --> 06:40:43,880 Return, return, remember you had to have a blank line. 9036 06:40:43,880 --> 06:40:45,500 We'll talk a little bit about this end code, 9037 06:40:45,500 --> 06:40:48,120 it's preparing the data to go across the internet, 9038 06:40:48,120 --> 06:40:49,800 and then we say send it, 9039 06:40:49,800 --> 06:40:52,040 and so this basically takes that little string 9040 06:40:52,040 --> 06:40:54,080 and sends it across the network, 9041 06:40:54,080 --> 06:40:57,180 and then this piece of software is waiting for it, 9042 06:40:57,180 --> 06:40:59,580 and then the software goes and reads a file 9043 06:40:59,580 --> 06:41:00,500 or does some other stuff, 9044 06:41:00,500 --> 06:41:02,760 and then it starts sending us data back, 9045 06:41:02,760 --> 06:41:05,740 which we can then choose to receive. 9046 06:41:05,740 --> 06:41:07,740 So now we write a real simple loop. 9047 06:41:07,740 --> 06:41:08,780 We're going to receive the first, 9048 06:41:08,780 --> 06:41:09,840 we're going to receive these things 9049 06:41:09,840 --> 06:41:11,760 512 characters at a time, 9050 06:41:11,760 --> 06:41:15,140 so we're going to loop through 512 each time, 9051 06:41:15,140 --> 06:41:18,020 and if we get zero characters, 9052 06:41:18,020 --> 06:41:20,840 that means it's end of the stream, the stream is closed, 9053 06:41:20,840 --> 06:41:23,280 and if you look at the little example from the previous one, 9054 06:41:23,280 --> 06:41:24,880 you saw a connection closed. 9055 06:41:24,880 --> 06:41:27,260 When the connection is closed, we get an indication 9056 06:41:27,260 --> 06:41:29,400 that it is because we ask for some data 9057 06:41:29,400 --> 06:41:30,800 and we get zero data. 9058 06:41:30,800 --> 06:41:33,520 Otherwise, if there might be more data, this'll wait. 9059 06:41:33,520 --> 06:41:34,960 If the network is slow, you'll see, 9060 06:41:34,960 --> 06:41:36,640 if you do a print statement in here, 9061 06:41:36,640 --> 06:41:38,680 you will see that this will pause from time to time 9062 06:41:38,680 --> 06:41:40,160 on a really slow network. 9063 06:41:40,160 --> 06:41:41,880 If your network is fast, it'll just go blank 9064 06:41:41,880 --> 06:41:44,280 and it'll be so fast it won't matter. 9065 06:41:44,280 --> 06:41:45,360 But this is how we go. 9066 06:41:45,360 --> 06:41:48,680 So this is basically until the entire socket, 9067 06:41:48,680 --> 06:41:50,920 until the socket is closed, 9068 06:41:50,920 --> 06:41:52,600 we are going to read this data, 9069 06:41:52,600 --> 06:41:55,120 and because this data's coming from the outside world, 9070 06:41:55,120 --> 06:41:57,720 we have to decode it before we print it, 9071 06:41:57,720 --> 06:41:59,640 and then when we're all done, we break out of here 9072 06:41:59,640 --> 06:42:00,980 and we close the socket. 9073 06:42:00,980 --> 06:42:05,980 So literally, that is an entire web browser 9074 06:42:07,200 --> 06:42:11,180 written in 10 lines of Python, 9075 06:42:11,180 --> 06:42:14,360 and again, this is why everybody loves Python. 9076 06:42:14,360 --> 06:42:17,160 So this is what this program will show if you run. 9077 06:42:18,360 --> 06:42:22,360 The get is sent, it looks exactly like doing it by hand. 9078 06:42:22,360 --> 06:42:25,120 You get some headers, again, this is metadata 9079 06:42:25,120 --> 06:42:27,120 that tells you something about the file. 9080 06:42:27,120 --> 06:42:28,520 In this case, one of the important things 9081 06:42:28,520 --> 06:42:30,100 is what kind of thing is coming next. 9082 06:42:30,100 --> 06:42:31,520 There's always a blank line, 9083 06:42:31,520 --> 06:42:34,720 there's a break between the headers and the actual data, 9084 06:42:34,720 --> 06:42:36,320 the metadata and the data, 9085 06:42:36,320 --> 06:42:41,240 and then here is the actual text of that romeo.txt file, 9086 06:42:41,240 --> 06:42:43,860 and then it's gonna run this, gonna print data.decode, 9087 06:42:43,860 --> 06:42:45,900 all this is coming from the print statement. 9088 06:42:45,900 --> 06:42:48,080 If you were gonna parse this, you have to know 9089 06:42:48,080 --> 06:42:51,240 that you're gonna read the headers up to a little blank line. 9090 06:42:51,240 --> 06:42:54,480 The blank line is your indication as a software developer 9091 06:42:54,480 --> 06:42:56,960 that the headers have stopped and the actual text begins, 9092 06:42:56,960 --> 06:42:58,200 and you know the syntax. 9093 06:42:58,200 --> 06:43:02,120 This actually could be a JPEG or PNG 9094 06:43:02,120 --> 06:43:03,600 or some kind of image, right? 9095 06:43:03,600 --> 06:43:05,760 And this data would here look like, blah, blah, blah. 9096 06:43:05,760 --> 06:43:07,940 So if you type this and you change that code 9097 06:43:07,940 --> 06:43:10,880 to actually go retrieve a JPEG URL, 9098 06:43:10,880 --> 06:43:12,820 gibberish will come out, okay? 9099 06:43:13,760 --> 06:43:15,920 And so that's exactly what you will see, 9100 06:43:15,920 --> 06:43:20,700 and so now you have built a very simple web browser. 9101 06:43:20,700 --> 06:43:23,680 Next, I wanna talk a little bit about what happens 9102 06:43:23,680 --> 06:43:28,600 when characters transition outside your computer, 9103 06:43:28,600 --> 06:43:30,760 I mean from inside the computer in strings, 9104 06:43:30,760 --> 06:43:33,900 out across these sockets to servers and then back. 9105 06:43:40,140 --> 06:43:43,560 Hello, everybody, and welcome to some work to sample code. 9106 06:43:43,560 --> 06:43:45,060 If you are interested in the source code, 9107 06:43:45,060 --> 06:43:49,320 you go to materials and download this sample code.zip. 9108 06:43:49,320 --> 06:43:51,800 I have this downloaded. 9109 06:43:51,800 --> 06:43:54,800 It'll be in a folder called code3 on my computer. 9110 06:43:54,800 --> 06:43:57,160 This is where I'm at, I'm in the code3 folder, 9111 06:43:57,160 --> 06:44:00,800 and this has a ton of bits of code here. 9112 06:44:00,800 --> 06:44:04,480 So if I do an ls, you'll see I got all these files here, 9113 06:44:04,480 --> 06:44:08,120 and so we'll just leave those there. 9114 06:44:08,120 --> 06:44:10,880 And so this is the one I wanna work through right now 9115 06:44:10,880 --> 06:44:13,440 is this socket1.py code. 9116 06:44:13,440 --> 06:44:17,240 And basically what we're doing here is we're simulating 9117 06:44:17,240 --> 06:44:20,080 what is gonna happen in a web browser. 9118 06:44:20,080 --> 06:44:24,480 And the cool thing about the HTML, the HTTP protocol, 9119 06:44:24,480 --> 06:44:26,800 is that we can do this by hand, 9120 06:44:26,800 --> 06:44:29,880 and I'm actually gonna hack this HTTP protocol. 9121 06:44:29,880 --> 06:44:34,720 This is gonna go to data.pr4e.org and retrieve a document. 9122 06:44:36,340 --> 06:44:40,880 And so I'm gonna do telnet to, 9123 06:44:40,880 --> 06:44:43,040 now you can do this on a Mac and Linux, 9124 06:44:43,040 --> 06:44:45,920 and if you put telnet on a Windows box, you can do it here, 9125 06:44:45,920 --> 06:44:50,680 data.pr4e.org, and I wanna talk to port 80, 9126 06:44:50,680 --> 06:44:52,320 and the port 80 is a different port, 9127 06:44:52,320 --> 06:44:54,560 it's a non-standard port, but what we're doing here 9128 06:44:54,560 --> 06:44:57,880 is talking to the HTTP port. 9129 06:44:57,880 --> 06:45:02,760 And so I'm going to be able to hand send commands 9130 06:45:02,760 --> 06:45:05,280 to the web server and retrieve a document. 9131 06:45:05,280 --> 06:45:09,440 So I'm gonna cut, I've already copied this string, 9132 06:45:09,440 --> 06:45:14,440 this get HTTP romeo.txt, I'm copying that into my buffer 9133 06:45:14,440 --> 06:45:17,640 because if I wait too long, this won't work. 9134 06:45:17,640 --> 06:45:20,440 So here I go, and now I'm gonna type that, 9135 06:45:20,440 --> 06:45:22,240 and I have to hit enter twice, 9136 06:45:22,240 --> 06:45:25,440 and that literally was the HTTP protocol. 9137 06:45:25,440 --> 06:45:27,760 What I typed there was the HTTP protocol, 9138 06:45:27,760 --> 06:45:30,240 and the web server responds with some metadata 9139 06:45:30,240 --> 06:45:33,720 about the document, how much data there is, 9140 06:45:33,720 --> 06:45:35,520 the kind of data is there. 9141 06:45:36,680 --> 06:45:40,160 A blank line separates the header information 9142 06:45:40,160 --> 06:45:42,680 from the body of the document. 9143 06:45:42,680 --> 06:45:45,600 If I was to go to this in a browser, right there, 9144 06:45:45,600 --> 06:45:50,600 you would see, and if I turned on developer console, 9145 06:45:55,160 --> 06:45:57,280 and I went to the network, let's make this 9146 06:45:57,280 --> 06:46:02,280 a little bit bigger, you would see that 9147 06:46:04,960 --> 06:46:07,640 it retrieves this file romeo.txt, 9148 06:46:07,640 --> 06:46:10,600 and it gets back, it tells us, it shows us the headers, 9149 06:46:10,600 --> 06:46:11,880 and it shows us the response. 9150 06:46:11,880 --> 06:46:15,480 And so this is all the same way of doing the same thing, 9151 06:46:15,480 --> 06:46:20,000 and that is how to do the HTTP protocol, okay? 9152 06:46:20,000 --> 06:46:21,880 But now we're gonna do this in Python, 9153 06:46:21,880 --> 06:46:24,080 and so here's the code we're gonna write. 9154 06:46:24,080 --> 06:46:26,280 So we're gonna import the socket library, 9155 06:46:26,280 --> 06:46:27,640 and we're gonna make a socket. 9156 06:46:27,640 --> 06:46:29,400 Now this doesn't actually make a connection, 9157 06:46:29,400 --> 06:46:31,560 think of a socket as a file handle 9158 06:46:31,560 --> 06:46:34,360 that doesn't have any data associated with it yet. 9159 06:46:34,360 --> 06:46:36,000 And then what we're going to do is we're going to 9160 06:46:36,000 --> 06:46:40,720 reach out and connect that socket to a destination 9161 06:46:40,720 --> 06:46:42,640 across the internet with the domain name 9162 06:46:42,640 --> 06:46:45,720 of data.pr4e.org, and the second parameter 9163 06:46:45,720 --> 06:46:48,400 in this tuple, this is a function call 9164 06:46:48,400 --> 06:46:50,640 with a single tuple as a parameter, 9165 06:46:50,640 --> 06:46:53,280 and so tuple sub zero is data.pr4e.org, 9166 06:46:53,280 --> 06:46:55,120 and tuple sub one is the 80, which says 9167 06:46:55,120 --> 06:46:56,480 I wanna talk to port 80. 9168 06:46:57,560 --> 06:47:01,360 That could fail, it will make the connection, 9169 06:47:01,360 --> 06:47:05,680 and if the port 80 is there, away it goes. 9170 06:47:05,680 --> 06:47:08,220 And then we're gonna actually send the HTTP command, 9171 06:47:08,220 --> 06:47:10,680 so get, this is the HTTP rules, 9172 06:47:10,680 --> 06:47:14,080 followed by an end of line, followed by a blank line. 9173 06:47:14,080 --> 06:47:17,160 So you saw me do this, this was what I typed here, 9174 06:47:17,160 --> 06:47:18,480 and then I had to type a blank line. 9175 06:47:18,480 --> 06:47:21,600 Now if you wanna go read the RFCs for how to do this, 9176 06:47:21,600 --> 06:47:22,960 you can figure this out. 9177 06:47:22,960 --> 06:47:25,120 So the only other thing that's kinda weird here 9178 06:47:25,120 --> 06:47:28,440 is we have to add this dot in code, 9179 06:47:29,600 --> 06:47:32,900 and that's because there are strings inside of Python 9180 06:47:32,900 --> 06:47:35,840 that are in Unicode, and we have to send them out 9181 06:47:35,840 --> 06:47:38,480 as what's called UTF-8, and in code, 9182 06:47:38,480 --> 06:47:41,440 converts from Unicode internally to UTF-8. 9183 06:47:41,440 --> 06:47:45,480 So this command is a set of UTF-8 bytes 9184 06:47:45,480 --> 06:47:46,760 that we're then going to send. 9185 06:47:46,760 --> 06:47:49,080 It still has that same set of characters in it, 9186 06:47:50,000 --> 06:47:51,000 and now we're gonna send it. 9187 06:47:51,000 --> 06:47:54,400 And that's, after we've made the connection, 9188 06:47:54,400 --> 06:47:55,880 we're gonna send these two things, 9189 06:47:55,880 --> 06:47:57,960 and then we're going to wait. 9190 06:47:57,960 --> 06:48:00,920 And my SOC is like a file handle at that point, 9191 06:48:00,920 --> 06:48:03,160 because it's been opened and we've sent data. 9192 06:48:03,160 --> 06:48:06,120 The HTTP protocol told us what we had to send 9193 06:48:06,120 --> 06:48:07,880 and the fact that we did have to send it. 9194 06:48:07,880 --> 06:48:10,200 So now I have just a simple while loop, 9195 06:48:10,200 --> 06:48:14,060 and I'm going to ask up to 512 characters, 9196 06:48:14,060 --> 06:48:18,240 and receive up to 512 characters and get that back. 9197 06:48:18,240 --> 06:48:21,200 If I will know that this is the end of file 9198 06:48:21,200 --> 06:48:24,080 if I got no data back, so if the length of the data, 9199 06:48:24,080 --> 06:48:26,840 the byte array that I got back is less than one, 9200 06:48:26,840 --> 06:48:28,080 then I'm gonna quit. 9201 06:48:28,080 --> 06:48:29,420 Otherwise, I'm gonna print the data, 9202 06:48:29,420 --> 06:48:30,620 and I'm gonna use this decode, 9203 06:48:30,620 --> 06:48:32,860 which is kinda the opposite of this end code. 9204 06:48:32,860 --> 06:48:37,320 What I'm getting is UTF-8 encoded data, most likely, 9205 06:48:37,320 --> 06:48:40,400 and decode basically converts it to the internal format 9206 06:48:40,400 --> 06:48:43,000 called Unicode that runs inside. 9207 06:48:43,000 --> 06:48:44,560 So this is gonna run a bunch of times, 9208 06:48:44,560 --> 06:48:47,120 pulling in the blocks, basically 512, 9209 06:48:47,120 --> 06:48:50,080 up to 512 characters at a time, printing it out, 9210 06:48:50,080 --> 06:48:51,360 and then when it's all said and done, 9211 06:48:51,360 --> 06:48:53,160 we will close that connection. 9212 06:48:53,160 --> 06:48:55,880 And so, it's not too exciting. 9213 06:48:55,880 --> 06:49:00,880 Python three, socket, one.py. 9214 06:49:00,880 --> 06:49:03,040 And you'll see that it's just gonna, 9215 06:49:03,040 --> 06:49:05,900 Python is now gonna do what I did by hand. 9216 06:49:05,900 --> 06:49:07,400 Now, of course, the interesting thing is 9217 06:49:07,400 --> 06:49:09,040 these are all in strings, right? 9218 06:49:09,040 --> 06:49:12,480 And so, you know, this way we could write code 9219 06:49:12,480 --> 06:49:13,520 that does stuff with this. 9220 06:49:13,520 --> 06:49:15,200 But all we're really trying to do 9221 06:49:15,200 --> 06:49:19,400 in this particular situation is show how you open a socket, 9222 06:49:19,400 --> 06:49:22,480 send a command, and then retrieve the data. 9223 06:49:26,800 --> 06:49:30,560 Okay, so now it's time to teach you a bit of complexity 9224 06:49:30,560 --> 06:49:32,200 about text processing. 9225 06:49:32,200 --> 06:49:35,040 Up till now, we've kind of been ignoring 9226 06:49:35,040 --> 06:49:36,900 the complexity of text processing. 9227 06:49:37,840 --> 06:49:40,400 Everything that I have been doing, 9228 06:49:40,400 --> 06:49:43,040 most of what I've been doing is in ASCII, 9229 06:49:43,920 --> 06:49:47,000 the Latin character set, the character set that, 9230 06:49:47,000 --> 06:49:49,320 you know, United States, Europe, 9231 06:49:49,320 --> 06:49:53,060 lots of Western civilizations use this character set. 9232 06:49:53,060 --> 06:49:57,360 And if you go back to the 1950s and 1960s, 9233 06:49:57,360 --> 06:50:00,160 they, we were happy to have one computer 9234 06:50:00,160 --> 06:50:02,000 and we didn't care what the character set was 9235 06:50:02,000 --> 06:50:04,040 as long as what you typed on the keyboard 9236 06:50:04,040 --> 06:50:05,280 came out on the printer, 9237 06:50:05,280 --> 06:50:08,360 the internal representation didn't matter. 9238 06:50:08,360 --> 06:50:13,320 And as the 70s and 80s came along, certainly 70s, 9239 06:50:13,320 --> 06:50:14,720 we needed some interoperability. 9240 06:50:14,720 --> 06:50:16,840 And so they standardized that character set, 9241 06:50:16,840 --> 06:50:18,740 but they standardized that character set, 9242 06:50:18,740 --> 06:50:22,120 certainly in the West, that did not represent anything. 9243 06:50:22,120 --> 06:50:26,020 And so if you look at this sheet, 9244 06:50:26,020 --> 06:50:27,880 basically what it's telling you 9245 06:50:27,880 --> 06:50:30,680 is for the various characters, 9246 06:50:30,680 --> 06:50:32,280 there's some non-printing characters, 9247 06:50:32,280 --> 06:50:34,560 white space, non-printing characters, 9248 06:50:34,560 --> 06:50:35,860 and then here's some printing characters 9249 06:50:35,860 --> 06:50:37,880 like the and key, the zero, 9250 06:50:37,880 --> 06:50:39,820 and then the uppercase characters, 9251 06:50:39,820 --> 06:50:41,520 and then the lowercase characters. 9252 06:50:41,520 --> 06:50:45,040 And there's 128 of these possible values. 9253 06:50:45,040 --> 06:50:48,880 And there are nothing even for Spanish or French in here. 9254 06:50:48,880 --> 06:50:50,840 And it's also why, by the way, 9255 06:50:50,840 --> 06:50:54,000 uppercase letters in Latin sort lower 9256 06:50:54,000 --> 06:50:55,200 than lowercase letters, 9257 06:50:55,200 --> 06:50:57,480 and we saw that in some of the string stuff. 9258 06:50:57,480 --> 06:51:00,560 And what these do is it maps and says, okay. 9259 06:51:03,240 --> 06:51:07,460 And a lowercase a maps to the number integer number 97, 9260 06:51:07,460 --> 06:51:12,460 which in base 16 is 61, and in octal it's 141. 9261 06:51:12,480 --> 06:51:14,840 But in binary, it's eight bit numbers. 9262 06:51:14,840 --> 06:51:16,860 And so these are eight bits, 9263 06:51:18,160 --> 06:51:19,520 otherwise known as a byte. 9264 06:51:20,760 --> 06:51:22,200 And they're very efficient. 9265 06:51:22,200 --> 06:51:24,520 Like when you buy a disk drive, 9266 06:51:24,520 --> 06:51:26,520 it's megabytes or gigabytes or whatever, 9267 06:51:26,520 --> 06:51:30,080 that's how many of these kind of characters it can store. 9268 06:51:30,080 --> 06:51:33,520 But unfortunately, this doesn't work 9269 06:51:33,520 --> 06:51:35,280 for more complex characters. 9270 06:51:35,280 --> 06:51:38,680 You can figure out these numbers inside of Python 9271 06:51:38,680 --> 06:51:41,820 by using the ord function. 9272 06:51:41,820 --> 06:51:44,000 And so you say, what is the ordinal 9273 06:51:44,000 --> 06:51:47,360 or the numeric representation of the uppercase h, 9274 06:51:47,360 --> 06:51:49,920 lowercase e, and newline is a character as well. 9275 06:51:49,920 --> 06:51:53,320 And so like 10 is the ordinal position of newline. 9276 06:51:53,320 --> 06:51:54,920 And this actually has to do with sorting 9277 06:51:54,920 --> 06:51:59,360 so that lowercase e is higher than uppercase h. 9278 06:51:59,360 --> 06:52:01,920 And that's just because in the simplest of sorts, 9279 06:52:01,920 --> 06:52:04,240 we just sort them numerically. 9280 06:52:04,240 --> 06:52:07,360 So newline, if you go back to the previous little sheet, 9281 06:52:07,360 --> 06:52:10,680 newline is this 10 right here, it's that 10, 9282 06:52:10,680 --> 06:52:12,680 which is a line feed and that's a 10. 9283 06:52:12,680 --> 06:52:15,400 And that's why when we print newline out, we get a 10. 9284 06:52:16,680 --> 06:52:19,900 And so again, in the early days when strings were simple, 9285 06:52:19,900 --> 06:52:22,560 we just represented them as one byte per character. 9286 06:52:22,560 --> 06:52:27,560 But the problem is that as we have gotten more complex 9287 06:52:28,080 --> 06:52:30,360 and in today's modern world, it's simply unacceptable 9288 06:52:30,360 --> 06:52:32,980 to say that the only thing computers can understand 9289 06:52:32,980 --> 06:52:34,040 is ASCII. 9290 06:52:34,040 --> 06:52:37,420 And so this leads to a very, very, 9291 06:52:37,420 --> 06:52:39,080 from the simplest of character sets 9292 06:52:39,080 --> 06:52:42,720 to a super complex character set called Unicode, 9293 06:52:42,720 --> 06:52:46,120 which basically is billions of characters, 9294 06:52:46,120 --> 06:52:49,780 potential billions of characters for every language 9295 06:52:49,780 --> 06:52:51,400 and every character set. 9296 06:52:51,400 --> 06:52:53,640 And because there's so much space in Unicode, 9297 06:52:53,640 --> 06:52:57,680 it's easy to take very small variations of characters 9298 06:52:57,680 --> 06:52:58,680 and give them a space. 9299 06:52:58,680 --> 06:53:01,080 It's so large that you can have, 9300 06:53:02,800 --> 06:53:05,960 you can have pretty much any character that you want. 9301 06:53:05,960 --> 06:53:07,220 So that's Unicode. 9302 06:53:08,480 --> 06:53:12,840 The problem is that if we sent Unicode across the network, 9303 06:53:12,840 --> 06:53:14,600 it would be way too large. 9304 06:53:14,600 --> 06:53:17,920 It'd be this UTF32, which instead of being 9305 06:53:17,920 --> 06:53:21,160 eight bytes per character would be four bytes per character. 9306 06:53:21,160 --> 06:53:24,440 And so it would take all of the data that we build 9307 06:53:24,440 --> 06:53:29,440 and make it four times larger and it'd be very difficult. 9308 06:53:29,440 --> 06:53:33,780 And so what they've come up with is ways to compress this. 9309 06:53:35,020 --> 06:53:37,600 And UTF-16 is this weird thing. 9310 06:53:37,600 --> 06:53:42,160 UTF-32 is really sort of the full Unicode pretty much. 9311 06:53:42,160 --> 06:53:44,920 UTF-16 is a subset of Unicode. 9312 06:53:44,920 --> 06:53:47,860 It's used in some countries. 9313 06:53:47,860 --> 06:53:52,580 But the best practice for moving data across the internet 9314 06:53:52,580 --> 06:53:56,000 or in a file that you're gonna move between computers 9315 06:53:56,000 --> 06:53:58,200 is what's called UTF-8. 9316 06:53:58,200 --> 06:54:01,800 And so what happens is that UTF-32 is fixed length. 9317 06:54:01,800 --> 06:54:06,060 ASCII is one byte. 9318 06:54:08,080 --> 06:54:10,680 UTF-16 is two bytes, UTF-32 is four bytes. 9319 06:54:10,680 --> 06:54:14,560 And UTF-8 has dynamic length, 9320 06:54:14,560 --> 06:54:16,940 meaning that it is one to four bytes. 9321 06:54:16,940 --> 06:54:18,560 And if it's only one byte long, 9322 06:54:18,560 --> 06:54:20,680 it's perfectly compatible with ASCII, 9323 06:54:20,680 --> 06:54:24,400 meaning that an ASCII file is also UTF-8. 9324 06:54:24,400 --> 06:54:25,840 And so here's this little sheet. 9325 06:54:25,840 --> 06:54:28,420 It's not critical that you understand this graph too much, 9326 06:54:28,420 --> 06:54:30,640 but basically as time passed, 9327 06:54:30,640 --> 06:54:34,960 2000 internets coming, coming, coming, coming, not 2014, 9328 06:54:34,960 --> 06:54:38,040 pretty much overwhelmingly the documents on the internet 9329 06:54:38,040 --> 06:54:40,800 that you might retrieve are UTF-8. 9330 06:54:40,800 --> 06:54:44,100 Now, so UTF-8 is the recommended practice 9331 06:54:44,100 --> 06:54:48,040 and it's sort of a compression of UTF-8 can represent 9332 06:54:48,040 --> 06:54:50,640 all the things UTF-32 can represent. 9333 06:54:50,640 --> 06:54:52,760 It's just a compression of it 9334 06:54:52,760 --> 06:54:56,160 so that with an overlap of ASCII, which is awesome. 9335 06:54:56,160 --> 06:54:57,260 It's what you want. 9336 06:54:58,520 --> 06:54:59,740 I don't even talk anymore. 9337 06:54:59,740 --> 06:55:02,440 So in Python, we have always had 9338 06:55:02,440 --> 06:55:05,540 sort of two ways of representing strings. 9339 06:55:05,540 --> 06:55:10,480 In Python 2, the normal string was a byte string, 9340 06:55:10,480 --> 06:55:13,680 was an ASCII string, was a Latin string. 9341 06:55:13,680 --> 06:55:15,640 And if you wanted to represent Unicode, 9342 06:55:15,640 --> 06:55:18,540 there was a separate kind of object that we had. 9343 06:55:18,540 --> 06:55:21,560 And so you would do that. 9344 06:55:21,560 --> 06:55:25,960 And in Python 3.0 or later, 9345 06:55:25,960 --> 06:55:28,720 one of the main features of Python 3 9346 06:55:28,720 --> 06:55:31,400 was to make Unicode and string the same. 9347 06:55:31,400 --> 06:55:33,640 So that means inside of Python, 9348 06:55:33,640 --> 06:55:36,940 when you have a string variable, it's a Unicode. 9349 06:55:36,940 --> 06:55:40,840 Whereas inside of Python 2, it was a byte variable. 9350 06:55:40,840 --> 06:55:44,160 And so now we have this notion, 9351 06:55:44,160 --> 06:55:47,520 separately in Python 2 and on Python 3, 9352 06:55:47,520 --> 06:55:49,740 where we have byte variables. 9353 06:55:49,740 --> 06:55:54,360 And so byte variables are, in effect, an array of bytes. 9354 06:55:54,360 --> 06:55:57,280 So if there's ABC, that means it's three bytes, 9355 06:55:57,280 --> 06:55:58,560 it's three bytes long. 9356 06:55:58,560 --> 06:56:01,240 Whereas a string might be, a three-character string 9357 06:56:01,240 --> 06:56:04,140 might be anywhere from three to 12 bytes long. 9358 06:56:05,120 --> 06:56:08,720 So Python 2 had bytes and strings that were the same. 9359 06:56:08,720 --> 06:56:11,680 So bytes and strings are the same, 9360 06:56:11,680 --> 06:56:13,000 and Unicode is weird. 9361 06:56:13,000 --> 06:56:14,580 And in Python 3, 9362 06:56:18,040 --> 06:56:21,320 strings and Unicode are the same, and bytes are weird. 9363 06:56:21,320 --> 06:56:24,520 Okay, and so that's what we've got to deal with. 9364 06:56:24,520 --> 06:56:29,520 And there'll be times when we get bytes from APIs, 9365 06:56:29,600 --> 06:56:32,320 when we call things, we have to then figure out 9366 06:56:32,320 --> 06:56:33,960 what kind of thing those bytes contain. 9367 06:56:33,960 --> 06:56:35,920 Because the bytes might contain ASCII, 9368 06:56:35,920 --> 06:56:39,460 they might contain UTF-8, they might contain various things. 9369 06:56:39,460 --> 06:56:43,080 And so internally, all the strings in Python 3 are Unicode. 9370 06:56:44,040 --> 06:56:46,660 Most of the time, if you're inside the program, 9371 06:56:46,660 --> 06:56:48,960 or reading and writing files, we just work. 9372 06:56:48,960 --> 06:56:50,880 And that's why we haven't mentioned it. 9373 06:56:50,880 --> 06:56:52,560 But now that we're talking over sockets, 9374 06:56:52,560 --> 06:56:55,080 and we're talking to the sort of random world out there, 9375 06:56:55,080 --> 06:56:57,240 we have to be a little more aware 9376 06:56:57,240 --> 06:56:58,360 of the data we're dealing with. 9377 06:56:58,360 --> 06:57:01,280 Now, the good news is 98% of the time, 9378 06:57:01,280 --> 06:57:04,800 or 95% of the time, it's UTF-8, 9379 06:57:04,800 --> 06:57:08,000 which might also include ASCII, and so it's quite nice. 9380 06:57:08,000 --> 06:57:10,440 But we have to be aware of this. 9381 06:57:11,440 --> 06:57:15,040 And so if we are going to take data 9382 06:57:15,040 --> 06:57:17,640 that comes off of the network in the bytes, 9383 06:57:17,640 --> 06:57:20,520 then we have to make sure that we interpret it, 9384 06:57:20,520 --> 06:57:23,380 or decode it in the right way, 9385 06:57:23,380 --> 06:57:25,960 so that internally the strings, which are Unicode, 9386 06:57:25,960 --> 06:57:27,680 are properly represented. 9387 06:57:27,680 --> 06:57:30,760 And so that's why when we read data in 9388 06:57:30,760 --> 06:57:33,500 from a network connection like a socket, 9389 06:57:33,500 --> 06:57:35,560 we have to say, hey, decode it. 9390 06:57:35,560 --> 06:57:37,440 Now, there's a couple things going on 9391 06:57:37,440 --> 06:57:39,320 at that moment of decode. 9392 06:57:40,320 --> 06:57:42,040 And so this is where we're doing it. 9393 06:57:42,040 --> 06:57:45,280 We see this, we have to manage this in this code, 9394 06:57:45,280 --> 06:57:47,320 where we, before we send this stuff, 9395 06:57:47,320 --> 06:57:50,760 we're gonna encode it, which takes a Unicode string 9396 06:57:50,760 --> 06:57:53,240 and turns it into UTF-8 bytes. 9397 06:57:53,240 --> 06:57:54,320 There's actually a parameter here 9398 06:57:54,320 --> 06:57:56,320 that you could do it different than UTF-8, 9399 06:57:56,320 --> 06:57:57,840 but no one ever does. 9400 06:57:57,840 --> 06:57:59,700 You might have to for certain situations, 9401 06:57:59,700 --> 06:58:03,640 but so that says that we're gonna encode this into UTF-8 9402 06:58:03,640 --> 06:58:06,540 before we send it, and then when we get something back, 9403 06:58:06,540 --> 06:58:08,760 before we print it, we're gonna decode it. 9404 06:58:08,760 --> 06:58:10,900 And that's how this ends up working out. 9405 06:58:11,920 --> 06:58:14,180 And if you look at the documentation, 9406 06:58:14,180 --> 06:58:17,040 you will see that sometimes it says it's a string, 9407 06:58:17,040 --> 06:58:18,120 or it's bytes. 9408 06:58:18,120 --> 06:58:22,120 And so you take a byte array, 9409 06:58:22,120 --> 06:58:23,820 and you decode it to get a string, 9410 06:58:23,820 --> 06:58:26,680 and you take a string and encode it to get a byte array. 9411 06:58:26,680 --> 06:58:29,280 And so that's what we're doing. 9412 06:58:29,280 --> 06:58:32,120 So you can think of the process as this way, 9413 06:58:32,120 --> 06:58:36,860 and that is the network has these UTF-8, 9414 06:58:36,860 --> 06:58:40,680 mostly UTF-8 resources, not ASCII. 9415 06:58:40,680 --> 06:58:42,160 If it's ASCII, it's okay. 9416 06:58:42,160 --> 06:58:44,840 So you read with the receive. 9417 06:58:44,840 --> 06:58:47,160 So this receive here pulls data, 9418 06:58:47,160 --> 06:58:49,480 well, we have a Unicode string, 9419 06:58:49,480 --> 06:58:50,780 let's start with the send. 9420 06:58:50,780 --> 06:58:53,480 So up here, we have a Unicode string, 9421 06:58:53,480 --> 06:58:54,720 that's a Unicode string, 9422 06:58:54,720 --> 06:58:56,680 even though there's no special characters in it, 9423 06:58:56,680 --> 06:58:58,880 no Asian characters or French characters, 9424 06:58:58,880 --> 06:59:00,800 that's a Unicode string. 9425 06:59:00,800 --> 06:59:02,400 And before we can send it, 9426 06:59:02,400 --> 06:59:04,320 we have to send it in UTF-8. 9427 06:59:04,320 --> 06:59:07,440 If that had Asian characters, it'd be okay, 9428 06:59:07,440 --> 06:59:09,240 and that would be set up just right, 9429 06:59:09,240 --> 06:59:11,000 so that the UTF-8 would be right. 9430 06:59:11,000 --> 06:59:13,820 So we encode it first, and that's the CMD. 9431 06:59:13,820 --> 06:59:15,240 This is now bytes, okay? 9432 06:59:15,240 --> 06:59:18,440 CMD is bytes, and then we actually send the bytes. 9433 06:59:18,440 --> 06:59:19,800 And that goes across the network. 9434 06:59:19,800 --> 06:59:22,280 We get back our thing, and we receive, 9435 06:59:22,280 --> 06:59:26,820 and we receive into data, well, data is bytes, not string. 9436 06:59:26,820 --> 06:59:27,760 It's bytes. 9437 06:59:27,760 --> 06:59:29,440 We can say how big it is. 9438 06:59:29,440 --> 06:59:31,760 Function's kinda like a string, and it has len, 9439 06:59:31,760 --> 06:59:34,520 except that it is one byte per character, 9440 06:59:34,520 --> 06:59:37,520 which means some of it might be UTF-8. 9441 06:59:37,520 --> 06:59:39,200 And then all we have to do is say decode. 9442 06:59:39,200 --> 06:59:41,480 Again, you could, if you were dealing with a situation 9443 06:59:41,480 --> 06:59:42,600 where you weren't expecting it 9444 06:59:42,600 --> 06:59:45,560 to typically be UTF-8 or ASCII, 9445 06:59:45,560 --> 06:59:48,040 you could tell it UTF-16 or something, 9446 06:59:48,040 --> 06:59:49,320 and it's more complex, 9447 06:59:49,320 --> 06:59:51,480 but the simple thing is to just say, 9448 06:59:51,480 --> 06:59:53,240 I'm gonna clean up my data on the way in, 9449 06:59:53,240 --> 06:59:54,940 I'm gonna clean it up by running it through decode, 9450 06:59:54,940 --> 06:59:56,960 and I'm gonna encode stuff on the way out. 9451 06:59:56,960 --> 07:00:01,440 And so sockets are the place where this comes into play. 9452 07:00:01,440 --> 07:00:04,500 And so you'll see, we'll always do this encode and decode 9453 07:00:04,500 --> 07:00:07,720 every time we're sending data kind of outside of Python 9454 07:00:07,720 --> 07:00:09,680 and inside of Python. 9455 07:00:09,680 --> 07:00:11,960 So now that we've talked a little bit about character sets, 9456 07:00:11,960 --> 07:00:15,080 we're going to make this even easier 9457 07:00:15,080 --> 07:00:16,280 so you don't have to use sockets. 9458 07:00:16,280 --> 07:00:20,040 A URL lib is a bit of Python code in the library 9459 07:00:20,040 --> 07:00:22,040 that does all the socket stuff for you. 9460 07:00:22,040 --> 07:00:27,040 Okay, so now we're going to write a web browser again 9461 07:00:28,480 --> 07:00:31,120 in Python, but it's going to even be shorter 9462 07:00:31,120 --> 07:00:32,320 than what we did before. 9463 07:00:32,320 --> 07:00:34,540 We did it in 10 lines using sockets. 9464 07:00:34,540 --> 07:00:37,900 Now we're gonna do it in four lines with URL lib. 9465 07:00:37,900 --> 07:00:41,380 So URL lib really is just because the idea 9466 07:00:41,380 --> 07:00:43,520 of opening a connection, sending a GET request, 9467 07:00:43,520 --> 07:00:45,700 sending the new line, retrieving the stuff, 9468 07:00:45,700 --> 07:00:48,080 breaking the headers out, doing all this stuff, 9469 07:00:48,080 --> 07:00:50,520 that's so common, why not put it in a library 9470 07:00:50,520 --> 07:00:52,280 to save ourselves some effort. 9471 07:00:52,280 --> 07:00:54,520 So here's how we do it. 9472 07:00:54,520 --> 07:00:58,600 We're going to read it in, we're gonna import this library 9473 07:00:58,600 --> 07:01:01,000 so it's not part, we had to import sockets before, 9474 07:01:01,000 --> 07:01:02,920 but we're gonna import URL lib now. 9475 07:01:02,920 --> 07:01:04,880 And so this is really quite simple. 9476 07:01:04,880 --> 07:01:07,240 It's like elegantly simple. 9477 07:01:07,240 --> 07:01:09,880 You say, URL lib, that's a library, 9478 07:01:09,880 --> 07:01:11,920 that's part of a module within the library 9479 07:01:11,920 --> 07:01:12,960 and this is a function. 9480 07:01:12,960 --> 07:01:17,120 So let's call URL open and then give it the URL. 9481 07:01:17,120 --> 07:01:19,760 Now that's a string which it's gonna encode automatically 9482 07:01:19,760 --> 07:01:22,360 for us, so it's taking care of all kind of pretty things 9483 07:01:22,360 --> 07:01:24,520 for us, it does the GET, it does the ENCODE. 9484 07:01:24,520 --> 07:01:26,120 Look back at that previous code. 9485 07:01:26,120 --> 07:01:29,360 That's kind of what URL lib is doing for us, okay? 9486 07:01:29,360 --> 07:01:32,720 Now what URL lib also does is it makes the connection, 9487 07:01:32,720 --> 07:01:36,720 encodes the GET request, and then it actually retrieves, 9488 07:01:36,720 --> 07:01:38,840 at this moment, it retrieves all the headers 9489 07:01:38,840 --> 07:01:40,080 and keeps them for you for later. 9490 07:01:40,080 --> 07:01:42,560 You can get the headers, but we're not gonna see the headers. 9491 07:01:42,560 --> 07:01:45,560 And it returns to you an object that looks 9492 07:01:45,560 --> 07:01:47,440 pretty much like a file handle. 9493 07:01:47,440 --> 07:01:51,480 Because you can put this in the for clause after the end. 9494 07:01:51,480 --> 07:01:56,480 Now it's going to read, run that loop one time 9495 07:01:56,900 --> 07:01:58,540 for every line of this file. 9496 07:01:59,480 --> 07:02:02,020 And so the lines we're gonna get back are bytes 9497 07:02:02,020 --> 07:02:03,240 and so we have to say decode. 9498 07:02:03,240 --> 07:02:05,160 It doesn't do that for us automatically. 9499 07:02:05,160 --> 07:02:06,720 We are gonna have to decode them 9500 07:02:06,720 --> 07:02:08,780 and that's because we might need to decode them 9501 07:02:08,780 --> 07:02:10,560 with a particular character set here. 9502 07:02:10,560 --> 07:02:11,880 And then we're gonna do our strip 9503 07:02:11,880 --> 07:02:12,920 and we're gonna just print this out. 9504 07:02:12,920 --> 07:02:15,380 So that's just, that's like open a file, 9505 07:02:15,380 --> 07:02:16,440 read through it and print it. 9506 07:02:16,440 --> 07:02:18,880 This is open a URL, read through and print it. 9507 07:02:18,880 --> 07:02:22,040 And that's as simple as it is. 9508 07:02:22,040 --> 07:02:23,080 And so that's what happens. 9509 07:02:23,080 --> 07:02:26,400 This is Romeo.txt and it prints out. 9510 07:02:26,400 --> 07:02:29,440 Now the thing to notice is that there are no headers here. 9511 07:02:29,440 --> 07:02:32,760 The headers have been sort of consumed in the URL open. 9512 07:02:32,760 --> 07:02:36,080 Again, there is a way to say, hey, give me my headers. 9513 07:02:36,080 --> 07:02:38,920 But for now, this is just gonna eat the headers 9514 07:02:38,920 --> 07:02:41,660 and keep them and then you get to read all the data 9515 07:02:41,660 --> 07:02:44,240 and the loop runs and this loop runs four times 9516 07:02:44,240 --> 07:02:45,220 and I'll count the four lines. 9517 07:02:45,220 --> 07:02:47,000 You can go ahead and run this one. 9518 07:02:47,000 --> 07:02:48,680 It's super easy. 9519 07:02:48,680 --> 07:02:51,360 I mean literally super easy. 9520 07:02:51,360 --> 07:02:53,640 And if you, you can do anything you want. 9521 07:02:53,640 --> 07:02:55,040 I mean treat it like a file. 9522 07:02:55,040 --> 07:02:57,400 You just have to remember to do the decode bit 9523 07:02:57,400 --> 07:02:59,160 when you treat it like a file. 9524 07:02:59,160 --> 07:03:01,360 And so we, that code import it. 9525 07:03:01,360 --> 07:03:02,320 We're gonna open it. 9526 07:03:02,320 --> 07:03:04,240 We're going to make a dictionary. 9527 07:03:04,240 --> 07:03:05,560 We're gonna loop through. 9528 07:03:05,560 --> 07:03:06,400 We're gonna split it. 9529 07:03:06,400 --> 07:03:08,260 We have to add the decode just to make sure 9530 07:03:08,260 --> 07:03:10,680 because that line is bytes, not string. 9531 07:03:11,560 --> 07:03:14,140 And then we're gonna go, you know, our words. 9532 07:03:14,140 --> 07:03:16,980 We're gonna go through the line and then each line 9533 07:03:16,980 --> 07:03:18,220 we're gonna bounce through the words. 9534 07:03:18,220 --> 07:03:20,500 The inner for loop is bouncing through the words 9535 07:03:20,500 --> 07:03:22,320 and then we're gonna go to the next line 9536 07:03:22,320 --> 07:03:25,360 and then we make ourselves a dictionary 9537 07:03:25,360 --> 07:03:26,620 and we print that dictionary out. 9538 07:03:26,620 --> 07:03:31,340 Now this is, this in effect, other than, you know, 9539 07:03:31,340 --> 07:03:34,140 importing this, opening it differently and doing the decode, 9540 07:03:34,140 --> 07:03:36,820 this is exactly how we would process a file. 9541 07:03:36,820 --> 07:03:39,980 And so by using URL lib, you really sort of reduce 9542 07:03:39,980 --> 07:03:44,500 the complexity of retrieving and reading network resources 9543 07:03:44,500 --> 07:03:46,420 to the same complexity of reading 9544 07:03:46,420 --> 07:03:49,260 and dealing with a file locally on your hard drive, 9545 07:03:49,260 --> 07:03:51,460 which is kind of pretty. 9546 07:03:51,460 --> 07:03:55,960 So one of the things then we can do is read web pages. 9547 07:03:55,960 --> 07:03:58,900 That was a text file but you can get HTML 9548 07:03:58,900 --> 07:04:01,540 and so here's how you read a web page. 9549 07:04:01,540 --> 07:04:03,380 And it's the same kind of code. 9550 07:04:03,380 --> 07:04:07,300 We open a, we open a URL. 9551 07:04:07,300 --> 07:04:10,060 This one happens to have HTML in it and we read through it 9552 07:04:10,060 --> 07:04:11,300 and out comes the HTML. 9553 07:04:11,300 --> 07:04:13,660 Remember that the headers are there 9554 07:04:13,660 --> 07:04:16,440 but they've been eaten by URL open for us. 9555 07:04:16,440 --> 07:04:18,500 And now we could write a browser that would parse 9556 07:04:18,500 --> 07:04:19,840 these less thans and greater thans 9557 07:04:19,840 --> 07:04:23,180 and make links, et cetera, et cetera, et cetera. 9558 07:04:25,260 --> 07:04:30,120 So if you can come up with ways to find these links, 9559 07:04:30,120 --> 07:04:32,820 you could actually write a bit of code 9560 07:04:32,820 --> 07:04:34,340 that would then have a loop that would go up 9561 07:04:34,340 --> 07:04:35,300 and open a new one. 9562 07:04:35,300 --> 07:04:36,740 Pull out the links, open a new one. 9563 07:04:36,740 --> 07:04:38,700 Pull out the links, open a new one. 9564 07:04:38,700 --> 07:04:39,540 And so you could. 9565 07:04:39,540 --> 07:04:42,020 You could make a thing that would retrieve a program 9566 07:04:42,020 --> 07:04:45,340 that would retrieve a page, find the links in the page 9567 07:04:45,340 --> 07:04:46,780 and then retrieve those links. 9568 07:04:46,780 --> 07:04:49,580 And we'll actually do that before the end of the class. 9569 07:04:50,540 --> 07:04:53,020 And so Python is a very popular language at Google 9570 07:04:53,020 --> 07:04:57,220 and I wonder if I'm gonna, I think it's a pretty safe bet 9571 07:04:57,220 --> 07:04:59,480 that the first crawler that they wrote 9572 07:04:59,480 --> 07:05:02,380 to crawl the web to build the index was Python 9573 07:05:02,380 --> 07:05:06,700 because literally that's all it takes to read web pages. 9574 07:05:06,700 --> 07:05:10,900 And pull those web pages into your web crawler database. 9575 07:05:10,900 --> 07:05:12,020 So I don't know. 9576 07:05:12,020 --> 07:05:14,460 Are those the first four lines ever written to Google? 9577 07:05:14,460 --> 07:05:16,340 Who knows? 9578 07:05:16,340 --> 07:05:18,240 So the next thing that we'll talk about 9579 07:05:18,240 --> 07:05:21,420 is how you handle that HTML. 9580 07:05:21,420 --> 07:05:23,480 HTML is kind of yucky and nasty 9581 07:05:23,480 --> 07:05:26,220 and so it's not as simple as regular expressions. 9582 07:05:26,220 --> 07:05:28,080 Regular expressions might help. 9583 07:05:28,080 --> 07:05:29,980 Strength parsing and split might help 9584 07:05:29,980 --> 07:05:31,520 but it's just too crazy. 9585 07:05:31,520 --> 07:05:34,100 So we'll talk a little bit about how to use a library 9586 07:05:34,100 --> 07:05:36,760 to make HTML parsing a lot easier. 9587 07:05:41,500 --> 07:05:44,580 We are going to be talking about some code. 9588 07:05:44,580 --> 07:05:47,060 If you wanna download all the code, it's right here. 9589 07:05:47,060 --> 07:05:49,220 It's all single big zip file. 9590 07:05:49,220 --> 07:05:51,740 And of all the sample code, 9591 07:05:51,740 --> 07:05:54,740 the one I'm gonna talk about is url.1.py. 9592 07:05:54,740 --> 07:05:57,380 It is not very exciting. 9593 07:05:57,380 --> 07:05:58,980 It's short. 9594 07:05:58,980 --> 07:06:02,960 That's what's kinda nice about Python code. 9595 07:06:02,960 --> 07:06:05,700 And it's really, if we go and take a look at the code 9596 07:06:05,700 --> 07:06:08,700 we played with just previously, which is socket, 9597 07:06:08,700 --> 07:06:11,660 the idea here is url.lib is something that Python 9598 07:06:11,660 --> 07:06:15,160 has produced for us to make socket communications 9599 07:06:15,160 --> 07:06:17,660 and HTTP communications a lot better. 9600 07:06:17,660 --> 07:06:21,440 So socket, this is making socket calls underneath it 9601 07:06:21,440 --> 07:06:24,360 but there's a library that makes this quite simple. 9602 07:06:24,360 --> 07:06:27,020 And so we have to do some imports. 9603 07:06:27,020 --> 07:06:29,300 So instead of importing socket, we'll import these. 9604 07:06:29,300 --> 07:06:30,580 We are going to create a handle. 9605 07:06:30,580 --> 07:06:34,180 The url request url open and just pass in a string. 9606 07:06:34,180 --> 07:06:35,780 So we're not encoding this. 9607 07:06:35,780 --> 07:06:37,140 We're not sending the get command. 9608 07:06:37,140 --> 07:06:39,820 All the stuff we did in the previous sockets example 9609 07:06:39,820 --> 07:06:42,700 is gone and then we can just put this as a for loop. 9610 07:06:42,700 --> 07:06:45,660 And so we're not using this lower level read and write code. 9611 07:06:45,660 --> 07:06:46,820 We're just using a for loop. 9612 07:06:46,820 --> 07:06:51,320 And so that literally is gonna read the text line by line. 9613 07:06:51,320 --> 07:06:53,940 And the line does come back as an array of bytes 9614 07:06:53,940 --> 07:06:56,580 so we have to do a decode but then we got a string 9615 07:06:56,580 --> 07:06:57,880 and then we can do a strip on it. 9616 07:06:57,880 --> 07:07:02,880 So this is like a super simple, super simple. 9617 07:07:06,420 --> 07:07:07,440 So there we go. 9618 07:07:07,440 --> 07:07:10,060 Now the interesting thing is you also don't see the headers. 9619 07:07:10,060 --> 07:07:11,700 We just read the contents. 9620 07:07:11,700 --> 07:07:13,580 Now it turns out in url lib, 9621 07:07:13,580 --> 07:07:16,480 and we'll see this in later more complex application, 9622 07:07:16,480 --> 07:07:18,520 you can get the headers if you want. 9623 07:07:18,520 --> 07:07:20,860 You can get various other things. 9624 07:07:20,860 --> 07:07:27,860 So that's url lib, a simple url lib tool. 9625 07:07:30,380 --> 07:07:33,180 Now we can also use this in url words 9626 07:07:33,180 --> 07:07:35,400 to show you something quite interesting. 9627 07:07:35,400 --> 07:07:38,860 And that is if you look at this from right here 9628 07:07:38,860 --> 07:07:42,260 other than the decode, this is exactly the code we wrote 9629 07:07:42,260 --> 07:07:45,140 to compute the words, right? 9630 07:07:45,140 --> 07:07:47,360 So other than this line.decode, 9631 07:07:47,360 --> 07:07:49,880 this is just a open something up. 9632 07:07:49,880 --> 07:07:51,880 In this case, we're gonna open a url. 9633 07:07:51,880 --> 07:07:52,920 We're gonna create a dictionary. 9634 07:07:52,920 --> 07:07:55,900 We're gonna loop through each of the lines in that thing. 9635 07:07:55,900 --> 07:07:57,560 We're gonna decode them and then split them. 9636 07:07:57,560 --> 07:07:59,100 So once you do line.decode, 9637 07:07:59,100 --> 07:08:02,160 this is now a legitimate internal Python string. 9638 07:08:02,160 --> 07:08:05,640 We split it, we run through the words, and run the counts. 9639 07:08:05,640 --> 07:08:09,320 And so this is exactly like code that we did before 9640 07:08:09,320 --> 07:08:10,680 to run counts. 9641 07:08:10,680 --> 07:08:13,700 And so Python three, 9642 07:08:14,620 --> 07:08:16,400 url words. 9643 07:08:16,400 --> 07:08:21,040 And so that gives us a dictionary 9644 07:08:21,040 --> 07:08:22,640 which is the word frequency. 9645 07:08:22,640 --> 07:08:25,920 And we could do all kinds of crazy stuff in here 9646 07:08:25,920 --> 07:08:28,280 with sorting and all the kinds of things. 9647 07:08:28,280 --> 07:08:31,000 The important thing is once you've done this and this, 9648 07:08:31,000 --> 07:08:33,440 the code doesn't need to decode these lines 9649 07:08:33,440 --> 07:08:34,680 when you first get them. 9650 07:08:35,760 --> 07:08:39,800 It really works just like makes the url lib 9651 07:08:39,800 --> 07:08:44,800 makes url's function inside Python very much like files. 9652 07:08:44,800 --> 07:08:48,320 So these are shortened to the point and very simple 9653 07:08:48,320 --> 07:08:50,320 and I hope that they were useful to you. 9654 07:08:53,640 --> 07:08:55,720 So now we're going to talk about what you would do 9655 07:08:55,720 --> 07:08:57,560 with a web page once you've retrieved it 9656 07:08:57,560 --> 07:08:58,960 in a Python program. 9657 07:08:58,960 --> 07:09:00,500 Call this web scraping. 9658 07:09:01,480 --> 07:09:03,560 And so web scraping or web spidering 9659 07:09:03,560 --> 07:09:05,240 is the act of retrieving a web page, 9660 07:09:05,240 --> 07:09:06,880 extracting the links from those web page, 9661 07:09:06,880 --> 07:09:10,000 making a queue of unretrieved links, and then moving on. 9662 07:09:10,000 --> 07:09:12,360 And eventually the idea is if you had enough time, 9663 07:09:12,360 --> 07:09:14,200 energy, bandwidth, and storage, 9664 07:09:14,200 --> 07:09:16,960 you could find your way to most of the web pages 9665 07:09:16,960 --> 07:09:20,120 on the internet that are pointing to 9666 07:09:20,120 --> 07:09:22,340 or are pointing to by other web pages. 9667 07:09:23,320 --> 07:09:26,400 And so you might have all kinds of reasons to scrape data. 9668 07:09:26,400 --> 07:09:28,840 You might have a blog that you posted. 9669 07:09:28,840 --> 07:09:32,280 You might have, who knows, maybe you put some data 9670 07:09:32,280 --> 07:09:35,840 in a system, maybe the system's being shut down 9671 07:09:35,840 --> 07:09:38,100 because it's being retired. 9672 07:09:38,100 --> 07:09:39,660 You can do all kinds of things. 9673 07:09:39,660 --> 07:09:41,040 You could write a little thing, 9674 07:09:41,040 --> 07:09:43,000 just talking to somebody who wrote a thing 9675 07:09:43,000 --> 07:09:44,720 to retrieve something and check, 9676 07:09:44,720 --> 07:09:46,920 and then send a text when something changed. 9677 07:09:46,920 --> 07:09:47,960 All kinds of stuff. 9678 07:09:47,960 --> 07:09:50,480 Or you might make yourself a search engine. 9679 07:09:50,480 --> 07:09:52,260 But be careful. 9680 07:09:52,260 --> 07:09:55,520 Not all websites are happy about you 9681 07:09:55,520 --> 07:09:58,120 using a robot to retrieve their content. 9682 07:09:58,120 --> 07:09:59,600 Some of the websites, as we'll see, 9683 07:09:59,600 --> 07:10:01,900 demand that you log in and they track what you do, 9684 07:10:01,900 --> 07:10:03,600 and if they think you're doing something bad, 9685 07:10:03,600 --> 07:10:05,340 they will shut your account off. 9686 07:10:05,340 --> 07:10:07,800 Other websites will track what you're doing 9687 07:10:07,800 --> 07:10:11,400 without you logging in, but then shut your address off. 9688 07:10:11,400 --> 07:10:13,240 And so you have to be careful. 9689 07:10:13,240 --> 07:10:14,080 You should read up. 9690 07:10:14,080 --> 07:10:17,240 You should figure out what sites allow you to scrape them. 9691 07:10:17,240 --> 07:10:18,920 Now I have some sites that I've set up 9692 07:10:18,920 --> 07:10:23,720 that you can play with to make it so that it's legit. 9693 07:10:23,720 --> 07:10:27,360 So parsing HTML is difficult. 9694 07:10:27,360 --> 07:10:29,640 Some of the simple examples, 9695 07:10:29,640 --> 07:10:31,560 you could probably write a regular expression, 9696 07:10:31,560 --> 07:10:35,000 or certainly some splitting and some whatever. 9697 07:10:35,000 --> 07:10:37,920 And what you would find is you would write that code. 9698 07:10:37,920 --> 07:10:40,160 And you would retrieve your first five webpages 9699 07:10:40,160 --> 07:10:41,000 and it would seem to work 9700 07:10:41,000 --> 07:10:43,240 and then it would encounter some really weird 9701 07:10:43,240 --> 07:10:45,560 but legitimate HTML, 9702 07:10:45,560 --> 07:10:47,820 or maybe even sort of slightly broken HTML. 9703 07:10:47,820 --> 07:10:50,200 So the web is full of broken HTML, 9704 07:10:50,200 --> 07:10:52,000 and your browsers just look at it and go like, 9705 07:10:52,000 --> 07:10:53,920 oh wow, more broken HTML. 9706 07:10:53,920 --> 07:10:55,280 But they don't put up error messages, 9707 07:10:55,280 --> 07:10:57,440 and so people just leave broken pages up. 9708 07:10:58,360 --> 07:11:01,120 But your Python program is gonna see those broken pages. 9709 07:11:01,120 --> 07:11:02,040 So what you would do is you'd be like, 9710 07:11:02,040 --> 07:11:05,640 oh, here's a new weird way to do an anchor tag. 9711 07:11:05,640 --> 07:11:07,480 I'll change my code. 9712 07:11:07,480 --> 07:11:09,320 And then run for another 100 pages, 9713 07:11:09,320 --> 07:11:12,200 I'm like, oh no, here's a new weird way to do an anchor tag. 9714 07:11:12,200 --> 07:11:14,360 And the problem is is that you're gonna find 9715 07:11:14,360 --> 07:11:17,200 a lot of different ways to mess up an anchor tag. 9716 07:11:17,200 --> 07:11:18,840 And someone's already done that. 9717 07:11:18,840 --> 07:11:21,160 There's a software called BeautifulSoup. 9718 07:11:21,160 --> 07:11:24,040 And we have installation instructions on how to use it. 9719 07:11:24,040 --> 07:11:28,080 And really what it is is it's somebody just spent months 9720 07:11:28,080 --> 07:11:30,480 figuring out all the nasty things that could happen 9721 07:11:30,480 --> 07:11:34,840 and compensated for it and gave you a nice wrapped interface 9722 07:11:34,840 --> 07:11:36,760 that just says, look, you give me the HTML 9723 07:11:36,760 --> 07:11:39,080 and I'll give you back the tags, okay? 9724 07:11:39,080 --> 07:11:40,680 And so it's called BeautifulSoup. 9725 07:11:41,560 --> 07:11:43,280 And so you have to install this. 9726 07:11:43,280 --> 07:11:46,040 There's a couple of ways that you can install this. 9727 07:11:46,040 --> 07:11:47,960 If you're good at extending your Python, 9728 07:11:47,960 --> 07:11:51,680 you can just extend and install BeautifulSoup 9729 07:11:51,680 --> 07:11:53,240 for all Python programs. 9730 07:11:53,240 --> 07:11:57,360 If you can't change your computer's configuration 9731 07:11:57,360 --> 07:12:00,040 because you're on a school computer 9732 07:12:00,040 --> 07:12:02,680 or you're using a USB stick or something, 9733 07:12:02,680 --> 07:12:05,720 then there's a way to download this file that I've created 9734 07:12:05,720 --> 07:12:07,080 called bs4.zip. 9735 07:12:07,080 --> 07:12:08,760 And so what you do is you end up with your file 9736 07:12:08,760 --> 07:12:13,000 called urllinks.py 9737 07:12:13,000 --> 07:12:15,440 and then a little folder called bs4, 9738 07:12:15,440 --> 07:12:17,920 which is a folder that has a bunch of files in it 9739 07:12:17,920 --> 07:12:21,080 from the zip file, and then you can run it. 9740 07:12:21,080 --> 07:12:24,960 And so it'll pull it in and you'll import from bs4, 9741 07:12:24,960 --> 07:12:27,000 BeautifulSoup, and that's either gonna pull it in 9742 07:12:27,000 --> 07:12:30,440 from the folder you do, or if you have installed it 9743 07:12:30,440 --> 07:12:34,280 using the Python installer, it will also just, 9744 07:12:34,280 --> 07:12:36,000 you don't have to put this file in. 9745 07:12:36,000 --> 07:12:36,840 So it's up to you. 9746 07:12:36,840 --> 07:12:38,760 You can either do it one or two ways. 9747 07:12:39,680 --> 07:12:41,600 So this is a little bit of code. 9748 07:12:41,600 --> 07:12:44,120 Now BeautifulSoup is a complex library, 9749 07:12:44,120 --> 07:12:46,720 and so just because this looks easy, 9750 07:12:46,720 --> 07:12:48,080 doing things in BeautifulSoup, 9751 07:12:48,080 --> 07:12:51,320 you might have to actually read a bit more to figure it out. 9752 07:12:51,320 --> 07:12:53,560 But we're going to just read this. 9753 07:12:53,560 --> 07:12:54,400 We're going to 9754 07:12:57,800 --> 07:12:59,000 import BeautifulSoup. 9755 07:12:59,000 --> 07:13:01,440 We're gonna ask for a url right here. 9756 07:13:01,440 --> 07:13:03,000 We're going to take that url. 9757 07:13:03,000 --> 07:13:03,840 We're gonna open it. 9758 07:13:03,840 --> 07:13:07,400 The url open, they give the url and read the whole thing. 9759 07:13:07,400 --> 07:13:08,600 That means we're not writing a loop. 9760 07:13:08,600 --> 07:13:09,640 We've read the whole thing. 9761 07:13:09,640 --> 07:13:13,720 That's okay as long as you know that the file's not so large. 9762 07:13:13,720 --> 07:13:16,920 And then we're going to pass the data we got back. 9763 07:13:16,920 --> 07:13:19,080 And this is gonna be bytes, but BeautifulSoup knows 9764 07:13:19,080 --> 07:13:21,200 all about bytes and all about UTF-8, 9765 07:13:21,200 --> 07:13:22,400 and it figures that out. 9766 07:13:22,400 --> 07:13:25,520 And you just say, hey, take that stuff I just got 9767 07:13:25,520 --> 07:13:27,600 and tear it apart using HTML. 9768 07:13:27,600 --> 07:13:30,760 And give me back an object, a soup object. 9769 07:13:30,760 --> 07:13:32,520 Now the soup object is something 9770 07:13:32,520 --> 07:13:33,920 that you can run queries against. 9771 07:13:33,920 --> 07:13:34,800 So it parses it. 9772 07:13:34,800 --> 07:13:38,020 It deals with all the imperfections and inconsistencies 9773 07:13:38,020 --> 07:13:40,360 in this HTML byte array. 9774 07:13:42,560 --> 07:13:44,440 And it fixes that and gives that back. 9775 07:13:44,440 --> 07:13:45,680 And so there's various things you can do. 9776 07:13:45,680 --> 07:13:47,920 And you gotta go look at the BeautifulSoup documentation. 9777 07:13:47,920 --> 07:13:50,720 It could be a whole class on BeautifulSoup. 9778 07:13:50,720 --> 07:13:52,880 So here's a thing you can do is this object, 9779 07:13:54,200 --> 07:13:56,440 you can sort of call it like a function 9780 07:13:56,440 --> 07:13:59,160 and say, hey, give me back the anchor tags. 9781 07:13:59,160 --> 07:14:00,640 And anchor tags, of course, are the tags. 9782 07:14:00,640 --> 07:14:05,640 Say href equals blah blah blah slash a. 9783 07:14:06,280 --> 07:14:08,720 So all of this is an anchor tag. 9784 07:14:08,720 --> 07:14:10,520 And then we're gonna loop through the tags 9785 07:14:10,520 --> 07:14:11,640 because there could be more than one 9786 07:14:11,640 --> 07:14:13,920 of those anchor tags in the file. 9787 07:14:13,920 --> 07:14:15,720 And then we're going to pull out that href. 9788 07:14:15,720 --> 07:14:16,680 And that's what this does. 9789 07:14:16,680 --> 07:14:19,240 We're gonna loop through all the tags and print out the href. 9790 07:14:19,240 --> 07:14:21,400 So if you tell it to go to drchuck.com, 9791 07:14:21,400 --> 07:14:25,980 it will tell you the one external link in drchuck.com. 9792 07:14:26,880 --> 07:14:29,740 And so I've got an assignment that sort of goes into that 9793 07:14:29,740 --> 07:14:31,560 in some more detail. 9794 07:14:31,560 --> 07:14:35,180 But this chapter has been a whole bunch 9795 07:14:35,180 --> 07:14:36,080 of interesting stuff. 9796 07:14:36,080 --> 07:14:40,120 We started with the TCPIP model and talked about sockets 9797 07:14:40,120 --> 07:14:42,240 that are phone calls between computers. 9798 07:14:42,240 --> 07:14:46,200 And then how applications protocols are developed 9799 07:14:46,200 --> 07:14:48,240 to say what we say on those phone calls. 9800 07:14:48,240 --> 07:14:51,280 And we've explored then the HTTP protocol, 9801 07:14:51,280 --> 07:14:54,600 which is probably the most likely thing you're going to see. 9802 07:14:54,600 --> 07:14:56,720 And then we played with all this in Python 9803 07:14:56,720 --> 07:15:00,520 and saw that Python is really good at this. 9804 07:15:00,520 --> 07:15:03,460 You can write extremely simple and small programs 9805 07:15:03,460 --> 07:15:07,120 to do some extremely complex and powerful things. 9806 07:15:07,120 --> 07:15:09,640 And again, that's why people like Python 9807 07:15:09,640 --> 07:15:13,260 is because it makes the complex simple. 9808 07:15:18,940 --> 07:15:20,720 We're gonna do a little bit of sample code. 9809 07:15:20,720 --> 07:15:22,880 If you're interested in getting the sample code, 9810 07:15:22,880 --> 07:15:24,320 you can download this zip here 9811 07:15:24,320 --> 07:15:28,080 at Pythonforeverybody.com, materials.php. 9812 07:15:28,080 --> 07:15:31,220 And you will download and you will get all the files. 9813 07:15:32,320 --> 07:15:34,720 And all the files that I'm looking at here. 9814 07:15:34,720 --> 07:15:37,160 And so the one I'm gonna play with today 9815 07:15:37,160 --> 07:15:40,240 is the file called URL links.py. 9816 07:15:40,240 --> 07:15:44,520 So the first thing you gotta do before URL links.py works 9817 07:15:44,520 --> 07:15:47,200 is you have got to install beautiful soup. 9818 07:15:47,200 --> 07:15:48,720 And I've got some simple instructions 9819 07:15:48,720 --> 07:15:50,480 at the beginning of the file. 9820 07:15:50,480 --> 07:15:55,000 And so one way to do it is install it using Python 9821 07:15:55,000 --> 07:15:58,220 install process to install this beautiful soup 9822 07:15:58,220 --> 07:16:00,400 for all Python applications. 9823 07:16:00,400 --> 07:16:02,240 And if you are the owner of your computer 9824 07:16:02,240 --> 07:16:03,680 and you're gonna use beautiful soup a lot, 9825 07:16:03,680 --> 07:16:05,600 it's a fine idea to do that. 9826 07:16:05,600 --> 07:16:08,460 But I wanna show you a simpler way 9827 07:16:08,460 --> 07:16:10,560 that if you don't own your own computer 9828 07:16:10,560 --> 07:16:13,780 and you just wanna make it so that beautiful soup works, 9829 07:16:14,720 --> 07:16:19,280 you can download this file, this file right here. 9830 07:16:19,280 --> 07:16:22,700 Beautiful soup, for.zip, unzip it 9831 07:16:22,700 --> 07:16:25,540 and put it in the same folder as here. 9832 07:16:25,540 --> 07:16:27,780 And so if you look in this folder, 9833 07:16:27,780 --> 07:16:30,260 I have a subfolder called bs4. 9834 07:16:30,260 --> 07:16:32,940 And that's the unzipped version of this. 9835 07:16:32,940 --> 07:16:33,760 And it has these things. 9836 07:16:33,760 --> 07:16:36,280 I didn't write this code, so I'm sorry if the name is bad, 9837 07:16:36,280 --> 07:16:38,580 but this is the code to bs4. 9838 07:16:38,580 --> 07:16:40,980 And this is what's in bs4.zip. 9839 07:16:40,980 --> 07:16:43,940 And it's in the same folder as 9840 07:16:45,860 --> 07:16:48,100 URL links.py. 9841 07:16:48,100 --> 07:16:49,960 And so what happens is when you do this 9842 07:16:49,960 --> 07:16:52,340 from bs4 import beautiful soup, 9843 07:16:52,340 --> 07:16:55,180 that either can go to sort of this global magic place 9844 07:16:55,180 --> 07:16:56,820 that Python installs stuff 9845 07:16:56,820 --> 07:16:59,220 and pulls in the beautiful soup object, 9846 07:16:59,220 --> 07:17:03,820 or it can go to the folder bs4 and pull it in, okay? 9847 07:17:03,820 --> 07:17:05,960 And so that's how that works. 9848 07:17:05,960 --> 07:17:08,480 So you have to do one of these two things. 9849 07:17:08,480 --> 07:17:10,380 I prefer to keep it simple, 9850 07:17:10,380 --> 07:17:11,620 download and unzip this file 9851 07:17:11,620 --> 07:17:15,220 and put it in the same folder as this code 9852 07:17:15,220 --> 07:17:16,160 and away you go. 9853 07:17:16,160 --> 07:17:18,560 So from the previous example, 9854 07:17:18,560 --> 07:17:20,600 we're gonna use URL lib, of course, 9855 07:17:20,600 --> 07:17:22,540 and then we're going to pull in the beautiful soup. 9856 07:17:22,540 --> 07:17:24,400 From the beautiful soup for our library, 9857 07:17:24,400 --> 07:17:25,880 we're gonna get the beautiful soup object. 9858 07:17:25,880 --> 07:17:28,040 Now, if you do this with SSL, 9859 07:17:28,040 --> 07:17:30,600 if these websites we're gonna play with have SSL, 9860 07:17:30,600 --> 07:17:33,840 you pretty much have to do this little hack. 9861 07:17:33,840 --> 07:17:36,680 And these three lines, don't worry too much about it. 9862 07:17:36,680 --> 07:17:39,640 The whole idea, you can do Google on Stack Overflow 9863 07:17:39,640 --> 07:17:40,960 and figure this out. 9864 07:17:40,960 --> 07:17:42,880 But this is the way that you ignore errors 9865 07:17:42,880 --> 07:17:46,340 when you have SSL certificate errors. 9866 07:17:46,340 --> 07:17:50,240 And so we have to add this parameter context equals ctx, 9867 07:17:50,240 --> 07:17:52,000 which is this variable that we create. 9868 07:17:52,000 --> 07:17:56,040 So this part and this part sort of just do them. 9869 07:17:56,040 --> 07:17:59,040 If you don't, you can take them out, actually. 9870 07:17:59,040 --> 07:18:01,120 Otherwise, you won't be able to do HTTPS sites. 9871 07:18:01,120 --> 07:18:03,240 So let's take a look at what we're doing 9872 07:18:03,240 --> 07:18:06,860 other than dealing with the HTTPS problem. 9873 07:18:08,760 --> 07:18:10,680 Gonna ask the user for a URL. 9874 07:18:10,680 --> 07:18:14,800 We are going to retrieve all the HTML. 9875 07:18:14,800 --> 07:18:17,960 We're gonna do a URL open, just like we did before. 9876 07:18:17,960 --> 07:18:19,480 Now, this would return us something 9877 07:18:19,480 --> 07:18:21,800 we could loop through line by line with a for loop. 9878 07:18:21,800 --> 07:18:24,120 But instead, we're gonna say, hey, read the whole thing. 9879 07:18:24,120 --> 07:18:28,560 And that basically returns us the entire document 9880 07:18:28,560 --> 07:18:32,320 at that webpage in a single big string 9881 07:18:32,320 --> 07:18:34,380 with new lines at the end of each line. 9882 07:18:34,380 --> 07:18:38,760 And this is not an Unicode, but it's probably UTF-8 string. 9883 07:18:38,760 --> 07:18:41,920 But it turns out BeautifulSoup knows how to deal with UTF-8, 9884 07:18:41,920 --> 07:18:44,240 and it also knows how to deal with Unicode strings. 9885 07:18:44,240 --> 07:18:47,080 So what we're saying is BeautifulSoup read through 9886 07:18:47,080 --> 07:18:49,980 and deal with all the nasty bits, right? 9887 07:18:49,980 --> 07:18:54,240 So HTML is like very, very flexible. 9888 07:18:54,240 --> 07:18:59,240 So drchuck.com slash page one, HTML. 9889 07:19:02,760 --> 07:19:04,840 And so if we take a look at the source of this, 9890 07:19:04,840 --> 07:19:09,160 view page source, make this bigger, 9891 07:19:09,160 --> 07:19:10,960 you might be able to do regular expressions, 9892 07:19:10,960 --> 07:19:13,680 but it does things like break stuff across lines. 9893 07:19:13,680 --> 07:19:15,280 There could be a line break here. 9894 07:19:15,280 --> 07:19:17,240 There could be all kinds of things, right? 9895 07:19:17,240 --> 07:19:21,400 And so writing regular expressions or splits or whatever 9896 07:19:21,400 --> 07:19:23,720 is really hard for HTML. 9897 07:19:23,720 --> 07:19:26,280 And so what we do is someone has written this. 9898 07:19:26,280 --> 07:19:27,680 It's called BeautifulSoup. 9899 07:19:30,800 --> 07:19:34,720 And it's basically, this is the code, 9900 07:19:34,720 --> 07:19:38,100 and it's based on a joke from a children's story. 9901 07:19:40,440 --> 07:19:42,240 It basically, someone has just went through 9902 07:19:42,240 --> 07:19:44,460 and figured all the bad things that could possibly happen 9903 07:19:44,460 --> 07:19:46,960 when you're reading and parsing HTML. 9904 07:19:46,960 --> 07:19:49,200 So either you use it or you will slowly but surely 9905 07:19:49,200 --> 07:19:52,700 derive all the things that it doesn't work. 9906 07:19:52,700 --> 07:19:56,240 And so when we look at this line right here, 9907 07:19:56,240 --> 07:19:58,760 this line at a high level is saying, 9908 07:19:58,760 --> 07:20:00,860 we're giving you ugly, nasty HTML 9909 07:20:00,860 --> 07:20:03,180 that could make no sense whatsoever. 9910 07:20:03,180 --> 07:20:06,400 Please read it and have all the brains that you have 9911 07:20:06,400 --> 07:20:08,820 and all the weird stuff figure that out for us 9912 07:20:08,820 --> 07:20:10,860 and give us back an object. 9913 07:20:10,860 --> 07:20:11,800 I happen to call it soup. 9914 07:20:11,800 --> 07:20:13,240 You don't have to call it soup. 9915 07:20:13,240 --> 07:20:15,920 An object, and that is a proxy for that HTML, 9916 07:20:15,920 --> 07:20:18,560 but this soup object is clean. 9917 07:20:18,560 --> 07:20:20,840 And so what we can do is we can sort of retrieve 9918 07:20:20,840 --> 07:20:22,200 all the anchor tags. 9919 07:20:22,200 --> 07:20:25,200 So we can talk to this object and say, ask it, 9920 07:20:25,200 --> 07:20:26,780 give me the anchor tags. 9921 07:20:26,780 --> 07:20:28,060 What's an anchor tag? 9922 07:20:28,060 --> 07:20:29,640 Well, if we take a look at this source, 9923 07:20:29,640 --> 07:20:32,560 the anchor tag is the A through the slash A. 9924 07:20:32,560 --> 07:20:33,560 That is the tag. 9925 07:20:33,560 --> 07:20:36,880 It is the tag, it is attributes that are on the tag, 9926 07:20:36,880 --> 07:20:39,460 it is the text within the tag, and everything. 9927 07:20:39,460 --> 07:20:40,920 And so that's what we're gonna get. 9928 07:20:40,920 --> 07:20:43,280 Now, I call it tags plural, 9929 07:20:43,280 --> 07:20:45,360 not because plural matters at all, 9930 07:20:45,360 --> 07:20:47,000 but because we're gonna get a list of tags. 9931 07:20:47,000 --> 07:20:51,280 Because even though this web page has lots and lots of tags, 9932 07:20:51,280 --> 07:20:53,760 if we look at, say, drchuck.com, 9933 07:20:58,720 --> 07:21:01,980 and view source, whoa, that's kinda small. 9934 07:21:01,980 --> 07:21:05,480 View page source, right? 9935 07:21:05,480 --> 07:21:09,540 And we go look for a anchor tags. 9936 07:21:09,540 --> 07:21:11,200 We got 45 of them, 9937 07:21:11,200 --> 07:21:13,720 and they all kinda have weird stuff in them, right? 9938 07:21:13,720 --> 07:21:17,440 So this line will give us back a list of tags. 9939 07:21:17,440 --> 07:21:20,240 It will give us all the tags in this document. 9940 07:21:20,240 --> 07:21:22,880 So it goes, the tag goes from there to there. 9941 07:21:23,920 --> 07:21:25,280 And then what we're gonna do is we're gonna write a loop 9942 07:21:25,280 --> 07:21:26,360 to loop through all the tags. 9943 07:21:26,360 --> 07:21:28,160 So that's basically hopping, 9944 07:21:28,160 --> 07:21:29,800 like it's hopping through the document, 9945 07:21:29,800 --> 07:21:31,880 sort of like this, that's what it's doing. 9946 07:21:31,880 --> 07:21:35,560 Hop, hop, hop, hop, hop, hop. 9947 07:21:35,560 --> 07:21:38,960 And it's pulling out the text of the href attributes. 9948 07:21:38,960 --> 07:21:42,120 So it's gonna talk, pull out this bit right here. 9949 07:21:42,120 --> 07:21:44,800 Oh, whoops, oh darn, that was so cool. 9950 07:21:44,800 --> 07:21:46,680 Cause that's a flaw, look at that. 9951 07:21:46,680 --> 07:21:48,040 This is my own page. 9952 07:21:48,040 --> 07:21:49,920 There is no closing quote here, 9953 07:21:49,920 --> 07:21:52,520 but it's gonna work because HTML soup is like, 9954 07:21:52,520 --> 07:21:54,100 oh, I know what to do about that. 9955 07:21:54,100 --> 07:21:55,420 I can deal with that. 9956 07:21:55,420 --> 07:21:56,800 So let's check to see if that one works, 9957 07:21:56,800 --> 07:21:58,220 cause that's like a mistake. 9958 07:21:58,220 --> 07:22:00,600 But that's one of the things we like about beautiful soup. 9959 07:22:00,600 --> 07:22:01,520 So we're gonna read through, 9960 07:22:01,520 --> 07:22:03,240 and then we're gonna pull out all the hrefs. 9961 07:22:03,240 --> 07:22:08,240 So, this is probably thousands of lines of code 9962 07:22:08,520 --> 07:22:10,400 that you really don't want to run. 9963 07:22:10,400 --> 07:22:15,400 So python3urllinks.py. 9964 07:22:15,760 --> 07:22:17,280 And so let's start with a simple one. 9965 07:22:17,280 --> 07:22:22,280 HTTP colon slash slash www.drchuck.com. 9966 07:22:24,840 --> 07:22:26,040 And it reads it. 9967 07:22:26,040 --> 07:22:27,680 Oh, that's the, no, that's, 9968 07:22:28,680 --> 07:22:30,040 that's actually the card one, 9969 07:22:30,040 --> 07:22:30,880 cause we got a whole bunch. 9970 07:22:30,880 --> 07:22:33,520 So let's see if sugi, see the sugi one worked. 9971 07:22:33,520 --> 07:22:34,820 It found that one. 9972 07:22:36,360 --> 07:22:38,200 It's right after socaiproject.org. 9973 07:22:38,200 --> 07:22:39,040 Where is that? 9974 07:22:39,040 --> 07:22:39,880 Is there another sugi? 9975 07:22:41,560 --> 07:22:42,840 Oh, no, it didn't find that one. 9976 07:22:42,840 --> 07:22:43,680 That's kind of funky. 9977 07:22:43,680 --> 07:22:46,100 Look, it found it wrong, but that's okay. 9978 07:22:46,100 --> 07:22:47,980 So you see it found all these 9979 07:22:47,980 --> 07:22:50,440 and did a lot of nice stuff for us. 9980 07:22:50,440 --> 07:22:53,440 If we do it, python3urllinks.py 9981 07:22:53,440 --> 07:22:54,280 and do the easy one. 9982 07:22:54,280 --> 07:22:57,040 It used to be colon slash slash www. 9983 07:22:57,040 --> 07:23:02,040 dr-chuck.com page one.htm. 9984 07:23:03,400 --> 07:23:04,880 We will only see one. 9985 07:23:06,440 --> 07:23:07,480 And there we go. 9986 07:23:07,480 --> 07:23:12,480 Now, the SSL is if you are looking at a page 9987 07:23:12,480 --> 07:23:15,820 that has SSL, python, 9988 07:23:15,820 --> 07:23:19,480 and then you can see that there's a lot of code 9989 07:23:19,480 --> 07:23:24,480 at a page that has SSL, python, URL links too. 9990 07:23:24,680 --> 07:23:28,680 So I'll go to like https colon wwwsi.umich.edu 9991 07:23:34,640 --> 07:23:35,960 and that will get a bunch of links. 9992 07:23:35,960 --> 07:23:37,640 And so you'll see. 9993 07:23:37,640 --> 07:23:41,400 If it wasn't for that, so all kinds of stuff coming back. 9994 07:23:41,400 --> 07:23:43,520 And if it wasn't for this bit right here 9995 07:23:43,520 --> 07:23:46,600 and this bit right here, this HTTPS wouldn't have worked. 9996 07:23:46,600 --> 07:23:49,400 And it's not that that website had a bad URL. 9997 07:23:49,400 --> 07:23:54,280 It has a certificate that's not in python's official list. 9998 07:23:55,120 --> 07:23:56,860 And so the URL is okay. 9999 07:23:57,720 --> 07:24:02,160 So that gives you a quick summary 10000 07:24:02,160 --> 07:24:06,520 of using the beautiful soup library in python 10001 07:24:06,520 --> 07:24:08,480 along with the URL lib. 10002 07:24:12,520 --> 07:24:15,400 Hello and welcome to chapter 13, web services. 10003 07:24:15,400 --> 07:24:17,320 So what we've been doing so far 10004 07:24:17,320 --> 07:24:19,440 is we've been using the request response cycle. 10005 07:24:19,440 --> 07:24:20,760 We've learned about sockets. 10006 07:24:20,760 --> 07:24:22,560 We've learned about URL lib. 10007 07:24:22,560 --> 07:24:24,520 And we've actually learned how to pull HTML 10008 07:24:24,520 --> 07:24:27,160 and even flat text off the internet. 10009 07:24:27,160 --> 07:24:28,900 But what we're gonna talk about now 10010 07:24:28,900 --> 07:24:31,000 is using that same request response cycle 10011 07:24:31,000 --> 07:24:35,240 to retrieve information that is specifically designed 10012 07:24:35,240 --> 07:24:36,800 for programmatic consumption. 10013 07:24:36,800 --> 07:24:39,600 So that we had to have this beautiful soup 10014 07:24:39,600 --> 07:24:42,240 which sort of did a hack job 10015 07:24:42,240 --> 07:24:45,200 or solved the hard problem of parsing HTML. 10016 07:24:45,200 --> 07:24:47,980 Well, why not produce data in a format 10017 07:24:47,980 --> 07:24:49,680 that makes good sense to a program 10018 07:24:49,680 --> 07:24:51,740 because programs wanna talk to each other. 10019 07:24:51,740 --> 07:24:54,240 If you recall, the whole idea of a socket 10020 07:24:54,240 --> 07:24:57,320 is to have one application process sending data 10021 07:24:57,320 --> 07:24:59,440 to another application process. 10022 07:24:59,440 --> 07:25:02,600 And so if we think about this for a moment 10023 07:25:02,600 --> 07:25:05,600 and we realize that we have all these programs, 10024 07:25:05,600 --> 07:25:07,580 they could be written in different programming languages 10025 07:25:07,580 --> 07:25:08,720 and they're all connected. 10026 07:25:08,720 --> 07:25:11,360 And so they might wanna send data back and forth 10027 07:25:11,360 --> 07:25:13,200 or through the network. 10028 07:25:13,200 --> 07:25:16,520 PHP programs, JavaScript programs, Java programs. 10029 07:25:16,520 --> 07:25:20,200 And so we have to decide on a protocol 10030 07:25:20,200 --> 07:25:22,320 that is independent of any programming language. 10031 07:25:22,320 --> 07:25:24,320 And then we call that the wire protocol 10032 07:25:24,320 --> 07:25:26,420 because if you were to sort of take some connection 10033 07:25:26,420 --> 07:25:30,060 and watch the exact characters that go back and forth, 10034 07:25:30,060 --> 07:25:32,880 that's what you would see if you were monitoring the wire. 10035 07:25:32,880 --> 07:25:35,540 So that's why we call that the wire protocol. 10036 07:25:35,540 --> 07:25:40,440 And so the idea is is that we have to agree on a format 10037 07:25:40,440 --> 07:25:41,800 that is going to represent the data 10038 07:25:41,800 --> 07:25:44,080 and we can't make it a Python specific format 10039 07:25:44,080 --> 07:25:45,640 or a Java format. 10040 07:25:45,640 --> 07:25:49,580 And when we take the data from the internal representation, 10041 07:25:49,580 --> 07:25:52,120 maybe a Python dictionary, to send it to the wire, 10042 07:25:52,120 --> 07:25:53,960 we call that act serialization. 10043 07:25:53,960 --> 07:25:56,600 And that is going from sort of the internal representation 10044 07:25:56,600 --> 07:25:59,600 to the serial representation or the wire representation. 10045 07:25:59,600 --> 07:26:02,520 And then here is an example of a person 10046 07:26:02,520 --> 07:26:03,960 with a name and phone number 10047 07:26:03,960 --> 07:26:05,760 with using less thans and greater thans. 10048 07:26:05,760 --> 07:26:07,360 This is an XML example. 10049 07:26:07,360 --> 07:26:08,480 And then in the far end, 10050 07:26:08,480 --> 07:26:10,480 in a different programming language, it receives this 10051 07:26:10,480 --> 07:26:12,920 and then deserializes it and then turns it 10052 07:26:12,920 --> 07:26:16,560 into some useful structure inside that programming language. 10053 07:26:16,560 --> 07:26:19,800 And so this is an example of a wire protocol 10054 07:26:19,800 --> 07:26:21,160 that's using XML. 10055 07:26:21,160 --> 07:26:23,840 And that's one of the formats we're going to talk about. 10056 07:26:23,840 --> 07:26:25,640 Another format that we're gonna talk about 10057 07:26:25,640 --> 07:26:29,500 is a format called JSON, JavaScript Object Notation. 10058 07:26:29,500 --> 07:26:31,960 And it is simpler and easier, 10059 07:26:31,960 --> 07:26:36,320 but it's not as precise and descriptive as XML is. 10060 07:26:36,320 --> 07:26:38,880 And so while you'll find that most of the things you run 10061 07:26:38,880 --> 07:26:40,760 into, especially if you're talking to APIs 10062 07:26:40,760 --> 07:26:42,120 of one form or another, 10063 07:26:42,120 --> 07:26:44,840 you'll find that JSON is very common. 10064 07:26:44,840 --> 07:26:47,880 XML still holds sway in places like documents. 10065 07:26:47,880 --> 07:26:49,960 So if you look at docx 10066 07:26:49,960 --> 07:26:52,360 at the end of a Microsoft Word document, 10067 07:26:52,360 --> 07:26:54,840 docx means that it's an XML version 10068 07:26:54,840 --> 07:26:58,520 of the representation of a word processing document. 10069 07:26:58,520 --> 07:27:01,060 So the first thing we'll talk about is XML. 10070 07:27:04,780 --> 07:27:07,920 So one of the two ways that we mark up data is XML. 10071 07:27:07,920 --> 07:27:10,160 The other of JSON, first we'll talk about XML. 10072 07:27:10,160 --> 07:27:12,840 We'll talk about XML more for a longer time 10073 07:27:12,840 --> 07:27:14,480 than we talk about JSON. 10074 07:27:14,480 --> 07:27:17,640 XML stands for Extensible Markup Language. 10075 07:27:18,760 --> 07:27:21,480 There was a number of markup languages in the 90s 10076 07:27:21,480 --> 07:27:25,040 that were out there, ways to send data between computers. 10077 07:27:25,040 --> 07:27:28,720 And none of them was like amazingly better than the other, 10078 07:27:28,720 --> 07:27:33,040 but in the late early 1990s, as HTML came out, 10079 07:27:33,040 --> 07:27:36,000 the idea that we could use less thans and greater thans, 10080 07:27:36,000 --> 07:27:38,860 you know, or angle brackets, some people call them. 10081 07:27:41,160 --> 07:27:43,040 Once HTML made angle brackets popular 10082 07:27:43,040 --> 07:27:45,200 as a representation format, 10083 07:27:45,200 --> 07:27:46,920 it was pretty natural that we would find 10084 07:27:46,920 --> 07:27:48,560 a data representation format 10085 07:27:48,560 --> 07:27:50,580 that would take a similar approach. 10086 07:27:50,580 --> 07:27:54,520 And so inside XML, we're gonna talk about tags, 10087 07:27:54,520 --> 07:27:55,600 we're gonna talk about attributes, 10088 07:27:55,600 --> 07:27:56,520 we're gonna talk about data, 10089 07:27:56,520 --> 07:27:58,800 and we've already talked about serialization 10090 07:27:58,800 --> 07:28:00,200 and deserialization. 10091 07:28:00,200 --> 07:28:01,920 Serialization is the act of taking data 10092 07:28:01,920 --> 07:28:04,540 inside of a computer in one programming language, 10093 07:28:04,540 --> 07:28:07,040 setting it up for transport, transporting it across, 10094 07:28:07,040 --> 07:28:09,120 and then taking it back apart 10095 07:28:09,120 --> 07:28:11,440 and turning it back into the data in, 10096 07:28:11,440 --> 07:28:13,880 whatever internal data it needs to be 10097 07:28:13,880 --> 07:28:16,020 in the destination system. 10098 07:28:16,020 --> 07:28:17,600 So here's some basic XML, 10099 07:28:17,600 --> 07:28:19,120 so we can take a look at the various things 10100 07:28:19,120 --> 07:28:20,240 that make up the XML. 10101 07:28:20,240 --> 07:28:23,860 So it's very much like HTML in that we have tags, 10102 07:28:23,860 --> 07:28:24,700 less than, greater than. 10103 07:28:24,700 --> 07:28:26,280 The difference is we get to name the tags 10104 07:28:26,280 --> 07:28:28,920 anything we want rather than the A tag 10105 07:28:28,920 --> 07:28:30,880 or the P tag or the H1 tag. 10106 07:28:30,880 --> 07:28:32,820 And there is a beginning tag and an ending tag, 10107 07:28:32,820 --> 07:28:33,820 and they're bracketed together. 10108 07:28:33,820 --> 07:28:36,120 And there's syntax errors in XML. 10109 07:28:36,120 --> 07:28:38,360 Syntax errors in XML are more severe 10110 07:28:38,360 --> 07:28:40,800 than syntax errors in HTML. 10111 07:28:40,800 --> 07:28:41,960 It's supposed to be right. 10112 07:28:41,960 --> 07:28:44,280 And if you send that XML, 10113 07:28:44,280 --> 07:28:46,980 it's likely that the far end will not understand it. 10114 07:28:47,880 --> 07:28:50,040 So we have a beginning tag and ending tag, 10115 07:28:50,040 --> 07:28:51,580 and so like name and slash name 10116 07:28:51,580 --> 07:28:53,120 are a beginning and ending pair. 10117 07:28:53,120 --> 07:28:55,440 Then there is the actual textual content, 10118 07:28:55,440 --> 07:28:58,040 and that is the material between it. 10119 07:28:58,040 --> 07:28:59,520 And then here's a phone and slash phone, 10120 07:28:59,520 --> 07:29:01,440 and we have this thing called the attribute. 10121 07:29:01,440 --> 07:29:03,320 Key equals value. 10122 07:29:03,320 --> 07:29:04,640 The key doesn't have double quotes. 10123 07:29:04,640 --> 07:29:06,140 The value always has double quotes. 10124 07:29:06,140 --> 07:29:10,980 And this is like href equals on an anchor tag. 10125 07:29:10,980 --> 07:29:13,540 And sometimes you have what's called a self-closing tag 10126 07:29:13,540 --> 07:29:15,360 where you don't actually have a closing tag. 10127 07:29:15,360 --> 07:29:18,640 You have all the data that you need in the attributes, 10128 07:29:18,640 --> 07:29:20,200 and so you don't even bother putting 10129 07:29:20,200 --> 07:29:22,720 an empty text area in in a closing tag. 10130 07:29:22,720 --> 07:29:26,120 So that is a start tag, an end tag, attribute, 10131 07:29:26,120 --> 07:29:27,520 and then a self-closing tag. 10132 07:29:27,520 --> 07:29:31,080 Those are some basics of XML. 10133 07:29:31,080 --> 07:29:35,240 In general, XML doesn't care too much about white space. 10134 07:29:35,240 --> 07:29:39,000 It does in the text areas, so in here it matters, 10135 07:29:39,000 --> 07:29:40,680 and in here it matters, but things like 10136 07:29:40,680 --> 07:29:42,360 we can indent this a little bit differently, 10137 07:29:42,360 --> 07:29:44,200 and we tend to indent it in a way 10138 07:29:44,200 --> 07:29:45,760 to make it look reasonable. 10139 07:29:45,760 --> 07:29:47,840 Although once you have programs sending it back and forth, 10140 07:29:47,840 --> 07:29:50,600 they tend to send it more compacted 10141 07:29:50,600 --> 07:29:52,440 just for efficiency purposes. 10142 07:29:53,920 --> 07:29:57,640 So one of the concepts is that there is 10143 07:29:57,640 --> 07:30:00,900 a hierarchical structure within an XML document, 10144 07:30:00,900 --> 07:30:03,200 and there are parent nodes and child nodes, 10145 07:30:03,200 --> 07:30:05,720 and you can think of these as simple nodes 10146 07:30:05,720 --> 07:30:10,000 that is a tag in some data, or a complex element 10147 07:30:10,000 --> 07:30:14,240 that has a tag that includes other tags, some child tags. 10148 07:30:14,240 --> 07:30:15,520 And there's a couple of different ways 10149 07:30:15,520 --> 07:30:17,120 we can take a look at this. 10150 07:30:18,240 --> 07:30:20,860 The simple and more natural way to think about this 10151 07:30:20,860 --> 07:30:23,640 is a tree with parent-child relationships. 10152 07:30:23,640 --> 07:30:25,920 So here we have this A tag on the outside, 10153 07:30:25,920 --> 07:30:27,720 and that's the top level one. 10154 07:30:27,720 --> 07:30:29,820 You can only have one outer tag, 10155 07:30:29,820 --> 07:30:32,880 and you can only, you can't have another tag down here, 10156 07:30:32,880 --> 07:30:35,520 so you have to have one tag that's sort of the root tag 10157 07:30:35,520 --> 07:30:37,840 for everything in this XML document, 10158 07:30:37,840 --> 07:30:41,400 and it has two children, so the C tag and the B tag 10159 07:30:41,400 --> 07:30:45,000 are two children, so the B tag is a child of A, 10160 07:30:45,000 --> 07:30:48,840 and then C has a D and an E tag that are children there, 10161 07:30:48,840 --> 07:30:53,840 and then the textual data we model as a child 10162 07:30:54,220 --> 07:30:56,880 of each of those tags, and you'll see in a bit 10163 07:30:56,880 --> 07:30:58,740 why it's best to do that. 10164 07:30:58,740 --> 07:31:02,520 So that is the way to think about this as a tree, 10165 07:31:02,520 --> 07:31:05,720 to represent that XML as a tree. 10166 07:31:05,720 --> 07:31:08,200 If we add attributes to it, and this is where you kind of 10167 07:31:08,200 --> 07:31:10,800 see why it's nice to take the text area 10168 07:31:10,800 --> 07:31:12,580 and make that be a child of the node, 10169 07:31:12,580 --> 07:31:14,360 an attribute is a different. 10170 07:31:14,360 --> 07:31:17,120 So the text is a special kind of child, 10171 07:31:17,120 --> 07:31:18,960 and you can literally have more than one attribute. 10172 07:31:18,960 --> 07:31:23,600 You could have X equals two, you know, zap equals whatever, 10173 07:31:23,600 --> 07:31:26,480 and these could have a couple of different attributes. 10174 07:31:26,480 --> 07:31:29,200 The W attribute is a value of five, 10175 07:31:29,200 --> 07:31:30,720 and that's the five down there, 10176 07:31:30,720 --> 07:31:32,120 and so you could have multiple ones. 10177 07:31:32,120 --> 07:31:34,200 You can only have one text node. 10178 07:31:34,200 --> 07:31:37,000 Now, in the case of A, you have a whole bunch of text nodes, 10179 07:31:37,000 --> 07:31:39,720 but these are because there are child nodes. 10180 07:31:39,720 --> 07:31:43,800 Within one simple node, you can only have one text element. 10181 07:31:43,800 --> 07:31:46,140 You can also think of XML as paths, 10182 07:31:46,140 --> 07:31:48,560 and the easiest way is to sort of look down 10183 07:31:48,560 --> 07:31:52,000 this tree version and look at from the path from the parent. 10184 07:31:52,000 --> 07:31:54,840 So you go to A, then the child B, and then X. 10185 07:31:54,840 --> 07:31:58,040 So at position AB, you find X. 10186 07:31:58,040 --> 07:32:01,040 So AB is the path up to the root, 10187 07:32:01,040 --> 07:32:04,780 so ACD, that's this one, is the path to Y, 10188 07:32:05,760 --> 07:32:08,660 and ACE is the path to Z, 10189 07:32:08,660 --> 07:32:10,840 and so you can think of these as paths. 10190 07:32:10,840 --> 07:32:12,840 Part of what we're doing is we're coming up with ways 10191 07:32:12,840 --> 07:32:16,840 to walk through and parse trees of XML data. 10192 07:32:18,000 --> 07:32:20,880 So the next thing we'll talk about is how we determine 10193 07:32:20,880 --> 07:32:24,620 if a particular XML document is legal 10194 07:32:24,620 --> 07:32:28,780 or meets the contracts that two applications have set up. 10195 07:32:34,160 --> 07:32:36,060 We're going to do a little bit of code. 10196 07:32:36,060 --> 07:32:38,120 If you want to get your hands on the code, 10197 07:32:38,120 --> 07:32:41,620 go to the materials website, materials.php, 10198 07:32:43,660 --> 07:32:48,460 actually materials.php, and download the sample code. 10199 07:32:48,460 --> 07:32:51,880 The code that we're going to work on today is the XML code, 10200 07:32:51,880 --> 07:32:56,720 and we need to be able to talk XML to work with web services. 10201 07:32:56,720 --> 07:33:01,560 So here's one of the examples from the book, it's XML1.py. 10202 07:33:01,560 --> 07:33:05,480 And so later we'll be pulling XML and JSON from the web, 10203 07:33:05,480 --> 07:33:06,760 but for now we're just going to put it 10204 07:33:06,760 --> 07:33:10,420 in a triple-coded string, so data, 10205 07:33:10,420 --> 07:33:13,440 and we're going to use a built-in XML parser 10206 07:33:13,440 --> 07:33:15,840 in Python called element tree, 10207 07:33:15,840 --> 07:33:19,320 and when we say import XML E-tree element tree, 10208 07:33:19,320 --> 07:33:23,480 this as ET gives us basically a shortcut handle for it. 10209 07:33:25,620 --> 07:33:27,560 And so the idea, this is a string, 10210 07:33:27,560 --> 07:33:28,880 it has less thans and greater thans, 10211 07:33:28,880 --> 07:33:31,400 it looks like structured information, and it is, 10212 07:33:31,400 --> 07:33:33,760 but really at this point it's only a string. 10213 07:33:33,760 --> 07:33:35,800 Now we have to call this ET from string 10214 07:33:35,800 --> 07:33:38,960 to read this and give us back a tree object. 10215 07:33:38,960 --> 07:33:40,660 And what it does is this might blow up, 10216 07:33:40,660 --> 07:33:42,520 this code might blow up right here 10217 07:33:42,520 --> 07:33:45,620 if there was a mistake in it. 10218 07:33:45,620 --> 07:33:47,540 Matter of fact, I can probably put a mistake in, 10219 07:33:47,540 --> 07:33:50,200 let's see if I can delete this and save it 10220 07:33:50,200 --> 07:33:54,360 and run this code, and we'll see that it will blow up. 10221 07:33:59,800 --> 07:34:03,040 Right, and so it blew up, here in line eight, 10222 07:34:03,040 --> 07:34:06,040 element tree blew up, I mean it blew up 10223 07:34:07,920 --> 07:34:10,720 in line 12 of the code, which is right here. 10224 07:34:10,720 --> 07:34:15,160 This failed because the line eight of the XML string 10225 07:34:15,160 --> 07:34:17,840 was wrong, so let's put that back in. 10226 07:34:17,840 --> 07:34:20,120 So now it's properly formed XML. 10227 07:34:20,120 --> 07:34:22,600 So this tree we get back, I name it tree 10228 07:34:22,600 --> 07:34:24,480 just because I always name it tree, 10229 07:34:24,480 --> 07:34:26,200 but you could name it X. 10230 07:34:26,200 --> 07:34:30,480 So the key is tree.find goes and looks for a tag name find, 10231 07:34:30,480 --> 07:34:33,900 and tree has no longer got less thans and greater thans in it, 10232 07:34:33,900 --> 07:34:36,940 it is went and turned these into objects 10233 07:34:36,940 --> 07:34:39,800 within objects within objects. 10234 07:34:39,800 --> 07:34:44,560 So tree find name says I would like to find the tag name, 10235 07:34:44,560 --> 07:34:46,880 and that's what this bit is right here, 10236 07:34:46,880 --> 07:34:49,960 and then.tx.txt is going within that 10237 07:34:49,960 --> 07:34:52,400 and grabbing that text, okay? 10238 07:34:52,400 --> 07:34:55,320 And if we say tree find dot email, 10239 07:34:55,320 --> 07:34:57,780 then that's going to give us this, 10240 07:34:57,780 --> 07:35:01,200 and then that's that object, and then.get 10241 07:35:01,200 --> 07:35:04,400 asks for the contents of the hide attribute, 10242 07:35:04,400 --> 07:35:07,200 which is the string yes, okay? 10243 07:35:07,200 --> 07:35:10,240 And so if we run this, now that it's fixed, 10244 07:35:10,240 --> 07:35:13,080 Python 3XML1.py, it will pull in 10245 07:35:13,080 --> 07:35:16,820 and get the at the name and the attributes. 10246 07:35:16,820 --> 07:35:19,880 So it pulled the chuck out, and so you get this object 10247 07:35:19,880 --> 07:35:21,960 and then you kind of dive into that object. 10248 07:35:21,960 --> 07:35:24,540 And so that's XML1.py. 10249 07:35:24,540 --> 07:35:28,120 If you've got a tag, you can either get the text 10250 07:35:28,120 --> 07:35:31,880 out of the tag, or you can get an attribute out of the tag. 10251 07:35:31,880 --> 07:35:34,480 So now let's take a look at XML2.py. 10252 07:35:34,480 --> 07:35:37,760 Again, we import element tree, and we have a tag, 10253 07:35:37,760 --> 07:35:41,320 and XML's always got to have a single outer tag. 10254 07:35:41,320 --> 07:35:45,400 But this time we're going to have, in effect, a list. 10255 07:35:45,400 --> 07:35:48,740 Now, let's line this up a little better. 10256 07:35:48,740 --> 07:35:52,280 There we go, that looks a little prettier. 10257 07:35:52,280 --> 07:35:56,920 And so users, the fact that it's users doesn't mean anything, 10258 07:35:56,920 --> 07:36:00,080 but we often come up with semantically meaningful names 10259 07:36:00,080 --> 07:36:01,420 for these things. 10260 07:36:01,420 --> 07:36:05,580 Users is going to have, as a children, a list of user tags. 10261 07:36:05,580 --> 07:36:08,720 Okay, so the children under user, 10262 07:36:08,720 --> 07:36:13,720 user under user, and then this has each of these as a tag. 10263 07:36:13,720 --> 07:36:16,720 So we want to parse this, and this is a common thing 10264 07:36:16,720 --> 07:36:17,720 we want to do. 10265 07:36:19,720 --> 07:36:22,720 And so, again, the first thing we do is we read the string 10266 07:36:22,720 --> 07:36:24,720 to just take this, it's a triple-coded string 10267 07:36:24,720 --> 07:36:26,720 going from here to here. 10268 07:36:26,720 --> 07:36:28,720 And then we're going to, instead of doing find, 10269 07:36:28,720 --> 07:36:31,720 which gives us one tag, we're going to do find all 10270 07:36:31,720 --> 07:36:36,720 the user's tag, the user tag that is a child of users. 10271 07:36:36,720 --> 07:36:40,720 And we get back a Python list of the tags, 10272 07:36:40,720 --> 07:36:43,720 not of the text, but of the tags. 10273 07:36:43,720 --> 07:36:46,720 So there's a one tag, and there is another tag. 10274 07:36:46,720 --> 07:36:49,720 And so we can do len of that, so we can see that we got two. 10275 07:36:49,720 --> 07:36:53,720 And then we can write a for loop, and this item is going 10276 07:36:53,720 --> 07:36:56,720 to iterate through the tags that are, the user tags 10277 07:36:56,720 --> 07:36:58,720 that are children of users. 10278 07:36:58,720 --> 07:37:01,720 So the first time item is going to be this tag, a tag, 10279 07:37:01,720 --> 07:37:04,720 remember, and then the second time is going to be this tag. 10280 07:37:04,720 --> 07:37:07,720 And so we can do things like find and get, 10281 07:37:07,720 --> 07:37:11,720 just like we did with the, in XML1. 10282 07:37:11,720 --> 07:37:16,720 So running this is not too exciting, Python 3, XML2.py. 10283 07:37:17,720 --> 07:37:22,720 You see that there are two users that comes from this print 10284 07:37:22,720 --> 07:37:24,720 right here, there are two users in there. 10285 07:37:24,720 --> 07:37:28,720 And the first one, if we go into name, and we go find 10286 07:37:28,720 --> 07:37:33,720 the text within the name tag, within user, then we get Chuck 10287 07:37:33,720 --> 07:37:37,720 and then we get the ID, which is 001, so we find the ID 10288 07:37:37,720 --> 07:37:39,720 within that item, and then we get the text. 10289 07:37:39,720 --> 07:37:44,720 And then we look and we grab the x attribute off of that. 10290 07:37:44,720 --> 07:37:51,720 And so we see Chuck, Chuck 001 and 2, and then in the next 10291 07:37:51,720 --> 07:37:55,720 tag, the for loop continues, and we print that out, okay? 10292 07:37:55,720 --> 07:38:02,720 And so that's just a basic run through of the XML from 10293 07:38:02,720 --> 07:38:06,720 the chapter in the Python book, okay? 10294 07:38:06,720 --> 07:38:07,720 Thanks. 10295 07:38:10,720 --> 07:38:12,720 So now we're going to talk a little bit about XML schema. 10296 07:38:12,720 --> 07:38:17,720 XML schema is a language that allows you to decide on 10297 07:38:17,720 --> 07:38:21,720 whether or not a particular XML document meets a contract 10298 07:38:21,720 --> 07:38:22,720 and arrangement. 10299 07:38:22,720 --> 07:38:25,720 So you have two pieces of software exchanging data using XML 10300 07:38:25,720 --> 07:38:28,720 and what if one of them, if they're all working, nobody 10301 07:38:28,720 --> 07:38:30,720 really worries too much about it, but if all of a sudden 10302 07:38:30,720 --> 07:38:33,720 one breaks, you change one side and another one breaks, 10303 07:38:33,720 --> 07:38:34,720 whose fault was it, right? 10304 07:38:34,720 --> 07:38:37,720 Was it the side that got changed or the other side? 10305 07:38:37,720 --> 07:38:38,720 And so you could argue. 10306 07:38:38,720 --> 07:38:41,720 So what you like to do is before you set up these arrangements 10307 07:38:41,720 --> 07:38:44,720 between these applications, set up a contract, in a way 10308 07:38:44,720 --> 07:38:48,720 they're kind of like the RFCs are, except that their scope 10309 07:38:48,720 --> 07:38:51,720 is between pairs of applications. 10310 07:38:51,720 --> 07:38:58,720 And so it itself is XML, and it basically, what we do is we 10311 07:38:58,720 --> 07:39:02,720 take an XML document and an XML schema contract, and then we 10312 07:39:02,720 --> 07:39:05,720 either say that's good or that that is bad, and that's called 10313 07:39:05,720 --> 07:39:10,720 validation, a piece of software that validates XML when given 10314 07:39:10,720 --> 07:39:13,720 a schema is called a validator. 10315 07:39:13,720 --> 07:39:17,720 And so an XML document, here we have our little XML document. 10316 07:39:17,720 --> 07:39:19,720 We're passing it to the validator. 10317 07:39:19,720 --> 07:39:23,720 And then we have a schema contract, which is a itself XML. 10318 07:39:23,720 --> 07:39:27,720 It's kind of a particular kind of XML, that XS colon complex 10319 07:39:27,720 --> 07:39:29,720 type, that's just a tag. 10320 07:39:29,720 --> 07:39:32,720 Colon is a legitimate character for the name of a tag. 10321 07:39:32,720 --> 07:39:34,720 Name equals person, that's just an attribute. 10322 07:39:34,720 --> 07:39:39,720 And so XML schema is a particular format of XML that 10323 07:39:39,720 --> 07:39:43,720 renders an opinion about what XML is supposed to look like. 10324 07:39:43,720 --> 07:39:47,720 So there's a number of different XML schema languages, the one 10325 07:39:47,720 --> 07:39:50,720 we're going to look at as one that kind of came a little bit 10326 07:39:50,720 --> 07:39:55,720 later, that's very common, called XSD, which is the 10327 07:39:55,720 --> 07:39:58,720 World Wide Web Consortium's schema specification. 10328 07:39:58,720 --> 07:40:02,720 Often you'll find files that have suffixes of.XSD that 10329 07:40:02,720 --> 07:40:06,720 actually contain the XML just like we're going to show you. 10330 07:40:06,720 --> 07:40:10,720 So if you recall, there are simple elements which have text 10331 07:40:10,720 --> 07:40:14,720 children, and then there are complex elements where other 10332 07:40:14,720 --> 07:40:16,720 nodes are children of other nodes. 10333 07:40:16,720 --> 07:40:18,720 And so we can say this. 10334 07:40:18,720 --> 07:40:21,720 And so here we have a little bit of XML, and the XML schema, 10335 07:40:21,720 --> 07:40:23,720 that makes sense with that. 10336 07:40:23,720 --> 07:40:27,720 So what we're saying is the outer tag of this legitimate 10337 07:40:27,720 --> 07:40:31,720 XML is supposed to be a complex tag with a name of person. 10338 07:40:31,720 --> 07:40:34,720 And so there we go, that looks good, good, good. 10339 07:40:34,720 --> 07:40:38,720 Then there is a sequence, and then there is a simple element, 10340 07:40:38,720 --> 07:40:41,720 a name of last name, looks good. 10341 07:40:41,720 --> 07:40:44,720 And it's a string, that looks good. 10342 07:40:44,720 --> 07:40:48,720 Another tag that's of named age, that's of type integer, 10343 07:40:48,720 --> 07:40:49,720 that's good. 10344 07:40:49,720 --> 07:40:53,720 And then a thing that's called date born, and then it looks 10345 07:40:53,720 --> 07:40:54,720 like a date. 10346 07:40:54,720 --> 07:40:57,720 So we check all these things, and we can basically say, 10347 07:40:57,720 --> 07:41:02,720 yup, that is a good XML document according to this schema. 10348 07:41:02,720 --> 07:41:05,720 And you don't have to write this generally, but there is 10349 07:41:05,720 --> 07:41:08,720 software that reads these two things and comes back with a 10350 07:41:08,720 --> 07:41:12,720 true or a false, and not even have some detail as to what 10351 07:41:12,720 --> 07:41:17,720 went wrong with this particular schema. 10352 07:41:17,720 --> 07:41:21,720 Here's some more that you can do with a schema. 10353 07:41:21,720 --> 07:41:24,720 We can do things like have a complex type, we have a 10354 07:41:24,720 --> 07:41:25,720 sequence. 10355 07:41:25,720 --> 07:41:29,720 Here we have a string, full name, and a string child name. 10356 07:41:29,720 --> 07:41:31,720 But we have this min occurs and max occurs. 10357 07:41:31,720 --> 07:41:34,720 So min occurs is the minimum number of times it can occur, 10358 07:41:34,720 --> 07:41:36,720 and maximum is the maximum. 10359 07:41:36,720 --> 07:41:39,720 So min occurs equals one, max occurs equals one means it's 10360 07:41:39,720 --> 07:41:40,720 required. 10361 07:41:40,720 --> 07:41:42,720 And so this is required, and we don't have two of them. 10362 07:41:42,720 --> 07:41:44,720 Two of them would be an error. 10363 07:41:44,720 --> 07:41:46,720 One of them is fine, so that's good. 10364 07:41:46,720 --> 07:41:50,720 Here the child name is min occurs zero, max occurs ten. 10365 07:41:50,720 --> 07:41:53,720 So we have four here, and so that's good too. 10366 07:41:53,720 --> 07:41:57,720 And so that is another kind of XML schema constraint that 10367 07:41:57,720 --> 07:41:59,720 you can have. 10368 07:41:59,720 --> 07:42:03,720 Here's a few other data types that we can do. 10369 07:42:03,720 --> 07:42:06,720 We've done the string, we've done the date. 10370 07:42:06,720 --> 07:42:07,720 The date looks like this. 10371 07:42:07,720 --> 07:42:12,720 Dates are four digit year, two digit month, two digit day 10372 07:42:12,720 --> 07:42:13,720 with dashes. 10373 07:42:13,720 --> 07:42:15,720 Now there's lots of different ways to represent dates, but 10374 07:42:15,720 --> 07:42:19,720 the nice thing about this, and you have to put the zeros in. 10375 07:42:19,720 --> 07:42:21,720 So zero, nine for September. 10376 07:42:21,720 --> 07:42:23,720 It means that these are sortable as strings. 10377 07:42:23,720 --> 07:42:26,720 So that if you do all your dates this way, they're 10378 07:42:26,720 --> 07:42:27,720 sortable as strings. 10379 07:42:27,720 --> 07:42:29,720 So you could argue what is prettier, but for computers we 10380 07:42:29,720 --> 07:42:30,720 don't worry about that. 10381 07:42:30,720 --> 07:42:33,720 We're arguing about what's the most functional. 10382 07:42:33,720 --> 07:42:37,720 And then the date time is that same date format with zeros 10383 07:42:37,720 --> 07:42:41,720 followed by the letter T, and then followed by hours, 10384 07:42:41,720 --> 07:42:44,720 minutes, seconds, zero filled, right? 10385 07:42:44,720 --> 07:42:49,720 So nine o'clock is zero, nine, and then the time zone, 10386 07:42:49,720 --> 07:42:51,720 which we'll talk about a second in the next slide. 10387 07:42:51,720 --> 07:42:54,720 You can have decimal numbers and you can have integer 10388 07:42:54,720 --> 07:42:55,720 numbers as well. 10389 07:42:55,720 --> 07:42:58,720 And so we are able to sort of render an opinion as to what 10390 07:42:58,720 --> 07:43:03,720 is good and what is bad in the resulting XML. 10391 07:43:03,720 --> 07:43:05,720 So dates are kind of interesting. 10392 07:43:05,720 --> 07:43:08,720 There's, again, we have lots of different formats of dates, 10393 07:43:08,720 --> 07:43:15,720 you know, nine slash 10 slash 2002, right? 10394 07:43:15,720 --> 07:43:18,720 You know, that's a format of date, but that's one. 10395 07:43:18,720 --> 07:43:21,720 There's another format of the date, which is, you know, 10396 07:43:21,720 --> 07:43:24,720 12 December, whatever. 10397 07:43:24,720 --> 07:43:26,720 And so this is how people show dates. 10398 07:43:26,720 --> 07:43:29,720 Computers don't want to have all those different dates 10399 07:43:29,720 --> 07:43:31,720 and don't want to figure those out. 10400 07:43:31,720 --> 07:43:34,720 They have libraries that produce dates and make them look 10401 07:43:34,720 --> 07:43:36,720 pretty for particular locales. 10402 07:43:36,720 --> 07:43:40,720 But computers really want dates that work best for them. 10403 07:43:40,720 --> 07:43:42,720 So we just say, okay, we're going to have this year, 10404 07:43:42,720 --> 07:43:46,720 month, day, time, and then zero fill, hours, minutes, 10405 07:43:46,720 --> 07:43:50,720 seconds, h, m, s, and then time zone. 10406 07:43:50,720 --> 07:43:53,720 Now computers even prefer a time zone. 10407 07:43:53,720 --> 07:43:56,720 I don't know if you've used something like your Google 10408 07:43:56,720 --> 07:43:59,720 calendar and you take a flight or take a train trip and you 10409 07:43:59,720 --> 07:44:02,720 have a different time zone, everything switches. 10410 07:44:02,720 --> 07:44:05,720 And that's because Google Calendar is not really storing 10411 07:44:05,720 --> 07:44:09,720 the time zone that you're, it's not storing the dates 10412 07:44:09,720 --> 07:44:13,720 in your current time zone, it's storing them in what we call 10413 07:44:13,720 --> 07:44:16,720 universal time or Greenwich Mean Time. 10414 07:44:16,720 --> 07:44:18,720 Zulu Time is another word for that. 10415 07:44:18,720 --> 07:44:22,720 And Z means this time that is the time in, you know, 10416 07:44:22,720 --> 07:44:25,720 London, England, Greenwich Mean Time. 10417 07:44:25,720 --> 07:44:29,720 And so the thing is that that means if this data moves 10418 07:44:29,720 --> 07:44:32,720 between time zones or crosses the international date line 10419 07:44:32,720 --> 07:44:35,720 or standard data like savings time or anything like that, 10420 07:44:35,720 --> 07:44:37,720 none of that changes. 10421 07:44:37,720 --> 07:44:41,720 And so we have this internal date and time that's very 10422 07:44:41,720 --> 07:44:45,720 common in situations where computers are exchanging data 10423 07:44:45,720 --> 07:44:49,720 that then gets shown with a time zone converted to the 10424 07:44:49,720 --> 07:44:52,720 time zone or the local format that's the right way to do that. 10425 07:44:52,720 --> 07:44:55,720 And there's a standard for how dates and times are supposed 10426 07:44:55,720 --> 07:44:56,720 to look. 10427 07:44:56,720 --> 07:44:59,720 So here's another little example of some stuff. 10428 07:44:59,720 --> 07:45:00,720 Let's see what we got. 10429 07:45:00,720 --> 07:45:03,720 Now, if you see this little question mark XML, 10430 07:45:03,720 --> 07:45:04,720 that's not a problem. 10431 07:45:04,720 --> 07:45:07,720 That just is a way of sort of putting a header on the whole 10432 07:45:07,720 --> 07:45:09,720 document that says it's an XML document, 10433 07:45:09,720 --> 07:45:12,720 telling it that it's a UTF-8 document. 10434 07:45:12,720 --> 07:45:14,720 And that's not really a tag. 10435 07:45:14,720 --> 07:45:17,720 That's sort of like a marker on the file so that you can put 10436 07:45:17,720 --> 07:45:20,720 that there but it doesn't harm the XML. 10437 07:45:20,720 --> 07:45:24,720 The outer tag is this tag right here, XS colon schema. 10438 07:45:24,720 --> 07:45:27,720 And then what else we got? 10439 07:45:27,720 --> 07:45:28,720 We got an address. 10440 07:45:28,720 --> 07:45:30,720 We got a string, string, string, string, string. 10441 07:45:30,720 --> 07:45:32,720 We've seen all those. 10442 07:45:32,720 --> 07:45:35,720 Here we have country and we're going to have a restriction that 10443 07:45:35,720 --> 07:45:39,720 basically says this is a simple string but we're going to make 10444 07:45:39,720 --> 07:45:44,720 it so that you have to list one of these four as the country 10445 07:45:44,720 --> 07:45:45,720 code. 10446 07:45:45,720 --> 07:45:49,720 And so here we are down here and that's UK and that's UK and 10447 07:45:49,720 --> 07:45:53,720 so that is valid XML. 10448 07:45:53,720 --> 07:45:56,720 Another couple of examples here. 10449 07:45:56,720 --> 07:46:00,720 Let's see, string, string, string, string, string. 10450 07:46:00,720 --> 07:46:02,720 Max occurs unbounded. 10451 07:46:02,720 --> 07:46:04,720 That means infinite number. 10452 07:46:04,720 --> 07:46:05,720 There's no limit on the number. 10453 07:46:05,720 --> 07:46:06,720 You can do that. 10454 07:46:06,720 --> 07:46:08,720 It occurs of zero. 10455 07:46:08,720 --> 07:46:09,720 Excess positive integer. 10456 07:46:09,720 --> 07:46:12,720 We've seen integer but you can also say it's got to be positive 10457 07:46:12,720 --> 07:46:13,720 integer. 10458 07:46:13,720 --> 07:46:14,720 Decimal, we've seen that. 10459 07:46:14,720 --> 07:46:17,720 And then use equals required is just another statement that you 10460 07:46:17,720 --> 07:46:18,720 can make. 10461 07:46:18,720 --> 07:46:21,720 I'm not trying to get you to the point where you can do XML 10462 07:46:21,720 --> 07:46:22,720 schema. 10463 07:46:22,720 --> 07:46:25,720 Just get you a sense of the kinds of statements that we can 10464 07:46:25,720 --> 07:46:28,720 speak about when we're talking about what is and is not 10465 07:46:28,720 --> 07:46:31,720 legitimate XML. 10466 07:46:31,720 --> 07:46:34,720 So let's talk a little bit about how we might talk XML inside 10467 07:46:34,720 --> 07:46:35,720 Python. 10468 07:46:35,720 --> 07:46:40,720 And so like most things that are in this extended part of Python 10469 07:46:40,720 --> 07:46:42,720 we have to import something. 10470 07:46:42,720 --> 07:46:45,720 And so this is the name of a library XML E-tree element tree 10471 07:46:45,720 --> 07:46:48,720 and then as ET this ends up being a shortcut. 10472 07:46:48,720 --> 07:46:51,720 So we don't have to type these long things. 10473 07:46:51,720 --> 07:46:53,720 And so ET is the same as typing that. 10474 07:46:53,720 --> 07:46:55,720 It's almost like a macro. 10475 07:46:55,720 --> 07:46:58,720 Now normally this XML is going to come somewhere from the 10476 07:46:58,720 --> 07:47:01,720 network but I'm just going to put this in a string. 10477 07:47:01,720 --> 07:47:04,720 I'm using a triple quoted string and so that means that this 10478 07:47:04,720 --> 07:47:07,720 triple quoted string starts here and ends here and all these 10479 07:47:07,720 --> 07:47:10,720 new lines that are here are actually part of the string. 10480 07:47:10,720 --> 07:47:12,720 So this is kind of like I opened a file and read the whole 10481 07:47:12,720 --> 07:47:13,720 thing in. 10482 07:47:13,720 --> 07:47:16,720 But just to keep this totally self-contained I'm putting it 10483 07:47:16,720 --> 07:47:17,720 in a string. 10484 07:47:17,720 --> 07:47:21,720 So the XML would come from some server on the other side of the 10485 07:47:21,720 --> 07:47:23,720 network we would get this XML. 10486 07:47:23,720 --> 07:47:25,720 So that's how it would normally work. 10487 07:47:25,720 --> 07:47:26,720 Okay? 10488 07:47:26,720 --> 07:47:31,720 So this is the XML right there. 10489 07:47:31,720 --> 07:47:37,720 And we parse a string of data and we call ET from string. 10490 07:47:37,720 --> 07:47:40,720 So we're passing in the less thans, the new lines, the 10491 07:47:40,720 --> 07:47:43,720 greater thans, all of this stuff we're passing in. 10492 07:47:43,720 --> 07:47:45,720 And this could have syntax errors in it. 10493 07:47:45,720 --> 07:47:50,720 So this might blow up if this had a syntax error like we forgot 10494 07:47:50,720 --> 07:47:51,720 the little slash or something. 10495 07:47:51,720 --> 07:47:52,720 There was a syntax error. 10496 07:47:52,720 --> 07:47:54,720 But this doesn't have a syntax error. 10497 07:47:54,720 --> 07:47:58,720 So then what we do is we get back an object. 10498 07:47:58,720 --> 07:48:01,720 I just happen to call it tree because it kind of is like that 10499 07:48:01,720 --> 07:48:04,720 tree version of the XML. 10500 07:48:04,720 --> 07:48:07,720 That is an object that we can then query to pull data out of 10501 07:48:07,720 --> 07:48:08,720 it. 10502 07:48:08,720 --> 07:48:13,720 So we say tree.find and look for a tag name name. 10503 07:48:13,720 --> 07:48:16,720 So that finds the tag name name is this. 10504 07:48:16,720 --> 07:48:18,720 It's everything. 10505 07:48:18,720 --> 07:48:20,720 It's the tag and the text. 10506 07:48:20,720 --> 07:48:22,720 If we want the text, we add dot text. 10507 07:48:22,720 --> 07:48:26,720 And then that dot text, that dot text, that actually 10508 07:48:26,720 --> 07:48:29,720 refines it to only the word chuck. 10509 07:48:29,720 --> 07:48:36,720 And similarly, if we do tree.findemail, that tree.findemail, 10510 07:48:36,720 --> 07:48:39,720 that finds the email tag which is this tag. 10511 07:48:39,720 --> 07:48:42,720 It has a child attribute and you can get any of the 10512 07:48:42,720 --> 07:48:43,720 attributes. 10513 07:48:43,720 --> 07:48:44,720 You say dot get. 10514 07:48:44,720 --> 07:48:46,720 There's only one text child. 10515 07:48:46,720 --> 07:48:48,720 But there are many attribute children. 10516 07:48:48,720 --> 07:48:50,720 And so you have to tell it which one you want. 10517 07:48:50,720 --> 07:48:55,720 And so this here, this bit right here, all of that 10518 07:48:55,720 --> 07:48:58,720 will resolve down to that string yes. 10519 07:48:58,720 --> 07:49:00,720 That's what you're going to get there. 10520 07:49:00,720 --> 07:49:01,720 Yes. 10521 07:49:01,720 --> 07:49:05,720 And so you kind of build up these little finds and 10522 07:49:05,720 --> 07:49:06,720 call methods. 10523 07:49:06,720 --> 07:49:10,720 This is not clearly a full introduction to element tree. 10524 07:49:10,720 --> 07:49:13,720 But you get the idea that you sort of dive down in with 10525 07:49:13,720 --> 07:49:16,720 these methods, the call methods, the call methods, 10526 07:49:16,720 --> 07:49:20,720 to get little pieces out and parse all of that. 10527 07:49:20,720 --> 07:49:23,720 Here is a different example. 10528 07:49:23,720 --> 07:49:26,720 In this one, again, we're using triple quoted string. 10529 07:49:26,720 --> 07:49:30,720 We always have a single tag on the outside. 10530 07:49:30,720 --> 07:49:32,720 And then I have a complex type of users. 10531 07:49:32,720 --> 07:49:35,720 And in it, there are two user objects. 10532 07:49:35,720 --> 07:49:37,720 So this is kind of like a list. 10533 07:49:37,720 --> 07:49:39,720 So this is more than one of these things. 10534 07:49:39,720 --> 07:49:42,720 So this user can occur more than one time. 10535 07:49:42,720 --> 07:49:47,720 And again, we take this, we pass that into from string 10536 07:49:47,720 --> 07:49:51,720 and get back an object that represents the name stuff 10537 07:49:51,720 --> 07:49:55,720 is not necessarily have to be the same as this outer tag. 10538 07:49:55,720 --> 07:49:56,720 Just a variable. 10539 07:49:56,720 --> 07:50:00,720 This could just be as easily as X if I wanted. 10540 07:50:00,720 --> 07:50:03,720 So now what I'm going to say is, hey, stuff, 10541 07:50:03,720 --> 07:50:07,720 I want to find the tag, the path users slash user. 10542 07:50:07,720 --> 07:50:10,720 I want to find all tags that match users slash user. 10543 07:50:10,720 --> 07:50:14,720 So that's going to give me a list of two tags, one tag, 10544 07:50:14,720 --> 07:50:18,720 two tags in a list. 10545 07:50:18,720 --> 07:50:22,720 Tag, tag. 10546 07:50:22,720 --> 07:50:23,720 Oops. 10547 07:50:23,720 --> 07:50:24,720 So two tags. 10548 07:50:24,720 --> 07:50:26,720 Now I can print out how many I get. 10549 07:50:26,720 --> 07:50:29,720 That'll be two in this case because I got two tags. 10550 07:50:29,720 --> 07:50:33,720 And I can actually iterate through the list. 10551 07:50:33,720 --> 07:50:36,720 So I can iterate through the list. 10552 07:50:36,720 --> 07:50:39,720 So this item is going to iterate first to this tag 10553 07:50:39,720 --> 07:50:43,720 and that tag, now it's like in the previous example, 10554 07:50:43,720 --> 07:50:46,720 we can look for the name tag within there 10555 07:50:46,720 --> 07:50:48,720 and pull the text out. 10556 07:50:48,720 --> 07:50:50,720 So we pull that text out, find the name tag, 10557 07:50:50,720 --> 07:50:53,720 find the name tag, and then within that find the text. 10558 07:50:53,720 --> 07:50:58,720 And we can find the ID tag and pull the text of that out. 10559 07:50:58,720 --> 07:51:05,720 So that pulls out this 001 and I've scribbled too much. 10560 07:51:05,720 --> 07:51:10,720 And then we can item, which is, this is item, 10561 07:51:10,720 --> 07:51:13,720 is that whole tag, dot get x. 10562 07:51:13,720 --> 07:51:16,720 So that gets the attribute, that gets the two, 10563 07:51:16,720 --> 07:51:18,720 that two comes down here. 10564 07:51:18,720 --> 07:51:24,720 And then item goes to the next one 10565 07:51:24,720 --> 07:51:28,720 because item is looping through so item iterates down to that one 10566 07:51:28,720 --> 07:51:33,720 and pulls out the name dot text, the ID dot text, 10567 07:51:33,720 --> 07:51:37,720 and the attribute dot x and pulls all those pieces out. 10568 07:51:37,720 --> 07:51:39,720 So this is the basic pattern. 10569 07:51:39,720 --> 07:51:43,720 You saw one where you're tearing into a single thing 10570 07:51:43,720 --> 07:51:45,720 and here you're tearing into something 10571 07:51:45,720 --> 07:51:48,720 that is expected to occur more than one time. 10572 07:51:48,720 --> 07:51:54,720 So that's a quick summary of how you talk to XML in Python. 10573 07:51:54,720 --> 07:51:57,720 Up next we're going to talk about the other serialization format, 10574 07:51:57,720 --> 07:52:03,720 JavaScript Object Notation. 10575 07:52:03,720 --> 07:52:06,720 So now we're going to talk about the other serialization format, 10576 07:52:06,720 --> 07:52:08,720 JavaScript Object Notation. 10577 07:52:08,720 --> 07:52:10,720 Chances are good as you go out there, 10578 07:52:10,720 --> 07:52:13,720 you will very likely encounter more JSON than you will XML. 10579 07:52:13,720 --> 07:52:15,720 Not that XML is bad. 10580 07:52:15,720 --> 07:52:19,720 XML is better for rich and hierarchical documents, 10581 07:52:19,720 --> 07:52:23,720 whereas JSON is best for just pulling data out of a system 10582 07:52:23,720 --> 07:52:27,720 and moving it between two systems with the minimum of fuss. 10583 07:52:27,720 --> 07:52:29,720 This is Douglas Crockford. 10584 07:52:29,720 --> 07:52:31,720 I have a great interview from him. 10585 07:52:31,720 --> 07:52:34,720 He's a funny guy, very, very smart. 10586 07:52:34,720 --> 07:52:37,720 He claims he didn't invent JSON, he discovered it 10587 07:52:37,720 --> 07:52:41,720 because it really is based on the literal notation for JavaScript. 10588 07:52:41,720 --> 07:52:44,720 And it actually looks a lot like the Python literal notation 10589 07:52:44,720 --> 07:52:47,720 for objects and for lists. 10590 07:52:47,720 --> 07:52:50,720 Now Douglas Crockford has quite a sense of humor. 10591 07:52:50,720 --> 07:52:53,720 He wrote this book called JavaScript the Good Parts, 10592 07:52:53,720 --> 07:52:54,720 that's the little one right there, 10593 07:52:54,720 --> 07:52:56,720 and then JavaScript the Comprehensive Guide, 10594 07:52:56,720 --> 07:52:59,720 and the sense of humor is all the stuff that's in JavaScript 10595 07:52:59,720 --> 07:53:00,720 that's not too useful. 10596 07:53:00,720 --> 07:53:02,720 And while this is sort of a tongue in cheek, 10597 07:53:02,720 --> 07:53:04,720 it also is trying to say that JavaScript, 10598 07:53:04,720 --> 07:53:08,720 what Crockford is really saying here is JavaScript is a great language 10599 07:53:08,720 --> 07:53:10,720 as long as you avoid the tricky bits 10600 07:53:10,720 --> 07:53:12,720 and sort of keep it very, very simple. 10601 07:53:12,720 --> 07:53:15,720 And JavaScript is indeed a great language. 10602 07:53:15,720 --> 07:53:17,720 But JSON comes from JavaScript. 10603 07:53:17,720 --> 07:53:20,720 You can read about JSON at JSON.org. 10604 07:53:20,720 --> 07:53:23,720 JSON is not an international standard. 10605 07:53:23,720 --> 07:53:25,720 It's not like an RFC. 10606 07:53:25,720 --> 07:53:26,720 It really is. 10607 07:53:26,720 --> 07:53:29,720 Douglas Crockford decided to register JSON.org 10608 07:53:29,720 --> 07:53:32,720 and typed in some pages, and people started reading it 10609 07:53:32,720 --> 07:53:33,720 and people started using it. 10610 07:53:33,720 --> 07:53:36,720 And partly that was because it was truly derived 10611 07:53:36,720 --> 07:53:42,720 from the JavaScript literal syntax. 10612 07:53:42,720 --> 07:53:44,720 So we're all ready to code. 10613 07:53:44,720 --> 07:53:48,720 Here is some Python that's going to process some JSON. 10614 07:53:48,720 --> 07:53:49,720 Keep it straight. 10615 07:53:49,720 --> 07:53:51,720 Python process JSON. 10616 07:53:51,720 --> 07:53:54,720 So again, I'm using the triple-quoted string here. 10617 07:53:54,720 --> 07:53:56,720 Now you'll notice the syntax that we are using 10618 07:53:56,720 --> 07:53:59,720 is not angle brackets, but instead curly braces. 10619 07:53:59,720 --> 07:54:02,720 And so the curly brace, and then within the curly brace 10620 07:54:02,720 --> 07:54:05,720 you have key value pairs, name colon chuck, 10621 07:54:05,720 --> 07:54:08,720 and the key colon value, and both sides have quotes. 10622 07:54:08,720 --> 07:54:12,720 You can also have objects within objects, curly brace, 10623 07:54:12,720 --> 07:54:14,720 key value pairs, key value, key value. 10624 07:54:14,720 --> 07:54:16,720 Looks a lot like Python. 10625 07:54:16,720 --> 07:54:18,720 And then you can do this. 10626 07:54:18,720 --> 07:54:21,720 And so this is a structure that has one key value pair 10627 07:54:21,720 --> 07:54:23,720 that's a string, another key value pair that's an object, 10628 07:54:23,720 --> 07:54:25,720 another key value pair that's an object, 10629 07:54:25,720 --> 07:54:28,720 and then these are key values within those contained objects. 10630 07:54:28,720 --> 07:54:33,720 So this is a string that again probably was retrieved 10631 07:54:33,720 --> 07:54:36,720 across the network from some other place. 10632 07:54:36,720 --> 07:54:40,720 And we're going to pass that string into the JSON library 10633 07:54:40,720 --> 07:54:42,720 called loadS, loadS stands for load from string. 10634 07:54:42,720 --> 07:54:45,720 So it reads this, parses it, looks at all the white space. 10635 07:54:45,720 --> 07:54:47,720 White space again doesn't matter too much here 10636 07:54:47,720 --> 07:54:49,720 unless it's in between double quotes. 10637 07:54:49,720 --> 07:54:51,720 The white space doesn't matter. 10638 07:54:51,720 --> 07:54:56,720 And so it parses it and then returns us a dictionary. 10639 07:54:56,720 --> 07:54:58,720 So the thing that's different about JSON 10640 07:54:58,720 --> 07:55:03,720 is that its structure and representation are simpler than XML. 10641 07:55:03,720 --> 07:55:07,720 So in Python, everything either comes back as a dictionary 10642 07:55:07,720 --> 07:55:09,720 or a list, or a dictionary within a dictionary 10643 07:55:09,720 --> 07:55:12,720 or a list within a dictionary, but it's all dictionaries. 10644 07:55:12,720 --> 07:55:15,720 It's not a separate structure that you have to do gets 10645 07:55:15,720 --> 07:55:17,720 and finds and findalls and lookups. 10646 07:55:17,720 --> 07:55:18,720 So it's right there. 10647 07:55:18,720 --> 07:55:23,720 So when we get this back, because this is a curly brace, 10648 07:55:23,720 --> 07:55:25,720 info is a dictionary. 10649 07:55:27,720 --> 07:55:32,720 And so we can just use the standard syntax of Python, 10650 07:55:32,720 --> 07:55:34,720 info sub name. 10651 07:55:34,720 --> 07:55:39,720 Well, that will bring, let's clear this. 10652 07:55:41,720 --> 07:55:44,720 So info sub name, we'll go find Chuck. 10653 07:55:44,720 --> 07:55:47,720 So if you compare that with the XML, that's just a lot easier. 10654 07:55:47,720 --> 07:55:51,720 Now, when we have info sub email, that's this thing. 10655 07:55:51,720 --> 07:55:53,720 So info sub email is that thing. 10656 07:55:53,720 --> 07:55:57,720 And then sub hide is this. 10657 07:55:57,720 --> 07:55:59,720 So that's what comes out here. 10658 07:55:59,720 --> 07:56:02,720 So it's really nested dictionaries and lists. 10659 07:56:02,720 --> 07:56:05,720 We haven't seen a list yet, but this is a set of nested 10660 07:56:05,720 --> 07:56:08,720 dictionaries that it parses. 10661 07:56:08,720 --> 07:56:11,720 And it's equally simple in other programming languages. 10662 07:56:11,720 --> 07:56:15,720 This is a little more complex version where the outer element 10663 07:56:15,720 --> 07:56:20,720 is a square bracket, which means it's going to be a list. 10664 07:56:20,720 --> 07:56:24,720 And so we have a list of one, comma, two things. 10665 07:56:24,720 --> 07:56:28,720 So this is a list of two dictionaries. 10666 07:56:28,720 --> 07:56:31,720 So there's two dictionaries inside that list. 10667 07:56:31,720 --> 07:56:35,720 So again, we take this string and we load it into, 10668 07:56:35,720 --> 07:56:39,720 use the JSON parser to read the string and give us back. 10669 07:56:39,720 --> 07:56:42,720 In this case, info is a list. 10670 07:56:42,720 --> 07:56:43,720 It's got two items. 10671 07:56:43,720 --> 07:56:46,720 If we print out info, it'll give us two. 10672 07:56:46,720 --> 07:56:48,720 And we're going to iterate through. 10673 07:56:48,720 --> 07:56:51,720 And so if we're going to iterate through, 10674 07:56:51,720 --> 07:56:55,720 item is going to first be this, 10675 07:56:55,720 --> 07:56:58,720 and then it's going to iterate to this. 10676 07:56:58,720 --> 07:57:00,720 And it's going to print out item sub name, 10677 07:57:00,720 --> 07:57:02,720 which is going to print out chuck, item sub id, 10678 07:57:02,720 --> 07:57:05,720 which is going to print out 001. 10679 07:57:05,720 --> 07:57:08,720 Now you'll notice that there is no attributes. 10680 07:57:08,720 --> 07:57:11,720 And that's because JSON is simpler. 10681 07:57:11,720 --> 07:57:14,720 But we can have the x just as another item. 10682 07:57:14,720 --> 07:57:18,720 So we say item sub x, and that's going to print the two out. 10683 07:57:18,720 --> 07:57:20,720 And then it'll iterate to the next one, 10684 07:57:20,720 --> 07:57:23,720 and it'll print out the same thing for those guys. 10685 07:57:23,720 --> 07:57:26,720 And so JSON is simpler because it is, 10686 07:57:26,720 --> 07:57:30,720 you can't represent as complex a data structure, 10687 07:57:30,720 --> 07:57:32,720 or you have to compromise and map it 10688 07:57:32,720 --> 07:57:34,720 into a simpler data structure. 10689 07:57:34,720 --> 07:57:37,720 But then it is lists and dictionaries. 10690 07:57:37,720 --> 07:57:39,720 And so once you've got it parsed, 10691 07:57:39,720 --> 07:57:44,720 it is easier to understand and to make use of. 10692 07:57:44,720 --> 07:57:45,720 So that was quick. 10693 07:57:45,720 --> 07:57:48,720 So that's partly why everyone likes JSON better, 10694 07:57:48,720 --> 07:57:50,720 is once you have come up with the format 10695 07:57:50,720 --> 07:57:52,720 that you're going to send it back and forth, 10696 07:57:52,720 --> 07:57:54,720 it's easy to make it, and it's easy to read it. 10697 07:57:54,720 --> 07:57:57,720 Now what we're going to talk about is sort of moving up a level. 10698 07:57:57,720 --> 07:58:00,720 If you've got all these data formats 10699 07:58:00,720 --> 07:58:03,720 and URLs that you can hit to pull those data formats down, 10700 07:58:03,720 --> 07:58:08,720 what approach do you do as you start to construct applications 10701 07:58:08,720 --> 07:58:11,720 that increasingly go from a single application 10702 07:58:11,720 --> 07:58:13,720 to a networked application? 10703 07:58:18,720 --> 07:58:21,720 We're playing with the web services chapter right now. 10704 07:58:21,720 --> 07:58:28,720 And if you want to get the materials for this course, 10705 07:58:28,720 --> 07:58:33,720 you can go here and download the sample zip, 10706 07:58:33,720 --> 07:58:34,720 samplecode.zip. 10707 07:58:34,720 --> 07:58:37,720 I've got this all sitting already on my computer. 10708 07:58:37,720 --> 07:58:39,720 I also have the whole thing in GitHub 10709 07:58:39,720 --> 07:58:40,720 if you want to get it out of GitHub. 10710 07:58:40,720 --> 07:58:42,720 So the thing we're talking about now 10711 07:58:42,720 --> 07:58:46,720 is we're talking about the JSON 1.py example from the book. 10712 07:58:46,720 --> 07:58:50,720 And so JSON is kind of like XML except a lot simpler. 10713 07:58:50,720 --> 07:58:52,720 And that's why a lot of people like it. 10714 07:58:52,720 --> 07:58:54,720 It's not that JSON is always better, 10715 07:58:54,720 --> 07:58:57,720 but JSON is better in a lot of situations 10716 07:58:57,720 --> 07:59:00,720 that don't require the complexity of XML. 10717 07:59:00,720 --> 07:59:02,720 So we start to import JSON. 10718 07:59:02,720 --> 07:59:05,720 JSON is built into Python, but we have to ask to import it. 10719 07:59:05,720 --> 07:59:10,720 Again, we're using a triple-coded string to put the JSON in there. 10720 07:59:10,720 --> 07:59:14,720 And JSON looks a lot like Python dictionaries, key-value pairs. 10721 07:59:14,720 --> 07:59:15,720 Key-value pairs. 10722 07:59:15,720 --> 07:59:17,720 In this case, this is a key, 10723 07:59:17,720 --> 07:59:21,720 and the value itself is another dictionary, 10724 07:59:21,720 --> 07:59:23,720 or in JSON terms, an object. 10725 07:59:23,720 --> 07:59:25,720 But again, key-value pairs within key-value pairs 10726 07:59:25,720 --> 07:59:27,720 within key-value pairs. 10727 07:59:27,720 --> 07:59:29,720 And all these little cursor guys have to, 10728 07:59:29,720 --> 07:59:33,720 all these little curly-brace guys have to line up properly. 10729 07:59:33,720 --> 07:59:38,720 And so, like all the time, this is a string, 10730 07:59:38,720 --> 07:59:41,720 which we normally would read and decode from the Internet. 10731 07:59:41,720 --> 07:59:43,720 But for now, we're just going to have it in there. 10732 07:59:43,720 --> 07:59:47,720 Load JSON.loadS says go into the JSON library, pull out load string, 10733 07:59:47,720 --> 07:59:53,720 and parse this, which turns this set of curly braces, spaces, commas, 10734 07:59:53,720 --> 07:59:56,720 and perhaps syntax errors into a structured object. 10735 07:59:56,720 --> 08:00:00,720 And if we'd made a syntax error in here, then this would blow up. 10736 08:00:00,720 --> 08:00:02,720 But if this doesn't make a syntax error, 10737 08:00:02,720 --> 08:00:06,720 if this doesn't blow up, then we have a structured representation. 10738 08:00:06,720 --> 08:00:10,720 Now, the difference between XML and Python JSON 10739 08:00:10,720 --> 08:00:15,720 is that this turns into a Python dictionary with key-value pairs. 10740 08:00:15,720 --> 08:00:19,720 And so, once we have this, this is a dictionary. 10741 08:00:19,720 --> 08:00:22,720 And we can say info sub name, 10742 08:00:22,720 --> 08:00:25,720 and that's the exact syntax that we would use to get the dictionary. 10743 08:00:25,720 --> 08:00:28,720 And that's going to extract this value out of there. 10744 08:00:28,720 --> 08:00:33,720 And if we want to go in deeper, we can say info sub email, 10745 08:00:33,720 --> 08:00:37,720 and that's what info sub email is right there, and then sub hide. 10746 08:00:37,720 --> 08:00:40,720 So that's a dictionary within a dictionary. 10747 08:00:40,720 --> 08:00:50,720 So if we run this, Python 3 JSON 1.py, it digs in really fast. 10748 08:00:50,720 --> 08:00:52,720 And so this is why people tend to like JSON, 10749 08:00:52,720 --> 08:00:54,720 is because you'll read the JSON, 10750 08:00:54,720 --> 08:00:57,720 which is actually a syntax derived from JavaScript, 10751 08:00:57,720 --> 08:01:00,720 but it looks just like the syntax for a Python. 10752 08:01:00,720 --> 08:01:04,720 So that's moving an object, a JSON object 10753 08:01:04,720 --> 08:01:07,720 that turns in directly into a Python dictionary 10754 08:01:07,720 --> 08:01:09,720 with nested dictionaries. 10755 08:01:09,720 --> 08:01:11,720 Now we're going to look at JSON 2. 10756 08:01:11,720 --> 08:01:14,720 And so JSON 2, we're going to see a list, 10757 08:01:14,720 --> 08:01:16,720 or an array in JSON terms, 10758 08:01:16,720 --> 08:01:18,720 but it turns into a list in Python terms. 10759 08:01:18,720 --> 08:01:22,720 So this is a list of dictionaries. 10760 08:01:22,720 --> 08:01:25,720 In JavaScript, that would be an array of objects, 10761 08:01:25,720 --> 08:01:27,720 but in Python, it's a list of dictionaries. 10762 08:01:27,720 --> 08:01:30,720 So we'll just pretend that it's a list of dictionaries. 10763 08:01:30,720 --> 08:01:35,720 Again, we load the string, parsing, looking for syntax errors. 10764 08:01:35,720 --> 08:01:42,720 So let's just make a syntax error here and run Python JSON 2.py, 10765 08:01:42,720 --> 08:01:44,720 and you'll see where it blows up. 10766 08:01:44,720 --> 08:01:47,720 It blows up at line 15, which is right here. 10767 08:01:47,720 --> 08:01:49,720 It's like this load s blows up. 10768 08:01:49,720 --> 08:01:51,720 Now you could put a try accept around it to save it, 10769 08:01:51,720 --> 08:01:53,720 but we're not going to do that. 10770 08:01:53,720 --> 08:01:54,720 And it even complains. 10771 08:01:54,720 --> 08:01:57,720 It says, look, we're expecting something here in line 11. 10772 08:01:57,720 --> 08:01:59,720 And that's line 11 of the JSON, 10773 08:01:59,720 --> 08:02:02,720 which starts at line 4. 10774 08:02:02,720 --> 08:02:05,720 And so I'll put my little square brace back in 10775 08:02:05,720 --> 08:02:07,720 so it's not syntactically broken. 10776 08:02:07,720 --> 08:02:10,720 So let's run it again and make sure that she runs, 10777 08:02:10,720 --> 08:02:11,720 and yes, she does. 10778 08:02:11,720 --> 08:02:16,720 So this parses it and converts from the JSON syntax 10779 08:02:16,720 --> 08:02:19,720 into a Python, in this case, list, 10780 08:02:19,720 --> 08:02:22,720 because it's got square braces instead of curly braces. 10781 08:02:22,720 --> 08:02:25,720 The previous example had square braces. 10782 08:02:25,720 --> 08:02:28,720 And we can then take a len of it, and it's an array, 10783 08:02:28,720 --> 08:02:31,720 it's a list, and we see that there are two things in there. 10784 08:02:31,720 --> 08:02:33,720 And then we're going to iterate through, 10785 08:02:33,720 --> 08:02:36,720 and this item is going to iterate through these dictionaries, 10786 08:02:36,720 --> 08:02:38,720 that dictionary followed by that dictionary. 10787 08:02:38,720 --> 08:02:42,720 So the first time it's item sub name, 10788 08:02:42,720 --> 08:02:44,720 which is this value right here, 10789 08:02:44,720 --> 08:02:47,720 and then item sub id, which is this value. 10790 08:02:47,720 --> 08:02:49,720 So you can dig right into this, 10791 08:02:49,720 --> 08:02:51,720 but you're using, you're not using get 10792 08:02:51,720 --> 08:02:55,720 and you're not using the weird extra find or find all or anything. 10793 08:02:55,720 --> 08:02:59,720 You just are going at these structures directly. 10794 08:02:59,720 --> 08:03:02,720 And so you can quickly extract this stuff out, 10795 08:03:02,720 --> 08:03:07,720 and we read through id's name is Chuck. 10796 08:03:07,720 --> 08:03:09,720 Oops, name is Chuck. 10797 08:03:09,720 --> 08:03:12,720 There are no attributes, by the way. 10798 08:03:12,720 --> 08:03:15,720 x is two, and so we had to make x. 10799 08:03:15,720 --> 08:03:16,720 So if you look at the XML, 10800 08:03:16,720 --> 08:03:19,720 we had this concept of attributes on the outer tag. 10801 08:03:19,720 --> 08:03:21,720 These things are also not named. 10802 08:03:21,720 --> 08:03:23,720 We just have to know what we're looking for. 10803 08:03:23,720 --> 08:03:25,720 JSON represents simple structures, 10804 08:03:25,720 --> 08:03:29,720 but it's much simpler to use. 10805 08:03:29,720 --> 08:03:34,720 So I hope this has been useful to you 10806 08:03:34,720 --> 08:03:37,720 and talk to you in a bit about some more JSON. 10807 08:03:41,720 --> 08:03:45,720 So the service-oriented approach is a way we approach solving 10808 08:03:45,720 --> 08:03:47,720 a complex application problem 10809 08:03:47,720 --> 08:03:50,720 where all the data really isn't present in one computer system. 10810 08:03:50,720 --> 08:03:53,720 It's somehow spread out over the internet, 10811 08:03:53,720 --> 08:03:56,720 connected via the internet or internal network. 10812 08:03:56,720 --> 08:04:00,720 And so the idea is that some applications 10813 08:04:00,720 --> 08:04:02,720 just can't contain everything. 10814 08:04:02,720 --> 08:04:05,720 The perfect example is a travel website 10815 08:04:05,720 --> 08:04:08,720 that can book you a flight, book you a car, 10816 08:04:08,720 --> 08:04:11,720 buy tickets, book you a hotel, and do all these things. 10817 08:04:11,720 --> 08:04:15,720 Well, that travel website is neither a hotel 10818 08:04:15,720 --> 08:04:17,720 nor a rental car company nor an airline, 10819 08:04:17,720 --> 08:04:19,720 but what it really does is it talks to all these services 10820 08:04:19,720 --> 08:04:21,720 somewhere else on the web on your behalf, 10821 08:04:21,720 --> 08:04:23,720 and it makes reservations for you. 10822 08:04:23,720 --> 08:04:26,720 And so you have this convenient user interface that says, 10823 08:04:26,720 --> 08:04:28,720 oh, here's your whole vacation. 10824 08:04:28,720 --> 08:04:30,720 I'm going to figure all this stuff out. 10825 08:04:30,720 --> 08:04:33,720 Now you say go, and it goes book, book, book, book, 10826 08:04:33,720 --> 08:04:35,720 and books on all these other systems. 10827 08:04:35,720 --> 08:04:38,720 Now it requires a lot of infrastructure, 10828 08:04:38,720 --> 08:04:41,720 a lot of coordination, and a lot of effort 10829 08:04:41,720 --> 08:04:44,720 to make sure that your application can talk. 10830 08:04:44,720 --> 08:04:46,720 And these other services that are out there in the internet 10831 08:04:46,720 --> 08:04:49,720 have good contracts, and you know exactly 10832 08:04:49,720 --> 08:04:53,720 how to send data to them and get data back from them. 10833 08:04:53,720 --> 08:04:55,720 And so initially, when you're building a service 10834 08:04:55,720 --> 08:04:58,720 under architecture, often you have one application, 10835 08:04:58,720 --> 08:05:01,720 and it's all internal, often it's all one language, 10836 08:05:01,720 --> 08:05:03,720 and then maybe you'll say, oh, wait a sec. 10837 08:05:03,720 --> 08:05:05,720 We want to take part of what we do and put it 10838 08:05:05,720 --> 08:05:08,720 in a second system, and then sort of come up 10839 08:05:08,720 --> 08:05:10,720 with a set of rules between the systems, 10840 08:05:10,720 --> 08:05:15,720 and then more and more and more. 10841 08:05:15,720 --> 08:05:17,720 So now that we're solving our problem 10842 08:05:17,720 --> 08:05:20,720 using a series of cooperating applications 10843 08:05:20,720 --> 08:05:22,720 communicating across the network, 10844 08:05:22,720 --> 08:05:24,720 we're going to talk a little bit more detail 10845 08:05:24,720 --> 08:05:27,720 about the notion of what we call web services. 10846 08:05:27,720 --> 08:05:30,720 And in this, we're going to take a different perspective. 10847 08:05:30,720 --> 08:05:32,720 Instead of building our application 10848 08:05:32,720 --> 08:05:34,720 and breaking it into pieces, we're 10849 08:05:34,720 --> 08:05:36,720 going to have an application that's going to really consume 10850 08:05:36,720 --> 08:05:37,720 an API from somebody else. 10851 08:05:37,720 --> 08:05:43,720 So there is some other provider of this API that's not us. 10852 08:05:43,720 --> 08:05:46,720 And so if you're going to talk to somebody's data, 10853 08:05:46,720 --> 08:05:50,720 like Google or Amazon or Twitter, 10854 08:05:50,720 --> 08:05:52,720 they're going to say, you have to use our API. 10855 08:05:52,720 --> 08:05:53,720 So what's that? 10856 08:05:53,720 --> 08:05:57,720 So an API is a contract that says, look, if you do this, 10857 08:05:57,720 --> 08:05:59,720 and this and this and this, we're going to give you data this way. 10858 08:05:59,720 --> 08:06:01,720 And they set the rules. 10859 08:06:01,720 --> 08:06:03,720 They tell you what the URLs are. 10860 08:06:03,720 --> 08:06:05,720 They'll tell you if it's XML or JSON. 10861 08:06:05,720 --> 08:06:07,720 And this is called the Application Program Interface. 10862 08:06:07,720 --> 08:06:11,720 And it's something you read and you understand. 10863 08:06:11,720 --> 08:06:14,720 And so you go look at the documentation. 10864 08:06:14,720 --> 08:06:17,720 This is the documentation for the Google Maps API. 10865 08:06:17,720 --> 08:06:20,720 So it turns out that Google knows a lot about maps. 10866 08:06:20,720 --> 08:06:21,720 It knows a lot of data. 10867 08:06:21,720 --> 08:06:23,720 It knows how to search maps. 10868 08:06:23,720 --> 08:06:26,720 And it actually provides some of those features to you 10869 08:06:26,720 --> 08:06:29,720 that your application can take advantage of. 10870 08:06:29,720 --> 08:06:31,720 I took advantage of this at one point 10871 08:06:31,720 --> 08:06:35,720 by asking all the students in one section of one of my online courses 10872 08:06:35,720 --> 08:06:36,720 where they were from. 10873 08:06:36,720 --> 08:06:38,720 And I just let them type in where it was. 10874 08:06:38,720 --> 08:06:41,720 And then I said, well, I don't know how to code any of that. 10875 08:06:41,720 --> 08:06:44,720 So I used this API doing what's called geocoding 10876 08:06:44,720 --> 08:06:48,720 to look all those places up and get precise latitudes and longitudes 10877 08:06:48,720 --> 08:06:50,720 for the ones Google could figure out. 10878 08:06:50,720 --> 08:06:52,720 And that saved me a lot of work. 10879 08:06:52,720 --> 08:06:54,720 Now, these are expensive resources, 10880 08:06:54,720 --> 08:06:58,720 but I could be patient and make use of these resources, 10881 08:06:58,720 --> 08:07:02,720 which as long as you use them not too much, they can be free. 10882 08:07:02,720 --> 08:07:04,720 We'll talk a little bit more about rate limiting 10883 08:07:04,720 --> 08:07:06,720 and what's free and what's not in a bit. 10884 08:07:06,720 --> 08:07:08,720 But you start by reading documentation. 10885 08:07:08,720 --> 08:07:12,720 It says, do this, hit this URL, hit that URL. 10886 08:07:12,720 --> 08:07:14,720 So if you read that documentation, 10887 08:07:14,720 --> 08:07:18,720 you will find that there is a URL that you can hit. 10888 08:07:18,720 --> 08:07:20,720 And they tell you where to go. 10889 08:07:20,720 --> 08:07:22,720 And then you go to this URL. 10890 08:07:22,720 --> 08:07:23,720 You add a question mark. 10891 08:07:23,720 --> 08:07:25,720 And then you say address equals. 10892 08:07:25,720 --> 08:07:27,720 And then an hour plus. 10893 08:07:27,720 --> 08:07:28,720 And there's all these rules. 10894 08:07:28,720 --> 08:07:30,720 These are called URL encoding rules. 10895 08:07:30,720 --> 08:07:32,720 When you have key values on URLs, 10896 08:07:32,720 --> 08:07:36,720 the plus means space and percent two C means comma. 10897 08:07:36,720 --> 08:07:40,720 So these are called URL encoded. 10898 08:07:40,720 --> 08:07:42,720 But don't worry too much about that 10899 08:07:42,720 --> 08:07:44,720 because we're going to have a magic library 10900 08:07:44,720 --> 08:07:46,720 like we always do in Python that takes care of this. 10901 08:07:46,720 --> 08:07:49,720 And so if you were to hit this URL, 10902 08:07:49,720 --> 08:07:52,720 you type it in the exact right way in your browser, 10903 08:07:52,720 --> 08:07:54,720 you will get back a JSON document. 10904 08:07:54,720 --> 08:07:56,720 It's an object that has key value pairs. 10905 08:07:56,720 --> 08:07:58,720 The first value is the status, 10906 08:07:58,720 --> 08:08:00,720 then it has these results and it's a list. 10907 08:08:00,720 --> 08:08:02,720 And you dive down and eventually you can kind of find 10908 08:08:02,720 --> 08:08:04,720 the latitude and longitude of the thing 10909 08:08:04,720 --> 08:08:06,720 that you are looking for. 10910 08:08:06,720 --> 08:08:10,720 And so the idea is can we write a program that can read this? 10911 08:08:10,720 --> 08:08:13,720 And so here's our little program that reads this. 10912 08:08:13,720 --> 08:08:16,720 And a lot of this is sort of comfortable. 10913 08:08:16,720 --> 08:08:19,720 You've already seen some of this. 10914 08:08:19,720 --> 08:08:21,720 You import URL lib. 10915 08:08:21,720 --> 08:08:23,720 We have to parse some JSON. 10916 08:08:23,720 --> 08:08:24,720 We grab the URL. 10917 08:08:24,720 --> 08:08:26,720 And then we're going to write a little while loop 10918 08:08:26,720 --> 08:08:28,720 that's going to ask for a location. 10919 08:08:28,720 --> 08:08:30,720 And we can type that location in. 10920 08:08:30,720 --> 08:08:34,720 And we've got to concatenate with this URL 10921 08:08:34,720 --> 08:08:36,720 the location equals. 10922 08:08:36,720 --> 08:08:38,720 And there is a bit of code, a library, 10923 08:08:38,720 --> 08:08:40,720 that's called parse URL and code 10924 08:08:40,720 --> 08:08:42,720 that takes the key and the value. 10925 08:08:42,720 --> 08:08:45,720 So the address equals and then whatever this text is 10926 08:08:45,720 --> 08:08:48,720 that we read in from the user, that goes in here. 10927 08:08:48,720 --> 08:08:50,720 And it does that URL encoding with the pluses 10928 08:08:50,720 --> 08:08:51,720 and the percent to C. 10929 08:08:51,720 --> 08:08:53,720 And all that stuff is taken care of. 10930 08:08:53,720 --> 08:08:57,720 And that is our URL that we're going to pass to URL open. 10931 08:08:57,720 --> 08:08:59,720 So we print out that we're going to retrieve it. 10932 08:08:59,720 --> 08:09:00,720 Prints this out. 10933 08:09:00,720 --> 08:09:02,720 And if you look at this, it's too long. 10934 08:09:02,720 --> 08:09:04,720 It has all that fancy stuff on it. 10935 08:09:04,720 --> 08:09:06,720 And then we read it. 10936 08:09:06,720 --> 08:09:08,720 I mean, we open it with URL open. 10937 08:09:08,720 --> 08:09:10,720 And then we read it and decode it. 10938 08:09:10,720 --> 08:09:14,720 So these two things, hit this URL, decode it. 10939 08:09:14,720 --> 08:09:17,720 And then we retrieved 1669 characters 10940 08:09:17,720 --> 08:09:19,720 because it's just a, in this case, 10941 08:09:19,720 --> 08:09:22,720 because we've decoded it, data is a string now 10942 08:09:22,720 --> 08:09:26,720 that read as bytes and data is a string. 10943 08:09:26,720 --> 08:09:30,720 So we read that many characters, 1669 characters. 10944 08:09:30,720 --> 08:09:32,720 And then we're going to take this data 10945 08:09:32,720 --> 08:09:34,720 and we're going to parse it with JSON. 10946 08:09:34,720 --> 08:09:37,720 And we might get to bad data here. 10947 08:09:37,720 --> 08:09:39,720 It might blow up, but it might work. 10948 08:09:39,720 --> 08:09:41,720 So in this case, it works. 10949 08:09:41,720 --> 08:09:43,720 We have an error that basically says, 10950 08:09:43,720 --> 08:09:45,720 if we got a bad thing, we're going to blow up. 10951 08:09:45,720 --> 08:09:47,720 But in this case, it doesn't blow up. 10952 08:09:47,720 --> 08:09:49,720 And so now we're going to sort of dig through. 10953 08:09:49,720 --> 08:09:55,720 And if you go back, let me just go back. 10954 08:09:55,720 --> 08:09:58,720 So the results sub-zero geometry. 10955 08:09:58,720 --> 08:10:00,720 Let's show you how that works. 10956 08:10:00,720 --> 08:10:03,720 So results is the first key. 10957 08:10:03,720 --> 08:10:06,720 So this is a dictionary with a key of results. 10958 08:10:06,720 --> 08:10:08,720 But then it has a list. 10959 08:10:08,720 --> 08:10:12,720 And the zero item, this list starts here and goes there. 10960 08:10:12,720 --> 08:10:14,720 And I'm only going to show part of it, 10961 08:10:14,720 --> 08:10:16,720 but there's many things here. 10962 08:10:16,720 --> 08:10:18,720 So the zero item is this. 10963 08:10:18,720 --> 08:10:20,720 This is the sub-zero. 10964 08:10:20,720 --> 08:10:23,720 And then geometry within that sub-zero item. 10965 08:10:23,720 --> 08:10:29,720 So if we look at that, it is the outer dictionary, 10966 08:10:29,720 --> 08:10:32,720 the first item in the list, sub-geometry. 10967 08:10:32,720 --> 08:10:35,720 So that grabs one part. 10968 08:10:35,720 --> 08:10:40,720 That grabs this part right here. 10969 08:10:40,720 --> 08:10:43,720 And then we're going to go into location and lat. 10970 08:10:43,720 --> 08:10:45,720 And those are just keys within keys, 10971 08:10:45,720 --> 08:10:47,720 a dictionary within a dictionary. 10972 08:10:47,720 --> 08:10:50,720 And so you see it says sub-location, sub-lat. 10973 08:10:50,720 --> 08:10:53,720 And so that is literally going to pull out 10974 08:10:53,720 --> 08:10:55,720 of that complex structure. 10975 08:10:55,720 --> 08:10:57,720 That will pull the latitude out. 10976 08:10:57,720 --> 08:11:00,720 And then in the next line, pull the longitude out. 10977 08:11:00,720 --> 08:11:03,720 So we can pull the latitude and longitude out. 10978 08:11:03,720 --> 08:11:04,720 And then we print it out. 10979 08:11:04,720 --> 08:11:07,720 And we can go into results sub-zero formatted address. 10980 08:11:07,720 --> 08:11:13,720 And that goes into results zero formatted address. 10981 08:11:13,720 --> 08:11:14,720 And that pulls this little bit out. 10982 08:11:14,720 --> 08:11:16,720 Now it takes a little while to write this stuff. 10983 08:11:16,720 --> 08:11:18,720 And you have to put a lot of debug. 10984 08:11:18,720 --> 08:11:20,720 And you don't necessarily figure out 10985 08:11:20,720 --> 08:11:22,720 this complex bit here at the end. 10986 08:11:22,720 --> 08:11:24,720 But you print it. 10987 08:11:24,720 --> 08:11:25,720 You don't get what you want. 10988 08:11:25,720 --> 08:11:26,720 You say, oh, wait a sec. 10989 08:11:26,720 --> 08:11:27,720 That was an array. 10990 08:11:27,720 --> 08:11:29,720 So I've got to add a little sub-zero there 10991 08:11:29,720 --> 08:11:30,720 to get the first one out of the array. 10992 08:11:30,720 --> 08:11:32,720 But eventually you figure it out. 10993 08:11:32,720 --> 08:11:34,720 And it's not all that difficult. 10994 08:11:34,720 --> 08:11:36,720 It's the first time, first few times you do it. 10995 08:11:36,720 --> 08:11:37,720 I'm like, what am I doing? 10996 08:11:37,720 --> 08:11:39,720 But after a while, you realize, oh, I'm just 10997 08:11:39,720 --> 08:11:42,720 sort of tearing this apart and digging deeper and deeper 10998 08:11:42,720 --> 08:11:45,720 into this data structure, which I just retrieved 10999 08:11:45,720 --> 08:11:47,720 over the internet from Google. 11000 08:11:47,720 --> 08:11:51,720 And I learned something good from that. 11001 08:11:51,720 --> 08:11:54,720 So up next, we're going to talk about how sometimes these APIs 11002 08:11:54,720 --> 08:11:58,720 protect themselves with keys or signatures 11003 08:11:58,720 --> 08:12:01,720 and why that happens and how to solve those problems. 11004 08:12:07,720 --> 08:12:09,720 We are doing some code samples here. 11005 08:12:09,720 --> 08:12:12,720 If you want to follow along, you can download the sample code. 11006 08:12:12,720 --> 08:12:14,720 All is in a big zip file. 11007 08:12:14,720 --> 08:12:15,720 I've got it. 11008 08:12:15,720 --> 08:12:19,720 We are going to be working with the Google Maps API. 11009 08:12:19,720 --> 08:12:22,720 In the old days, this Maps API was free 11010 08:12:22,720 --> 08:12:26,720 and did 2,500 requests per day. 11011 08:12:26,720 --> 08:12:28,720 But now they've made it so that parts of it 11012 08:12:28,720 --> 08:12:31,720 are behind API keys, and you start 11013 08:12:31,720 --> 08:12:33,720 having to be using OAuth and stuff. 11014 08:12:33,720 --> 08:12:36,720 But they haven't put it all behind this one address service 11015 08:12:36,720 --> 08:12:37,720 that we've been using. 11016 08:12:37,720 --> 08:12:39,720 That continues to work. 11017 08:12:39,720 --> 08:12:41,720 And the basically idea of an API is 11018 08:12:41,720 --> 08:12:42,720 you go read the documentation. 11019 08:12:42,720 --> 08:12:45,720 You find a URL. 11020 08:12:45,720 --> 08:12:47,720 And this is going to Google servers. 11021 08:12:47,720 --> 08:12:49,720 And you pass in the address. 11022 08:12:49,720 --> 08:12:51,720 And we have to pass in the address 11023 08:12:51,720 --> 08:12:53,720 using what's called URL encoding. 11024 08:12:53,720 --> 08:12:55,720 So spaces are pluses. 11025 08:12:55,720 --> 08:12:56,720 That's a comma. 11026 08:12:56,720 --> 08:12:57,720 And then that's a space. 11027 08:12:57,720 --> 08:13:00,720 And so we have to pass this in a certain way. 11028 08:13:00,720 --> 08:13:02,720 But if we do it right, we hit this. 11029 08:13:02,720 --> 08:13:04,720 We're going to get ourselves some JSON back. 11030 08:13:04,720 --> 08:13:05,720 And that's really cool. 11031 08:13:05,720 --> 08:13:09,720 And so deep inside here, we get the real address, a good address. 11032 08:13:09,720 --> 08:13:12,720 We get a geometry. 11033 08:13:12,720 --> 08:13:14,720 We have the location. 11034 08:13:14,720 --> 08:13:15,720 We got the latitude and longitude. 11035 08:13:15,720 --> 08:13:17,720 And we can extract stuff out of here. 11036 08:13:17,720 --> 08:13:18,720 And so we're talking. 11037 08:13:18,720 --> 08:13:21,720 And this one here is still rate limited to 2,500. 11038 08:13:21,720 --> 08:13:23,720 But it's one of the few parts of the Google Maps API 11039 08:13:23,720 --> 08:13:26,720 that is not hidden behind an API key. 11040 08:13:26,720 --> 08:13:27,720 In a later chapter, we'll show you 11041 08:13:27,720 --> 08:13:32,720 how to actually talk with the API key in the geodata code. 11042 08:13:32,720 --> 08:13:36,720 The geoload shows you how to use an API key 11043 08:13:36,720 --> 08:13:39,720 if you want to jump ahead and take a look at that. 11044 08:13:39,720 --> 08:13:42,720 But for now, we're just going to take a look at GeoJSON, 11045 08:13:42,720 --> 08:13:44,720 which is going to retrieve one page and tear it apart. 11046 08:13:44,720 --> 08:13:46,720 So let's take a look. 11047 08:13:46,720 --> 08:13:50,720 So we're going to grab the URL.lib stuff and import JSON. 11048 08:13:50,720 --> 08:13:51,720 So now we're going to use JSON. 11049 08:13:51,720 --> 08:13:56,720 But we're going to actually pull the data out of the internet. 11050 08:13:56,720 --> 08:14:00,720 And so I just take that service URL for Google Maps API. 11051 08:14:00,720 --> 08:14:02,720 I found that somewhere in the documentation. 11052 08:14:02,720 --> 08:14:05,720 And then I'm going to have a loop that's going to run forever. 11053 08:14:05,720 --> 08:14:08,720 I'm going to add for the location. 11054 08:14:08,720 --> 08:14:11,720 And then if I hit enter, that's what this is saying, 11055 08:14:11,720 --> 08:14:12,720 get out of the loop. 11056 08:14:12,720 --> 08:14:15,720 And then what I'm going to do is I'm going to concatenate 11057 08:14:15,720 --> 08:14:18,720 the service URL, which is this. 11058 08:14:18,720 --> 08:14:24,720 And this URL.lib parse URL encode gives a dictionary of address equals. 11059 08:14:24,720 --> 08:14:29,720 And this bit right here gives me the string 11060 08:14:29,720 --> 08:14:32,720 that leads to putting this address equals 11061 08:14:32,720 --> 08:14:34,720 but then coding these spaces the right way. 11062 08:14:34,720 --> 08:14:38,720 So if you type a space, that bit of code turns it into the plus. 11063 08:14:38,720 --> 08:14:39,720 So that's important. 11064 08:14:39,720 --> 08:14:42,720 And I've got the question mark sitting here at the end of that. 11065 08:14:42,720 --> 08:14:45,720 Then what we're going to do is we're just going to do a URL open 11066 08:14:45,720 --> 08:14:46,720 to get a handle. 11067 08:14:46,720 --> 08:14:48,720 We're going to read the whole document. 11068 08:14:48,720 --> 08:14:51,720 And because it's UTF-8 coming from the outside world 11069 08:14:51,720 --> 08:14:54,720 and we want it turned into Unicode inside our application, 11070 08:14:54,720 --> 08:14:56,720 we say.decode. 11071 08:14:56,720 --> 08:14:58,720 We can ask how many characters we got. 11072 08:14:58,720 --> 08:15:00,720 And we put our JSON load s. 11073 08:15:00,720 --> 08:15:02,720 Now up till now we've been just doing load s's 11074 08:15:02,720 --> 08:15:03,720 from internal strings. 11075 08:15:03,720 --> 08:15:07,720 But this is now a string that came from the outside world. 11076 08:15:07,720 --> 08:15:11,720 And we'll put a try accept in. 11077 08:15:11,720 --> 08:15:14,720 And we'll set JS to be none and that'll be our little trigger. 11078 08:15:14,720 --> 08:15:18,720 Now we can look for, they give us, if we take a look at the output, 11079 08:15:18,720 --> 08:15:20,720 they give us this okay. 11080 08:15:20,720 --> 08:15:23,720 And that status can be a problem and it can complain about things. 11081 08:15:23,720 --> 08:15:26,720 So we have to check to see if we got a good status. 11082 08:15:26,720 --> 08:15:30,720 So at this point, if you look at the outer bit of this, 11083 08:15:30,720 --> 08:15:35,720 the outer bit that we get is a curly brace, so it's a dictionary. 11084 08:15:35,720 --> 08:15:40,720 Then there is, within that dictionary, a key results, which is a list. 11085 08:15:40,720 --> 08:15:43,720 But then the second thing in the outer dictionary is status. 11086 08:15:43,720 --> 08:15:54,720 And so we can ask if the word, if we got a false, 11087 08:15:54,720 --> 08:15:56,720 if we got nothing, that will quit. 11088 08:15:56,720 --> 08:16:02,720 If we don't have a status key in that object, or that dictionary, 11089 08:16:02,720 --> 08:16:06,720 or it's not equal to okay, any number of those things, 11090 08:16:06,720 --> 08:16:14,720 if this, or this, or this, either of those are true, we're going to quit. 11091 08:16:14,720 --> 08:16:16,720 Failure to retrieve and print the data out. 11092 08:16:16,720 --> 08:16:18,720 And when you're starting to read stuff all over the net, 11093 08:16:18,720 --> 08:16:20,720 you often have to put debugging in here like this, 11094 08:16:20,720 --> 08:16:22,720 like oh, something quit, I've got to figure out. 11095 08:16:22,720 --> 08:16:24,720 And so debugging it. 11096 08:16:24,720 --> 08:16:27,720 Next thing we're going to do is call JSON dump s, 11097 08:16:27,720 --> 08:16:33,720 which is the opposite of load s, which takes this dictionary that includes arrays, 11098 08:16:33,720 --> 08:16:36,720 and we're going to pretty print it with an indent of four. 11099 08:16:36,720 --> 08:16:38,720 And then we're going to print that out. 11100 08:16:38,720 --> 08:16:41,720 And so if you look at my code, we'll see that the first thing we do, 11101 08:16:41,720 --> 08:16:44,720 once we've parsed it, is we print it back out so we can see it. 11102 08:16:44,720 --> 08:16:46,720 And then we're going to dig into it. 11103 08:16:46,720 --> 08:16:48,720 So let's go ahead and run this code. 11104 08:16:48,720 --> 08:16:55,720 Python geo JSON.py. 11105 08:16:55,720 --> 08:16:58,720 One of these days, I will always type Python three. 11106 08:16:58,720 --> 08:17:01,720 And arbor comma Michigan. 11107 08:17:01,720 --> 08:17:03,720 Okay, so it ran. 11108 08:17:03,720 --> 08:17:05,720 And so you see that it retrieved this URL. 11109 08:17:05,720 --> 08:17:10,720 This URL was constructed and retrieved 1736 characters. 11110 08:17:10,720 --> 08:17:13,720 And it's JSON pretty printed with an indent of four. 11111 08:17:13,720 --> 08:17:17,720 And this is that JSON dump s all the way down to here. 11112 08:17:17,720 --> 08:17:19,720 So that's just JSON dump s. 11113 08:17:19,720 --> 08:17:21,720 And then it starts extracting. 11114 08:17:21,720 --> 08:17:23,720 So it's going to pull things out. 11115 08:17:23,720 --> 08:17:26,720 Now, when you write this code, it's really easy to look at this and say, 11116 08:17:26,720 --> 08:17:27,720 oh, great, it's easy. 11117 08:17:27,720 --> 08:17:30,720 I tend to have to print this stuff out over and over and over 11118 08:17:30,720 --> 08:17:32,720 as I kind of construct this expression. 11119 08:17:32,720 --> 08:17:36,720 But if we look at it, the outer dictionary, 11120 08:17:36,720 --> 08:17:41,720 the outer dictionary sub results leads to this array. 11121 08:17:41,720 --> 08:17:43,720 And if you go look at this array carefully, 11122 08:17:43,720 --> 08:17:46,720 you find there is only one thing in it. 11123 08:17:46,720 --> 08:17:48,720 And so the results is an array. 11124 08:17:48,720 --> 08:17:53,720 Sub zero gets us this dictionary. 11125 08:17:53,720 --> 08:17:56,720 I keep wanting to say object because that's what it's called. 11126 08:17:56,720 --> 08:17:58,720 And that goes all the way down to here. 11127 08:17:58,720 --> 08:18:00,720 So that's what we get there. 11128 08:18:00,720 --> 08:18:03,720 And then within that, we now have an object. 11129 08:18:03,720 --> 08:18:08,720 And we look for geometry within that object. 11130 08:18:08,720 --> 08:18:10,720 Where is geometry? 11131 08:18:10,720 --> 08:18:11,720 Right there. 11132 08:18:11,720 --> 08:18:13,720 Geometry. 11133 08:18:13,720 --> 08:18:17,720 Geometry goes from there to there. 11134 08:18:17,720 --> 08:18:19,720 There's geometry in there. 11135 08:18:19,720 --> 08:18:20,720 You've got to get used to it. 11136 08:18:20,720 --> 08:18:22,720 That's why it's nice to have this stuff indented. 11137 08:18:22,720 --> 08:18:25,720 Geometry sub low. 11138 08:18:25,720 --> 08:18:26,720 Oops, come back. 11139 08:18:26,720 --> 08:18:27,720 Come back. 11140 08:18:27,720 --> 08:18:30,720 And then we go to location within that. 11141 08:18:30,720 --> 08:18:32,720 So location within geometry. 11142 08:18:32,720 --> 08:18:35,720 And then within location, we have lat and long. 11143 08:18:35,720 --> 08:18:41,720 And so this is pulling out this 42 and 83. 11144 08:18:41,720 --> 08:18:43,720 And then so we print that out. 11145 08:18:43,720 --> 08:18:45,720 Take a look. 11146 08:18:45,720 --> 08:18:46,720 And that prints that out. 11147 08:18:46,720 --> 08:18:48,720 Pulls that right out of the JSON. 11148 08:18:48,720 --> 08:18:50,720 These are tricky to write, but after a while you win 11149 08:18:50,720 --> 08:18:53,720 and you get it right and it's just fine. 11150 08:18:53,720 --> 08:18:54,720 Okay. 11151 08:18:54,720 --> 08:18:56,720 And so we do the same thing. 11152 08:18:56,720 --> 08:19:00,720 Results of zero formatted address gets us this. 11153 08:19:00,720 --> 08:19:03,720 And so that's how we print the location out. 11154 08:19:03,720 --> 08:19:08,720 And so that's a real quick look at how we would do that 11155 08:19:08,720 --> 08:19:16,720 with the JSON talking to the Google Maps API. 11156 08:19:16,720 --> 08:19:18,720 Okay. Hope this helps. 11157 08:19:21,720 --> 08:19:24,720 Now we're going to talk about API rate limiting and security. 11158 08:19:24,720 --> 08:19:27,720 The key thing is that the Google API 11159 08:19:27,720 --> 08:19:29,720 and the Google data is super valuable. 11160 08:19:29,720 --> 08:19:31,720 And you could build a website that did nothing 11161 08:19:31,720 --> 08:19:34,720 but sort of like asked the person for something 11162 08:19:34,720 --> 08:19:37,720 and then showed them that place and make them be a map searcher. 11163 08:19:37,720 --> 08:19:41,720 And you added so little value and Google did all the hard work. 11164 08:19:41,720 --> 08:19:43,720 And so they protect these somewhat. 11165 08:19:43,720 --> 08:19:46,720 Sometimes they'll say you can only do 50 of these a day 11166 08:19:46,720 --> 08:19:48,720 or 500 a day or whatever. 11167 08:19:48,720 --> 08:19:49,720 That's called rate limiting. 11168 08:19:49,720 --> 08:19:51,720 And sometimes they say you've got to log in. 11169 08:19:51,720 --> 08:19:54,720 You've got to create an account and get a key with us 11170 08:19:54,720 --> 08:19:55,720 and then present your key. 11171 08:19:55,720 --> 08:19:58,720 So that means that your account only gets so many. 11172 08:19:58,720 --> 08:20:00,720 And they keep track of who's using their service 11173 08:20:00,720 --> 08:20:02,720 and how much they're using it. 11174 08:20:02,720 --> 08:20:04,720 Google gives you even sort of a dashboard 11175 08:20:04,720 --> 08:20:05,720 that tells you some of this stuff. 11176 08:20:05,720 --> 08:20:07,720 It's kind of nice. 11177 08:20:07,720 --> 08:20:12,720 And so the other thing is that sometimes an API is free 11178 08:20:12,720 --> 08:20:14,720 and then it becomes popular and they decide 11179 08:20:14,720 --> 08:20:17,720 they're going to put a key on it or a rate limit on it. 11180 08:20:17,720 --> 08:20:19,720 So you've got to kind of play this game with them 11181 08:20:19,720 --> 08:20:23,720 and the rules kind of change as things progress. 11182 08:20:23,720 --> 08:20:26,720 So that geocoding API that we're talking about 11183 08:20:26,720 --> 08:20:31,720 has at one point in time 2500 requests a day. 11184 08:20:31,720 --> 08:20:34,720 You can get more requests if you get a key. 11185 08:20:34,720 --> 08:20:38,720 Now another API that we can talk about is the Twitter API. 11186 08:20:38,720 --> 08:20:41,720 Now Twitter API started out as a free public API 11187 08:20:41,720 --> 08:20:44,720 but then Twitter realized that people were making more money 11188 08:20:44,720 --> 08:20:47,720 off of Twitter's data than Twitter was making off of Twitter's data. 11189 08:20:47,720 --> 08:20:51,720 And so Twitter makes it so that you have to have an account. 11190 08:20:51,720 --> 08:20:54,720 You can only request data from their API 11191 08:20:54,720 --> 08:20:57,720 if you use your account key to sign that. 11192 08:20:57,720 --> 08:21:01,720 And so there's a whole series of getting and issuing keys 11193 08:21:01,720 --> 08:21:03,720 and then using those keys. 11194 08:21:03,720 --> 08:21:07,720 And I'll just give you a short summary of the kind of code 11195 08:21:07,720 --> 08:21:13,720 that it takes to build those requests up that have to be signed. 11196 08:21:13,720 --> 08:21:16,720 So you'll look through the Twitter documentation 11197 08:21:16,720 --> 08:21:21,720 and it'll say, oh, this URL to get the tweets, et cetera, et cetera. 11198 08:21:21,720 --> 08:21:24,720 And it says do a get request to this URL and that URL 11199 08:21:24,720 --> 08:21:26,720 and maybe substitute a little bit of things here 11200 08:21:26,720 --> 08:21:28,720 for the screen name you're looking for 11201 08:21:28,720 --> 08:21:30,720 or how many tweets you want. 11202 08:21:30,720 --> 08:21:35,720 And they tell you how to carefully construct these URLs. 11203 08:21:35,720 --> 08:21:39,720 And so here's an example bit of code that talks to the Twitter. 11204 08:21:39,720 --> 08:21:42,720 For now, I'll ignore the security bit. 11205 08:21:42,720 --> 08:21:45,720 That's all hidden in this TW URL. 11206 08:21:45,720 --> 08:21:46,720 So it looks a lot like the last one. 11207 08:21:46,720 --> 08:21:48,720 We're going to use JSON and URL lib. 11208 08:21:48,720 --> 08:21:52,720 And we have found that this is the API name, blah, blah, blah, blah, blah, 11209 08:21:52,720 --> 08:21:56,720 list.json, getting a friend list for a particular person. 11210 08:21:56,720 --> 08:22:01,720 And so that is the base URL that we're going to do. 11211 08:22:01,720 --> 08:22:04,720 And we're going to ask for a Twitter account. 11212 08:22:04,720 --> 08:22:06,720 If we hit enter, we're going to break out. 11213 08:22:06,720 --> 08:22:08,720 And TW URL augment, we're going to say, 11214 08:22:08,720 --> 08:22:12,720 give me the first five friends of this particular screen name, 11215 08:22:12,720 --> 08:22:14,720 the one we just read in from input. 11216 08:22:14,720 --> 08:22:16,720 And this TW URL you'll see in a second, 11217 08:22:16,720 --> 08:22:20,720 it adds a bunch of stuff to prove that you are who you are. 11218 08:22:20,720 --> 08:22:21,720 It's signing that URL. 11219 08:22:21,720 --> 08:22:23,720 So you're sending a signed URL, 11220 08:22:23,720 --> 08:22:26,720 which is nothing more than a whole bunch of crazy characters. 11221 08:22:26,720 --> 08:22:27,720 We'll see that in a second. 11222 08:22:27,720 --> 08:22:28,720 We retrieve it. 11223 08:22:28,720 --> 08:22:30,720 And this is pretty straightforward. 11224 08:22:30,720 --> 08:22:34,720 We can just open the URL, read it, and decode it. 11225 08:22:34,720 --> 08:22:37,720 Decode solves the UTF-8 thing. 11226 08:22:37,720 --> 08:22:39,720 Makes it all so that data is a real string 11227 08:22:39,720 --> 08:22:41,720 and it's in the Unicode internally. 11228 08:22:41,720 --> 08:22:43,720 Now we can actually get the headers. 11229 08:22:43,720 --> 08:22:48,720 Remember I told you earlier that URL open bypasses the headers, 11230 08:22:48,720 --> 08:22:49,720 but it's stored them for later. 11231 08:22:49,720 --> 08:22:51,720 And we can say, hey, give me back those headers. 11232 08:22:51,720 --> 08:22:54,720 And that gives us back a dictionary of headers. 11233 08:22:54,720 --> 08:22:56,720 And the headers, if you go all the way back, 11234 08:22:56,720 --> 08:22:59,720 are a bunch of key value pairs. 11235 08:22:59,720 --> 08:23:02,720 Key colon value in the headers. 11236 08:23:02,720 --> 08:23:04,720 And in Twitter, if you read the documentation, 11237 08:23:04,720 --> 08:23:06,720 there's this x dash rate limit remaining 11238 08:23:06,720 --> 08:23:09,720 that tells you each time it returns to the API, 11239 08:23:09,720 --> 08:23:11,720 response to the API call that you made, 11240 08:23:11,720 --> 08:23:13,720 it says, look, you've got 12 left. 11241 08:23:13,720 --> 08:23:14,720 You've got 11 left. 11242 08:23:14,720 --> 08:23:15,720 You've got 10. 11243 08:23:15,720 --> 08:23:16,720 So you can print that out. 11244 08:23:16,720 --> 08:23:19,720 So this prints out how many you've got left. 11245 08:23:19,720 --> 08:23:21,720 Then we parse the JSON data. 11246 08:23:21,720 --> 08:23:24,720 We're going to print it so we can debug it. 11247 08:23:24,720 --> 08:23:27,720 This dump to string and then print it. 11248 08:23:27,720 --> 08:23:29,720 Indent equals four. 11249 08:23:29,720 --> 08:23:31,720 This is called pretty printing. 11250 08:23:31,720 --> 08:23:34,720 And it's indenting things really nicely 11251 08:23:34,720 --> 08:23:36,720 so that you can make more sense of it. 11252 08:23:36,720 --> 08:23:38,720 Whereas when these things are talking, 11253 08:23:38,720 --> 08:23:39,720 when programs are talking to each other, 11254 08:23:39,720 --> 08:23:44,720 they don't really make the output look particularly pretty. 11255 08:23:44,720 --> 08:23:48,720 And then if you, we're going to go through, 11256 08:23:48,720 --> 08:23:50,720 we have the outer thing of users. 11257 08:23:50,720 --> 08:23:52,720 And we're going to print out the screen name 11258 08:23:52,720 --> 08:23:55,720 and go grab the, for each user and users, 11259 08:23:55,720 --> 08:23:56,720 we're going to print their screen name. 11260 08:23:56,720 --> 08:23:59,720 We're going to grab their status text and print that out. 11261 08:23:59,720 --> 08:24:02,720 And so this is what that data looks like. 11262 08:24:02,720 --> 08:24:04,720 Kind of chopped a bit. 11263 08:24:04,720 --> 08:24:08,720 So the thing we get is an outer layer. 11264 08:24:08,720 --> 08:24:11,720 We get users and then we get a list. 11265 08:24:11,720 --> 08:24:13,720 And here's the first user. 11266 08:24:13,720 --> 08:24:14,720 Now, if you look at the actual data, 11267 08:24:14,720 --> 08:24:15,720 it's much larger than this. 11268 08:24:15,720 --> 08:24:18,720 Here's the second user and then we have status text, 11269 08:24:18,720 --> 08:24:22,720 status text and the screen name. 11270 08:24:22,720 --> 08:24:26,720 And so those are the bits that we're extracting from that. 11271 08:24:26,720 --> 08:24:28,720 If you look, we're going to grab the screen name. 11272 08:24:28,720 --> 08:24:31,720 We're going to grab the status text and away you go. 11273 08:24:31,720 --> 08:24:37,720 So you can start with this, 11274 08:24:37,720 --> 08:24:39,720 but you realize that once you're looking at this 11275 08:24:39,720 --> 08:24:41,720 and you're printing this out with pretty printing, 11276 08:24:41,720 --> 08:24:43,720 you can sort of work your way in 11277 08:24:43,720 --> 08:24:46,720 knowing that it's either a dictionary or a list. 11278 08:24:46,720 --> 08:24:47,720 If it's a dictionary, you look up the key. 11279 08:24:47,720 --> 08:24:50,720 If it's a list, you say which position it is 11280 08:24:50,720 --> 08:24:52,720 and then you get more dictionaries within dictionaries 11281 08:24:52,720 --> 08:24:54,720 within dictionaries and away you go. 11282 08:24:54,720 --> 08:24:58,720 And so this code actually, when it runs, 11283 08:24:58,720 --> 08:25:01,720 it prints out the screen name and then that status 11284 08:25:01,720 --> 08:25:02,720 and the next person. 11285 08:25:02,720 --> 08:25:04,720 So it's my first five, in that case, 11286 08:25:04,720 --> 08:25:07,720 my first five friends and their most recent status, 11287 08:25:07,720 --> 08:25:09,720 the first five people. 11288 08:25:09,720 --> 08:25:12,720 Now, let's talk a little bit about how this security works. 11289 08:25:12,720 --> 08:25:15,720 And so you have to go to the website. 11290 08:25:15,720 --> 08:25:16,720 You have to have a Twitter account. 11291 08:25:16,720 --> 08:25:19,720 You can't talk to Twitter API without a Twitter account. 11292 08:25:19,720 --> 08:25:23,720 And then you go to this website and then you set up a key. 11293 08:25:23,720 --> 08:25:26,720 You say, I'm going to build an application 11294 08:25:26,720 --> 08:25:28,720 that is going to consume the Twitter API. 11295 08:25:28,720 --> 08:25:30,720 And then you go in, you have to work through. 11296 08:25:30,720 --> 08:25:33,720 There's documentation on how all this stuff works. 11297 08:25:33,720 --> 08:25:35,720 You set up an API key. 11298 08:25:35,720 --> 08:25:36,720 You set the application. 11299 08:25:36,720 --> 08:25:39,720 So I made a key called Python on my laptop. 11300 08:25:39,720 --> 08:25:42,720 And it gives us some values. 11301 08:25:42,720 --> 08:25:44,720 It gives us a consumer key, a consumer secret, 11302 08:25:44,720 --> 08:25:46,720 a token key, and a token secret. 11303 08:25:46,720 --> 08:25:48,720 And you get to regenerate these. 11304 08:25:48,720 --> 08:25:51,720 And there's this file called hidden.py. 11305 08:25:51,720 --> 08:25:54,720 And you edit them and copy and paste all the stuff 11306 08:25:54,720 --> 08:25:58,720 from those pages, those four values, into these strings. 11307 08:25:58,720 --> 08:26:02,720 Now, if you download my code, I don't have my keys in there. 11308 08:26:02,720 --> 08:26:04,720 I got some placeholders for this stuff. 11309 08:26:04,720 --> 08:26:07,720 So you've got to get to this web page that's on Twitter, 11310 08:26:07,720 --> 08:26:09,720 copy these things in, 11311 08:26:09,720 --> 08:26:13,720 and then the TWRL code will start to work. 11312 08:26:13,720 --> 08:26:16,720 It uses a technology called OAuth, 11313 08:26:16,720 --> 08:26:19,720 which is a way to sign a URL 11314 08:26:19,720 --> 08:26:22,720 in a way that proves that you have the key and the secret 11315 08:26:22,720 --> 08:26:25,720 and the tokens. 11316 08:26:25,720 --> 08:26:27,720 And it can't be modified in the middle. 11317 08:26:27,720 --> 08:26:29,720 So once you send this URL, 11318 08:26:29,720 --> 08:26:31,720 they can check the key and the secret 11319 08:26:31,720 --> 08:26:32,720 to make sure that you truly signed it 11320 08:26:32,720 --> 08:26:34,720 without actually sending the key and the secret. 11321 08:26:34,720 --> 08:26:36,720 It's actually kind of cool and fascinating, 11322 08:26:36,720 --> 08:26:39,720 but we won't go into it in great detail here. 11323 08:26:39,720 --> 08:26:44,720 And so if you look at the code in TWRL.py, 11324 08:26:44,720 --> 08:26:46,720 this is the code that does it. 11325 08:26:46,720 --> 08:26:51,720 It actually pulls in an OAuth library, that hidden.py. 11326 08:26:51,720 --> 08:26:54,720 That is that code that you've got. 11327 08:26:54,720 --> 08:26:58,720 And it's got the consumer key, the consumer secret. 11328 08:26:58,720 --> 08:26:59,720 Secrets. 11329 08:26:59,720 --> 08:27:03,720 This is pulling that from hidden.py. 11330 08:27:03,720 --> 08:27:06,720 This is a lot of stuff that's using this OAuth library. 11331 08:27:06,720 --> 08:27:08,720 Don't worry too much about that. 11332 08:27:08,720 --> 08:27:11,720 Eventually it produces a URL that looks like this. 11333 08:27:11,720 --> 08:27:13,720 And what happens is this was the base URL 11334 08:27:13,720 --> 08:27:14,720 you were told to use. 11335 08:27:14,720 --> 08:27:16,720 Then you have count equals two 11336 08:27:16,720 --> 08:27:18,720 and screen name equals Dr. Chuck. 11337 08:27:18,720 --> 08:27:22,720 Those parts are your parameters to that web service call. 11338 08:27:22,720 --> 08:27:26,720 And then all this OAuth stuff is produced 11339 08:27:26,720 --> 08:27:28,720 by this OAuth code 11340 08:27:28,720 --> 08:27:30,720 and the consumer key and the secret. 11341 08:27:30,720 --> 08:27:33,720 What happens is the key gets sent, 11342 08:27:33,720 --> 08:27:37,720 the key gets sent and the secret does not get sent, 11343 08:27:37,720 --> 08:27:40,720 but they send the signature which is based on the secret 11344 08:27:40,720 --> 08:27:43,720 and then what it does is it rechecks the signature 11345 08:27:43,720 --> 08:27:44,720 on the far end. 11346 08:27:44,720 --> 08:27:48,720 Signature is a long string by regenerating the signature 11347 08:27:48,720 --> 08:27:51,720 because the secret is available to both you 11348 08:27:51,720 --> 08:27:54,720 to generate the signature and to them to check the signature. 11349 08:27:54,720 --> 08:27:57,720 So it's kind of like a hash, et cetera, et cetera. 11350 08:27:57,720 --> 08:27:59,720 You don't have to worry about all this. 11351 08:27:59,720 --> 08:28:01,720 These URLs get really long 11352 08:28:01,720 --> 08:28:03,720 and your values that you need are in, 11353 08:28:03,720 --> 08:28:06,720 the name of the URL is in and you call this routine. 11354 08:28:06,720 --> 08:28:09,720 That's called augment that takes a URL and then parameters 11355 08:28:09,720 --> 08:28:12,720 and then augments it by adding all this OAuth stuff. 11356 08:28:12,720 --> 08:28:16,720 And so that's why it's called augment to augment the URL. 11357 08:28:16,720 --> 08:28:18,720 And once you got this set up and hidden working, 11358 08:28:18,720 --> 08:28:21,720 then you sort of just augment the URL and then hit it. 11359 08:28:21,720 --> 08:28:24,720 Now, you know, if you don't have the right keys or secrets 11360 08:28:24,720 --> 08:28:25,720 or you don't have an account on Twitter, 11361 08:28:25,720 --> 08:28:26,720 then it's going to blow up. 11362 08:28:26,720 --> 08:28:27,720 But if you get it set up, 11363 08:28:27,720 --> 08:28:31,720 you will be able to talk to the Twitter API with this. 11364 08:28:31,720 --> 08:28:33,720 So this whole web services section, 11365 08:28:33,720 --> 08:28:36,720 we've done quite a bit of stuff, right? 11366 08:28:36,720 --> 08:28:40,720 We've looked at how instead of reading HTML or flat text, 11367 08:28:40,720 --> 08:28:44,720 we are creating structured data according to contracts, 11368 08:28:44,720 --> 08:28:46,720 whether it be XML or JSON. 11369 08:28:46,720 --> 08:28:48,720 We can retrieve and parse that information 11370 08:28:48,720 --> 08:28:50,720 in a deterministic way. 11371 08:28:50,720 --> 08:28:53,720 We talked about schemas that define the contracts 11372 08:28:53,720 --> 08:28:56,720 so that you know if the data you're getting is wrong, 11373 08:28:56,720 --> 08:28:57,720 you could know who to blame 11374 08:28:57,720 --> 08:28:59,720 because the schema gets violated. 11375 08:28:59,720 --> 08:29:03,720 And we've played with APIs where you're talking to someone else 11376 08:29:03,720 --> 08:29:05,720 who's defining what the rules are 11377 08:29:05,720 --> 08:29:07,720 and how to read their documentation. 11378 08:29:07,720 --> 08:29:11,720 And even if they have an API key or need to sign URLs, 11379 08:29:11,720 --> 08:29:14,720 showed a little bit about how to do that. 11380 08:29:19,720 --> 08:29:21,720 We're doing some code, sample code, 11381 08:29:21,720 --> 08:29:24,720 playing through with some sample code samples. 11382 08:29:24,720 --> 08:29:27,720 And you can get this by downloading it. 11383 08:29:27,720 --> 08:29:29,720 I've got this whole thing downloaded. 11384 08:29:29,720 --> 08:29:32,720 And I've got all the files here. 11385 08:29:32,720 --> 08:29:35,720 And these are the files we're going to play with today. 11386 08:29:35,720 --> 08:29:38,720 Today what we're going to do is talk to about the Twitter API. 11387 08:29:38,720 --> 08:29:41,720 And the one thing we've got to learn about the Twitter API 11388 08:29:41,720 --> 08:29:43,720 is we have to authorize ourselves. 11389 08:29:43,720 --> 08:29:48,720 And so we have to make sure that we have a Twitter account 11390 08:29:48,720 --> 08:29:50,720 and then we get some keys. 11391 08:29:50,720 --> 08:29:52,720 And so in this particular application, 11392 08:29:52,720 --> 08:29:54,720 if you want to duplicate what I'm doing, 11393 08:29:54,720 --> 08:29:56,720 you have to go to apps.twitter.com, 11394 08:29:56,720 --> 08:29:58,720 click this create new application button, 11395 08:29:58,720 --> 08:30:00,720 and then get some codes. 11396 08:30:00,720 --> 08:30:03,720 And the codes show up as soon as you hit this button 11397 08:30:03,720 --> 08:30:04,720 and then one more button, 11398 08:30:04,720 --> 08:30:07,720 which I'm not going to do on screen. 11399 08:30:07,720 --> 08:30:10,720 And so what happens is there are four codes 11400 08:30:10,720 --> 08:30:13,720 that you've got to put in this file hidden.py. 11401 08:30:13,720 --> 08:30:15,720 The consumer key, the consumer secret, the token key, 11402 08:30:15,720 --> 08:30:16,720 and token secret. 11403 08:30:16,720 --> 08:30:18,720 These are just messed up, 11404 08:30:18,720 --> 08:30:20,720 so I'll show you how this works 11405 08:30:20,720 --> 08:30:22,720 and it blows up if first, 11406 08:30:22,720 --> 08:30:25,720 and then I'll put my keys in here without showing you. 11407 08:30:25,720 --> 08:30:28,720 But basically, this is a little file you've got to edit 11408 08:30:28,720 --> 08:30:30,720 or these Twitter ones don't work. 11409 08:30:30,720 --> 08:30:32,720 You'll see what happens. 11410 08:30:32,720 --> 08:30:35,720 So the first one I'm going to do is do the simplest one of all. 11411 08:30:35,720 --> 08:30:38,720 And that is I call this thing Twitter Test 11412 08:30:38,720 --> 08:30:42,720 and it just is going to go ask for the user timeline. 11413 08:30:42,720 --> 08:30:43,720 And we can take a look at this. 11414 08:30:43,720 --> 08:30:46,720 And we're going to take the URL 11415 08:30:46,720 --> 08:30:48,720 and we're going to augment the URL. 11416 08:30:48,720 --> 08:30:49,720 This is the base. 11417 08:30:49,720 --> 08:30:52,720 We found this looking at the Twitter API documentation. 11418 08:30:52,720 --> 08:30:54,720 We're going to pass a parameter of screen name, 11419 08:30:54,720 --> 08:30:56,720 Dr. Chuck, and a count of two. 11420 08:30:56,720 --> 08:30:58,720 So this is just a Python dictionary. 11421 08:30:58,720 --> 08:31:03,720 And augment comes from this little bit of code called twurl. 11422 08:31:03,720 --> 08:31:07,720 And this uses a bit of code called oauth, 11423 08:31:07,720 --> 08:31:12,720 which is built into Python as well, right? 11424 08:31:12,720 --> 08:31:14,720 Yeah, that's built into Python as well. 11425 08:31:14,720 --> 08:31:18,720 And it augments the URL and it takes the key, 11426 08:31:18,720 --> 08:31:22,720 the secret, the token key, and does a thing and signs it 11427 08:31:22,720 --> 08:31:24,720 and then makes this big, long, ugly URL, 11428 08:31:24,720 --> 08:31:26,720 which you will soon see, 11429 08:31:26,720 --> 08:31:29,720 and it's a signature of the URL. 11430 08:31:29,720 --> 08:31:32,720 So we pass this data back and forth to Twitter 11431 08:31:32,720 --> 08:31:36,720 with a signature and then they recheck the signature 11432 08:31:36,720 --> 08:31:38,720 and it's a digital signature that knows that 11433 08:31:38,720 --> 08:31:41,720 this URL came from a program that knows the key, 11434 08:31:41,720 --> 08:31:43,720 secret, and token and token secret. 11435 08:31:43,720 --> 08:31:47,720 And so this augment basically is something 11436 08:31:47,720 --> 08:31:52,720 that I wrote, twurl, augment, is something I wrote 11437 08:31:52,720 --> 08:31:55,720 to make it easier to add all these oauth parameters. 11438 08:31:55,720 --> 08:32:00,720 And you feed this code by putting your data into hidden.py. 11439 08:32:00,720 --> 08:32:02,720 Lots of people get this to work, so don't worry. 11440 08:32:02,720 --> 08:32:06,720 It's kind of cool when you finally get it to work. 11441 08:32:06,720 --> 08:32:08,720 So let's take a look at what it does. 11442 08:32:08,720 --> 08:32:10,720 Just know that this makes an awesome URL 11443 08:32:10,720 --> 08:32:12,720 that does all the security. 11444 08:32:12,720 --> 08:32:15,720 And we'll see one of those URLs. 11445 08:32:15,720 --> 08:32:18,720 So ignore the certificate errors. 11446 08:32:18,720 --> 08:32:22,720 This has to do with the fact that we're using HTTPS 11447 08:32:22,720 --> 08:32:24,720 and Python doesn't have enough certificates 11448 08:32:24,720 --> 08:32:26,720 put into it by default for a lot of reasons, 11449 08:32:26,720 --> 08:32:29,720 but our quick and dirty way is to turn them off. 11450 08:32:29,720 --> 08:32:32,720 Thank you, Python, for reducing security by teaching us 11451 08:32:32,720 --> 08:32:34,720 so that this is the best way to do it. 11452 08:32:34,720 --> 08:32:37,720 That's a grumpy moment from on my part. 11453 08:32:37,720 --> 08:32:40,720 So what we're going to do is we're going to do a URL open. 11454 08:32:40,720 --> 08:32:43,720 This bit here is to shut off the security checking 11455 08:32:43,720 --> 08:32:45,720 for the SSL certificate. 11456 08:32:45,720 --> 08:32:47,720 And then we're going to read all the data. 11457 08:32:47,720 --> 08:32:49,720 And then we're going to print it out. 11458 08:32:49,720 --> 08:32:53,720 And we're also going to ask the connection, 11459 08:32:53,720 --> 08:32:56,720 this URL, remember I told you a long time ago 11460 08:32:56,720 --> 08:33:00,720 that URL lib eats the headers, but you can get them back. 11461 08:33:00,720 --> 08:33:02,720 And now we're going to ask to get a dictionary 11462 08:33:02,720 --> 08:33:03,720 of the headers back. 11463 08:33:03,720 --> 08:33:05,720 And so we'll print those out. 11464 08:33:05,720 --> 08:33:08,720 So this is really kind of just testing the body 11465 08:33:08,720 --> 08:33:10,720 and the headers and printing them out 11466 08:33:10,720 --> 08:33:12,720 sort of in as raw a way we can do. 11467 08:33:12,720 --> 08:33:14,720 So let's go run this. 11468 08:33:14,720 --> 08:33:17,720 Now, this is going to fail the first time we do it 11469 08:33:17,720 --> 08:33:21,720 because we haven't put the hidden variables in there. 11470 08:33:21,720 --> 08:33:27,720 So if I say python3twtest.py, it's going to run and blow up. 11471 08:33:27,720 --> 08:33:31,720 And it's going to give you this 401 authorization required. 11472 08:33:31,720 --> 08:33:34,720 That's a good sign because that means that you haven't yet 11473 08:33:34,720 --> 08:33:37,720 updated your values in hidden.py. 11474 08:33:37,720 --> 08:33:42,720 And so this is that augmented URL. 11475 08:33:42,720 --> 08:33:45,720 And you can see the consumer key and the consumer secret 11476 08:33:45,720 --> 08:33:48,720 and the OAuth token and whatever. 11477 08:33:48,720 --> 08:33:51,720 Okay, so these tokens are like wrong. 11478 08:33:51,720 --> 08:33:54,720 These aren't, oops, control C. 11479 08:33:54,720 --> 08:33:56,720 They aren't real. 11480 08:33:56,720 --> 08:33:58,720 But you'll notice it doesn't have the key and the secret 11481 08:33:58,720 --> 08:34:01,720 of the token key, the token secret and the secret. 11482 08:34:01,720 --> 08:34:03,720 And that's all actually encoded in this signature. 11483 08:34:03,720 --> 08:34:07,720 It turns out that you need to have the key and the token, 11484 08:34:07,720 --> 08:34:11,720 I mean the secret and the token secret to generate the signature. 11485 08:34:11,720 --> 08:34:13,720 And where is the signature? 11486 08:34:13,720 --> 08:34:15,720 Oh, there's the signature, right? 11487 08:34:15,720 --> 08:34:16,720 There's the signature. 11488 08:34:16,720 --> 08:34:20,720 And so this signature combined with the nonce, 11489 08:34:20,720 --> 08:34:22,720 you can only do, this signature has a time 11490 08:34:22,720 --> 08:34:24,720 and includes all kinds of things. 11491 08:34:24,720 --> 08:34:28,720 So even if you type this in, well, you'll see these go by. 11492 08:34:28,720 --> 08:34:31,720 And it's not really breaking my security too much 11493 08:34:31,720 --> 08:34:32,720 when you see these afterwards. 11494 08:34:32,720 --> 08:34:34,720 So don't get all excited when you say, 11495 08:34:34,720 --> 08:34:37,720 oh, you revealed your token and your key. 11496 08:34:37,720 --> 08:34:39,720 Well, I can reveal my token and key, 11497 08:34:39,720 --> 08:34:41,720 but I'm not gonna reveal the secret. 11498 08:34:41,720 --> 08:34:45,720 Okay, so this adds all this OAuth stuff, OAuth nonce, 11499 08:34:45,720 --> 08:34:47,720 OAuth timestamp. 11500 08:34:47,720 --> 08:34:49,720 And these timestamps and nonces are made it 11501 08:34:49,720 --> 08:34:53,720 so that you can't replay my URL even if you see the exact URL. 11502 08:34:53,720 --> 08:34:55,720 Once I hit it, then you can't hit it again. 11503 08:34:55,720 --> 08:34:57,720 And so that's what the nonce does. 11504 08:34:57,720 --> 08:35:00,720 So I'm gonna close hidden.py here. 11505 08:35:00,720 --> 08:35:05,720 And I'm going to update hidden.py in another window. 11506 08:35:20,720 --> 08:35:24,720 Okay, so I just, in another window, I updated hidden.py. 11507 08:35:24,720 --> 08:35:26,720 I'm not gonna show you that. 11508 08:35:26,720 --> 08:35:29,720 But now I'm gonna run python-tw-test.py. 11509 08:35:29,720 --> 08:35:32,720 So TWRL is going to read hidden. 11510 08:35:32,720 --> 08:35:34,720 And now these keys and secrets are my real ones 11511 08:35:34,720 --> 08:35:36,720 that I haven't shown you. 11512 08:35:36,720 --> 08:35:37,720 So this should work. 11513 08:35:37,720 --> 08:35:38,720 Fingers crossed. 11514 08:35:40,720 --> 08:35:41,720 Yay, it worked. 11515 08:35:41,720 --> 08:35:44,720 Okay, so it worked. 11516 08:35:44,720 --> 08:35:46,720 So I'm calling Twitter. 11517 08:35:46,720 --> 08:35:47,720 Here's the URL. 11518 08:35:47,720 --> 08:35:51,720 Now don't worry, the token and the consumer key 11519 08:35:51,720 --> 08:35:53,720 are not enough to break into my account. 11520 08:35:53,720 --> 08:35:56,720 And neither is the signature because you can't replay this. 11521 08:35:56,720 --> 08:36:00,720 In about five minutes, you can't replay this anymore, okay? 11522 08:36:00,720 --> 08:36:04,720 So you can't generate the signature. 11523 08:36:04,720 --> 08:36:06,720 I've done one. 11524 08:36:06,720 --> 08:36:09,720 The signature includes the time and date. 11525 08:36:09,720 --> 08:36:14,720 So you can't, trust me, go read up on OAuth. 11526 08:36:14,720 --> 08:36:15,720 Don't worry. 11527 08:36:15,720 --> 08:36:16,720 I haven't really revealed anything. 11528 08:36:16,720 --> 08:36:18,720 But, so the first thing we see is this. 11529 08:36:18,720 --> 08:36:21,720 So we see, and we should put like the line of dashes here. 11530 08:36:21,720 --> 08:36:23,720 This is the JSON. 11531 08:36:23,720 --> 08:36:24,720 It ain't very pretty. 11532 08:36:24,720 --> 08:36:25,720 It's not very pretty. 11533 08:36:25,720 --> 08:36:28,720 Okay, and so that's the JSON from there to there. 11534 08:36:28,720 --> 08:36:30,720 It's just what most APIs give us back. 11535 08:36:30,720 --> 08:36:32,720 It's really dense JSON, right? 11536 08:36:32,720 --> 08:36:34,720 And so this is a byte array. 11537 08:36:34,720 --> 08:36:37,720 Remember how you have to do a.decode? 11538 08:36:37,720 --> 08:36:39,720 I didn't do a.decode here. 11539 08:36:39,720 --> 08:36:41,720 And so this is telling, and Python is telling us, 11540 08:36:41,720 --> 08:36:44,720 this is a byte array, which it's a raw set of bytes 11541 08:36:44,720 --> 08:36:49,720 that came from the internet, which probably are UTF-8. 11542 08:36:49,720 --> 08:36:52,720 And if I put a decode here, then it would decode, 11543 08:36:52,720 --> 08:36:56,720 if I say.data.decode there, then it would be fine. 11544 08:36:56,720 --> 08:36:57,720 But we don't care. 11545 08:36:57,720 --> 08:36:58,720 This was just a dump. 11546 08:36:58,720 --> 08:36:59,720 Do we get anything? 11547 08:36:59,720 --> 08:37:03,720 And so then, here, let's do this. 11548 08:37:03,720 --> 08:37:04,720 Print. 11549 08:37:04,720 --> 08:37:07,720 I'll just make this code different. 11550 08:37:07,720 --> 08:37:10,720 Put some equal signs here, a lot of equal signs. 11551 08:37:10,720 --> 08:37:16,720 So we can easily see where the thing starts and stops. 11552 08:37:16,720 --> 08:37:17,720 So we'll run that again. 11553 08:37:17,720 --> 08:37:19,720 If you look at those URLs. 11554 08:37:19,720 --> 08:37:21,720 So that was all of that stuff. 11555 08:37:21,720 --> 08:37:26,720 And then, this is the headers. 11556 08:37:26,720 --> 08:37:28,720 And so the headers, again, are not pretty. 11557 08:37:28,720 --> 08:37:30,720 If you get the headers, it's a dictionary. 11558 08:37:30,720 --> 08:37:33,720 You got cache control, no cache, comma. 11559 08:37:33,720 --> 08:37:36,720 This is the string, key value. 11560 08:37:36,720 --> 08:37:38,720 You got to find your commas key value. 11561 08:37:38,720 --> 08:37:43,720 But the one that's really interesting here is, 11562 08:37:43,720 --> 08:37:45,720 which one is it? 11563 08:37:45,720 --> 08:37:47,720 X rate limit remaining, right there. 11564 08:37:47,720 --> 08:37:49,720 X rate limit remaining. 11565 08:37:49,720 --> 08:37:52,720 So that means that for this particular API, 11566 08:37:52,720 --> 08:37:57,720 and this header tells me that I've got 898 calls left. 11567 08:37:57,720 --> 08:38:00,720 And this is when I will get more calls, 11568 08:38:00,720 --> 08:38:05,720 and yeah, so let's see, yeah. 11569 08:38:05,720 --> 08:38:06,720 So watch. 11570 08:38:06,720 --> 08:38:07,720 I'm going to do this again, 11571 08:38:07,720 --> 08:38:11,720 and you will see that I can only do this 897 more times now. 11572 08:38:11,720 --> 08:38:13,720 Do, do, do, run it. 11573 08:38:13,720 --> 08:38:15,720 I can only do this 897. 11574 08:38:15,720 --> 08:38:18,720 So I am being tracked at this point. 11575 08:38:18,720 --> 08:38:20,720 I am being tracked by Twitter. 11576 08:38:20,720 --> 08:38:23,720 Twitter knows that it's Dr. Chuck that's doing this, 11577 08:38:23,720 --> 08:38:26,720 and Dr. Chuck has done 900. 11578 08:38:26,720 --> 08:38:28,720 He's done 899, 897. 11579 08:38:28,720 --> 08:38:30,720 And if I keep running this, 11580 08:38:30,720 --> 08:38:32,720 eventually Twitter will tell me, 11581 08:38:32,720 --> 08:38:34,720 you got to wait for a while. 11582 08:38:34,720 --> 08:38:36,720 And that's because Twitter doesn't want me, 11583 08:38:36,720 --> 08:38:37,720 under my Dr. Chuck account, 11584 08:38:37,720 --> 08:38:40,720 pulling out like lots and lots of stuff out of Twitter 11585 08:38:40,720 --> 08:38:43,720 and making my own website. 11586 08:38:43,720 --> 08:38:45,720 I do actually have my own Twitter website, 11587 08:38:45,720 --> 08:38:47,720 using some cool software. 11588 08:38:47,720 --> 08:38:58,720 www.drchuck.com slash Twitter. 11589 08:38:58,720 --> 08:38:59,720 And this I have to run, 11590 08:38:59,720 --> 08:39:02,720 and it rate limits and causes all kinds of, you know, whatever. 11591 08:39:02,720 --> 08:39:08,720 So, okay, so Twitter rate limit. 11592 08:39:08,720 --> 08:39:11,720 So, I'll save that. 11593 08:39:11,720 --> 08:39:12,720 So that's tweet. 11594 08:39:12,720 --> 08:39:14,720 This is just to test it, okay? 11595 08:39:14,720 --> 08:39:16,720 Because we're doing, I want to do something interesting. 11596 08:39:16,720 --> 08:39:19,720 So we're not parsing the JSON that comes back. 11597 08:39:19,720 --> 08:39:21,720 We're not doing anything tricky with this. 11598 08:39:21,720 --> 08:39:23,720 And away we go. 11599 08:39:23,720 --> 08:39:28,720 So, let's take a look at some more code. 11600 08:39:28,720 --> 08:39:30,720 I think I don't need this anymore. 11601 08:39:30,720 --> 08:39:34,720 So now, I am going to parse this. 11602 08:39:34,720 --> 08:39:36,720 So most of this looks the same. 11603 08:39:36,720 --> 08:39:38,720 I've got that same user timeline JSON. 11604 08:39:38,720 --> 08:39:40,720 I'm going to ignore the SSL certificates. 11605 08:39:40,720 --> 08:39:41,720 I'm going to write a loop. 11606 08:39:41,720 --> 08:39:43,720 So I'm going to ask the Twitter, 11607 08:39:43,720 --> 08:39:48,720 I'm going to print, 11608 08:39:48,720 --> 08:39:50,720 I'm going to get a Twitter account and quit 11609 08:39:50,720 --> 08:39:52,720 if it's a blank line or if I had to enter it. 11610 08:39:52,720 --> 08:39:55,720 I'm going to use the Twitter URL augment the same way. 11611 08:39:55,720 --> 08:39:58,720 That's going to do all the signing using from hidden.py. 11612 08:39:58,720 --> 08:39:59,720 I retrieve it. 11613 08:39:59,720 --> 08:40:02,720 And I'm going to retrieve it, ignoring the SSL errors. 11614 08:40:02,720 --> 08:40:03,720 And then I'm going to decode it. 11615 08:40:03,720 --> 08:40:04,720 This time I'm going to decode it 11616 08:40:04,720 --> 08:40:06,720 so that I get a real Unicode string. 11617 08:40:06,720 --> 08:40:08,720 And I'm going to print the first 250 characters of it. 11618 08:40:08,720 --> 08:40:10,720 I'm going to grab the headers. 11619 08:40:10,720 --> 08:40:15,720 And I'm going to print the remaining, the right limit. 11620 08:40:15,720 --> 08:40:20,720 So this is sort of a very simple version of this same thing. 11621 08:40:20,720 --> 08:40:22,720 It really is decoding the data 11622 08:40:22,720 --> 08:40:24,720 and only printing the first 250 characters. 11623 08:40:24,720 --> 08:40:36,720 So let's run that. 11624 08:40:36,720 --> 08:40:40,720 Dr. Chuck, boom, and it's got 896. 11625 08:40:40,720 --> 08:40:43,720 So that's just a little simpler version of that 11626 08:40:43,720 --> 08:40:45,720 with a little less brutal debugging. 11627 08:40:45,720 --> 08:40:47,720 Okay, so now let's do something even more fun. 11628 08:40:47,720 --> 08:40:51,720 Let's go to Twitter2.py and tear it apart. 11629 08:40:51,720 --> 08:40:55,720 And so again, we're going to look at my friends list 11630 08:40:55,720 --> 08:40:58,720 or someone else, anybody's friends list. 11631 08:40:58,720 --> 08:41:00,720 We're going to ask for the friends 11632 08:41:00,720 --> 08:41:02,720 and ask for the screen name, 11633 08:41:02,720 --> 08:41:04,720 ask for the first five friends, 11634 08:41:04,720 --> 08:41:07,720 and then look at their statuses, 11635 08:41:07,720 --> 08:41:09,720 open it, decode it, get the headers, 11636 08:41:09,720 --> 08:41:10,720 print the right limit. 11637 08:41:10,720 --> 08:41:13,720 Remaining all this stuff is the same as in Twitter1. 11638 08:41:13,720 --> 08:41:16,720 But now we're going to parse the JavaScript. 11639 08:41:16,720 --> 08:41:18,720 I'm not even putting this in a try and accept 11640 08:41:18,720 --> 08:41:20,720 because, hey, I'm talking to Twitter. 11641 08:41:20,720 --> 08:41:24,720 I'm going to guess that Twitter's going to give me the right stuff. 11642 08:41:24,720 --> 08:41:26,720 You'll probably want to put a try and accept here. 11643 08:41:26,720 --> 08:41:28,720 Then I'm going to do a debug print. 11644 08:41:28,720 --> 08:41:31,720 I'm going to do a JSON pretty print. 11645 08:41:31,720 --> 08:41:33,720 Let's make that be 2 so it looks a little better. 11646 08:41:33,720 --> 08:41:36,720 And then, well, I'm going to run it 11647 08:41:36,720 --> 08:41:39,720 and then you're going to see how we have to parse this 11648 08:41:39,720 --> 08:41:41,720 and we're going to see that it's a list. 11649 08:41:41,720 --> 08:41:43,720 So we're done with that. 11650 08:41:43,720 --> 08:41:46,720 And now we're running Twitter2.py. 11651 08:41:46,720 --> 08:41:48,720 So I'm going to go to Dr. Chuck 11652 08:41:48,720 --> 08:41:51,720 and this is going to ask the question 11653 08:41:51,720 --> 08:41:53,720 who Dr. Chuck's friends are. 11654 08:41:53,720 --> 08:41:56,720 Okay, let's go to the top. 11655 08:41:56,720 --> 08:42:01,720 So it hit this API and it has the screen name 11656 08:42:01,720 --> 08:42:04,720 Dr. Chuck count equals 5 and all this OAuth stuff. 11657 08:42:04,720 --> 08:42:08,720 Again, this is not a security breach by showing you all of this 11658 08:42:08,720 --> 08:42:11,720 because the signature, the secrets aren't there. 11659 08:42:11,720 --> 08:42:18,720 Okay, so if we look at it, it's an outer object or dictionary 11660 08:42:18,720 --> 08:42:23,720 and then the outer has a users which is a list. 11661 08:42:23,720 --> 08:42:25,720 And then each user has some stuff in it. 11662 08:42:25,720 --> 08:42:27,720 So this one's Stephanie Teasley. 11663 08:42:27,720 --> 08:42:29,720 It's got her screen name. 11664 08:42:29,720 --> 08:42:31,720 It's got some descriptions. 11665 08:42:31,720 --> 08:42:32,720 Keep on going. 11666 08:42:32,720 --> 08:42:35,720 It's got her status, her latest status. 11667 08:42:35,720 --> 08:42:37,720 For my friend, her status. 11668 08:42:37,720 --> 08:42:40,720 Her source, where she's at. 11669 08:42:40,720 --> 08:42:43,720 I don't know, man, she's got a lot of stuff here. 11670 08:42:43,720 --> 08:42:44,720 Okay, there we go. 11671 08:42:44,720 --> 08:42:46,720 That was the first one. 11672 08:42:46,720 --> 08:42:50,720 And then the next one that I'm following is live EDU, etc. 11673 08:42:50,720 --> 08:42:52,720 So you'll see that this is an array. 11674 08:42:52,720 --> 08:42:55,720 So that outer thing is an array of users. 11675 08:42:55,720 --> 08:43:00,720 Now, JS here is a dictionary. 11676 08:43:00,720 --> 08:43:03,720 So I can say for you in JS subusers. 11677 08:43:03,720 --> 08:43:05,720 Well, JS subusers is a list. 11678 08:43:05,720 --> 08:43:08,720 So the first U is gonna be this Stephanie Teasley U 11679 08:43:08,720 --> 08:43:10,720 and the second U is gonna be live EDU. 11680 08:43:10,720 --> 08:43:14,720 So that's all it took to get through all that stuff 11681 08:43:14,720 --> 08:43:16,720 and figure that out. 11682 08:43:16,720 --> 08:43:21,720 And then I'm gonna say, get me the screen name of my person. 11683 08:43:21,720 --> 08:43:22,720 So let's go in here. 11684 08:43:22,720 --> 08:43:27,720 So that's gonna pull Stephanie Teasley out. 11685 08:43:27,720 --> 08:43:31,720 Then I'm gonna go find her status. 11686 08:43:31,720 --> 08:43:35,720 Let's find her somewhere in here. 11687 08:43:35,720 --> 08:43:39,720 U sub status subtext. 11688 08:43:39,720 --> 08:43:40,720 Come on. 11689 08:43:40,720 --> 08:43:41,720 Okay, there's sub status. 11690 08:43:41,720 --> 08:43:45,720 Sub status is all this stuff. 11691 08:43:45,720 --> 08:43:48,720 More, more, more, more, more, more, more. 11692 08:43:48,720 --> 08:43:50,720 Right there, that's status. 11693 08:43:50,720 --> 08:43:52,720 That's U sub status is that. 11694 08:43:52,720 --> 08:43:56,720 And then U sub status subtext is this stuff. 11695 08:43:56,720 --> 08:44:00,720 So it's gonna extract this bit right here. 11696 08:44:00,720 --> 08:44:03,720 And so U status text. 11697 08:44:03,720 --> 08:44:05,720 And I print out the first 50 characters 11698 08:44:05,720 --> 08:44:07,720 of the screen name status. 11699 08:44:07,720 --> 08:44:10,720 And I do that for the first five 11700 08:44:10,720 --> 08:44:15,720 because I told it I only wanted five. 11701 08:44:15,720 --> 08:44:17,720 And then of course I get to see the right limit. 11702 08:44:17,720 --> 08:44:18,720 So let's go down to the bottom. 11703 08:44:18,720 --> 08:44:21,720 So all of this is the debug print 11704 08:44:21,720 --> 08:44:23,720 of the JSON I got back. 11705 08:44:23,720 --> 08:44:25,720 Here is the program starting to print. 11706 08:44:25,720 --> 08:44:27,720 Here is the screen name of my first friend. 11707 08:44:27,720 --> 08:44:29,720 And here's the first 50 characters 11708 08:44:29,720 --> 08:44:31,720 of her most recent status. 11709 08:44:31,720 --> 08:44:34,720 Here is the screen name of my, 11710 08:44:34,720 --> 08:44:37,720 and these are in reverse order who I've been following. 11711 08:44:37,720 --> 08:44:39,720 So I've been playing with this live coding stuff. 11712 08:44:39,720 --> 08:44:42,720 So I'm following them. 11713 08:44:42,720 --> 08:44:44,720 What? 11714 08:44:44,720 --> 08:44:51,720 Key error status, that didn't work. 11715 08:44:51,720 --> 08:44:56,720 Why not? 11716 08:44:56,720 --> 08:44:59,720 Oh, that's because live coding TV 11717 08:44:59,720 --> 08:45:01,720 somehow doesn't have a status. 11718 08:45:01,720 --> 08:45:03,720 So most of these work, 11719 08:45:03,720 --> 08:45:05,720 so now you'll get to see me fix something. 11720 08:45:05,720 --> 08:45:08,720 And when you download it, it'll be fixed. 11721 08:45:08,720 --> 08:45:09,720 And so it says key error status. 11722 08:45:09,720 --> 08:45:13,720 So that means that I've got to do a thing that says, 11723 08:45:13,720 --> 08:45:34,720 if status, not in you, print, no status found. 11724 08:45:34,720 --> 08:45:36,720 Continue. 11725 08:45:36,720 --> 08:45:38,720 Since sometimes there's no statuses. 11726 08:45:38,720 --> 08:45:39,720 Who would have thought? 11727 08:45:39,720 --> 08:45:42,720 I did not know that. 11728 08:45:42,720 --> 08:45:48,720 Yeah, so you. 11729 08:45:48,720 --> 08:45:51,720 Okay, so let's run this again. 11730 08:45:51,720 --> 08:45:55,720 Did I get to see my remaining? 11731 08:45:55,720 --> 08:45:58,720 Actually, let me change the order of this. 11732 08:45:58,720 --> 08:46:01,720 Let me put this down here. 11733 08:46:01,720 --> 08:46:03,720 That'll be wrong from the slides, 11734 08:46:03,720 --> 08:46:08,720 but it'll be prettier now. 11735 08:46:08,720 --> 08:46:14,720 Let's put the headers after the dump of the data. 11736 08:46:14,720 --> 08:46:17,720 Okay, so let's run it again. 11737 08:46:17,720 --> 08:46:18,720 Did I save it? 11738 08:46:18,720 --> 08:46:19,720 Yeah. 11739 08:46:19,720 --> 08:46:21,720 Dr. Chuck. 11740 08:46:21,720 --> 08:46:22,720 Blah. 11741 08:46:22,720 --> 08:46:23,720 Whole bunch of stuff. 11742 08:46:23,720 --> 08:46:25,720 So I got 13 remaining calls on this one. 11743 08:46:25,720 --> 08:46:27,720 So it's not the same as the other one. 11744 08:46:27,720 --> 08:46:29,720 I don't get to call this too many more times, 11745 08:46:29,720 --> 08:46:32,720 so hopefully I'll get the debugging to work. 11746 08:46:32,720 --> 08:46:35,720 Sort of. 11747 08:46:35,720 --> 08:46:37,720 I got a bad space here. 11748 08:46:37,720 --> 08:46:39,720 No, not status found. 11749 08:46:39,720 --> 08:46:40,720 No status found. 11750 08:46:40,720 --> 08:46:43,720 And I need to put three spaces there. 11751 08:46:43,720 --> 08:46:44,720 No status found. 11752 08:46:44,720 --> 08:46:45,720 I'll make an asterisk. 11753 08:46:45,720 --> 08:46:48,720 So let's run it again. 11754 08:46:48,720 --> 08:46:50,720 See, I got 13 remaining. 11755 08:46:50,720 --> 08:46:53,720 So it's important you write code that's aware of your remaining. 11756 08:46:53,720 --> 08:46:56,720 That's why I made so obvious about that. 11757 08:46:56,720 --> 08:46:57,720 I'll retrieve all that. 11758 08:46:57,720 --> 08:47:00,720 I got 12 remaining, but my code starts to look. 11759 08:47:00,720 --> 08:47:01,720 Dang it. 11760 08:47:01,720 --> 08:47:03,720 I now have another space here. 11761 08:47:03,720 --> 08:47:04,720 Hang on. 11762 08:47:04,720 --> 08:47:06,720 Got to fix that. 11763 08:47:06,720 --> 08:47:08,720 I need yet another space. 11764 08:47:08,720 --> 08:47:23,720 Hopefully, I can make this as pretty as I want it to work. 11765 08:47:23,720 --> 08:47:25,720 Oh, wait a sec. 11766 08:47:25,720 --> 08:47:26,720 I didn't even do Dr. Chuck. 11767 08:47:26,720 --> 08:47:28,720 I did that wrong. 11768 08:47:28,720 --> 08:47:29,720 Typed my name wrong. 11769 08:47:29,720 --> 08:47:30,720 OK. 11770 08:47:30,720 --> 08:47:33,720 So now it works. 11771 08:47:33,720 --> 08:47:34,720 Oh, well. 11772 08:47:34,720 --> 08:47:40,720 So now I have my first, most five recent friends are this. 11773 08:47:40,720 --> 08:47:42,720 Steph Deasley, live edu official. 11774 08:47:42,720 --> 08:47:47,720 LifecodingTV, Nancy Gilby, and Greg E. Kruger. 11775 08:47:47,720 --> 08:47:49,720 And so there are their statuses. 11776 08:47:49,720 --> 08:47:55,720 And I tore all this JSON apart using twitter2.py. 11777 08:47:55,720 --> 08:48:01,720 Of course, after fixing hidden.py, which I'm not going to show you, 11778 08:48:01,720 --> 08:48:05,720 because it actually contains my real consumer key and consumer secret, 11779 08:48:05,720 --> 08:48:10,720 you're seeing the consumer key and the token key go by on each of these URLs. 11780 08:48:10,720 --> 08:48:12,720 But what you're not seeing is these two things, 11781 08:48:12,720 --> 08:48:16,720 which are the thing I'm protecting, so that it's not a problem. 11782 08:48:16,720 --> 08:48:17,720 OK. 11783 08:48:17,720 --> 08:48:19,720 So I will send that up. 11784 08:48:19,720 --> 08:48:21,720 But there you go. 11785 08:48:21,720 --> 08:48:22,720 Welcome. 11786 08:48:22,720 --> 08:48:25,720 I hope you found this useful. 11787 08:48:25,720 --> 08:48:27,720 The code will be fixed when you take a look at it 11788 08:48:27,720 --> 08:48:31,720 and download it here from samplecode.zip. 11789 08:48:34,720 --> 08:48:36,720 Hello, and welcome to Python Objects. 11790 08:48:36,720 --> 08:48:39,720 I'm Charles Severance, and we're well on our way 11791 08:48:39,720 --> 08:48:43,720 to getting through all this material in Python. 11792 08:48:43,720 --> 08:48:45,720 So this lecture is in a weird place. 11793 08:48:45,720 --> 08:48:48,720 I even debated where to put it in the book. 11794 08:48:48,720 --> 08:48:52,720 I don't really want to teach you how to write a lot of object-oriented programming, 11795 08:48:52,720 --> 08:48:55,720 but we're going to start using objects. 11796 08:48:55,720 --> 08:48:58,720 And I want to be able to use the terminology. 11797 08:48:58,720 --> 08:49:01,720 And so as much as anything, this lecture is about terminology 11798 08:49:01,720 --> 08:49:04,720 and understanding the words, things like methods 11799 08:49:04,720 --> 08:49:08,720 and method signatures and variables and inheritance. 11800 08:49:08,720 --> 08:49:10,720 And so think of this as a terminology lecture 11801 08:49:10,720 --> 08:49:14,720 rather than a learn-how-to program or learn-how-to use this. 11802 08:49:14,720 --> 08:49:17,720 It's not something you're going to figure out right away. 11803 08:49:17,720 --> 08:49:19,720 And there'll come a time when you as a programmer 11804 08:49:19,720 --> 08:49:21,720 really want to start using object-oriented programming. 11805 08:49:21,720 --> 08:49:24,720 It's really a powerful and wonderful technique. 11806 08:49:24,720 --> 08:49:27,720 But I think it's too early as a beginning programmer 11807 08:49:27,720 --> 08:49:30,720 to really say, oh, let's write a bunch of objects. 11808 08:49:30,720 --> 08:49:34,720 So just relax and enjoy and learn this material 11809 08:49:34,720 --> 08:49:38,720 and think of it as sort of a theoretical thing 11810 08:49:38,720 --> 08:49:43,720 rather than a how-to program thing. 11811 08:49:43,720 --> 08:49:47,720 And so part of this is we're going to start reading data structures 11812 08:49:47,720 --> 08:49:52,720 and data on how to use all these libraries, etc. 11813 08:49:52,720 --> 08:49:54,720 And we're going to see the word objects, right? 11814 08:49:54,720 --> 08:49:56,720 And then we're going to start hearing them. 11815 08:49:56,720 --> 08:49:58,720 And I want you to be able to read the Python documentation 11816 08:49:58,720 --> 08:50:00,720 so that you understand what's going on. 11817 08:50:00,720 --> 08:50:04,720 And so the word objects should make sense to you 11818 08:50:04,720 --> 08:50:06,720 even though you're not going to write a lot of objects 11819 08:50:06,720 --> 08:50:07,720 or any programming. 11820 08:50:07,720 --> 08:50:11,720 And so page upon page upon page, database stuff, 11821 08:50:11,720 --> 08:50:13,720 which we're going to talk about soon, 11822 08:50:13,720 --> 08:50:15,720 uses objects all over the place. 11823 08:50:15,720 --> 08:50:18,720 And the beautiful soup uses objects. 11824 08:50:18,720 --> 08:50:21,720 We've kind of been using them, and I've been waving my hands 11825 08:50:21,720 --> 08:50:23,720 and I use the word method without defining it. 11826 08:50:23,720 --> 08:50:28,720 But now it's really time to define it and go to it. 11827 08:50:28,720 --> 08:50:33,720 So I want to review from the very beginning 11828 08:50:33,720 --> 08:50:35,720 what we think of as a program. 11829 08:50:35,720 --> 08:50:38,720 So the classic program, my favorite little minimum program, 11830 08:50:38,720 --> 08:50:41,720 is our little elevator floor converter, 11831 08:50:41,720 --> 08:50:43,720 which converts from European elevator floors 11832 08:50:43,720 --> 08:50:45,720 to United States elevator floors. 11833 08:50:45,720 --> 08:50:50,720 And the key to this is that it's input, processing, and output. 11834 08:50:50,720 --> 08:50:54,720 And this is a good way to model any program. 11835 08:50:54,720 --> 08:50:57,720 And in that process, we've got variables, 11836 08:50:57,720 --> 08:51:00,720 and we've got logic, we've got algorithms, 11837 08:51:00,720 --> 08:51:02,720 we've got loops that we write, we've got all kinds of things. 11838 08:51:02,720 --> 08:51:09,720 And we construct a series of steps to achieve some goal. 11839 08:51:09,720 --> 08:51:11,720 In object-oriented, and frankly, 11840 08:51:11,720 --> 08:51:13,720 you've been using object-oriented all along, 11841 08:51:13,720 --> 08:51:16,720 the program has lots of objects. 11842 08:51:16,720 --> 08:51:18,720 And we're sort of putting stuff into these objects, 11843 08:51:18,720 --> 08:51:21,720 taking stuff out of one object and putting it into another object. 11844 08:51:21,720 --> 08:51:23,720 And you've actually been doing this all along. 11845 08:51:23,720 --> 08:51:25,720 As soon as you're looking at dictionaries and lists, 11846 08:51:25,720 --> 08:51:27,720 you're doing objects. 11847 08:51:27,720 --> 08:51:31,720 And so an object is quite a little thing. 11848 08:51:31,720 --> 08:51:34,720 It's sort of its own little space inside of a program 11849 08:51:34,720 --> 08:51:38,720 that contains code and data. 11850 08:51:38,720 --> 08:51:40,720 And so we're working together. 11851 08:51:40,720 --> 08:51:42,720 All these objects are now working together. 11852 08:51:42,720 --> 08:51:44,720 It's a bit of self-contained code and data. 11853 08:51:44,720 --> 08:51:48,720 And it is one way to take a very complex problem 11854 08:51:48,720 --> 08:51:52,720 and make it easier by breaking it into separate things 11855 08:51:52,720 --> 08:51:54,720 that can be engineered and developed separately. 11856 08:51:54,720 --> 08:51:56,720 So you'd be using string objects, 11857 08:51:56,720 --> 08:51:58,720 or maybe you'd use beautiful soup or something. 11858 08:51:58,720 --> 08:52:00,720 These are powerful capabilities, 11859 08:52:00,720 --> 08:52:02,720 and if you had to look at all of them, 11860 08:52:02,720 --> 08:52:04,720 it's just, hey, here's a thing, use this object, 11861 08:52:04,720 --> 08:52:06,720 it'll do these things for you, 11862 08:52:06,720 --> 08:52:08,720 and there's lots of details inside of it. 11863 08:52:08,720 --> 08:52:10,720 Just don't look at it, don't worry about it. 11864 08:52:10,720 --> 08:52:12,720 And so there's boundaries, the things that you can use, 11865 08:52:12,720 --> 08:52:14,720 things that you can look at, 11866 08:52:14,720 --> 08:52:16,720 and things that really you don't bother looking at. 11867 08:52:16,720 --> 08:52:18,720 You go read the documentation and use it, 11868 08:52:18,720 --> 08:52:20,720 and away it goes. 11869 08:52:20,720 --> 08:52:23,720 But then someone had to write that, and so they built an object. 11870 08:52:23,720 --> 08:52:25,720 So what we're going to do is look a little bit 11871 08:52:25,720 --> 08:52:32,720 under the covers of what it takes to build some of these objects. 11872 08:52:32,720 --> 08:52:34,720 And so if we think of this program 11873 08:52:34,720 --> 08:52:36,720 that originally just sort of did processing, 11874 08:52:36,720 --> 08:52:39,720 we can think of it as having some kind of an input, right, 11875 08:52:39,720 --> 08:52:41,720 coming into our program. 11876 08:52:41,720 --> 08:52:43,720 And we have a string object, a dictionary object, 11877 08:52:43,720 --> 08:52:46,720 maybe eventually some objects like a database object 11878 08:52:46,720 --> 08:52:48,720 or an object that we eventually define. 11879 08:52:48,720 --> 08:52:50,720 And you can think of us, we're receiving data, 11880 08:52:50,720 --> 08:52:52,720 it comes in an object, which is a string object, 11881 08:52:52,720 --> 08:52:55,720 or you start putting the strings in dictionaries 11882 08:52:55,720 --> 08:52:58,720 and do whatever, we pull out a list of them, 11883 08:52:58,720 --> 08:53:01,720 and so you can think of data as moving between these objects. 11884 08:53:01,720 --> 08:53:05,720 And like I say, even strings, in the first week, 11885 08:53:05,720 --> 08:53:08,720 first lecture, first week, first everything, 11886 08:53:08,720 --> 08:53:13,720 we were using objects, and we've been using them all along. 11887 08:53:13,720 --> 08:53:16,720 And so you can think of every string and every dictionary 11888 08:53:16,720 --> 08:53:20,720 as a little program all by itself that has a bit of code 11889 08:53:20,720 --> 08:53:22,720 and a bit of data. 11890 08:53:22,720 --> 08:53:25,720 And so a string has the data, which includes all the characters 11891 08:53:25,720 --> 08:53:28,720 that make up the string, but then there is a method called 11892 08:53:28,720 --> 08:53:31,720 upper that does uppercase, or rstrip, 11893 08:53:31,720 --> 08:53:34,720 that strips off the right white space from the right. 11894 08:53:34,720 --> 08:53:36,720 And so it's like they're almost little programs 11895 08:53:36,720 --> 08:53:38,720 that have inputs and outputs themselves, 11896 08:53:38,720 --> 08:53:40,720 and we can make lots of them. 11897 08:53:40,720 --> 08:53:45,720 And there's lots of cooperating objects that make up an application. 11898 08:53:45,720 --> 08:53:48,720 And one of the nice things about the object-oriented pattern 11899 08:53:48,720 --> 08:53:53,720 is that they form boundaries, and within the boundary, 11900 08:53:53,720 --> 08:53:55,720 if you're inside the object, you can say, 11901 08:53:55,720 --> 08:53:58,720 look, I'm going to build you a string object or a database object 11902 08:53:58,720 --> 08:54:01,720 or a beautiful soup object, and I'm going to build this capability 11903 08:54:01,720 --> 08:54:03,720 and I'm going to give it to you in the form of an interface, 11904 08:54:03,720 --> 08:54:05,720 and I'm not really going to care how you use it. 11905 08:54:05,720 --> 08:54:08,720 And so we have this sort of visibility wall 11906 08:54:08,720 --> 08:54:11,720 where I'm going to make an object and I'm going to let you use it, 11907 08:54:11,720 --> 08:54:14,720 and the maker of the object doesn't necessarily have to know 11908 08:54:14,720 --> 08:54:17,720 every single thing about the use of that object. 11909 08:54:17,720 --> 08:54:20,720 But so just like inside the object, they don't have to worry 11910 08:54:20,720 --> 08:54:23,720 about what you're doing with the object outside of it. 11911 08:54:23,720 --> 08:54:25,720 When you're outside the object, you don't have to worry 11912 08:54:25,720 --> 08:54:27,720 about what's going on inside of it. 11913 08:54:27,720 --> 08:54:30,720 We, as the user of the object, we talk to its interface 11914 08:54:30,720 --> 08:54:32,720 and we get things from it and give things to it 11915 08:54:32,720 --> 08:54:34,720 and use functionality within that object, 11916 08:54:34,720 --> 08:54:36,720 but we don't have to look inside of this. 11917 08:54:36,720 --> 08:54:38,720 We can just say, oh, it's a nice little magical thing. 11918 08:54:38,720 --> 08:54:40,720 We read the documentation, we read a web page, 11919 08:54:40,720 --> 08:54:43,720 and it told us to do this, this, and this, and away you go. 11920 08:54:43,720 --> 08:54:46,720 And so it is sort of this isolation boundary 11921 08:54:46,720 --> 08:54:50,720 that works both for the programmer who's writing the object 11922 08:54:50,720 --> 08:54:53,720 and the programmer who's using the object. 11923 08:54:53,720 --> 08:54:56,720 And so it's a very nice pattern, 11924 08:54:56,720 --> 08:54:59,720 and so you'll see how we're going to build code 11925 08:54:59,720 --> 08:55:01,720 and we're going to group it together, 11926 08:55:01,720 --> 08:55:06,720 and then we're going to be using it sort of as a big blob of stuff. 11927 08:55:06,720 --> 08:55:09,720 So some definitions in this space, 11928 08:55:09,720 --> 08:55:13,720 words that I want you to understand. 11929 08:55:13,720 --> 08:55:16,720 When we're going to create one of these things, 11930 08:55:16,720 --> 08:55:20,720 one of these objects, instances, that has some data in it 11931 08:55:20,720 --> 08:55:23,720 and some code in it, we have to be able to define 11932 08:55:23,720 --> 08:55:24,720 the shape of this object. 11933 08:55:24,720 --> 08:55:26,720 What code will each object have in it 11934 08:55:26,720 --> 08:55:29,720 and what data will each object have in it? 11935 08:55:29,720 --> 08:55:31,720 And that's called a class. 11936 08:55:31,720 --> 08:55:33,720 The key to a class in this little picture 11937 08:55:33,720 --> 08:55:36,720 that I've got up here in all these slides is a key. 11938 08:55:36,720 --> 08:55:37,720 The class is a template. 11939 08:55:37,720 --> 08:55:39,720 It's not the thing itself, so it's a cookie cutter. 11940 08:55:39,720 --> 08:55:42,720 It knows a lot about how cookies are made, 11941 08:55:42,720 --> 08:55:45,720 and if you have cookie dough and you hit the thing, 11942 08:55:45,720 --> 08:55:47,720 then you make as many cookies as you want. 11943 08:55:47,720 --> 08:55:52,720 And so this nice little cookie picture is a great, you know, 11944 08:55:52,720 --> 08:55:54,720 mental model of how it works. 11945 08:55:54,720 --> 08:56:03,720 The class is the template, 11946 08:56:03,720 --> 08:56:07,720 and then the object are all of the cookies 11947 08:56:07,720 --> 08:56:09,720 that are made from that template. 11948 08:56:09,720 --> 08:56:12,720 But the template defines the shape and the nature of the class. 11949 08:56:12,720 --> 08:56:16,720 So the code that we write is going of each of the objects. 11950 08:56:16,720 --> 08:56:19,720 The code we write is the class code, 11951 08:56:19,720 --> 08:56:21,720 and then later we say, oh, let's take that template 11952 08:56:21,720 --> 08:56:24,720 and make ourselves an object or an instance. 11953 08:56:24,720 --> 08:56:27,720 Now, as we're defining a class, 11954 08:56:27,720 --> 08:56:30,720 we have two basic things that we put in the class. 11955 08:56:30,720 --> 08:56:32,720 And there's a couple of different terminologies for this. 11956 08:56:32,720 --> 08:56:34,720 One is method, which is code. 11957 08:56:34,720 --> 08:56:36,720 It's like a function that lives inside of a class. 11958 08:56:36,720 --> 08:56:39,720 Not a function that lives inside your program, 11959 08:56:39,720 --> 08:56:40,720 but one that lives inside of a class. 11960 08:56:40,720 --> 08:56:42,720 And so this is a scoping thing. 11961 08:56:42,720 --> 08:56:45,720 A method is really just a function, 11962 08:56:45,720 --> 08:56:47,720 but it lives inside the class. 11963 08:56:47,720 --> 08:56:50,720 And then fields or attributes are data items that are in the class. 11964 08:56:50,720 --> 08:56:53,720 And so they're variables that are defined in the class. 11965 08:56:53,720 --> 08:56:56,720 You can define variables outside the class that you use in your program, 11966 08:56:56,720 --> 08:56:57,720 and you've been doing that all along. 11967 08:56:57,720 --> 08:56:59,720 But if you're saying, I'm going to build this capability 11968 08:56:59,720 --> 08:57:02,720 and it's going to have data inside of it and code inside of it, 11969 08:57:02,720 --> 08:57:05,720 the code is the method or message and field or attribute. 11970 08:57:05,720 --> 08:57:13,720 And there are just two different sets of terminology. 11971 08:57:13,720 --> 08:57:17,720 Method is what I'll probably use if you look in some object-oriented patterns 11972 08:57:17,720 --> 08:57:20,720 like Smalltalk or Apple. 11973 08:57:20,720 --> 08:57:22,720 They often don't call these messages. 11974 08:57:22,720 --> 08:57:26,720 So you can either access a method inside of a class or an object, 11975 08:57:26,720 --> 08:57:28,720 or you can send a message to the object. 11976 08:57:28,720 --> 08:57:30,720 The same is true for field and attribute. 11977 08:57:30,720 --> 08:57:32,720 It's just a chunk of data that's in the object 11978 08:57:32,720 --> 08:57:37,720 that you may or may not have the right to access. 11979 08:57:37,720 --> 08:57:40,720 So like I said, a class is a template. 11980 08:57:40,720 --> 08:57:44,720 It defines the characteristics of the objects that we're going to use to make it. 11981 08:57:44,720 --> 08:57:48,720 It is the cookie cutter. 11982 08:57:48,720 --> 08:57:52,720 So dog is sort of the exemplar. 11983 08:57:52,720 --> 08:57:54,720 Lassie is a particular dog. 11984 08:57:54,720 --> 08:57:57,720 And so dog has fur and dog barks, and dogs do all these things. 11985 08:57:57,720 --> 08:58:01,720 And so we know something about dogs, but it doesn't mean we have a dog, right? 11986 08:58:01,720 --> 08:58:06,720 And the class is a more abstract concept that when it's time to get a dog, 11987 08:58:06,720 --> 08:58:09,720 we know certain things about dogs. 11988 08:58:09,720 --> 08:58:16,720 Instances or objects are once we say, oh, time to make a cookie from the template. 11989 08:58:16,720 --> 08:58:17,720 Time to get a dog. 11990 08:58:17,720 --> 08:58:19,720 We know something about dogs. 11991 08:58:19,720 --> 08:58:23,720 That's the creation of an object, and we call them instances, 11992 08:58:23,720 --> 08:58:24,720 instance of a class. 11993 08:58:24,720 --> 08:58:28,720 So the class doesn't exist, but we say, 11994 08:58:28,720 --> 08:58:31,720 make me a new object using this class as its template. 11995 08:58:31,720 --> 08:58:33,720 Oh, and now make me another one. 11996 08:58:33,720 --> 08:58:36,720 And so we can have many, many objects from one class. 11997 08:58:36,720 --> 08:58:41,720 So just like many cookies from one cookie cutter. 11998 08:58:41,720 --> 08:58:45,720 Method is a bit of code that lives inside of an object. 11999 08:58:45,720 --> 08:58:50,720 It's like a function, but it's scoped to within the object or within the class. 12000 08:58:50,720 --> 08:58:54,720 Okay, so that kind of gets us started on some of the terminology, 12001 08:58:54,720 --> 08:59:03,720 and we'll come back and we'll take a look at how we write code that's object oriented. 12002 08:59:03,720 --> 08:59:06,720 Okay, so now that we've gotten through the definitions, 12003 08:59:06,720 --> 08:59:08,720 let's work into some sample code. 12004 08:59:08,720 --> 08:59:10,720 But hey, look at this. 12005 08:59:10,720 --> 08:59:13,720 We've got ourselves a cookie cutter and some cookies. 12006 08:59:13,720 --> 08:59:17,720 So remember that a class is a template. 12007 08:59:17,720 --> 08:59:19,720 It's not the actual thing. 12008 08:59:19,720 --> 08:59:23,720 An object is an instance of a class. 12009 08:59:23,720 --> 08:59:27,720 So you have to take the class and do something to make the object. 12010 08:59:27,720 --> 08:59:30,720 And actually you can see here some other classes. 12011 08:59:30,720 --> 08:59:34,720 Clearly a sort of a snowflake class and a gingerbread man class. 12012 08:59:34,720 --> 08:59:36,720 That's an object, object, object. 12013 08:59:36,720 --> 08:59:41,720 Somewhere out here there is a snowflake class and a gingerbread class. 12014 08:59:41,720 --> 08:59:46,720 But we've got a snowman object and a snowman object and a snowman class. 12015 08:59:46,720 --> 08:59:51,720 So class is the template. 12016 08:59:51,720 --> 08:59:53,720 Object is the instance. 12017 08:59:53,720 --> 08:59:54,720 So here's a bit of Python code. 12018 08:59:54,720 --> 08:59:57,720 So let's take a look at what we've got here. 12019 08:59:57,720 --> 08:59:59,720 Class is a new reserved word, kind of like def. 12020 08:59:59,720 --> 09:00:01,720 We have the name of the class. 12021 09:00:01,720 --> 09:00:03,720 That is a name that we choose. 12022 09:00:03,720 --> 09:00:07,720 That's the name by which we'll refer to this class for the rest of this program. 12023 09:00:07,720 --> 09:00:09,720 And it has a colon at the end of it, 12024 09:00:09,720 --> 09:00:13,720 which means it starts an indented block, which ends when we deindent. 12025 09:00:13,720 --> 09:00:16,720 Inside the class there are generally two things. 12026 09:00:16,720 --> 09:00:19,720 There is some data, and this just looks like an assignment statement in the class, 12027 09:00:19,720 --> 09:00:20,720 x equals zero. 12028 09:00:20,720 --> 09:00:22,720 And then there is a def. 12029 09:00:22,720 --> 09:00:24,720 This looks just like a function. 12030 09:00:24,720 --> 09:00:27,720 And then it starts with a def, has a colon, indents. 12031 09:00:27,720 --> 09:00:29,720 That function finishes right there. 12032 09:00:29,720 --> 09:00:33,720 The difference is this is a method because it lives inside of a class. 12033 09:00:33,720 --> 09:00:36,720 And so there is no function called party. 12034 09:00:36,720 --> 09:00:40,720 There's a function called party within party animal class. 12035 09:00:40,720 --> 09:00:43,720 And we'll talk in a second about this self thing. 12036 09:00:43,720 --> 09:00:47,720 It is the way that inside this code we refer back to that variable. 12037 09:00:47,720 --> 09:00:50,720 So this is not actually executing any code. 12038 09:00:50,720 --> 09:00:54,720 It's sort of remembering the template, defining the class party animal. 12039 09:00:54,720 --> 09:00:57,720 This is what we call constructing. 12040 09:00:57,720 --> 09:01:00,720 We're constructing, using the party animal template or class, 12041 09:01:00,720 --> 09:01:02,720 we are making a party animal. 12042 09:01:02,720 --> 09:01:06,720 And then once we make that, we stick it in the variable an. 12043 09:01:06,720 --> 09:01:10,720 And then we're going to call this party animal, this party method, 12044 09:01:10,720 --> 09:01:12,720 three times one, two, three. 12045 09:01:12,720 --> 09:01:15,720 Now this self thing, and we'll take a look at the self. 12046 09:01:15,720 --> 09:01:19,720 The self ends up being an alias of an. 12047 09:01:19,720 --> 09:01:21,720 And so you can look at this syntax. 12048 09:01:21,720 --> 09:01:23,720 It's just kind of an equivalent of this syntax. 12049 09:01:23,720 --> 09:01:27,720 It's calling the party method within the party animal class 12050 09:01:27,720 --> 09:01:30,720 and passing the instance in as the first parameter. 12051 09:01:30,720 --> 09:01:36,720 And so self ends up being an alias of an each time these are called. 12052 09:01:36,720 --> 09:01:39,720 Now if we make a different variable and a second object, 12053 09:01:39,720 --> 09:01:43,720 which we will eventually, you will see that that works a little bit differently. 12054 09:01:43,720 --> 09:01:48,720 And so this syntax is a short version of that syntax. 12055 09:01:48,720 --> 09:01:55,720 So if we watch how this executes, it starts up here, 12056 09:01:55,720 --> 09:01:59,720 it just defines it, and then we construct it. 12057 09:01:59,720 --> 09:02:04,720 And that's what basically constructing it, we know how to construct it 12058 09:02:04,720 --> 09:02:08,720 because we look at the class and we make a variable x, we make some code party, 12059 09:02:08,720 --> 09:02:11,720 and then we construct that, that's what the party animal does, 12060 09:02:11,720 --> 09:02:13,720 and then we assign that into an. 12061 09:02:13,720 --> 09:02:17,720 And so an is now pointing at that. 12062 09:02:17,720 --> 09:02:22,720 And then when we call the party method, that basically takes this an 12063 09:02:22,720 --> 09:02:27,720 and passes it in as the first parameter, which is used as self. 12064 09:02:27,720 --> 09:02:32,720 And so self.x, which is what we're doing in this line right here, 12065 09:02:32,720 --> 09:02:37,720 self.x is a variable, x starts out as zero. 12066 09:02:37,720 --> 09:02:41,720 x starts out as zero because when it was constructed it was set to zero. 12067 09:02:41,720 --> 09:02:44,720 So we're in here, an is an alias of self. 12068 09:02:44,720 --> 09:02:48,720 It looks up self.x, which is zero, adds one to it, and so this becomes one. 12069 09:02:48,720 --> 09:02:51,720 And then we print so far, so far one. 12070 09:02:51,720 --> 09:02:54,720 And then the code returns and it goes down and does it again. 12071 09:02:54,720 --> 09:02:58,720 And x becomes two, prints out so far two, comes back down, 12072 09:02:58,720 --> 09:03:02,720 and does the last time, calls it again, self.x is two, 12073 09:03:02,720 --> 09:03:06,720 add one to it and stick it back in, so this becomes three, 12074 09:03:06,720 --> 09:03:09,720 and we print out three, and then the program finishes. 12075 09:03:09,720 --> 09:03:13,720 And so you can think of this as constructing the object, 12076 09:03:13,720 --> 09:03:19,720 and then associating it with this and variable. 12077 09:03:19,720 --> 09:03:22,720 Now that we've created this object, we can play around with things 12078 09:03:22,720 --> 09:03:24,720 we've played around before with dir and type. 12079 09:03:24,720 --> 09:03:31,720 We use dir and type to kind of inspect variables and types and objects. 12080 09:03:31,720 --> 09:03:34,720 So we've been using objects all along. 12081 09:03:34,720 --> 09:03:37,720 This code here says, hey, make me an empty list. 12082 09:03:37,720 --> 09:03:42,720 Well, it turns out that what we're saying is there is already a list class 12083 09:03:42,720 --> 09:03:46,720 inside of Python, and we're constructing an empty list. 12084 09:03:46,720 --> 09:03:51,720 And when we get back this empty list, we're assigning that into x. 12085 09:03:51,720 --> 09:03:55,720 So x, in a sense, contains or points to an empty list. 12086 09:03:55,720 --> 09:03:57,720 So then we say, hey, what is in x? 12087 09:03:57,720 --> 09:03:59,720 What kind of thing is x? Well, it's a list. 12088 09:03:59,720 --> 09:04:01,720 This is a thing. It's a list type. 12089 09:04:01,720 --> 09:04:04,720 Lists have lists of things in them. 12090 09:04:04,720 --> 09:04:07,720 And, you know, use append and all the things we've been doing before, 12091 09:04:07,720 --> 09:04:08,720 they're just objects. 12092 09:04:08,720 --> 09:04:12,720 And then the dir, if you remember the dir, the dir is the capabilities. 12093 09:04:12,720 --> 09:04:15,720 And there's all these internal capabilities that do things like 12094 09:04:15,720 --> 09:04:19,720 implement the bracket operator, et cetera, those double underscore ones. 12095 09:04:19,720 --> 09:04:21,720 We can ignore them, although you can even look them up 12096 09:04:21,720 --> 09:04:23,720 and figure out what they mean if you feel like it. 12097 09:04:23,720 --> 09:04:27,720 But the methods that we tend to call are in this class. 12098 09:04:27,720 --> 09:04:32,720 And so things like x.sort, I've always told you, 12099 09:04:32,720 --> 09:04:35,720 that is the sort method within the x thing. 12100 09:04:35,720 --> 09:04:38,720 And the dot operator is the operator that we use 12101 09:04:38,720 --> 09:04:40,720 to look something up within an object. 12102 09:04:40,720 --> 09:04:43,720 And so you've been using the syntax all along. 12103 09:04:43,720 --> 09:04:47,720 x.sort, dictionary.items, all of those are methods 12104 09:04:47,720 --> 09:04:50,720 within the corresponding class. 12105 09:04:50,720 --> 09:04:53,720 If we take a look at this line of code that we've been doing 12106 09:04:53,720 --> 09:04:57,720 for a very long time, which says, oh, stick hello there into y. 12107 09:04:57,720 --> 09:05:01,720 It's, if I reword that as more oo or object oriented, 12108 09:05:01,720 --> 09:05:06,720 what this single quote does says, make me a string object 12109 09:05:06,720 --> 09:05:11,720 and put some text in it, and then when that is done being constructed, 12110 09:05:11,720 --> 09:05:13,720 stick that into y. 12111 09:05:13,720 --> 09:05:17,720 Right? And so y now points to a string object 12112 09:05:17,720 --> 09:05:20,720 that's been preinitialized to the string hello there. 12113 09:05:20,720 --> 09:05:23,720 Now that's a long way of saying hello there ends up in y. 12114 09:05:23,720 --> 09:05:26,720 But in oo terms we can talk about that. 12115 09:05:26,720 --> 09:05:30,720 If we do a dir of that, we see a whole bunch of internal methods, 12116 09:05:30,720 --> 09:05:32,720 which have double underscores. 12117 09:05:32,720 --> 09:05:34,720 And then we see all kinds of methods that we've been using. 12118 09:05:34,720 --> 09:05:37,720 We've been using methods like upper. 12119 09:05:37,720 --> 09:05:39,720 We've been using methods like find. 12120 09:05:39,720 --> 09:05:44,720 We've been using methods like rstrip, right? 12121 09:05:44,720 --> 09:05:46,720 We've been using these methods. 12122 09:05:46,720 --> 09:05:51,720 So we're going to like y.rstrip, parentheses. 12123 09:05:51,720 --> 09:05:54,720 Again, that's a method, that's an object. 12124 09:05:54,720 --> 09:05:59,720 Not a class, it's an object, and that is the object lookup operator. 12125 09:05:59,720 --> 09:06:03,720 Now if we do the same thing to code that we've built, 12126 09:06:03,720 --> 09:06:07,720 or a class that we've built, so now we have a party animal class. 12127 09:06:07,720 --> 09:06:10,720 Remember this up to here is just definition. 12128 09:06:10,720 --> 09:06:13,720 Now we construct it, and we store it in an. 12129 09:06:13,720 --> 09:06:17,720 So an is a variable that contains an object of type party animal. 12130 09:06:17,720 --> 09:06:21,720 We ask it what type it is, and it prints out here. 12131 09:06:21,720 --> 09:06:25,720 It says this is a class, and it's main underscore party animal. 12132 09:06:25,720 --> 09:06:28,720 And this whole thing here is the underscore main. 12133 09:06:28,720 --> 09:06:30,720 It's scope to underscore main. 12134 09:06:30,720 --> 09:06:32,720 But you can see that you have made a new type. 12135 09:06:32,720 --> 09:06:35,720 You built a type by using this class keyword. 12136 09:06:35,720 --> 09:06:37,720 And then we use the dir. 12137 09:06:37,720 --> 09:06:39,720 Remember, dir looks for capabilities. 12138 09:06:39,720 --> 09:06:44,720 And again, you will see a whole bunch of underscore things. 12139 09:06:44,720 --> 09:06:46,720 They have meaning, you can look them up. 12140 09:06:46,720 --> 09:06:49,720 But eventually you'll see the two things that you've put in it. 12141 09:06:49,720 --> 09:06:53,720 One is the method party, and the other is the attribute, or field x. 12142 09:06:53,720 --> 09:06:57,720 And again, these are the things that you can say, an.x. 12143 09:06:57,720 --> 09:07:00,720 Or an.party. 12144 09:07:00,720 --> 09:07:06,720 Because this dot is the object operator, the object lookup operator that says, 12145 09:07:06,720 --> 09:07:09,720 look up in the object an, the thing x. 12146 09:07:09,720 --> 09:07:11,720 Or look up in the object an, the thing party. 12147 09:07:11,720 --> 09:07:13,720 Okay? 12148 09:07:15,720 --> 09:07:19,720 So up next we'll talk a little bit about how objects are created and destroyed. 12149 09:07:19,720 --> 09:07:22,720 We also call that object life cycle. 12150 09:07:22,720 --> 09:07:27,720 Now I'm going to talk a little bit about object life cycle. 12151 09:07:27,720 --> 09:07:32,720 And what we mean by object life cycle is the act of creating and destroying these objects. 12152 09:07:32,720 --> 09:07:35,720 And I've been using this term constructor already. 12153 09:07:35,720 --> 09:07:41,720 And so when we declare a variable, whether it's a string or a dictionary or a party animal, 12154 09:07:41,720 --> 09:07:43,720 whether we create them and then they're discarded, 12155 09:07:43,720 --> 09:07:46,720 and there's all this dynamic memory that comes and goes. 12156 09:07:46,720 --> 09:07:53,720 And we as the writers of objects have the ability to insert ourselves at the moment of object creation 12157 09:07:53,720 --> 09:07:55,720 and at the moment of object destruction. 12158 09:07:55,720 --> 09:08:00,720 And we make special functions that we call the constructor, the object constructor, 12159 09:08:00,720 --> 09:08:03,720 or the class constructor, and the destructor. 12160 09:08:03,720 --> 09:08:05,720 And we don't actually explicitly call them. 12161 09:08:05,720 --> 09:08:10,720 They're called automatically by the by Python on our behalf. 12162 09:08:10,720 --> 09:08:13,720 And so the constructor is much more commonly used. 12163 09:08:13,720 --> 09:08:18,720 It's used to set up any initial values of variables if necessary, etc., etc. 12164 09:08:18,720 --> 09:08:24,720 Destructors will cover them, but they're used very rarely. 12165 09:08:24,720 --> 09:08:26,720 So here's a bit of code that we've got. 12166 09:08:26,720 --> 09:08:31,720 It's our party animal, and a lot of it is the same as what we've been doing so far. 12167 09:08:31,720 --> 09:08:35,720 So we have this variable x, and the constructor has a special name, 12168 09:08:35,720 --> 09:08:38,720 underscore, underscore, init, underscore. 12169 09:08:38,720 --> 09:08:42,720 Again, we pass in the instance of the object, self. 12170 09:08:42,720 --> 09:08:45,720 And in this one, all we're going to do is print out that you're constructed. 12171 09:08:45,720 --> 09:08:47,720 And here's this code that we've had before. 12172 09:08:47,720 --> 09:08:51,720 And now we have underscore, underscore, del, and then we pass in self. 12173 09:08:51,720 --> 09:08:54,720 And we'll just print out that we're being destructed 12174 09:08:54,720 --> 09:09:01,720 and what the current value of x is for that particular instance. 12175 09:09:01,720 --> 09:09:04,720 So let's go ahead and run this. 12176 09:09:04,720 --> 09:09:07,720 And so, again, this doesn't really do any code up to here. 12177 09:09:07,720 --> 09:09:11,720 That just defines party animal, but this is the constructing of it. 12178 09:09:11,720 --> 09:09:15,720 And basically that says, oh, and it really kind of creates these variables, 12179 09:09:15,720 --> 09:09:17,720 and then it also runs the constructor. 12180 09:09:17,720 --> 09:09:23,720 And so in this case, this line right here is causing the I am constructed message to come out. 12181 09:09:23,720 --> 09:09:29,720 Then we do and party, and party, and that says, you know, one and two. 12182 09:09:29,720 --> 09:09:31,720 And here's an interesting thing. 12183 09:09:31,720 --> 09:09:37,720 We're actually going to destroy this variable by throwing away an an no longer points at that object. 12184 09:09:37,720 --> 09:09:39,720 an is going to point to 42. 12185 09:09:39,720 --> 09:09:42,720 So we're going to sort of overwrite an and put 42 in it. 12186 09:09:42,720 --> 09:09:46,720 And at that point, Python's like, oh, this whole little object that I just created, 12187 09:09:46,720 --> 09:09:52,720 somewhere it's out here, it's vaporizing it and throwing it away. 12188 09:09:52,720 --> 09:09:57,720 And so before this line completes, it actually calls our destructor on our behalf. 12189 09:09:57,720 --> 09:09:59,720 And so that message comes out. 12190 09:09:59,720 --> 09:10:05,720 So we are allowed as the builder of these objects to add these little chunks of code that says, 12191 09:10:05,720 --> 09:10:11,720 I want to be involved at the moment this object is created, and I want to be involved at the moment that this object is destroyed. 12192 09:10:11,720 --> 09:10:16,720 Now, in this last line, an is no longer a party animal. 12193 09:10:16,720 --> 09:10:18,720 an is now an integer. 12194 09:10:18,720 --> 09:10:20,720 It's got a 42 in it. 12195 09:10:20,720 --> 09:10:23,720 It's gone. It's been created. It was used, and then it was destroyed. 12196 09:10:23,720 --> 09:10:26,720 So you've got to be careful if you overwrite something. 12197 09:10:26,720 --> 09:10:29,720 You can sort of throw the object away. 12198 09:10:29,720 --> 09:10:38,720 So the constructor is a special block of code that's called when the object is created to set the object up. 12199 09:10:38,720 --> 09:10:40,720 So we can create lots of instances. 12200 09:10:40,720 --> 09:10:46,720 Everything we've done so far is we make a class, and then we create one instance, one object. 12201 09:10:46,720 --> 09:10:49,720 And each of these objects ends up being stored in its own variable. 12202 09:10:49,720 --> 09:10:51,720 We have a variable an, and we've been using it. 12203 09:10:51,720 --> 09:10:58,720 But the more interesting thing begins to happen when we have multiple instances of the same class sitting in different variables. 12204 09:10:58,720 --> 09:11:01,720 And it has its own copy of the instance variables. 12205 09:11:01,720 --> 09:11:03,720 So let's take a look at this. 12206 09:11:03,720 --> 09:11:11,720 So this code here, I've taken out the destructor, and it shows a little bit more information. 12207 09:11:11,720 --> 09:11:13,720 So now we're going to put two variables in here. 12208 09:11:13,720 --> 09:11:18,720 We're going to have a current score or whatever and a name, and we're going to start it out as blank. 12209 09:11:18,720 --> 09:11:23,720 And this time we're going to add a parameter onto the constructor. 12210 09:11:23,720 --> 09:11:28,720 And so the self comes in sort of automatically as the object is being constructed. 12211 09:11:28,720 --> 09:11:36,720 But if we put a parameter on the constructor call, which is this party animal call, then this comes in as the z variable. 12212 09:11:36,720 --> 09:11:42,720 And so self is the object itself, and z, this first parameter, is whatever parameter we put here. 12213 09:11:42,720 --> 09:11:46,720 Everything we've done so far has no parameter here, but now we have a parameter here. 12214 09:11:46,720 --> 09:11:51,720 And then that means that when we call this constructor, this line of code comes, 12215 09:11:51,720 --> 09:11:56,720 and then name is no longer blank, name is going to be Sally in this particular thing. 12216 09:11:56,720 --> 09:12:01,720 And then it'll say, oh, self.name, which will be Sally who has been constructed. 12217 09:12:01,720 --> 09:12:07,720 And so then we have this, and that object is now constructed, and then we put it in the variable s. 12218 09:12:07,720 --> 09:12:11,720 And then we call the party method on that, and we construct a different one. 12219 09:12:11,720 --> 09:12:20,720 And so this time it calls, and z is Jim, and we basically have a, oops, 12220 09:12:20,720 --> 09:12:24,720 another copy of this. And so this is how it's going to look. 12221 09:12:24,720 --> 09:12:35,720 As it runs down here, when this is called, it makes one instance and stores that in the variable s. 12222 09:12:35,720 --> 09:12:41,720 And there's a variable x in there, there's a name in there, there's an init method in party, and that's all in here. 12223 09:12:41,720 --> 09:12:48,720 All that stuff is in here. And now we say, let's make, and that's going to have Sally in there. 12224 09:12:48,720 --> 09:12:52,720 All right, Sally in there. 12225 09:12:52,720 --> 09:12:56,720 And then we're going to do another constructor, and so it's going to make a whole new thing, 12226 09:12:56,720 --> 09:13:00,720 and it's going to store that in j, and this one's going to have Jim in it. 12227 09:13:00,720 --> 09:13:05,720 S party, then this turns into a one, and then we're going to call j party, 12228 09:13:05,720 --> 09:13:10,720 that turns that into a one, and then s party will cause this to be a two. 12229 09:13:10,720 --> 09:13:17,720 And so what happens is we have now two objects, one in the variable s and one in the variable j, 12230 09:13:17,720 --> 09:13:20,720 and they have separate copies of their instance variables. 12231 09:13:20,720 --> 09:13:25,720 These are the instance variables, or the object fields, or whatever, but they're the variables. 12232 09:13:25,720 --> 09:13:32,720 But the key is that every time we do a new construction, it duplicates this, and there's another copy of it. 12233 09:13:32,720 --> 09:13:40,720 So there's an x within s. So s.x is this variable, and j.x is that variable. 12234 09:13:40,720 --> 09:13:54,720 Okay? So the next thing we'll talk about is inheritance, and that's the idea of taking one class and extending it to make something new. 12235 09:13:54,720 --> 09:13:59,720 So the last topic we'll talk about here in object orientation is the notion of inheritance. 12236 09:13:59,720 --> 09:14:06,720 And this is a form of code reuse, and it's one of the more advanced aspects of object-oriented programming. 12237 09:14:06,720 --> 09:14:14,720 So just kind of understand what it is at a high level, and then you know where to come back to when you need to learn a bit more about inheritance. 12238 09:14:14,720 --> 09:14:21,720 So the idea is instead of making a new class from scratch, we actually make a new class by starting with an existing class. 12239 09:14:21,720 --> 09:14:25,720 We are extending it, or another word for this is subclassing. 12240 09:14:25,720 --> 09:14:33,720 And it's sort of a situation where you're like, I've got this code, and I've got this data, and I just need to add a few things to it, 12241 09:14:33,720 --> 09:14:35,720 and then I'll have a whole new thing. 12242 09:14:35,720 --> 09:14:44,720 And as you design objects and what we call object hierarchies, you often do this, and it's a form of sort of real clever code reuse. 12243 09:14:44,720 --> 09:14:51,720 But again, don't necessarily think that you're supposed to know when to use this or why to use this. 12244 09:14:51,720 --> 09:14:55,720 Right now, it's just terminology, okay? Just terminology. 12245 09:14:55,720 --> 09:14:57,720 We have what call these as parent-child relationships. 12246 09:14:57,720 --> 09:15:02,720 The original class is called a parent, and the new class is called the child class. 12247 09:15:02,720 --> 09:15:05,720 So subclasses are another word for this. 12248 09:15:05,720 --> 09:15:07,720 You have a class, and then you subclass it. 12249 09:15:07,720 --> 09:15:14,720 I think extending and inheriting and parent-child are probably better ways of expressing it than subclassing. 12250 09:15:14,720 --> 09:15:17,720 So here's a bit of code. Let's take a look at this. 12251 09:15:17,720 --> 09:15:22,720 This code's unchanged. It's the party animal code that we've been saying all along. 12252 09:15:22,720 --> 09:15:25,720 It's the one that we construct and put a name in. 12253 09:15:25,720 --> 09:15:27,720 And now what we're going to do is extend it. 12254 09:15:27,720 --> 09:15:31,720 And so you'll notice that this code down here is the part that's doing the extending. 12255 09:15:31,720 --> 09:15:34,720 So we're making a new class, football fan. 12256 09:15:34,720 --> 09:15:39,720 And by putting in parentheses before the colon, party animal, that says, 12257 09:15:39,720 --> 09:15:45,720 football fan inherits everything that is party animal, meaning the x, the name, the init, the party. 12258 09:15:45,720 --> 09:15:48,720 All those methods and data are sitting there. 12259 09:15:48,720 --> 09:15:50,720 And now we're going to add a new variable. 12260 09:15:50,720 --> 09:15:56,720 So football fan has, in addition to all those other variables, it has points, and it has a touchdown method. 12261 09:15:56,720 --> 09:16:03,720 And self-points is added to, we add seven of the points, and then we call the party. 12262 09:16:03,720 --> 09:16:04,720 And that does that. 12263 09:16:04,720 --> 09:16:11,720 So this is calling this method because football fan includes x, name, and party, and init, and everything. 12264 09:16:11,720 --> 09:16:18,720 And all this constructor, so this football fan is really an amalgamation of all these things together. 12265 09:16:18,720 --> 09:16:22,720 Party animal is just this stuff, right? 12266 09:16:22,720 --> 09:16:25,720 And so we still have two classes. We don't just have one. 12267 09:16:25,720 --> 09:16:27,720 We didn't erase the party animal class. 12268 09:16:27,720 --> 09:16:29,720 And so we take a look at the code that we can run here. 12269 09:16:29,720 --> 09:16:32,720 We can say, oh, okay, let's make a party animal, Sally. 12270 09:16:32,720 --> 09:16:41,720 And so that constructs an object like this, and then stores that in s, with an x starting out at zero. 12271 09:16:41,720 --> 09:16:48,720 And then we call this party, oops, better change that color, starts out at zero. 12272 09:16:48,720 --> 09:16:51,720 And then we call the party method, and that changes it to one. 12273 09:16:51,720 --> 09:16:57,720 And so this bit of code, it's as if this part doesn't matter at all because it is a party animal. 12274 09:16:57,720 --> 09:16:59,720 It's not a football fan. 12275 09:16:59,720 --> 09:17:06,720 But now if we take a look at this code down here, take this code down here, 12276 09:17:06,720 --> 09:17:10,720 we're going to construct a football fan and pass in gym. 12277 09:17:10,720 --> 09:17:13,720 But football fan has no underscore, underscore, and knit. 12278 09:17:13,720 --> 09:17:20,720 So that actually uses the underscore and knit from party animal because we extended party animal to make football fan. 12279 09:17:20,720 --> 09:17:23,720 So we inherited all of the good that was in there. 12280 09:17:23,720 --> 09:17:27,720 So there it's going to make a name, a variable x, which is going to start at zero, 12281 09:17:27,720 --> 09:17:32,720 a variable name that's going to have gym in it, and a variable points that's going to have a zero in it. 12282 09:17:32,720 --> 09:17:38,720 So this j variable has more things in it than the s variable has. 12283 09:17:38,720 --> 09:17:46,720 And so we can call the j party, and if we call j party, that goes here and adds one to x, right? 12284 09:17:46,720 --> 09:17:48,720 So that adds one to x. 12285 09:17:48,720 --> 09:17:50,720 And then we call j touchdown. 12286 09:17:50,720 --> 09:17:55,720 Well, that comes down in here and adds seven to the points, right? 12287 09:17:55,720 --> 09:17:58,720 And then calls party within us. 12288 09:17:58,720 --> 09:18:04,720 So self.party is the current object, i.e. self and j are the same thing, right? 12289 09:18:04,720 --> 09:18:13,720 Self.party, and then it goes up here and passes self in, and it adds one to the x, in this case, of this j variable. 12290 09:18:13,720 --> 09:18:15,720 So this becomes two. 12291 09:18:15,720 --> 09:18:20,720 And that's where it prints out seven and two, and away you go. 12292 09:18:20,720 --> 09:18:28,720 And so it's a way for you to kind of take all this stuff and stuff it into a class by making a new class 12293 09:18:28,720 --> 09:18:33,720 and just add the extending bits, the bits that are in addition to the other stuff. 12294 09:18:33,720 --> 09:18:38,720 So like I said, inheritance is a powerful and wonderful concept. 12295 09:18:38,720 --> 09:18:47,720 It's a form of, excellent form of reuse, but basically the whole purpose of this lecture was 12296 09:18:47,720 --> 09:18:52,720 so that I could in the future just use these words and you would understand them as compared to, 12297 09:18:52,720 --> 09:18:56,720 I just want to say method, and I've been saying method all along in this high time that I defined it. 12298 09:18:56,720 --> 09:18:59,720 So let's just review one last time. 12299 09:18:59,720 --> 09:19:01,720 Class is a template. 12300 09:19:01,720 --> 09:19:03,720 It is not actually a thing. 12301 09:19:03,720 --> 09:19:05,720 It is a shape of a thing. 12302 09:19:05,720 --> 09:19:09,720 And we define it and say when we make one of these things, it's going to have these variables in it, 12303 09:19:09,720 --> 09:19:11,720 it's going to have these method in it. 12304 09:19:11,720 --> 09:19:16,720 Attributes, variables within a class, method is a function that's inside of a class. 12305 09:19:16,720 --> 09:19:21,720 Object is once we construct a class, we get back an object. 12306 09:19:21,720 --> 09:19:26,720 And so object here is the snowman cookies. 12307 09:19:26,720 --> 09:19:29,720 Class is the snowman cookie cutter. 12308 09:19:29,720 --> 09:19:38,720 And a constructor is a bit of code that sets up our object, our instance, when it first is created. 12309 09:19:38,720 --> 09:19:47,720 And inheritance is this ability to create a new class but take all and import and affect all the capabilities of an existing class. 12310 09:19:47,720 --> 09:19:51,720 So object-oriented is awesome. 12311 09:19:51,720 --> 09:19:54,720 For the rest of this class, we're not going to write any object code. 12312 09:19:54,720 --> 09:19:57,720 We're not going to use class at all, but we are going to use objects. 12313 09:19:57,720 --> 09:20:00,720 Literally, you've been using objects from the beginning of this course. 12314 09:20:00,720 --> 09:20:09,720 As soon as you said, print, whoops, as soon as you said, you know, x equals high, that's an object. 12315 09:20:09,720 --> 09:20:15,720 And as soon as you said x.upper, you were calling a method, right? 12316 09:20:15,720 --> 09:20:17,720 You've been calling a method all along. 12317 09:20:17,720 --> 09:20:25,720 When you're doing something like fh equals open, this thing you're getting back, that's an object. 12318 09:20:25,720 --> 09:20:28,720 And then you do fh.read or whatever. 12319 09:20:28,720 --> 09:20:31,720 You're calling a method in the dot operator. 12320 09:20:31,720 --> 09:20:33,720 So you've been using objects all along. 12321 09:20:33,720 --> 09:20:41,720 Now I'm just finally explaining to you when I say call the read method or call the upper method or what's this little dot and why is that there? 12322 09:20:41,720 --> 09:20:54,720 So again, it's time for us to understand that, but it will take you a long time before you encounter a problem that's large enough where as part of your solution, you're going to make a new object. 12323 09:20:54,720 --> 09:20:56,720 But when you do, it's really a powerful thing. 12324 09:20:56,720 --> 09:21:02,720 I mean, it's a really bad idea for me as a teacher to say, oh, write a bunch of objects. 12325 09:21:02,720 --> 09:21:04,720 It's premature for that. 12326 09:21:04,720 --> 09:21:08,720 It's later is when you will actually learn how to use objects. 12327 09:21:08,720 --> 09:21:11,720 And you'll be like, oh, thank heaven that these objects are here. 12328 09:21:11,720 --> 09:21:12,720 Okay? 12329 09:21:12,720 --> 09:21:14,720 So that's all for now. 12330 09:21:14,720 --> 09:21:15,720 Thanks for listening. 12331 09:21:15,720 --> 09:21:16,720 See you on the net. 12332 09:21:20,720 --> 09:21:23,720 Hello and welcome to our chapter on databases. 12333 09:21:23,720 --> 09:21:30,720 We're going to learn a lot in this chapter, learn a whole new programming language, SQL, and learn how to use that. 12334 09:21:30,720 --> 09:21:37,720 So you're going to need a new piece of software to run all of the exercises that I'm going to do called SQLite Browser. 12335 09:21:37,720 --> 09:21:39,720 We're using a database called SQLite. 12336 09:21:39,720 --> 09:21:40,720 Go ahead and download this. 12337 09:21:40,720 --> 09:21:42,720 You might have to pause and come back if you like. 12338 09:21:42,720 --> 09:21:46,720 Go to sqlitebrowser.org and download it and install it. 12339 09:21:46,720 --> 09:21:50,720 While you're doing that, we'll talk a little bit about the history. 12340 09:21:50,720 --> 09:22:01,720 So in the old days, 1960s, 1970s, I started doing computing in 1975, we didn't have a lot of storage. 12341 09:22:01,720 --> 09:22:06,720 I mean, this is 16 gigabytes right here, and we didn't even have megabytes. 12342 09:22:06,720 --> 09:22:10,720 I mean, the computer I had had a few megabytes of stuff. 12343 09:22:10,720 --> 09:22:12,720 Well, so we didn't have a lot of disk drives. 12344 09:22:12,720 --> 09:22:18,720 And so permanent storage was often sequential in these tapes, these tape drives that we had. 12345 09:22:18,720 --> 09:22:24,720 Tapes and tape drives were the scalable part of storage because you could just make more tapes and you could rack them up. 12346 09:22:24,720 --> 09:22:28,720 And so that was our way of greatly increasing the storage of the computer. 12347 09:22:28,720 --> 09:22:30,720 The problem they had was, is they were sequential. 12348 09:22:30,720 --> 09:22:33,720 You read it, it advances, read it, advance, read and advance. 12349 09:22:33,720 --> 09:22:39,720 Now, interestingly, we've been writing programs that do this, that everything we've written so far pretty much reads the whole file, 12350 09:22:39,720 --> 09:22:42,720 reads the whole web page, reads this, everything we read it. 12351 09:22:42,720 --> 09:22:44,720 We read either a loop or read the whole thing. 12352 09:22:44,720 --> 09:22:46,720 And that's because we have plenty of memory. 12353 09:22:46,720 --> 09:22:49,720 But we're still reading sequentially. 12354 09:22:49,720 --> 09:22:57,720 And so the way you would do this when you didn't have enough spinning storage or online storage is you'd use offline storage. 12355 09:22:57,720 --> 09:22:59,720 But the trick would be that you would sort it. 12356 09:22:59,720 --> 09:23:04,720 So let's imagine that you're a bank and you have a bunch of accounts, only a few of which are active on any day. 12357 09:23:04,720 --> 09:23:14,720 And you have a tape that has, in account number order from low to high, the prior balance, last night's balance of every one of your bank accounts. 12358 09:23:14,720 --> 09:23:21,720 And then you do all the transactions and you record how much money was taken in or out for each account number. 12359 09:23:21,720 --> 09:23:22,720 And then you sort those transactions. 12360 09:23:22,720 --> 09:23:26,720 And then what you do is what we call the sequential master update. 12361 09:23:26,720 --> 09:23:31,720 And that is, you would write a program that would read the first transaction and hold on to it. 12362 09:23:31,720 --> 09:23:34,720 Say, okay, this is count 45. 12363 09:23:34,720 --> 09:23:36,720 Then it would read the first count, like one. 12364 09:23:36,720 --> 09:23:37,720 And it would copy one. 12365 09:23:37,720 --> 09:23:41,720 And then it would read two and read like seven, eight, 42, 43. 12366 09:23:41,720 --> 09:23:43,720 Then it would read like 44. 12367 09:23:43,720 --> 09:23:49,720 And then it would read 45, but now it would change that and write the new 45 and read the next thing. 12368 09:23:49,720 --> 09:23:50,720 And so this might be 60. 12369 09:23:50,720 --> 09:23:53,720 And it would read a bunch of stuff and copy a bunch of stuff. 12370 09:23:53,720 --> 09:23:57,720 And then it would finally get to 60 and it would merge the add or subtract. 12371 09:23:57,720 --> 09:23:59,720 And so the old balance ended up here. 12372 09:23:59,720 --> 09:24:01,720 And the new balance did here. 12373 09:24:01,720 --> 09:24:03,720 And you had to only make one pass through the data. 12374 09:24:03,720 --> 09:24:05,720 So it was super efficient. 12375 09:24:05,720 --> 09:24:07,720 So we had all these mechanisms to sort. 12376 09:24:07,720 --> 09:24:10,720 We used to do punch cards and have sorters and all these things. 12377 09:24:10,720 --> 09:24:14,720 And then these things would run for hours. 12378 09:24:14,720 --> 09:24:19,720 And if you watch old TV shows, these tapes are spinning and these things are running back and forth. 12379 09:24:19,720 --> 09:24:22,720 These are simply reading and writing tapes. 12380 09:24:22,720 --> 09:24:29,720 And that's how we did a lot of data processing because we could store far more on a tape drive than we could on a disk. 12381 09:24:29,720 --> 09:24:35,720 And with racks of tape drives, we could scale the storage that our computers had. 12382 09:24:35,720 --> 09:24:37,720 And so that's the way we did data processing. 12383 09:24:37,720 --> 09:24:41,720 But it meant that the only way you knew what the old balance was 12384 09:24:41,720 --> 09:24:45,720 was it was the balance as of this morning before your bank started. 12385 09:24:45,720 --> 09:24:47,720 You don't know what the balance was for the day. 12386 09:24:47,720 --> 09:24:54,720 And that led to things like you can never withdraw more than $100 a day or something like that 12387 09:24:54,720 --> 09:24:56,720 because you don't know what the old balance was. 12388 09:24:56,720 --> 09:24:59,720 Or you might go withdraw $100 at a couple of different branches. 12389 09:24:59,720 --> 09:25:04,720 And so they weren't able to look your stuff up right away. 12390 09:25:04,720 --> 09:25:09,720 Now, it didn't take long until the disk drives got better and better and better. 12391 09:25:09,720 --> 09:25:15,720 And you could store the entire accounts, all the accounts and their current balances, on computers. 12392 09:25:15,720 --> 09:25:21,720 And then the problem becomes is what happens if sort of in the middle of the afternoon you want to update a balance? 12393 09:25:21,720 --> 09:25:25,720 Well, do you want to read all your data and then write a brand new one? 12394 09:25:25,720 --> 09:25:27,720 And say that takes like 10 minutes. 12395 09:25:27,720 --> 09:25:32,720 That means for that 10 minutes, only one person can be updating their bank balance. 12396 09:25:32,720 --> 09:25:37,720 And so because we could randomly access this data, we didn't have to read it all sequentially. 12397 09:25:37,720 --> 09:25:40,720 The trick was is how do you spread the data out? 12398 09:25:40,720 --> 09:25:43,720 And then how do you make it so you can change a balance? 12399 09:25:43,720 --> 09:25:45,720 This is, of course, second nature today. 12400 09:25:45,720 --> 09:25:49,720 But how do you make it so you change the balance here without changing the balance there? 12401 09:25:49,720 --> 09:25:52,720 And you can have multiple people going simultaneously to these things. 12402 09:25:52,720 --> 09:25:57,720 And make sure that you can't say withdraw money at two different locations simultaneously 12403 09:25:57,720 --> 09:26:00,720 and somehow have your bank balance get corrupted by that. 12404 09:26:00,720 --> 09:26:02,720 So there's a lot of debate on how to do that. 12405 09:26:02,720 --> 09:26:05,720 And in early days, we just did sequential master update. 12406 09:26:05,720 --> 09:26:12,720 But increasingly, we wanted to make better use of the random nature of our computers and our storage. 12407 09:26:12,720 --> 09:26:15,720 And so that's what led to databases. 12408 09:26:15,720 --> 09:26:24,720 Databases are the science of how you make use of rotating random access data, permanent data, 12409 09:26:24,720 --> 09:26:29,720 in a way that allows you to read, modify, and update that simultaneously from many different locations. 12410 09:26:29,720 --> 09:26:32,720 And yet keep the data completely consistent. 12411 09:26:32,720 --> 09:26:36,720 And so this led to a study of a thing called relational databases. 12412 09:26:36,720 --> 09:26:41,720 And relational databases are not the only databases that happened. 12413 09:26:41,720 --> 09:26:43,720 We had many other kinds of databases. 12414 09:26:43,720 --> 09:26:44,720 And there was a debate. 12415 09:26:44,720 --> 09:26:49,720 And I remember in the 70s and the 80s, there was a folks that says, oh, no, no, there. 12416 09:26:49,720 --> 09:26:50,720 You can do index sequential. 12417 09:26:50,720 --> 09:26:51,720 That's the way to do it. 12418 09:26:51,720 --> 09:26:58,720 And relational databases weren't all that popular the first time that I saw them. 12419 09:26:58,720 --> 09:27:01,720 I didn't like relational databases. 12420 09:27:01,720 --> 09:27:06,720 Relational databases had an inherent advantage because they were based on some really powerful mathematics. 12421 09:27:06,720 --> 09:27:11,720 And the interesting thing is, early on, the relational databases were slower. 12422 09:27:11,720 --> 09:27:17,720 But eventually, they figured out how to sort of bring all the cleverness to bear to make relational databases fast. 12423 09:27:17,720 --> 09:27:21,720 And so relational databases are a pretty advanced technology. 12424 09:27:21,720 --> 09:27:24,720 And there are companies like Oracle that are very, very wealthy. 12425 09:27:24,720 --> 09:27:29,720 And their primary product for many, many years was nothing more than a clever database product, 12426 09:27:29,720 --> 09:27:32,720 a clever piece of software that was really good at solving this problem. 12427 09:27:32,720 --> 09:27:36,720 And that's how important this problem was to computing. 12428 09:27:36,720 --> 09:27:39,720 If you read about databases, you're going to see two sets of terminology. 12429 09:27:39,720 --> 09:27:46,720 One set of terminology comes from the mathematical background and has to do with the underlying math, 12430 09:27:46,720 --> 09:27:49,720 things like relations, tuples, and attributes. 12431 09:27:49,720 --> 09:27:54,720 That's kind of like the fancy math version of it. 12432 09:27:54,720 --> 09:27:58,720 And programmers kind of think of them as rows and columns inside of a table. 12433 09:27:58,720 --> 09:28:03,720 And so if you look at sort of fancy theory, you'll see words that look like this. 12434 09:28:03,720 --> 09:28:05,720 And they're just full of this and the connection. 12435 09:28:05,720 --> 09:28:07,720 Now, all this is important and true. 12436 09:28:07,720 --> 09:28:13,720 And if you really want to get good, you sort of begin to understand the nature that we model data at connections 12437 09:28:13,720 --> 09:28:19,720 rather than at sort of intersection points rather than just modeling data as a flat file the way we do. 12438 09:28:19,720 --> 09:28:26,720 But for now, we're going to, as programmers, think of this as just like, oh, it's like a super fast spreadsheet. 12439 09:28:26,720 --> 09:28:28,720 The super fast part is the math. 12440 09:28:28,720 --> 09:28:30,720 For us, the rows, columns, and tables are spreadsheets. 12441 09:28:30,720 --> 09:28:34,720 So think in a spreadsheet of sheets, sheet, sheet, sheet. 12442 09:28:34,720 --> 09:28:39,720 And that's like a table, a named thing like tracks or albums, artists or genres. 12443 09:28:39,720 --> 09:28:43,720 And then there is rows, and each row has a different kind of data. 12444 09:28:43,720 --> 09:28:44,720 And then there's columns. 12445 09:28:44,720 --> 09:28:49,720 And we sort of specialize the first column in many spreadsheets to say what's in there. 12446 09:28:49,720 --> 09:28:50,720 This is not really the data. 12447 09:28:50,720 --> 09:28:52,720 This is like metadata. 12448 09:28:52,720 --> 09:28:54,720 It's like the titles in this first column. 12449 09:28:54,720 --> 09:28:56,720 That's not really the data, and the data starts here. 12450 09:28:56,720 --> 09:29:02,720 And we have different kinds of data like strings and numbers, et cetera, et cetera, for each of the rows. 12451 09:29:02,720 --> 09:29:08,720 And literally, you can get away with this as sort of about 80% of databases. 12452 09:29:08,720 --> 09:29:10,720 It's just a really super cool spreadsheet. 12453 09:29:10,720 --> 09:29:15,720 But under the covers, it is far more powerful than that. 12454 09:29:15,720 --> 09:29:21,720 So one of the early arguments that happened was, again, what the programming model for this was. 12455 09:29:21,720 --> 09:29:27,720 And a lot of folks wanted a programming model that reflected how the data was actually stored. 12456 09:29:27,720 --> 09:29:34,720 The notion of structured query language came about in a way to express what you wanted to happen 12457 09:29:34,720 --> 09:29:37,720 and allow that to be sort of a very abstract expression. 12458 09:29:37,720 --> 09:29:40,720 Select all records that meet this criteria. 12459 09:29:40,720 --> 09:29:43,720 Not read, read, read, read, read, read. 12460 09:29:43,720 --> 09:29:47,720 And so structured query language is not a procedural language. 12461 09:29:47,720 --> 09:29:52,720 It is an imperative language where you're simply saying what you want. 12462 09:29:52,720 --> 09:29:54,720 And then somebody writes the loop. 12463 09:29:54,720 --> 09:29:58,720 The database actually does the loop, but it's a way for you to avoid actually writing the loop. 12464 09:29:58,720 --> 09:30:00,720 Now, that turns out to be the power of databases. 12465 09:30:00,720 --> 09:30:05,720 Because the cleverness in how to write the loop is a way that you would probably never figure out 12466 09:30:05,720 --> 09:30:09,720 how to be most supremely optimal when it comes to writing the loop. 12467 09:30:09,720 --> 09:30:13,720 As you'll see toward the end of joining many tables together and selecting and throwing a ray 12468 09:30:13,720 --> 09:30:15,720 and getting down a count or whatever. 12469 09:30:15,720 --> 09:30:18,720 Someone has figured out how to do that really, really well. 12470 09:30:18,720 --> 09:30:22,720 So the idea was, is you would express, you know, we're going to create some data, 12471 09:30:22,720 --> 09:30:25,720 we're going to retrieve some data, we're going to insert and delete it. 12472 09:30:25,720 --> 09:30:27,720 Create, read, crud. 12473 09:30:27,720 --> 09:30:30,720 C-R-U-D. 12474 09:30:30,720 --> 09:30:33,720 Create, read, update, and delete, crud. 12475 09:30:33,720 --> 09:30:34,720 And so that's what this does. 12476 09:30:34,720 --> 09:30:37,720 It's a language that does this very simply. 12477 09:30:37,720 --> 09:30:43,720 Now, the applications that we're going to use this for are more of a data analysis application. 12478 09:30:43,720 --> 09:30:46,720 We've been doing data analysis through the whole course. 12479 09:30:46,720 --> 09:30:51,720 And the kinds of things that we'll see in the remaining chapters is we'll take some raw data file. 12480 09:30:51,720 --> 09:30:53,720 These might actually come across the network. 12481 09:30:53,720 --> 09:31:00,720 And we'll write some Python programs to play with that data, parse it, clean it up, make sense of it, you know. 12482 09:31:00,720 --> 09:31:02,720 And then write it into a database. 12483 09:31:02,720 --> 09:31:07,720 And this might be a slow processor, this might be really nasty, and this might be a way to have very clean data. 12484 09:31:07,720 --> 09:31:13,720 And then we'll write another Python program to sort of read this, read through it, and it's all efficient and pretty. 12485 09:31:13,720 --> 09:31:22,720 And then we can produce files and maybe we'll visualize it or do further analysis in our Excel or JavaScript visualization framework. 12486 09:31:22,720 --> 09:31:30,720 And so in this situation, you will be the person who is both sort of writing the programs, database administrator, 12487 09:31:30,720 --> 09:31:35,720 and you can, using SQLite Browser, play and look at the database kind of in a raw way. 12488 09:31:35,720 --> 09:31:40,720 And the first part of this, we are mostly going to be using SQLite Browser just to talk straight to a database. 12489 09:31:40,720 --> 09:31:46,720 Later, we'll write Python programs that read and write data and visualize the data. 12490 09:31:46,720 --> 09:31:48,720 So this is what we're going to do first. 12491 09:31:48,720 --> 09:31:51,720 And then second, we're going to do this part right here. 12492 09:31:51,720 --> 09:31:53,720 That's the second thing we're going to do. 12493 09:31:53,720 --> 09:31:59,720 Now, another really common use of applications and something that if you continue learning more about programming, 12494 09:31:59,720 --> 09:32:08,720 is that you will want to write an online application like Amazon or a company or Twitter 12495 09:32:08,720 --> 09:32:12,720 that's got a website and it stores dynamic data in databases. 12496 09:32:12,720 --> 09:32:17,720 And so the picture for that is similar but different than the picture we're going to start out with. 12497 09:32:17,720 --> 09:32:23,720 And so the way this usually works is that you, the end user, uses a web browser, talks to the application, 12498 09:32:23,720 --> 09:32:27,720 and the developer writes the application software. 12499 09:32:27,720 --> 09:32:31,720 And that application software stores its data in a database. 12500 09:32:31,720 --> 09:32:35,720 And inside that database, we talk to the database using SQL. 12501 09:32:35,720 --> 09:32:38,720 And all the data is actually stored here and the magic happens. 12502 09:32:38,720 --> 09:32:42,720 The data server is that database software that's so precious and valuable. 12503 09:32:42,720 --> 09:32:48,720 And then there's another person often called the database administrator who has access to the direct access to the data. 12504 09:32:48,720 --> 09:33:01,720 And these roles in medium and large projects are kept separate mostly because the production, 12505 09:33:01,720 --> 09:33:07,720 while it's running and live, the developer leaves the data alone and works on, say, the next version of the software. 12506 09:33:07,720 --> 09:33:15,720 And then the developer has a test version of the application that they run on their computer where they're doing all that stuff. 12507 09:33:15,720 --> 09:33:23,720 And so this database administrator is a role in a large project where we have to run production and keep production careful, 12508 09:33:23,720 --> 09:33:26,720 keep production in good shape. 12509 09:33:26,720 --> 09:33:30,720 So the database administrator has this responsibility for the production aspects of the data. 12510 09:33:30,720 --> 09:33:34,720 And you may be working in a situation where you're not actually controlling the data. 12511 09:33:34,720 --> 09:33:36,720 The database server is on different computers. 12512 09:33:36,720 --> 09:33:40,720 You have a little special access and you write programs to sort of read the data. 12513 09:33:40,720 --> 09:33:47,720 And so the database administrator is the person who is asked by the organization to administer that data. 12514 09:33:47,720 --> 09:33:54,720 The data that we develop, and we'll do this in the second part of these lectures, conforms to a data model. 12515 09:33:54,720 --> 09:33:55,720 That's the metadata. 12516 09:33:55,720 --> 09:33:56,720 Is this an integer? 12517 09:33:56,720 --> 09:33:57,720 Is this a string? 12518 09:33:57,720 --> 09:33:59,720 You know, how many columns is this? 12519 09:33:59,720 --> 09:34:02,720 And the data model turns out to be very, very important. 12520 09:34:02,720 --> 09:34:06,720 And there's a lot of science to building an effective data model that leads to really good performance. 12521 09:34:06,720 --> 09:34:14,720 And it's a collaborative activity between the application developers and the database administrator to make it so it's efficient, 12522 09:34:14,720 --> 09:34:17,720 runs in production, et cetera, et cetera, et cetera. 12523 09:34:17,720 --> 09:34:21,720 There's a lot of products out there that you may encounter. 12524 09:34:21,720 --> 09:34:22,720 We're going to be using SQLite. 12525 09:34:22,720 --> 09:34:27,720 SQLite's a little tiny database server, and it's built into so many things, and that's why we like it. 12526 09:34:27,720 --> 09:34:34,720 But if you're going to work at a large organization, you can easily run into Oracle, which is the number one commercial product. 12527 09:34:34,720 --> 09:34:40,720 Microsoft has a thing called SQL Server, which is a commercial product, and it's also very popular and very effective. 12528 09:34:40,720 --> 09:34:45,720 The more popular open source, there's things called Postgres. 12529 09:34:45,720 --> 09:34:46,720 There's MySQL. 12530 09:34:46,720 --> 09:34:49,720 And MySQL recently was sort of bought by Oracle. 12531 09:34:49,720 --> 09:34:56,720 And there is a copy of that called MariaDB that doesn't belong to Oracle, MariaDB. 12532 09:34:56,720 --> 09:35:06,720 And so most of the SQL that we're going to learn is common across these database systems because SQL is a standard. 12533 09:35:06,720 --> 09:35:11,720 But then there are parts that weren't part of the original standard where each database vendor has done things a little bit different. 12534 09:35:11,720 --> 09:35:18,720 But there is a core common subset that does the basic create, read, update, and delete operations. 12535 09:35:18,720 --> 09:35:21,720 So SQLite is a very popular. 12536 09:35:21,720 --> 09:35:24,720 You probably have it in your cell phone 10, 12 times. 12537 09:35:24,720 --> 09:35:26,720 Your web browser has a database engine in it. 12538 09:35:26,720 --> 09:35:29,720 Your car has a few databases in it. 12539 09:35:29,720 --> 09:35:33,720 And so SQLite is what's called an embedded database system. 12540 09:35:33,720 --> 09:35:35,720 Python comes built in with it. 12541 09:35:35,720 --> 09:35:39,720 You just import SQLite 3 and away you go. 12542 09:35:39,720 --> 09:35:49,720 And so it's very, very popular because it's free, it's open source, and it's such a tiny little piece of software that you just include it in other pieces of software 12543 09:35:49,720 --> 09:35:53,720 and use it to solve the data management problems of those pieces of software. 12544 09:35:53,720 --> 09:35:56,720 Like your browser might use SQLite to store your bookmarks. 12545 09:35:56,720 --> 09:35:59,720 Now you think, oh, there's only how many bookmarks can you have. 12546 09:35:59,720 --> 09:36:01,720 But what if there you need it to be fast? 12547 09:36:01,720 --> 09:36:03,720 And what if there's like people that have 10,000 bookmarks? 12548 09:36:03,720 --> 09:36:04,720 There probably are. 12549 09:36:04,720 --> 09:36:05,720 Do you still want it fast? 12550 09:36:05,720 --> 09:36:06,720 Do you want to be able to search? 12551 09:36:06,720 --> 09:36:11,720 And so you get all that by using a database like SQLite. 12552 09:36:11,720 --> 09:36:18,720 And so again, we're going to encourage you to download the SQLite browser so you can follow along with what we're going to do coming up next. 12553 09:36:18,720 --> 09:36:20,720 And so here is the SQLite browser. 12554 09:36:20,720 --> 09:36:22,720 Here's what it looks like. 12555 09:36:22,720 --> 09:36:24,720 And it's just a desktop application. 12556 09:36:24,720 --> 09:36:33,720 And coming up next, we'll start playing with this desktop application and see how it works. 12557 09:36:33,720 --> 09:36:34,720 So now we're going to make a database. 12558 09:36:34,720 --> 09:36:36,720 We're going to use SQLite browser. 12559 09:36:36,720 --> 09:36:39,720 Hopefully you've downloaded it so you can follow along. 12560 09:36:39,720 --> 09:36:44,720 And I've got this handout, this basic database handout that saves you from having to type all these things. 12561 09:36:44,720 --> 09:36:48,720 So bring that up in your web browser. 12562 09:36:48,720 --> 09:36:51,720 And so that gives you all of the commands that I'm going to type now. 12563 09:36:51,720 --> 09:37:03,720 And so you could pull them out of the, either the web page or the, you can pull them out of the slides or you can pull them out of that, out of that. 12564 09:37:03,720 --> 09:37:07,720 So I'm going to bring up the database browser here. 12565 09:37:07,720 --> 09:37:09,720 Database browser. 12566 09:37:09,720 --> 09:37:12,720 Now the thing that's going to happen, you'll see this happen on my desktop. 12567 09:37:12,720 --> 09:37:15,720 I'm going to make a new database and you have to store it somewhere. 12568 09:37:15,720 --> 09:37:24,720 And so I'm going to put it on my desktop and I'm going to call it py4efund. 12569 09:37:24,720 --> 09:37:29,720 And so we should see a new file on my database right there, py4efund. 12570 09:37:29,720 --> 09:37:34,720 Now that's a file that you don't want to edit with a text editor or anything like that. 12571 09:37:34,720 --> 09:37:42,720 This is a database that you're, this is a file that's to be read by SQLite browser and nothing else. 12572 09:37:42,720 --> 09:37:51,720 Okay, so we're going to create a table and I'm going to make a table called users with a column called name that's a text and a column called email. 12573 09:37:51,720 --> 09:37:54,720 So I'm going to, it's already asking me to make a table. 12574 09:37:54,720 --> 09:38:02,720 I'm going to call this users and I'm going to add a field that is called name and I'm going to add a text. 12575 09:38:02,720 --> 09:38:07,720 And I'm going to add another field called email and I'm going to make that be text. 12576 09:38:07,720 --> 09:38:15,720 Now the key thing here is we are in effect making columns and rendering an opinion as to exactly what the column is supposed to be used for. 12577 09:38:15,720 --> 09:38:17,720 And we're not allowed to violate that. 12578 09:38:17,720 --> 09:38:26,720 It's not like, oh, we'll do whatever you want because the database is optimizing its storage based on our contract that we're effectively making the contract ourselves. 12579 09:38:26,720 --> 09:38:33,720 We could make these columns anything we wanted, but we're just going to, we have to, we're going to contract with ourselves. 12580 09:38:33,720 --> 09:38:34,720 And you can see it's kind of small here. 12581 09:38:34,720 --> 09:38:41,720 You can see there's a create table and that's on the slide and that's the, the, the SQL way of doing that. 12582 09:38:41,720 --> 09:38:44,720 This user interface is just helping us write SQL. 12583 09:38:44,720 --> 09:38:46,720 So now I'm going to just say, okay. 12584 09:38:46,720 --> 09:38:58,720 And if you take a look, you can see that I now have a table users and I can look at my database structure and the table users and away we go. 12585 09:38:58,720 --> 09:39:08,720 And so, so now that's, that is creating it. And like I said, here in the slides is the create statement or on the web page, there's the create statement that could have done it. 12586 09:39:08,720 --> 09:39:14,720 Now we can insert some data. 12587 09:39:14,720 --> 09:39:28,720 Let's add a new record to this database users and we'll call this guy name Charles C7 at umish.edu. 12588 09:39:28,720 --> 09:39:32,720 So now we have a record. So it's kind of like a database spreadsheet. 12589 09:39:32,720 --> 09:39:43,720 Now that's not the SQL way to do it. There's SQL sort of going on in the background, but if we really want to do this using SQL, we're going to use the insert statement. 12590 09:39:43,720 --> 09:39:51,720 And the insert statement looks like this. 12591 09:39:51,720 --> 09:39:58,720 The SQL syntax sometimes has extra words. Insert into is actually an SQL key words. 12592 09:39:58,720 --> 09:40:08,720 The name of table, the columns, and then the word values, and then one to one correspondence between the values and its parenthesis. 12593 09:40:08,720 --> 09:40:13,720 So it looks kind of like a tupple in Python, but we're nowhere near Python right now. 12594 09:40:13,720 --> 09:40:20,720 Okay, and so that's what we're going to do. And so I'm going to grab this. 12595 09:40:20,720 --> 09:40:28,720 Kristen and I'm going to go over here to my SQLite browser and say execute SQL. 12596 09:40:28,720 --> 09:40:37,720 So now I can say paste that in and then hit this little run button and that's going to submit the SQL to SQLite and then update that file. 12597 09:40:37,720 --> 09:40:41,720 And it says query executed successfully and away we go. 12598 09:40:41,720 --> 09:40:46,720 So if I go back now and I look at the data, I see that there's two things in here. 12599 09:40:46,720 --> 09:40:51,720 And now I can actually insert all the rest of these. Let's go back to my little bit of stuff here. 12600 09:40:51,720 --> 09:41:04,720 Let's put all these other rows in. It turns out that if I go into the execute SQL and I want to do more than one command at a time, 12601 09:41:04,720 --> 09:41:10,720 I can put a semicolon at the end of each one of these things and then I can run them all at the same time. 12602 09:41:10,720 --> 09:41:13,720 I mean, one after another actually is what's going on here. 12603 09:41:13,720 --> 09:41:18,720 So boom, boom, boom, and I take a look at the data and look, I've got all those things in there. 12604 09:41:18,720 --> 09:41:23,720 Now, eventually the thing that's going to generate that SQL is a program, not us. 12605 09:41:23,720 --> 09:41:27,720 This is we're being the database administrator, so we're sort of doing things manually. 12606 09:41:27,720 --> 09:41:32,720 Once things get going, you write programs, do that insert over and over and over again in Python 12607 09:41:32,720 --> 09:41:35,720 or a web language like PHP or something like that. 12608 09:41:35,720 --> 09:41:39,720 And so that is the insert. 12609 09:41:39,720 --> 09:41:42,720 Now, we can get rid of data. 12610 09:41:42,720 --> 09:41:45,720 And so I'm going to say delete from, that's the key word. 12611 09:41:45,720 --> 09:41:48,720 Users is the name of a table. Where is a where clause? 12612 09:41:48,720 --> 09:41:52,720 We'll have lots of where clauses in SQL, which is, it's not like an if. 12613 09:41:52,720 --> 09:41:58,720 In effect, the delete is going towards the whole table and being turned on and off by this where clause. 12614 09:41:58,720 --> 09:42:02,720 So delete from users, if you didn't put the where clause on, will actually delete all the rows. 12615 09:42:02,720 --> 09:42:11,720 But where email equals ted.eumich.edu, well, that one is going to make it so it only applies to the rows where that is true. 12616 09:42:11,720 --> 09:42:18,720 So I'm going to go over here in SQL and I'm going to say delete from users where email equals ted.eumich.edu 12617 09:42:18,720 --> 09:42:20,720 and then I'm going to run it because it's only one. 12618 09:42:20,720 --> 09:42:22,720 I don't need a semicolon at the end of it. 12619 09:42:22,720 --> 09:42:26,720 And now if I go back and I look at the data, ted is gone. 12620 09:42:26,720 --> 09:42:29,720 Okay. 12621 09:42:29,720 --> 09:42:34,720 Update. So the update says, 12622 09:42:34,720 --> 09:42:38,720 updates keyword, users is the name of the table, set is a keyword, 12623 09:42:38,720 --> 09:42:41,720 and then this is column equals new value, and then a where clause. 12624 09:42:41,720 --> 09:42:46,720 Again, this update, if we didn't have a where clause, would change every row in the table. 12625 09:42:46,720 --> 09:42:55,720 And so where email equals csev.eumich.edu. 12626 09:42:55,720 --> 09:42:59,720 Oh, I got to change that because I already got the name to be Charles. 12627 09:42:59,720 --> 09:43:00,720 So you see the name is already Charles. 12628 09:43:00,720 --> 09:43:05,720 So I'll just execute here. 12629 09:43:05,720 --> 09:43:07,720 Make this be Chuck. So we see it. 12630 09:43:07,720 --> 09:43:09,720 And then I run it. 12631 09:43:09,720 --> 09:43:12,720 Then you take a look at the data and it's changed. 12632 09:43:12,720 --> 09:43:13,720 That's it. 12633 09:43:13,720 --> 09:43:15,720 That's an update statement. 12634 09:43:15,720 --> 09:43:16,720 We're doing, you're doing great. 12635 09:43:16,720 --> 09:43:18,720 You're doing great. 12636 09:43:18,720 --> 09:43:30,720 And so the next thing we're going to do is we're going to take a look at how we retrieve data. 12637 09:43:30,720 --> 09:43:33,720 Now this is the select statement, select star. 12638 09:43:33,720 --> 09:43:38,720 You have a list of columns and star means all columns from is a keyword and then the name of a table. 12639 09:43:38,720 --> 09:43:42,720 So this select star from users is the kind of thing you type all the time. 12640 09:43:42,720 --> 09:43:46,720 As a matter of fact, it's what SQLite browser is doing internally to cause this to happen. 12641 09:43:46,720 --> 09:43:51,720 But we can do it by hand by saying select star from users and then run it. 12642 09:43:51,720 --> 09:43:56,720 And so then we get a little record set that is those four records that are sitting there. 12643 09:43:56,720 --> 09:43:58,720 We can also throw a where clause on the end of it. 12644 09:43:58,720 --> 09:44:05,720 So we say select star from users where email equals csev at umich.edu. 12645 09:44:05,720 --> 09:44:11,720 And that again, the select star from users goes at the whole table and the where clause goes at the whole table 12646 09:44:11,720 --> 09:44:14,720 and then filters out all of the things except one record. 12647 09:44:14,720 --> 09:44:19,720 So the where clause is send it to the table but then filter based on whatever. 12648 09:44:19,720 --> 09:44:22,720 And so it only shows us that. 12649 09:44:22,720 --> 09:44:26,720 Okay, we're cruising right along here. 12650 09:44:26,720 --> 09:44:30,720 You can also put an order by clause on there. 12651 09:44:30,720 --> 09:44:34,720 So we can say select star from users order by email. 12652 09:44:34,720 --> 09:44:37,720 So that's a column. 12653 09:44:37,720 --> 09:44:40,720 Select star from users order by email. 12654 09:44:40,720 --> 09:44:42,720 And so that orders by email. 12655 09:44:42,720 --> 09:44:48,720 Or we can change it by to name and we can say descending. 12656 09:44:48,720 --> 09:44:52,720 So that's the name and descending order. 12657 09:44:52,720 --> 09:44:59,720 Sorting and selecting are good things that databases are really good at. 12658 09:44:59,720 --> 09:45:02,720 So this is the summary of what I've told you. 12659 09:45:02,720 --> 09:45:05,720 So the databases do create, read, update and delete crud. 12660 09:45:05,720 --> 09:45:10,720 And we've done all those things except we did create, delete, update, read. 12661 09:45:10,720 --> 09:45:11,720 That's what we did. 12662 09:45:11,720 --> 09:45:13,720 And that's the summary of SQL. 12663 09:45:13,720 --> 09:45:19,720 And so you might be saying why did I take so long to learn such a simple and elegant and beautiful language 12664 09:45:19,720 --> 09:45:21,720 because it's not really exciting. 12665 09:45:21,720 --> 09:45:27,720 It's a extremely simple language that's very predictable and you're like that's pretty easy. 12666 09:45:27,720 --> 09:45:34,720 And it turns out that some of you may have been using SQL in situations maybe with Microsoft Access or something. 12667 09:45:34,720 --> 09:45:37,720 Or actually type in this stuff and you just kind of typed it 12668 09:45:37,720 --> 09:45:40,720 and you never realized that you were learning a programming language. 12669 09:45:40,720 --> 09:45:44,720 That's why I like SQL and that's a very declarative language and it's very straightforward. 12670 09:45:44,720 --> 09:45:49,720 It's much easier to learn SQL than it is to learn Python. 12671 09:45:49,720 --> 09:45:53,720 Because in Python you have to figure out how loops work and how iteration variables work 12672 09:45:53,720 --> 09:45:55,720 and you'll notice there's none of that. 12673 09:45:55,720 --> 09:45:59,720 But the key is we've only started to understand the power. 12674 09:45:59,720 --> 09:46:06,720 That's the simple ability to move around and update data and read data randomly using these simple sets of commands. 12675 09:46:06,720 --> 09:46:14,720 But up next we're going to look at how you do this with data models and relationships and really multiple tables. 12676 09:46:19,720 --> 09:46:21,720 Hello and welcome to a code walkthrough. 12677 09:46:21,720 --> 09:46:25,720 In this bit of code we're talking about the emaildb.py. 12678 09:46:25,720 --> 09:46:34,720 This is a beautiful little example in that it sort of reduces talking to the database to kind of its pure essence. 12679 09:46:34,720 --> 09:46:40,720 And so we'll start out this code and we import the SQLite 3 just to get the library there. 12680 09:46:40,720 --> 09:46:46,720 We make a connection and in databases we sort of end up with an open that's two steps. 12681 09:46:46,720 --> 09:46:53,720 There's the connection to the database which checks access to the file and the cursor is kind of like our handle. 12682 09:46:53,720 --> 09:47:00,720 It's not as simple as you just open it and read it but you open it and then you send SQL commands through the cursor 12683 09:47:00,720 --> 09:47:03,720 and then you get your responses through that same cursor. 12684 09:47:03,720 --> 09:47:07,720 So C-U-R here is the variable that we're interested in. 12685 09:47:07,720 --> 09:47:13,720 And the first thing that we're going to do is we're going to, we've got this file. 12686 09:47:13,720 --> 09:47:17,720 It will either create this file and right now this file doesn't exist. 12687 09:47:17,720 --> 09:47:21,720 It's going to be in the same directory. 12688 09:47:21,720 --> 09:47:25,720 There's no emaildb. 12689 09:47:25,720 --> 09:47:28,720 So this is actually going to create the file when it runs. 12690 09:47:28,720 --> 09:47:33,720 And then the first thing we're going to do is drop the table if it exists. Drop table is a bit of SQL. 12691 09:47:33,720 --> 09:47:37,720 The if exists just keeps this from blowing up if we start it with a fresh database. 12692 09:47:37,720 --> 09:47:41,720 And in this case there is no file there so we are starting with a fresh database. 12693 09:47:41,720 --> 09:47:45,720 So this will accomplish absolutely nothing which is just fine. 12694 09:47:45,720 --> 09:47:47,720 Now we're using triple quotes here. 12695 09:47:47,720 --> 09:47:50,720 I'm just kind of using that to make this a little bit easier to read. 12696 09:47:50,720 --> 09:47:53,720 I probably could pull those lines up a bit. 12697 09:47:53,720 --> 09:47:57,720 This one's actually small enough that I could, maybe I'll just do that. 12698 09:47:57,720 --> 09:48:03,720 Let's do that. Let's bring that baby right up and turn this into a single quote. 12699 09:48:03,720 --> 09:48:06,720 That's short enough. 12700 09:48:06,720 --> 09:48:10,720 But triple quote is just, this one here is a little longer so I'll use triple quote. 12701 09:48:10,720 --> 09:48:13,720 So we're going to drop table. That's going to do nothing first time through. 12702 09:48:13,720 --> 09:48:15,720 Then we're going to do a create table. 12703 09:48:15,720 --> 09:48:18,720 Now sometimes your application will have like a read me or something. 12704 09:48:18,720 --> 09:48:21,720 It says go run these commands to set the database up. 12705 09:48:21,720 --> 09:48:25,720 But we're able to just set this database up in this particular application. 12706 09:48:25,720 --> 09:48:30,720 We'll see later ones where we're going to leave the database and not start it fresh. 12707 09:48:30,720 --> 09:48:33,720 And in this one we can do the same. 12708 09:48:33,720 --> 09:48:38,720 But in this one we could but we're just going to start fresh by dropping the table. 12709 09:48:38,720 --> 09:48:40,720 So we'll create it. 12710 09:48:40,720 --> 09:48:44,720 We're going to have an email and an account. 12711 09:48:44,720 --> 09:48:49,720 Basically what we're doing here is we're really going to pretend that this is a dictionary. 12712 09:48:49,720 --> 09:48:53,720 If you recall when I said dictionary, a dictionary is like an in-memory database. 12713 09:48:53,720 --> 09:48:56,720 Well, now we're using a database to do a database. 12714 09:48:56,720 --> 09:49:00,720 But the first thing we're going to do here is pretend it's a dictionary. 12715 09:49:00,720 --> 09:49:01,720 So that's a little crazy. 12716 09:49:01,720 --> 09:49:04,720 So these next lines of code hopefully are pretty familiar to you, right? 12717 09:49:04,720 --> 09:49:12,720 Get a file name, loop through it, check to see if it's, you know, grab and box short by default 12718 09:49:12,720 --> 09:49:15,720 so we can press the enter key and then loop through it, right? 12719 09:49:15,720 --> 09:49:21,720 And so this little part right here, this is our basic loop that we're doing. 12720 09:49:21,720 --> 09:49:25,720 And so, you know, that is pretty normal. 12721 09:49:25,720 --> 09:49:31,720 And if we look at this line right here, that line right there is the line that is, 12722 09:49:31,720 --> 09:49:36,720 that line right there makes sure that we can only get the from lines. 12723 09:49:36,720 --> 09:49:38,720 We've done that a bunch of times and we're going to split it. 12724 09:49:38,720 --> 09:49:42,720 We're not going to strip the right because the split's going to take care of that. 12725 09:49:42,720 --> 09:49:48,720 And then we're going to grab the email address, which of course in the from line is the second part. 12726 09:49:48,720 --> 09:49:52,720 And then we will have that. 12727 09:49:52,720 --> 09:49:53,720 So now we're going to do some database. 12728 09:49:53,720 --> 09:50:01,720 So the first thing we're going to do, this bit right here is kind of like the dictionary part. 12729 09:50:01,720 --> 09:50:05,720 So the first thing that we're going to do is we're going to select count from our database, 12730 09:50:05,720 --> 09:50:08,720 that is an integer, where email equals. 12731 09:50:08,720 --> 09:50:12,720 And this part right here bears some explaining. 12732 09:50:12,720 --> 09:50:15,720 This is going to be csevitumich.edu or whatever. 12733 09:50:15,720 --> 09:50:24,720 Now, it is dangerous to put those strings, especially from user enter to enter data into your SQL. 12734 09:50:24,720 --> 09:50:25,720 You technically could. 12735 09:50:25,720 --> 09:50:30,720 I could make this be a email equals csevitumich.edu. 12736 09:50:30,720 --> 09:50:31,720 I'd have to skate the boats and stuff. 12737 09:50:31,720 --> 09:50:34,720 But this question mark is a placeholder. 12738 09:50:34,720 --> 09:50:39,720 And this is a way to basically make sure that we don't allow SQL injection. 12739 09:50:39,720 --> 09:50:43,720 Go Google SQL injection to get a sense of what that is. 12740 09:50:43,720 --> 09:50:48,720 It's more of an issue in online applications. 12741 09:50:48,720 --> 09:50:53,720 But in this application, we're just being good. 12742 09:50:53,720 --> 09:50:58,720 So the way this works is this is a placeholder in this SQL that will ultimately be replaced by this. 12743 09:50:58,720 --> 09:51:00,720 Now, you could have several question marks. 12744 09:51:00,720 --> 09:51:02,720 We only have one in here. 12745 09:51:02,720 --> 09:51:04,720 And so you give a tuple. 12746 09:51:04,720 --> 09:51:07,720 And if we just put email, it won't turn into a tuple. 12747 09:51:07,720 --> 09:51:09,720 This is a one tuple, basically. 12748 09:51:09,720 --> 09:51:14,720 This little weird parenthesis, email, comma, parenthesis. 12749 09:51:14,720 --> 09:51:16,720 That is a tuple with only one thing in it. 12750 09:51:16,720 --> 09:51:19,720 And that's just the weird Python syntax. 12751 09:51:19,720 --> 09:51:21,720 It's rare that I apologize for Python syntax. 12752 09:51:21,720 --> 09:51:24,720 But that's a little bit less than pretty. 12753 09:51:24,720 --> 09:51:25,720 But it's OK. 12754 09:51:25,720 --> 09:51:26,720 It's a tuple. 12755 09:51:26,720 --> 09:51:31,720 And normally, if there were two of these, then there would be email, name, dot, dot, dot, dot. 12756 09:51:31,720 --> 09:51:33,720 OK? 12757 09:51:33,720 --> 09:51:38,720 So this cur.execute is actually not really retrieving the data. 12758 09:51:38,720 --> 09:51:44,720 In a way, it's looking at the SQL and making sure that maybe it might verify that the table name is right 12759 09:51:44,720 --> 09:51:46,720 or if there's any syntax errors, et cetera, et cetera. 12760 09:51:46,720 --> 09:51:49,720 So this actually is not really reading the data. 12761 09:51:49,720 --> 09:51:52,720 But we have prepared this cursor. 12762 09:51:52,720 --> 09:51:55,720 This is kind of like the opening of a file. 12763 09:51:55,720 --> 09:51:57,720 But what we're opening is a record set. 12764 09:51:57,720 --> 09:52:03,720 We're opening a set of records that are going to be this wherever it's true. 12765 09:52:03,720 --> 09:52:06,720 So it's like we're going to read this like a file. 12766 09:52:06,720 --> 09:52:08,720 Now, later things will loop through this. 12767 09:52:08,720 --> 09:52:10,720 But we're only going to say, hey, grab that first one. 12768 09:52:10,720 --> 09:52:13,720 We could have even put maybe a limit clause on there or something. 12769 09:52:13,720 --> 09:52:16,720 Grab the first one and give it back in row. 12770 09:52:16,720 --> 09:52:25,720 And so row is going to be the information that we get from the database. 12771 09:52:25,720 --> 09:52:31,720 And so if there are no records that meet this, then row is going to be none. 12772 09:52:31,720 --> 09:52:34,720 So here's kind of, again, like the get. 12773 09:52:34,720 --> 09:52:38,720 Here's like the get, where if the row wasn't there, 12774 09:52:38,720 --> 09:52:43,720 because the way we're doing this is we're going to end up with this row in the database. 12775 09:52:43,720 --> 09:52:45,720 Here is this database. 12776 09:52:45,720 --> 09:52:46,720 And there's going to be two columns. 12777 09:52:46,720 --> 09:52:47,720 And there's a bunch of rows. 12778 09:52:47,720 --> 09:52:55,720 And then here's going to be csev4 and gen3 and steven6, right? 12779 09:52:55,720 --> 09:52:56,720 So these are the counts. 12780 09:52:56,720 --> 09:53:02,720 And so we're grabbing this variable out if it's csev that we're grabbing. 12781 09:53:02,720 --> 09:53:03,720 And that's going to come into here, right? 12782 09:53:03,720 --> 09:53:05,720 That's going to show up in here. 12783 09:53:05,720 --> 09:53:13,720 And that row is actually, it turns out that the row is a list, 12784 09:53:13,720 --> 09:53:15,720 but we're only getting one thing. 12785 09:53:15,720 --> 09:53:18,720 And what we really are doing is if we searched through and we got through 12786 09:53:18,720 --> 09:53:22,720 and there was nothing, then row is none means that there was none 12787 09:53:22,720 --> 09:53:27,720 and we're seeing like gens for the first time and we have to insert it. 12788 09:53:27,720 --> 09:53:31,720 So if row is none, we're going to run an insert statement. 12789 09:53:31,720 --> 09:53:34,720 Insert into counts, email count. 12790 09:53:34,720 --> 09:53:37,720 Now we've got to set it to one because it's the first time we've seen it. 12791 09:53:37,720 --> 09:53:39,720 So values, and then again the question mark. 12792 09:53:39,720 --> 09:53:43,720 The question mark basically says, hey, I'm going to have a value in this tuple 12793 09:53:43,720 --> 09:53:45,720 and there's an ordering to the tuple. 12794 09:53:45,720 --> 09:53:49,720 And so there's only one question here, one question mark placeholder here 12795 09:53:49,720 --> 09:53:50,720 and then one is the initial count. 12796 09:53:50,720 --> 09:53:54,720 So email, question mark, count, one, away we go. 12797 09:53:54,720 --> 09:53:59,720 And so then we have, again, we have a tuple that gives to this execute statement 12798 09:53:59,720 --> 09:54:02,720 just like in that execute statement, the corresponding sort of strings 12799 09:54:02,720 --> 09:54:06,720 or integers that are to be placed by each of the questions. 12800 09:54:06,720 --> 09:54:09,720 So when this runs, there's going to be a new record 12801 09:54:09,720 --> 09:54:13,720 and there's going to be a one that's put in there into that new record. 12802 09:54:13,720 --> 09:54:16,720 If on the other hand we pull back a row that exists, 12803 09:54:16,720 --> 09:54:18,720 we're going to get this for number. 12804 09:54:18,720 --> 09:54:21,720 And you might think we want to take this for number and add it, 12805 09:54:21,720 --> 09:54:25,720 but in databases it's always better to do an update 12806 09:54:25,720 --> 09:54:29,720 because there might be multiple applications 12807 09:54:29,720 --> 09:54:31,720 that are talking to this database at the same time. 12808 09:54:31,720 --> 09:54:36,720 So no matter what update does is in a single atomic operation 12809 09:54:36,720 --> 09:54:39,720 it turns whatever this number is into one higher 12810 09:54:39,720 --> 09:54:43,720 and we don't have to worry about other pieces of code potentially modifying. 12811 09:54:43,720 --> 09:54:45,720 Now in this case we don't have to worry about that 12812 09:54:45,720 --> 09:54:47,720 because we're the only piece of code, 12813 09:54:47,720 --> 09:54:52,720 but using update to increment something is way better than reading the value 12814 09:54:52,720 --> 09:54:56,720 and then doing an update to adding one inside of Python 12815 09:54:56,720 --> 09:54:59,720 and then updating the new value which is that's two SQL statements 12816 09:54:59,720 --> 09:55:03,720 but it's also not atomic. 12817 09:55:03,720 --> 09:55:09,720 So if the row is none, if the row exists we just know that it exists 12818 09:55:09,720 --> 09:55:11,720 and we just want to add one to the number. 12819 09:55:11,720 --> 09:55:14,720 We do have the number sitting here in the row variable 12820 09:55:14,720 --> 09:55:16,720 but we don't need it. 12821 09:55:16,720 --> 09:55:21,720 And so we're going to say update count set count equals count plus one 12822 09:55:21,720 --> 09:55:24,720 column name where email equals and then another place holder 12823 09:55:24,720 --> 09:55:27,720 and then another tuple for the question mark. 12824 09:55:27,720 --> 09:55:30,720 And so that's what this little bit of code does. 12825 09:55:30,720 --> 09:55:34,720 That is kind of the read it, parse it, check to see if it's there, 12826 09:55:34,720 --> 09:55:37,720 if it's not, insert it, if it is updated. 12827 09:55:37,720 --> 09:55:41,720 And so then we see this con commit. 12828 09:55:41,720 --> 09:55:46,720 And this con commit basically the way it works is that the database 12829 09:55:46,720 --> 09:55:50,720 is efficiently keeping some of the information in memory 12830 09:55:50,720 --> 09:55:53,720 and at some point it has to write all that stuff out to disk. 12831 09:55:53,720 --> 09:55:56,720 So you can choose at times where you put this commit. 12832 09:55:56,720 --> 09:55:59,720 Right now we're going to commit every time through this loop 12833 09:55:59,720 --> 09:56:01,720 but you might commit every tenth time through the loop 12834 09:56:01,720 --> 09:56:05,720 because the commit will take some time because it forces everything 12835 09:56:05,720 --> 09:56:08,720 to be written to disk and these can run really fast 12836 09:56:08,720 --> 09:56:10,720 and the commit is the slowest part here. 12837 09:56:10,720 --> 09:56:13,720 So sometimes we do things like commit every tenth record 12838 09:56:13,720 --> 09:56:15,720 or every hundredth record. 12839 09:56:15,720 --> 09:56:18,720 If it's an online system which is not what this is, 12840 09:56:18,720 --> 09:56:21,720 you have to commit at the end of every sort of screenping. 12841 09:56:21,720 --> 09:56:24,720 But for this kind of a system because we're putting so much in, 12842 09:56:24,720 --> 09:56:26,720 this is kind of a bulk insert, 12843 09:56:26,720 --> 09:56:28,720 we might come up with a thing where we, 12844 09:56:28,720 --> 09:56:31,720 you know, every tenth time we do a commit. 12845 09:56:31,720 --> 09:56:34,720 But ultimately what this will do when this is running 12846 09:56:34,720 --> 09:56:37,720 is it will build up slowly but surely adding new records 12847 09:56:37,720 --> 09:56:40,720 and then one one and then it will build two and a three 12848 09:56:40,720 --> 09:56:42,720 and all these things and add another one, that will be one. 12849 09:56:42,720 --> 09:56:43,720 It will do this thing, right? 12850 09:56:43,720 --> 09:56:47,720 And then at the end of the day that is what's going to be in the database. 12851 09:56:47,720 --> 09:56:55,720 Now, so now we're, so let's take a look what's in the database 12852 09:56:55,720 --> 09:56:57,720 and now we can actually read the database. 12853 09:56:57,720 --> 09:57:01,720 And so in the database we're going to run a select 12854 09:57:01,720 --> 09:57:04,720 and we're going to say we're going to select the email and account 12855 09:57:04,720 --> 09:57:06,720 from counts, order by count, descending. 12856 09:57:06,720 --> 09:57:08,720 So look at that, isn't that cool? 12857 09:57:08,720 --> 09:57:11,720 We're getting in the top ten because databases are good at sorting 12858 09:57:11,720 --> 09:57:13,720 and they're good at all these other things. 12859 09:57:13,720 --> 09:57:15,720 So we're going to then execute this 12860 09:57:15,720 --> 09:57:19,720 and then we're going to ask for the rows one at a time 12861 09:57:19,720 --> 09:57:23,720 and the rows are going to be a tuple 12862 09:57:23,720 --> 09:57:26,720 and row sub zero will be email and row sub one will be count. 12863 09:57:26,720 --> 09:57:29,720 So we run all this stuff and then we close the connection 12864 09:57:29,720 --> 09:57:31,720 and away we go, okay? 12865 09:57:31,720 --> 09:57:34,720 So let's go ahead and run this. 12866 09:57:34,720 --> 09:57:37,720 Let's go ahead and run all this stuff. 12867 09:57:37,720 --> 09:57:42,720 Python three, email bb.py. 12868 09:57:42,720 --> 09:57:49,720 It asks for a file name, mbox short. 12869 09:57:49,720 --> 09:57:50,720 I can hit enter, right? 12870 09:57:50,720 --> 09:57:53,720 mbox short and that's it and it looks just like that 12871 09:57:53,720 --> 09:57:55,720 and it counts it and away we go. 12872 09:57:55,720 --> 09:57:59,720 Now the difference is at this point we have a file, 12873 09:57:59,720 --> 09:58:06,720 emaildb.sqlite and we can run the SQLite browser 12874 09:58:06,720 --> 09:58:12,720 and we can then open this database 12875 09:58:12,720 --> 09:58:13,720 and we can see what's in there. 12876 09:58:13,720 --> 09:58:14,720 So here we go. 12877 09:58:14,720 --> 09:58:16,720 It has made an SQLite database. 12878 09:58:16,720 --> 09:58:18,720 We have a table of counts 12879 09:58:18,720 --> 09:58:21,720 and then we can take a look at the data and there we go. 12880 09:58:21,720 --> 09:58:25,720 We've got the data and we can do this. 12881 09:58:25,720 --> 09:58:27,720 And so let me close this. 12882 09:58:27,720 --> 09:58:33,720 It's important at times when you don't want necessarily to have, 12883 09:58:33,720 --> 09:58:35,720 well let's see if we can cause it to lock up. 12884 09:58:35,720 --> 09:58:39,720 Let me run this again and it's going to drop this table. 12885 09:58:39,720 --> 09:58:44,720 So I'm going to run the code again 12886 09:58:44,720 --> 09:58:51,720 but this time I am going to do the full one, mbox.txt. 12887 09:58:51,720 --> 09:58:55,720 Now we'll see what happens here but it ran 12888 09:58:55,720 --> 09:58:58,720 and now so what we have to do then to see this date 12889 09:58:58,720 --> 09:59:01,720 is from the previous run but if we want the most recent one 12890 09:59:01,720 --> 09:59:03,720 we hit refresh and then away we go 12891 09:59:03,720 --> 09:59:05,720 and so we can see this stuff. 12892 09:59:05,720 --> 09:59:09,720 And so this is just a real simple start 12893 09:59:09,720 --> 09:59:12,720 to see how you can connect some of the stuff that we've been doing 12894 09:59:12,720 --> 09:59:14,720 but store the data in a database. 12895 09:59:14,720 --> 09:59:16,720 But the nice thing about the database 12896 09:59:16,720 --> 09:59:20,720 is that it can store this stuff from run to run. 12897 09:59:20,720 --> 09:59:23,720 Even though in this case we're dropping the table every time 12898 09:59:23,720 --> 09:59:26,720 in later things we will see how we can store data from run to run 12899 09:59:26,720 --> 09:59:29,720 to give ourselves more restartable processes. 12900 09:59:29,720 --> 09:59:35,720 Cheers. 12901 09:59:35,720 --> 09:59:37,720 We're going to do some code walkthrough 12902 09:59:37,720 --> 09:59:39,720 and if you want to follow through with the code 12903 09:59:39,720 --> 09:59:44,720 you can download the sample code from Python for Everybody. 12904 09:59:44,720 --> 09:59:48,720 And so the code that we're going to play with is the Twitter Spider code 12905 09:59:48,720 --> 09:59:54,720 that is both talking to the Twitter API and talking to the database. 12906 09:59:54,720 --> 09:59:58,720 And so what we're going to be doing is we're going to run code 12907 09:59:58,720 --> 10:00:01,720 that's going to hit the Twitter API much like we did in a previous chapter 12908 10:00:01,720 --> 10:00:04,720 and we're going to retrieve the data but we're going to remember the data 12909 10:00:04,720 --> 10:00:07,720 so we don't have to retrieve it again. 12910 10:00:07,720 --> 10:00:10,720 And so we're going to keep track of people's friends 12911 10:00:10,720 --> 10:00:14,720 and what we're doing here is sort of illicitly pulling down 12912 10:00:14,720 --> 10:00:17,720 slowly but surely based subject to our rate limit 12913 10:00:17,720 --> 10:00:20,720 we're pulling down who our friends are. 12914 10:00:20,720 --> 10:00:22,720 And so let's take a look. 12915 10:00:22,720 --> 10:00:25,720 We're going to use urllib and urllib error, 12916 10:00:25,720 --> 10:00:30,720 which was code that augments my URL to do all the OAuth calculation. 12917 10:00:30,720 --> 10:00:32,720 We're going to get JSON data back. 12918 10:00:32,720 --> 10:00:34,720 We're going to make a database and we have to import SQL 12919 10:00:34,720 --> 10:00:39,720 because of the way Python doesn't trust any certificates 12920 10:00:39,720 --> 10:00:41,720 no matter how good they are. 12921 10:00:41,720 --> 10:00:44,720 So this is our URL to talk to the Twitter API. 12922 10:00:44,720 --> 10:00:47,720 We're going to make a database and again the way SQL lite works 12923 10:00:47,720 --> 10:00:52,720 is if this spider.sql lite doesn't exist, it creates it. 12924 10:00:52,720 --> 10:00:57,720 And we get ourself a cursor and we're going to do a create table. 12925 10:00:57,720 --> 10:01:01,720 This if not exists some SQLs but SQL lite 3 does this. 12926 10:01:01,720 --> 10:01:03,720 Create table if it doesn't exist. 12927 10:01:03,720 --> 10:01:07,720 We want to start this over and over unlike the tracks example 12928 10:01:07,720 --> 10:01:10,720 I want to start this over and over and not lose data. 12929 10:01:10,720 --> 10:01:13,720 And this is a spidering process and we'll see a lot of these 12930 10:01:13,720 --> 10:01:17,720 where we want a restartable process where we use a database. 12931 10:01:17,720 --> 10:01:21,720 So if we're starting with nothing and there's no file of spider SQL lite 12932 10:01:21,720 --> 10:01:24,720 it creates this table and it's the name of the person, 12933 10:01:24,720 --> 10:01:27,720 whether we retrieved it or not and how many friends this person has 12934 10:01:27,720 --> 10:01:29,720 that we know of in our database. 12935 10:01:29,720 --> 10:01:33,720 Now this little bit is to deal with the SSL certificate errors. 12936 10:01:33,720 --> 10:01:36,720 The certificates are totally fine but Python doesn't trust any certificates 12937 10:01:36,720 --> 10:01:40,720 by default which is frustrating but whatever. 12938 10:01:40,720 --> 10:01:41,720 So here we're going to have a loop. 12939 10:01:41,720 --> 10:01:43,720 We're going to ask for a Twitter account. 12940 10:01:43,720 --> 10:01:45,720 We have to type quit to quit. 12941 10:01:45,720 --> 10:01:49,720 If we hit enter in this case we're going to actually read from the database 12942 10:01:49,720 --> 10:01:54,720 an unretrieved Twitter person and then grab all that person's friends. 12943 10:01:54,720 --> 10:02:02,720 And so then we're going to do a fetch one, get one 12944 10:02:02,720 --> 10:02:06,720 and that's going to get the name of the first person, the sub zero. 12945 10:02:06,720 --> 10:02:10,720 If we had more things than name here, sub zero is the first of those. 12946 10:02:10,720 --> 10:02:13,720 Fetch one means get one row from the database 12947 10:02:13,720 --> 10:02:16,720 and sub zero means the first column of that first row. 12948 10:02:16,720 --> 10:02:20,720 And if this fails then we've retrieved all the Twitter accounts. 12949 10:02:20,720 --> 10:02:26,720 And so we're going to augment this Twitter URL using this makes 12950 10:02:26,720 --> 10:02:29,720 you can look at the twurl.py code. 12951 10:02:29,720 --> 10:02:34,720 This basically requires the hidden.py file 12952 10:02:34,720 --> 10:02:37,720 which has your keys and secrets in it. 12953 10:02:37,720 --> 10:02:39,720 You've got to get hidden.py updated. 12954 10:02:39,720 --> 10:02:41,720 I've got it updated but I'm not going to show you 12955 10:02:41,720 --> 10:02:43,720 because it has my keys and secrets in it. 12956 10:02:43,720 --> 10:02:45,720 And so we're only going to take the first five 12957 10:02:45,720 --> 10:02:48,720 which means we're probably not going to find friends of friends of friends. 12958 10:02:48,720 --> 10:02:50,720 It's only if most five recent ones. 12959 10:02:50,720 --> 10:02:54,720 We could run this with a much higher number to get to the 12960 10:02:54,720 --> 10:02:55,720 so we have more than one friend. 12961 10:02:55,720 --> 10:02:58,720 We'll show the URL while we retrieve it. 12962 10:02:58,720 --> 10:03:00,720 We will do our UL open. 12963 10:03:00,720 --> 10:03:04,720 We'll do a read and then we'll do a decode to make sure that this UTF 12964 10:03:04,720 --> 10:03:07,720 this will give us data in UTF-8 and then decode 12965 10:03:07,720 --> 10:03:11,720 will give us data in Unicode which is what we need inside of Python. 12966 10:03:11,720 --> 10:03:16,720 We will ask for the headers from the connection. 12967 10:03:16,720 --> 10:03:19,720 We'll say give me the headers, give me a dictionary of the headers 12968 10:03:19,720 --> 10:03:23,720 and the x rate limiting header from the Twitter API 12969 10:03:23,720 --> 10:03:27,720 tells us when we're going to be told we can't use this API anymore 12970 10:03:27,720 --> 10:03:29,720 because this is one of those things. 12971 10:03:29,720 --> 10:03:34,720 And then we're going to parse and load the data that we got from Twitter 12972 10:03:34,720 --> 10:03:39,720 and get a, I think it's a list. 12973 10:03:39,720 --> 10:03:41,720 Yeah, it's a list. 12974 10:03:41,720 --> 10:03:45,720 And then we could dump this if you want and yours you can undo that. 12975 10:03:45,720 --> 10:03:49,720 And then what we're going to do is we've just retrieved 12976 10:03:49,720 --> 10:03:52,720 this person's screen name and their friends. 12977 10:03:52,720 --> 10:03:56,720 And so the first thing we want to do is update the database 12978 10:03:56,720 --> 10:03:58,720 and change the retrieve from zero to one. 12979 10:03:58,720 --> 10:04:01,720 And that's because we want, we're going to use this to know about unretrieved. 12980 10:04:01,720 --> 10:04:05,720 So retrieved being one means we've already retrieved it 12981 10:04:05,720 --> 10:04:08,720 and we did retrieve it so for that account we've retrieved it. 12982 10:04:08,720 --> 10:04:10,720 And then what we're going to do is we're going to parse that. 12983 10:04:10,720 --> 10:04:13,720 And so this is similar to the Twitter code we did previously 12984 10:04:13,720 --> 10:04:15,720 in the web services chapter. 12985 10:04:15,720 --> 10:04:16,720 We're going to go through all the users. 12986 10:04:16,720 --> 10:04:17,720 We're going to find their screen name. 12987 10:04:17,720 --> 10:04:20,720 We're going to print the screen name out. 12988 10:04:20,720 --> 10:04:29,720 And then what we're going to do is see if, let's see. 12989 10:04:29,720 --> 10:04:34,720 So we're going through all the users who are the friends of this person 12990 10:04:34,720 --> 10:04:38,720 and we're going to say, oh okay, let's select the friends from Twitter 12991 10:04:38,720 --> 10:04:42,720 where the name is the friend person. 12992 10:04:42,720 --> 10:04:48,720 And what we're going to do is we're going to, 12993 10:04:48,720 --> 10:04:52,720 if we're going to do a curve fetch one of this Twitter, 12994 10:04:52,720 --> 10:04:57,720 the name of the friends, this is the friend screen name, right? 12995 10:04:57,720 --> 10:05:01,720 So we're going to say, oh okay, if we get this, 12996 10:05:01,720 --> 10:05:03,720 we're going to get that friend screen name 12997 10:05:03,720 --> 10:05:07,720 and we're going to get how many friends this particular screen name has. 12998 10:05:07,720 --> 10:05:12,720 If we find a URL, we find it in there, 12999 10:05:12,720 --> 10:05:15,720 we're going to do an update statement and add one to their friend count, 13000 10:05:15,720 --> 10:05:18,720 how many friends they have, and then keep track. 13001 10:05:18,720 --> 10:05:20,720 This count here is not in the database. 13002 10:05:20,720 --> 10:05:22,720 It's just so I can print it out at the end. 13003 10:05:22,720 --> 10:05:26,720 If there is no record for this particular friend, 13004 10:05:26,720 --> 10:05:30,720 we're going to insert them into it new and we're going to say, 13005 10:05:30,720 --> 10:05:33,720 here's the new person that we just saw. 13006 10:05:33,720 --> 10:05:35,720 Here, that's their name. 13007 10:05:35,720 --> 10:05:37,720 We're going to set retrieve to zero 13008 10:05:37,720 --> 10:05:41,720 and we're going to say that they have one friend, okay? 13009 10:05:41,720 --> 10:05:44,720 And then we're going to commit the transaction 13010 10:05:44,720 --> 10:05:47,720 and then we're going to close this at the end, okay? 13011 10:05:47,720 --> 10:05:49,720 So let's go ahead and run this. 13012 10:05:49,720 --> 10:05:52,720 The first time it's going to create an empty database. 13013 10:05:52,720 --> 10:05:56,720 So I'm going to say python3 twspider. 13014 10:05:56,720 --> 10:06:02,720 So ls star SQLite, nothing there. 13015 10:06:02,720 --> 10:06:06,720 Python3, oops, that's because I removed it. 13016 10:06:06,720 --> 10:06:11,720 Python3 twspider.py. 13017 10:06:11,720 --> 10:06:17,720 Okay, so I'm going to start with a Twitter account, Dr. Chuck. 13018 10:06:17,720 --> 10:06:20,720 And so it's doing its retrieval and don't worry, 13019 10:06:20,720 --> 10:06:24,720 showing the token and the signature is not dangerous 13020 10:06:24,720 --> 10:06:26,720 because you don't have the keys or the token, 13021 10:06:26,720 --> 10:06:28,720 I mean the secrets and the token secrets. 13022 10:06:28,720 --> 10:06:29,720 So don't get all too worried. 13023 10:06:29,720 --> 10:06:33,720 So I have 11 calls left, so I got to hope this all works. 13024 10:06:33,720 --> 10:06:35,720 One of my friends is Stephanie Teasley 13025 10:06:35,720 --> 10:06:37,720 and I do these are in reverse order. 13026 10:06:37,720 --> 10:06:45,720 So let's grab Stephanie and ask for Stephanie's friends. 13027 10:06:45,720 --> 10:06:47,720 So now we just retrieve Stephanie's friends 13028 10:06:47,720 --> 10:06:51,720 and here are Stephanie's most recent friends. 13029 10:06:51,720 --> 10:06:54,720 And I can just hit enter and it will randomly pick. 13030 10:06:54,720 --> 10:06:57,720 Let's see if I can in the database. 13031 10:06:57,720 --> 10:07:00,720 Let's open this up, file open database. 13032 10:07:00,720 --> 10:07:01,720 Hope I don't lock myself. 13033 10:07:01,720 --> 10:07:05,720 Sometimes it's a little scary when you look at the database 13034 10:07:05,720 --> 10:07:07,720 and you're just checking. 13035 10:07:07,720 --> 10:07:10,720 So this is what my database looks like. 13036 10:07:10,720 --> 10:07:16,720 We retrieve Stephanie and she has, this is how many people. 13037 10:07:16,720 --> 10:07:20,720 So these are the friends of Stephanie and me 13038 10:07:20,720 --> 10:07:22,720 and these are how many, I'm not in there. 13039 10:07:22,720 --> 10:07:25,720 So we retrieve Stephanie, which was a friend. 13040 10:07:25,720 --> 10:07:28,720 So let's go grab, oh I don't know. 13041 10:07:28,720 --> 10:07:31,720 Let's grab Tim McKay and get that one. 13042 10:07:31,720 --> 10:07:34,720 Remaining 10, I don't have too many of these. 13043 10:07:34,720 --> 10:07:36,720 Tim McKay, right? 13044 10:07:36,720 --> 10:07:38,720 So there we go. 13045 10:07:38,720 --> 10:07:40,720 Remaining nine. 13046 10:07:40,720 --> 10:07:43,720 And so if I do a refresh on this, 13047 10:07:43,720 --> 10:07:45,720 then you see I've got some more folks. 13048 10:07:45,720 --> 10:07:47,720 If I hit enter here, it will retrieve, 13049 10:07:47,720 --> 10:07:51,720 it will pick one randomly based on the retrieve being zero. 13050 10:07:51,720 --> 10:07:54,720 So it won't pick Stephanie or Tim because they're zero, 13051 10:07:54,720 --> 10:07:56,720 but we have lots of other folks to pick randomly. 13052 10:07:56,720 --> 10:07:58,720 And we'll hit enter. 13053 10:07:58,720 --> 10:08:01,720 So it picked, who did it pick? 13054 10:08:01,720 --> 10:08:06,720 It picked screen name LiveEduTV, which is ironic 13055 10:08:06,720 --> 10:08:09,720 because I'm recording this on LiveEduTV right now. 13056 10:08:09,720 --> 10:08:12,720 And so we can keep hitting refresh and away we go. 13057 10:08:12,720 --> 10:08:16,720 So I'm gonna stop now because I only have eight remaining. 13058 10:08:16,720 --> 10:08:18,720 And so I'm gonna type quit. 13059 10:08:18,720 --> 10:08:22,720 And so we will see how that works. 13060 10:08:22,720 --> 10:08:23,720 So that's how it works. 13061 10:08:23,720 --> 10:08:29,720 Now remember that you've got to edit the hidden.py file 13062 10:08:29,720 --> 10:08:33,720 to make this work because we are talking to the Twitter API. 13063 10:08:33,720 --> 10:08:39,720 If you don't edit that file, it won't work for you. 13064 10:08:39,720 --> 10:08:41,720 Okay, so I hope you find this useful. 13065 10:08:41,720 --> 10:08:46,720 Cheers. 13066 10:08:46,720 --> 10:08:48,720 So now we're gonna take a look at how we deal with 13067 10:08:48,720 --> 10:08:50,720 smaller than one table, multiple tables. 13068 10:08:50,720 --> 10:08:54,720 Because the real power of SQL and the power of database performance 13069 10:08:54,720 --> 10:08:56,720 has to do with when you start connecting tables together. 13070 10:08:56,720 --> 10:08:58,720 If you go back to that original mathematics, 13071 10:08:58,720 --> 10:09:02,720 it models data at the intersections between the row and the columns. 13072 10:09:02,720 --> 10:09:06,720 And these intersections are the magical bits. 13073 10:09:06,720 --> 10:09:10,720 And so breaking an application to use multiple tables is an art form. 13074 10:09:10,720 --> 10:09:12,720 It takes a while. 13075 10:09:12,720 --> 10:09:15,720 There are some simple basic things that you can learn 13076 10:09:15,720 --> 10:09:17,720 and will teach you here. 13077 10:09:17,720 --> 10:09:19,720 And so it's not too hard to learn the basics, 13078 10:09:19,720 --> 10:09:24,720 but then it's much more complex to be super skilled at it. 13079 10:09:24,720 --> 10:09:26,720 And in general, advanced databases, in my mind, 13080 10:09:26,720 --> 10:09:29,720 it's hard to teach advanced databases 13081 10:09:29,720 --> 10:09:33,720 because they're always so contextually grounded. 13082 10:09:33,720 --> 10:09:37,720 You know, something like Twitter or Google, 13083 10:09:37,720 --> 10:09:39,720 the databases are so specialized. 13084 10:09:39,720 --> 10:09:43,720 By the time you make, everyone can do small to medium-sized databases 13085 10:09:43,720 --> 10:09:45,720 using the basic techniques, but at some point, 13086 10:09:45,720 --> 10:09:47,720 once you escape medium-sized databases, 13087 10:09:47,720 --> 10:09:49,720 you end up in these sort of narrow things 13088 10:09:49,720 --> 10:09:52,720 and optimize each database very separately. 13089 10:09:52,720 --> 10:09:54,720 And so I just tell people, you know, 13090 10:09:54,720 --> 10:09:57,720 learn the basics really, really well, write programs, 13091 10:09:57,720 --> 10:10:00,720 and then go do real work. 13092 10:10:00,720 --> 10:10:06,720 But database design is the act of figuring out 13093 10:10:06,720 --> 10:10:09,720 the data that your application is going to want to store 13094 10:10:09,720 --> 10:10:11,720 and spreading that across multiple tables. 13095 10:10:11,720 --> 10:10:12,720 But we don't just do it randomly. 13096 10:10:12,720 --> 10:10:14,720 We do it very much cleverly. 13097 10:10:14,720 --> 10:10:17,720 And if you look at a data model, this is what it looks like. 13098 10:10:17,720 --> 10:10:20,720 And what we're showing here in this data model 13099 10:10:20,720 --> 10:10:24,720 is we are showing five tables, 13100 10:10:24,720 --> 10:10:27,720 and this is kind of a calendar kind of a system, 13101 10:10:27,720 --> 10:10:30,720 and we're seeing the columns that are in each of the tables, 13102 10:10:30,720 --> 10:10:33,720 and then we're seeing the relationships between the tables. 13103 10:10:33,720 --> 10:10:35,720 And even in these relationships, 13104 10:10:35,720 --> 10:10:37,720 there's kind of a little bit of code, 13105 10:10:37,720 --> 10:10:39,720 and when you have an arrow that looks like that, 13106 10:10:39,720 --> 10:10:40,720 there's many of those to one, 13107 10:10:40,720 --> 10:10:43,720 and this is a many-to-one relationship. 13108 10:10:43,720 --> 10:10:44,720 Many-to-one relationship. 13109 10:10:44,720 --> 10:10:46,720 We'll talk all about that stuff. 13110 10:10:46,720 --> 10:10:48,720 But if you go into an organization 13111 10:10:48,720 --> 10:10:51,720 and you have a really large and complex data application, 13112 10:10:51,720 --> 10:10:53,720 they might have something printed out on the wall 13113 10:10:53,720 --> 10:10:54,720 that looks about like this, 13114 10:10:54,720 --> 10:10:57,720 which shows the database tables and connections, 13115 10:10:57,720 --> 10:10:58,720 et cetera, et cetera. 13116 10:10:58,720 --> 10:10:59,720 And they might say, 13117 10:10:59,720 --> 10:11:01,720 oh, your job is to go down in this little corner, 13118 10:11:01,720 --> 10:11:03,720 add one column field there, 13119 10:11:03,720 --> 10:11:05,720 and then do this, and then connect it with this thing over there, 13120 10:11:05,720 --> 10:11:08,720 and then make a screen that shows all these things 13121 10:11:08,720 --> 10:11:09,720 that pulls from this table, this table, 13122 10:11:09,720 --> 10:11:11,720 this table, and that table, 13123 10:11:11,720 --> 10:11:13,720 and that's your job if you're a programmer 13124 10:11:13,720 --> 10:11:16,720 on a large software development project. 13125 10:11:16,720 --> 10:11:19,720 These database models become sort of like 13126 10:11:19,720 --> 10:11:21,720 the core backbone of the knowledge 13127 10:11:21,720 --> 10:11:25,720 that applications are managing and using. 13128 10:11:25,720 --> 10:11:28,720 So the idea is that you take your application, 13129 10:11:28,720 --> 10:11:29,720 we're going to start really simple, 13130 10:11:29,720 --> 10:11:31,720 we're going to take your application, 13131 10:11:31,720 --> 10:11:32,720 and you have to draw a picture. 13132 10:11:32,720 --> 10:11:36,720 And the basic rule, and literally you could spend 13133 10:11:36,720 --> 10:11:40,720 course upon course learning about database normalization, 13134 10:11:40,720 --> 10:11:42,720 but I'm going to distill it into one basic rule, 13135 10:11:42,720 --> 10:11:47,720 and that is never put the same string data in twice. 13136 10:11:47,720 --> 10:11:49,720 So my name, Charles Severance, 13137 10:11:49,720 --> 10:11:51,720 if I build a database well, 13138 10:11:51,720 --> 10:11:53,720 you should go into that database and you'd say, 13139 10:11:53,720 --> 10:11:55,720 okay, the words Charles Severance, 13140 10:11:55,720 --> 10:11:58,720 which is the name of a person, me, in that database, 13141 10:11:58,720 --> 10:11:59,720 only shows up once. 13142 10:11:59,720 --> 10:12:02,720 And what we do instead is we connect things together 13143 10:12:02,720 --> 10:12:05,720 and model my name as a connection to the record 13144 10:12:05,720 --> 10:12:07,720 that has my actual name in it, 13145 10:12:07,720 --> 10:12:09,720 rather than putting my name all these other places. 13146 10:12:09,720 --> 10:12:12,720 And so the idea is to pull duplicate data out 13147 10:12:12,720 --> 10:12:14,720 and make only one copy of it. 13148 10:12:14,720 --> 10:12:17,720 So there is the users, and in there is the user's name, 13149 10:12:17,720 --> 10:12:20,720 and the user name shows up only here, 13150 10:12:20,720 --> 10:12:24,720 and everything else points to the particular user entry. 13151 10:12:24,720 --> 10:12:26,720 So that's the idea. 13152 10:12:26,720 --> 10:12:29,720 And so here is our first application. 13153 10:12:29,720 --> 10:12:32,720 We are working as a startup. 13154 10:12:32,720 --> 10:12:34,720 We just quit all of our jobs, 13155 10:12:34,720 --> 10:12:37,720 and we are going to build a music management application. 13156 10:12:37,720 --> 10:12:38,720 I mean, what a great idea. 13157 10:12:38,720 --> 10:12:40,720 Don't you think that'll be quite successful? 13158 10:12:40,720 --> 10:12:42,720 And so we have mocked up, 13159 10:12:42,720 --> 10:12:44,720 and we have figured out that this is what our 13160 10:12:44,720 --> 10:12:46,720 music management application. 13161 10:12:46,720 --> 10:12:48,720 We want to track people's tracks, 13162 10:12:48,720 --> 10:12:51,720 know something about what artists and albums 13163 10:12:51,720 --> 10:12:52,720 and genre they are, 13164 10:12:52,720 --> 10:12:54,720 and have ratings and how many times we've played them, 13165 10:12:54,720 --> 10:12:55,720 and how long they are. 13166 10:12:55,720 --> 10:12:58,720 Well, that's the data that our application needs to represent. 13167 10:12:58,720 --> 10:13:01,720 And we've done testing on this, and wireframes, 13168 10:13:01,720 --> 10:13:02,720 and everyone loves this. 13169 10:13:02,720 --> 10:13:04,720 It's a great user interface. 13170 10:13:04,720 --> 10:13:06,720 And so this is how it's got to look. 13171 10:13:06,720 --> 10:13:09,720 But we're going to have billions and billions of tracks 13172 10:13:09,720 --> 10:13:10,720 in these things, and so we want to come up 13173 10:13:10,720 --> 10:13:13,720 with an efficient database to handle this. 13174 10:13:13,720 --> 10:13:15,720 And so we're going to take a look at this 13175 10:13:15,720 --> 10:13:16,720 and look at each of the columns, 13176 10:13:16,720 --> 10:13:18,720 and we're going to ask ourselves, 13177 10:13:18,720 --> 10:13:23,720 is this column part of one of our existing objects, 13178 10:13:23,720 --> 10:13:27,720 our existing tables, or is this object 13179 10:13:27,720 --> 10:13:29,720 have to create a new table? 13180 10:13:29,720 --> 10:13:31,720 And then once we've defined those different objects, 13181 10:13:31,720 --> 10:13:34,720 we connect the tables together and model the connections. 13182 10:13:34,720 --> 10:13:37,720 Now, a little trick to kind of make it a little easier 13183 10:13:37,720 --> 10:13:40,720 on ourselves is we can look in these columns, 13184 10:13:40,720 --> 10:13:42,720 and look in the columns that have duplicate information 13185 10:13:42,720 --> 10:13:44,720 vertically that's string information. 13186 10:13:44,720 --> 10:13:47,720 So a rating is just a number like zero through five. 13187 10:13:47,720 --> 10:13:50,720 So we don't worry too much about integers and numbers 13188 10:13:50,720 --> 10:13:52,720 and that kind of stuff, or whatever. 13189 10:13:52,720 --> 10:13:53,720 But we do look for strings. 13190 10:13:53,720 --> 10:13:55,720 And the problem here is we got like these strings 13191 10:13:55,720 --> 10:13:58,720 occur many times, and so these are the problems. 13192 10:13:58,720 --> 10:14:01,720 And so we have to put these things where there is 13193 10:14:01,720 --> 10:14:04,720 replication of string data kind of in the vertical dimension. 13194 10:14:04,720 --> 10:14:07,720 We have to put those in different tables. 13195 10:14:07,720 --> 10:14:09,720 And so we'll start up. 13196 10:14:09,720 --> 10:14:12,720 Now, the first question that you have to ask yourself 13197 10:14:12,720 --> 10:14:14,720 when you're going to draw this picture of how this data 13198 10:14:14,720 --> 10:14:16,720 is in multiple tables and connected together 13199 10:14:16,720 --> 10:14:19,720 is what is the first one that you're going to write down? 13200 10:14:19,720 --> 10:14:21,720 And this is an interesting debate, 13201 10:14:21,720 --> 10:14:23,720 and often people are sitting in a conference room, 13202 10:14:23,720 --> 10:14:25,720 and people who have experience kind of know what to do. 13203 10:14:25,720 --> 10:14:28,720 Usually if it's a multi-user system, 13204 10:14:28,720 --> 10:14:30,720 like a learning management system, 13205 10:14:30,720 --> 10:14:32,720 the users might be the central concept. 13206 10:14:32,720 --> 10:14:34,720 Perhaps the courses might be the central concept. 13207 10:14:34,720 --> 10:14:36,720 This is a single user system, 13208 10:14:36,720 --> 10:14:38,720 and so you can think, well, 13209 10:14:38,720 --> 10:14:40,720 what is really this application about? 13210 10:14:40,720 --> 10:14:41,720 It's not about people. 13211 10:14:41,720 --> 10:14:43,720 It's one person. 13212 10:14:43,720 --> 10:14:45,720 But it is about tracks. 13213 10:14:45,720 --> 10:14:47,720 And so we can say, okay, 13214 10:14:47,720 --> 10:14:51,720 here we'll take the track is probably the sort of 13215 10:14:51,720 --> 10:14:55,720 most foundational notion of this application. 13216 10:14:55,720 --> 10:14:57,720 And then we can take and say, okay, 13217 10:14:57,720 --> 10:15:01,720 now that we've decided that tracks are the foundational notion, 13218 10:15:01,720 --> 10:15:05,720 which of these columns are simply an attribute of the track? 13219 10:15:05,720 --> 10:15:09,720 Not really the cheating way and the easy way. 13220 10:15:09,720 --> 10:15:11,720 And this particular one is like these numbers, 13221 10:15:11,720 --> 10:15:14,720 all these numbers, like this number and these numbers. 13222 10:15:14,720 --> 10:15:16,720 Not that one. 13223 10:15:16,720 --> 10:15:18,720 They just go along with track. 13224 10:15:18,720 --> 10:15:19,720 And so we'll put that in. 13225 10:15:19,720 --> 10:15:22,720 We've got the track title, rating, length, and count, 13226 10:15:22,720 --> 10:15:25,720 and we put that in. 13227 10:15:25,720 --> 10:15:28,720 And then the question is we've got the remaining things are, 13228 10:15:28,720 --> 10:15:30,720 we've got the artist, we've got the album, 13229 10:15:30,720 --> 10:15:32,720 and we've got the genre. 13230 10:15:32,720 --> 10:15:34,720 And so we can say, okay, well, we can't, 13231 10:15:34,720 --> 10:15:36,720 we've got some vertical duplication, 13232 10:15:36,720 --> 10:15:37,720 so we're going to say, okay, 13233 10:15:37,720 --> 10:15:40,720 this track probably belongs to an album. 13234 10:15:40,720 --> 10:15:45,720 So let's pull out the album into its own table. 13235 10:15:45,720 --> 10:15:48,720 Oops. 13236 10:15:48,720 --> 10:15:51,720 Pull the album out into its own table. 13237 10:16:03,720 --> 10:16:06,720 Pull the album out into its own table. 13238 10:16:06,720 --> 10:16:07,720 And so that pulls that out. 13239 10:16:07,720 --> 10:16:08,720 And then you say, okay, 13240 10:16:08,720 --> 10:16:10,720 what would be the next thing that we're going to pull out? 13241 10:16:10,720 --> 10:16:12,720 So we've pulled out the track. 13242 10:16:12,720 --> 10:16:14,720 We've got this taken care of, this taken care of, that taken, 13243 10:16:14,720 --> 10:16:16,720 now we've got the album. 13244 10:16:16,720 --> 10:16:18,720 Well, albums belong to artists. 13245 10:16:18,720 --> 10:16:21,720 So let's take out the artist. 13246 10:16:21,720 --> 10:16:24,720 And then we'll pick where the genre belongs, 13247 10:16:24,720 --> 10:16:26,720 and we'll just say that the genre belongs to the track. 13248 10:16:26,720 --> 10:16:28,720 And so because there might be albums 13249 10:16:28,720 --> 10:16:30,720 with more than one different genre. 13250 10:16:30,720 --> 10:16:32,720 So each album is not necessarily a rock album. 13251 10:16:32,720 --> 10:16:34,720 It could have a rock track and a country track, 13252 10:16:34,720 --> 10:16:36,720 et cetera, et cetera, et cetera. 13253 10:16:36,720 --> 10:16:38,720 And so now what we've got is we've got four tables, right? 13254 10:16:38,720 --> 10:16:39,720 We've got a track table. 13255 10:16:39,720 --> 10:16:41,720 We've got an album table, an artist table, 13256 10:16:41,720 --> 10:16:42,720 and a genre table. 13257 10:16:42,720 --> 10:16:44,720 And if we sort of double check, 13258 10:16:44,720 --> 10:16:46,720 all of the columns that had vertical duplication in them 13259 10:16:46,720 --> 10:16:50,720 now have their own little table. 13260 10:16:50,720 --> 10:16:52,720 So we can eliminate, 13261 10:16:52,720 --> 10:16:55,720 the next thing we'll do is to show how we're going to eliminate 13262 10:16:55,720 --> 10:16:59,720 this vertical data replication 13263 10:16:59,720 --> 10:17:02,720 by showing how you represent these relationships 13264 10:17:02,720 --> 10:17:08,720 that we just created inside of the database. 13265 10:17:08,720 --> 10:17:11,720 Now we're going to represent these relationships in the database. 13266 10:17:11,720 --> 10:17:13,720 And again, what we're trying to solve here 13267 10:17:13,720 --> 10:17:15,720 is this notion of database normalization, 13268 10:17:15,720 --> 10:17:17,720 third normal form. 13269 10:17:17,720 --> 10:17:19,720 There is so much theory, right? 13270 10:17:19,720 --> 10:17:21,720 But in this lecture, 13271 10:17:21,720 --> 10:17:23,720 I'm just going to condense this down to 13272 10:17:23,720 --> 10:17:25,720 don't replicate string data 13273 10:17:25,720 --> 10:17:27,720 and use what are called keys, 13274 10:17:27,720 --> 10:17:30,720 use integer keys to point at those things. 13275 10:17:30,720 --> 10:17:32,720 And we're going to use these integers then to point. 13276 10:17:32,720 --> 10:17:34,720 So assign each row an integer, 13277 10:17:34,720 --> 10:17:36,720 and then we're going to point from one row to another 13278 10:17:36,720 --> 10:17:37,720 using those integers. 13279 10:17:37,720 --> 10:17:40,720 And so we're going to add these special key columns 13280 10:17:40,720 --> 10:17:42,720 to each of the tables. 13281 10:17:42,720 --> 10:17:44,720 And help in the database will even give us help 13282 10:17:44,720 --> 10:17:46,720 managing those. 13283 10:17:46,720 --> 10:17:48,720 So we still need to keep track of 13284 10:17:48,720 --> 10:17:51,720 who is the creator of the album, 13285 10:17:51,720 --> 10:17:53,720 which album a track belongs to. 13286 10:17:53,720 --> 10:17:55,720 We've got to create these relationships 13287 10:17:55,720 --> 10:17:58,720 and we have to come up with ways to store those relationships. 13288 10:17:58,720 --> 10:18:01,720 And so the idea is we're going to have 13289 10:18:01,720 --> 10:18:04,720 a column in a table which is the key column. 13290 10:18:04,720 --> 10:18:06,720 And we're going to call this the ID column. 13291 10:18:06,720 --> 10:18:07,720 And so this is a row, 13292 10:18:07,720 --> 10:18:08,720 it might have many bits of data here, 13293 10:18:08,720 --> 10:18:11,720 but in this case it's just the name of an artist. 13294 10:18:11,720 --> 10:18:14,720 So this album is going to belong to an artist. 13295 10:18:14,720 --> 10:18:17,720 And we're going to assign a number inside the database. 13296 10:18:17,720 --> 10:18:21,720 And so that Led Zeppelin is one and AC-DC is two. 13297 10:18:21,720 --> 10:18:24,720 And so we have this key, this is called a primary key. 13298 10:18:24,720 --> 10:18:26,720 And then later when we want to say that the 13299 10:18:26,720 --> 10:18:31,720 who made who album really was done by AC-DC, 13300 10:18:31,720 --> 10:18:33,720 we put the number two in. 13301 10:18:33,720 --> 10:18:36,720 And so the difference here is instead of saying AC-DC 13302 10:18:36,720 --> 10:18:39,720 in this record we just put the number two 13303 10:18:39,720 --> 10:18:41,720 once we've established this number. 13304 10:18:41,720 --> 10:18:44,720 So we assign keys and then we have these pointers 13305 10:18:44,720 --> 10:18:45,720 that point back. 13306 10:18:45,720 --> 10:18:47,720 And so that's how we model a relationship 13307 10:18:47,720 --> 10:18:50,720 with these small integer numbers. 13308 10:18:50,720 --> 10:18:53,720 And so there are three basic kind of keys that we use. 13309 10:18:53,720 --> 10:18:56,720 One is the primary key and that is that little ID column 13310 10:18:56,720 --> 10:18:58,720 that is just a number. 13311 10:18:58,720 --> 10:19:00,720 But once we give Led Zeppelin the number one, 13312 10:19:00,720 --> 10:19:05,720 Led Zeppelin has got the key one for the rest of that database. 13313 10:19:05,720 --> 10:19:08,720 The logical key is the text area that we use 13314 10:19:08,720 --> 10:19:09,720 that you might look up. 13315 10:19:09,720 --> 10:19:12,720 So the title of the band or the title of the album, 13316 10:19:12,720 --> 10:19:13,720 that's the logical key. 13317 10:19:13,720 --> 10:19:15,720 And then the foreign key is one of these keys 13318 10:19:15,720 --> 10:19:18,720 that is really pointing to the primary key of another row. 13319 10:19:18,720 --> 10:19:21,720 So that's called a foreign key. 13320 10:19:21,720 --> 10:19:24,720 And you might think that you want to use something 13321 10:19:24,720 --> 10:19:26,720 like an email address as the primary key 13322 10:19:26,720 --> 10:19:28,720 for a user table or something like that. 13323 10:19:28,720 --> 10:19:30,720 The logical key should always be separate 13324 10:19:30,720 --> 10:19:32,720 and there should always be a primary key, 13325 10:19:32,720 --> 10:19:33,720 that integer number. 13326 10:19:33,720 --> 10:19:35,720 Because things like logical keys do change. 13327 10:19:35,720 --> 10:19:37,720 People do get new email addresses. 13328 10:19:37,720 --> 10:19:39,720 And if you've got that email address as a foreign key 13329 10:19:39,720 --> 10:19:42,720 pointing all over the place, it doesn't work out so well. 13330 10:19:42,720 --> 10:19:45,720 And so that's why you use these small integer numbers 13331 10:19:45,720 --> 10:19:47,720 that have no meaning outside. 13332 10:19:47,720 --> 10:19:49,720 So sometimes if you're on a system and you see a URL 13333 10:19:49,720 --> 10:19:52,720 and you see some number like 422,016, 13334 10:19:52,720 --> 10:19:54,720 you're like, oh, that turns out to probably be 13335 10:19:54,720 --> 10:19:56,720 my primary key in their database. 13336 10:19:56,720 --> 10:19:58,720 So sometimes you can look in a URL 13337 10:19:58,720 --> 10:20:00,720 and you can see these primary keys in the URL, 13338 10:20:00,720 --> 10:20:04,720 but they don't mean anything outside of that particular system. 13339 10:20:04,720 --> 10:20:07,720 So like I said, a foreign key is a key that is 13340 10:20:07,720 --> 10:20:10,720 really pointing at a row in a different table. 13341 10:20:10,720 --> 10:20:13,720 So the album has a primary key for it, 13342 10:20:13,720 --> 10:20:16,720 but the artist underscore ID points to a row 13343 10:20:16,720 --> 10:20:19,720 in the artist table, as we will soon see. 13344 10:20:19,720 --> 10:20:21,720 I have a naming convention. 13345 10:20:21,720 --> 10:20:24,720 And in my naming convention, on this lecture, 13346 10:20:24,720 --> 10:20:26,720 I use ID for the primary key. 13347 10:20:26,720 --> 10:20:29,720 And then artist underscore ID, I use uppercase 13348 10:20:29,720 --> 10:20:30,720 for the table names. 13349 10:20:30,720 --> 10:20:34,720 And then artist underscore ID says this is a key, 13350 10:20:34,720 --> 10:20:37,720 this is just a key that points to the ID key 13351 10:20:37,720 --> 10:20:38,720 of the artist table. 13352 10:20:38,720 --> 10:20:40,720 And so that's what I do, so you'll see. 13353 10:20:40,720 --> 10:20:41,720 And all my stuff, I'll use that. 13354 10:20:41,720 --> 10:20:43,720 It's a convention. 13355 10:20:43,720 --> 10:20:46,720 It's not something SQL forces you to do. 13356 10:20:46,720 --> 10:20:48,720 But you will find when you go to organizations 13357 10:20:48,720 --> 10:20:49,720 and work on their databases, 13358 10:20:49,720 --> 10:20:51,720 these conventions are very important. 13359 10:20:51,720 --> 10:20:53,720 So I can do something and you can understand 13360 10:20:53,720 --> 10:20:55,720 the rules in which I created. 13361 10:20:55,720 --> 10:20:58,720 Some of these, you'll find this used by some people. 13362 10:20:58,720 --> 10:21:00,720 You'll find completely different conventions, 13363 10:21:00,720 --> 10:21:01,720 and that'll be okay. 13364 10:21:01,720 --> 10:21:03,720 Whatever convention your organization uses, 13365 10:21:03,720 --> 10:21:05,720 learn that convention. 13366 10:21:05,720 --> 10:21:08,720 So now we're going to talk about how we put these keys in 13367 10:21:08,720 --> 10:21:11,720 and then how we actually make the connections 13368 10:21:11,720 --> 10:21:13,720 from one row to another row. 13369 10:21:17,720 --> 10:21:19,720 So now that we know what a primary key, 13370 10:21:19,720 --> 10:21:20,720 logical key, and foreign key are, 13371 10:21:20,720 --> 10:21:22,720 we're going to actually start putting these together 13372 10:21:22,720 --> 10:21:26,720 and creating tables that have these kind of values in them. 13373 10:21:26,720 --> 10:21:28,720 So when we were done, we drew this picture 13374 10:21:28,720 --> 10:21:30,720 that was sort of a logical model 13375 10:21:30,720 --> 10:21:33,720 of how our data would be spread across four tables 13376 10:21:33,720 --> 10:21:35,720 and how those tables are connected. 13377 10:21:35,720 --> 10:21:38,720 Now we have to take this and we have to map it in a way 13378 10:21:38,720 --> 10:21:41,720 that leads to the columns 13379 10:21:41,720 --> 10:21:44,720 and the needed columns in each of our database tables. 13380 10:21:44,720 --> 10:21:45,720 And so here's what we do. 13381 10:21:45,720 --> 10:21:47,720 We basically have to take, 13382 10:21:47,720 --> 10:21:49,720 and for each of these, 13383 10:21:49,720 --> 10:21:51,720 when we're going to build a track table, 13384 10:21:51,720 --> 10:21:53,720 when we're going to build a track table, 13385 10:21:53,720 --> 10:21:54,720 we add a primary key. 13386 10:21:54,720 --> 10:21:57,720 So we just added an ID field to every one of these things. 13387 10:21:57,720 --> 10:21:59,720 And that's so we have a place to store 13388 10:21:59,720 --> 10:22:02,720 the sequence number of this particular row. 13389 10:22:02,720 --> 10:22:03,720 We have logical keys. 13390 10:22:03,720 --> 10:22:04,720 We've just marked those. 13391 10:22:04,720 --> 10:22:05,720 Those are strings. 13392 10:22:05,720 --> 10:22:07,720 And then we have things like, you know, 13393 10:22:07,720 --> 10:22:08,720 rating, length, and count. 13394 10:22:08,720 --> 10:22:09,720 They just kind of go in here. 13395 10:22:09,720 --> 10:22:12,720 And now we have to model a relationship. 13396 10:22:12,720 --> 10:22:14,720 So what we do is we, in the table, 13397 10:22:14,720 --> 10:22:16,720 the relationship starts from, 13398 10:22:16,720 --> 10:22:18,720 we put one more column in, 13399 10:22:18,720 --> 10:22:20,720 and this is the one I will name album ID, 13400 10:22:20,720 --> 10:22:22,720 and that just is an integer column 13401 10:22:22,720 --> 10:22:25,720 that's going to record the album ID. 13402 10:22:25,720 --> 10:22:28,720 So this might be 16, and then 16 goes in here. 13403 10:22:28,720 --> 10:22:31,720 So there's one of these columns that's a foreign key 13404 10:22:31,720 --> 10:22:32,720 that points to this. 13405 10:22:32,720 --> 10:22:33,720 And that's why it's foreign. 13406 10:22:33,720 --> 10:22:35,720 This is a key that's not in the track table. 13407 10:22:35,720 --> 10:22:38,720 This is a key in the album table that we're pointing to. 13408 10:22:38,720 --> 10:22:40,720 And so there's a foreign key. 13409 10:22:40,720 --> 10:22:42,720 And that's what we have to do. 13410 10:22:42,720 --> 10:22:44,720 And we just do that over and over and over again. 13411 10:22:44,720 --> 10:22:47,720 And we quickly convert that picture 13412 10:22:47,720 --> 10:22:49,720 that was a logical picture 13413 10:22:49,720 --> 10:22:52,720 to having every table has a primary key. 13414 10:22:52,720 --> 10:22:54,720 And every time we have a starting point, 13415 10:22:54,720 --> 10:22:56,720 we have a foreign key, foreign key, 13416 10:22:56,720 --> 10:22:57,720 and then foreign key. 13417 10:22:57,720 --> 10:22:59,720 And then we mark these things as logical key, 13418 10:22:59,720 --> 10:23:00,720 logical key, logical key, 13419 10:23:00,720 --> 10:23:02,720 and we'll see how we do that. 13420 10:23:02,720 --> 10:23:03,720 And so that's the picture. 13421 10:23:03,720 --> 10:23:05,720 Now we have a picture of exactly 13422 10:23:05,720 --> 10:23:07,720 how we're going to lay these tables out 13423 10:23:07,720 --> 10:23:10,720 in the fields that we need in these tables. 13424 10:23:10,720 --> 10:23:19,720 So we're going to do a create table statement. 13425 10:23:19,720 --> 10:23:26,720 And I've got this create table statement sitting there. 13426 10:23:26,720 --> 10:23:29,720 And so this one's going to be a little bit different. 13427 10:23:29,720 --> 10:23:32,720 We're going to say create table artist. 13428 10:23:32,720 --> 10:23:35,720 And the ID field is integer. 13429 10:23:35,720 --> 10:23:38,720 And we're going to add all of this stuff. 13430 10:23:38,720 --> 10:23:43,720 This is adding to the column to tell it additional stuff. 13431 10:23:43,720 --> 10:23:44,720 It's a primary key, 13432 10:23:44,720 --> 10:23:46,720 which means we're going to use it to look up a lot. 13433 10:23:46,720 --> 10:23:47,720 It's automatically incremented, 13434 10:23:47,720 --> 10:23:49,720 which means the database is actually going to provide 13435 10:23:49,720 --> 10:23:51,720 this number for us as we insert records. 13436 10:23:51,720 --> 10:23:53,720 It's not allowed to be null. 13437 10:23:53,720 --> 10:23:55,720 It's not allowed to be empty. 13438 10:23:55,720 --> 10:23:56,720 And it's supposed to be unique. 13439 10:23:56,720 --> 10:24:01,720 And then the artist is going to have a name column, 13440 10:24:01,720 --> 10:24:03,720 a name column that's just text. 13441 10:24:03,720 --> 10:24:07,720 So let's do that. 13442 10:24:07,720 --> 10:24:08,720 We already have our users. 13443 10:24:08,720 --> 10:24:11,720 And now we're going to do a create table in this SQL. 13444 10:24:11,720 --> 10:24:12,720 And you can do that. 13445 10:24:12,720 --> 10:24:13,720 That's okay. 13446 10:24:13,720 --> 10:24:14,720 That's totally fine. 13447 10:24:14,720 --> 10:24:15,720 And we have to get this right. 13448 10:24:15,720 --> 10:24:17,720 And we say away we go. 13449 10:24:17,720 --> 10:24:20,720 And so now if I take a look at database structure, 13450 10:24:20,720 --> 10:24:23,720 I've got a users table as well as that users table 13451 10:24:23,720 --> 10:24:26,720 we were playing with before and this artist table. 13452 10:24:26,720 --> 10:24:31,720 Let me go ahead and delete this users table just to say goodbye. 13453 10:24:31,720 --> 10:24:33,720 Okay, so now we have the artist table. 13454 10:24:33,720 --> 10:24:34,720 And we take a look. 13455 10:24:34,720 --> 10:24:35,720 And it's got an ID. 13456 10:24:35,720 --> 10:24:36,720 And it knows all about this stuff. 13457 10:24:36,720 --> 10:24:39,720 Okay? 13458 10:24:39,720 --> 10:24:42,720 So that created the table. 13459 10:24:42,720 --> 10:24:43,720 We're going to keep doing this. 13460 10:24:43,720 --> 10:24:46,720 The next thing that we're going to show here is we're going to show 13461 10:24:46,720 --> 10:24:48,720 the foreign key, right? 13462 10:24:48,720 --> 10:24:50,720 So artist ID is just an integer. 13463 10:24:50,720 --> 10:24:53,720 In some database languages like MySQL and Oracle, 13464 10:24:53,720 --> 10:24:56,720 you would put more stuff here to say this is a foreign key, blah, blah, blah. 13465 10:24:56,720 --> 10:25:00,720 But in SQLite, we keep it simple and just say that is an integer column. 13466 10:25:00,720 --> 10:25:01,720 That's a foreign key. 13467 10:25:01,720 --> 10:25:04,720 The album table has a primary key and a foreign key, 13468 10:25:04,720 --> 10:25:12,720 and then the title. 13469 10:25:12,720 --> 10:25:17,720 So we'll go back and we'll grab that text out of my little page. 13470 10:25:17,720 --> 10:25:19,720 This create table. 13471 10:25:19,720 --> 10:25:22,720 Go back to execute SQL. 13472 10:25:22,720 --> 10:25:28,720 And then run that. 13473 10:25:28,720 --> 10:25:32,720 And we'll continue with just the genre table has an ID on it. 13474 10:25:32,720 --> 10:25:36,720 And primary key, you'll just copy and paste these. 13475 10:25:36,720 --> 10:25:38,720 That whole thing, you do that over and over and over again. 13476 10:25:38,720 --> 10:25:43,720 So we'll go in here and run that one. 13477 10:25:43,720 --> 10:25:50,720 And so the last one we're going to do is the track table. 13478 10:25:50,720 --> 10:25:52,720 And the only thing that's kind of weird about the track table 13479 10:25:52,720 --> 10:25:54,720 is it's got two foreign keys, right? 13480 10:25:54,720 --> 10:25:56,720 It's got an album ID and a genre ID. 13481 10:25:56,720 --> 10:25:59,720 Once you draw the picture, you just sort of literally translate these things. 13482 10:25:59,720 --> 10:26:02,720 It's got two foreign keys and a primary key that's pretty much 13483 10:26:02,720 --> 10:26:08,720 just like all those other primary keys. 13484 10:26:08,720 --> 10:26:12,720 And integer counts an integer and lengths an integer, all that stuff. 13485 10:26:12,720 --> 10:26:14,720 And now we've got it. 13486 10:26:14,720 --> 10:26:15,720 So if we take a look at our database structure, 13487 10:26:15,720 --> 10:26:19,720 we're going to see that our album, genre, and track are all set up. 13488 10:26:19,720 --> 10:26:23,720 And these are no columns that we just made with those create statements. 13489 10:26:23,720 --> 10:26:29,720 Okay? 13490 10:26:29,720 --> 10:26:30,720 So now let's insert some data. 13491 10:26:30,720 --> 10:26:34,720 This first insert statement is kind of important to take a look at. 13492 10:26:34,720 --> 10:26:37,720 So insert into, by the way, the keywords can be upper or lowercase, 13493 10:26:37,720 --> 10:26:39,720 table name, columns. 13494 10:26:39,720 --> 10:26:41,720 Now, this table has two columns. 13495 10:26:41,720 --> 10:26:43,720 It has ID and name. 13496 10:26:43,720 --> 10:26:46,720 But we told the database that ID was auto increment. 13497 10:26:46,720 --> 10:26:48,720 So it's going to actually give us the number. 13498 10:26:48,720 --> 10:26:51,720 It's going to assign the number rather than make us assign. 13499 10:26:51,720 --> 10:26:53,720 We could make it be one, two, three. 13500 10:26:53,720 --> 10:26:55,720 But we say, hey, database, you're good at this. 13501 10:26:55,720 --> 10:26:57,720 Why don't you make it one, two, three? 13502 10:26:57,720 --> 10:27:01,720 And so there is going to be a record that it adds Led Zeppelin. 13503 10:27:01,720 --> 10:27:05,720 So let's take a look at that. 13504 10:27:05,720 --> 10:27:10,720 So we'll insert Led Zeppelin. 13505 10:27:10,720 --> 10:27:13,720 Oops. 13506 10:27:13,720 --> 10:27:16,720 Over to SQL. 13507 10:27:16,720 --> 10:27:18,720 Insert Led Zeppelin and run it. 13508 10:27:18,720 --> 10:27:22,720 So now if I look at database structure and I look at the, 13509 10:27:22,720 --> 10:27:25,720 let's look at browse data and look at the artist database, 13510 10:27:25,720 --> 10:27:27,720 you will see that I put Led Zeppelin in, 13511 10:27:27,720 --> 10:27:30,720 but this ID field here was auto incremented. 13512 10:27:30,720 --> 10:27:33,720 And so it was put there by the database. 13513 10:27:33,720 --> 10:27:44,720 And now when we do the next insert, which is ACDC, 13514 10:27:44,720 --> 10:27:47,720 and we take a look at the data, 13515 10:27:47,720 --> 10:27:49,720 we will see that ACDC is two. 13516 10:27:49,720 --> 10:27:52,720 Now, if you're writing this in a program, 13517 10:27:52,720 --> 10:27:54,720 if you're going to write this in a program, 13518 10:27:54,720 --> 10:27:57,720 you can get these numbers back from the database in your program, 13519 10:27:57,720 --> 10:27:59,720 but I'm not writing this in a program, 13520 10:27:59,720 --> 10:28:02,720 so I have to remember that one is Zeppelin 13521 10:28:02,720 --> 10:28:04,720 and two is ACDC. 13522 10:28:04,720 --> 10:28:06,720 So I'm going to keep myself a little cheat sheet here 13523 10:28:06,720 --> 10:28:09,720 to remember that because everywhere else in the program 13524 10:28:09,720 --> 10:28:11,720 that we're going to say Led Zeppelin, 13525 10:28:11,720 --> 10:28:13,720 I've got to say one now because the artist, 13526 10:28:13,720 --> 10:28:17,720 the artist ID of one means Led Zeppelin in those rows. 13527 10:28:17,720 --> 10:28:19,720 And so now we're going to go back 13528 10:28:19,720 --> 10:28:22,720 and we're going to take a look at the next one. 13529 10:28:22,720 --> 10:28:24,720 And now we're going to put the genre in. 13530 10:28:24,720 --> 10:28:27,720 If you think about it, we're working from the leaves out. 13531 10:28:27,720 --> 10:28:29,720 The track will be the last table that will update 13532 10:28:29,720 --> 10:28:31,720 because you have to define the keys 13533 10:28:31,720 --> 10:28:33,720 for things like rock and metal and Led Zeppelin 13534 10:28:33,720 --> 10:28:35,720 and all those other things. 13535 10:28:35,720 --> 10:28:38,720 And again, even though the genre table has two columns, 13536 10:28:38,720 --> 10:28:40,720 ID and name, we're only going to specify the name 13537 10:28:40,720 --> 10:28:45,720 and let the database assign the value. 13538 10:28:45,720 --> 10:28:47,720 So I'm going to insert both of these 13539 10:28:47,720 --> 10:28:51,720 and use the semicolon trick. 13540 10:28:51,720 --> 10:28:55,720 Put a semicolon here and a semicolon there. 13541 10:28:55,720 --> 10:28:58,720 And run that. 13542 10:28:58,720 --> 10:29:00,720 And so if I take a look at my browse data 13543 10:29:00,720 --> 10:29:02,720 and I look at the genre, 13544 10:29:02,720 --> 10:29:05,720 it's assigned one to rock and two to metal. 13545 10:29:05,720 --> 10:29:07,720 I'm going to write that down. 13546 10:29:07,720 --> 10:29:11,720 One rock, two metal. 13547 10:29:11,720 --> 10:29:13,720 I should have done something like rock and country 13548 10:29:13,720 --> 10:29:14,720 because I can't even tell the difference 13549 10:29:14,720 --> 10:29:17,720 between rock and metal, but whatever. 13550 10:29:17,720 --> 10:29:22,720 My musical skill is not what's at issue in this class. 13551 10:29:22,720 --> 10:29:25,720 So now we're going to put an album in. 13552 10:29:25,720 --> 10:29:27,720 The album is the first thing that has a foreign key. 13553 10:29:27,720 --> 10:29:31,720 So if you remember the thing, the album points to artist. 13554 10:29:31,720 --> 10:29:34,720 And so that means it has a foreign key of artist ID. 13555 10:29:34,720 --> 10:29:36,720 And so we have to explicitly say this 13556 10:29:36,720 --> 10:29:40,720 because the system doesn't know which artist who made who is. 13557 10:29:40,720 --> 10:29:44,720 But we know that who made who is ACDC and that's two. 13558 10:29:44,720 --> 10:29:46,720 And so we know to put artist ID in. 13559 10:29:46,720 --> 10:29:49,720 So we'll say insert into album title artist ID. 13560 10:29:49,720 --> 10:29:51,720 And so we have to know what this two number is. 13561 10:29:51,720 --> 10:29:59,720 And of course because we have our handy little cheat sheet, 13562 10:29:59,720 --> 10:30:02,720 we can go over to execute and run that. 13563 10:30:02,720 --> 10:30:07,720 And I'll put a semicolon there and a semicolon there and run it. 13564 10:30:07,720 --> 10:30:16,720 And so now we have in the album field, we now have this. 13565 10:30:16,720 --> 10:30:18,720 And so this was assigned. 13566 10:30:18,720 --> 10:30:22,720 And so who made who, you still have to write down that. 13567 10:30:22,720 --> 10:30:30,720 Who made who is album one and album two is Led Zeppelin four. 13568 10:30:30,720 --> 10:30:34,720 That makes it even more complex because the name of the album is at Roman numeral four. 13569 10:30:34,720 --> 10:30:36,720 I'm sure I can figure that out. 13570 10:30:36,720 --> 10:30:37,720 Okay. 13571 10:30:37,720 --> 10:30:42,720 So the next thing that we're going to do is we're going to insert the track record. 13572 10:30:42,720 --> 10:30:45,720 Now if you think about the track record, the track has two foreign keys. 13573 10:30:45,720 --> 10:30:50,720 And it's got a lot of stuff. 13574 10:30:50,720 --> 10:30:51,720 It's got the title. 13575 10:30:51,720 --> 10:30:52,720 It's got the rating length count. 13576 10:30:52,720 --> 10:30:53,720 But then we got the two foreign keys. 13577 10:30:53,720 --> 10:30:56,720 And so we have to know these numbers. 13578 10:30:56,720 --> 10:31:02,720 So this two one, this two one, this one two is the genre. 13579 10:31:02,720 --> 10:31:07,720 We're specifying the genre and the album that this track is from by those numbers. 13580 10:31:07,720 --> 10:31:10,720 Now, again, we have to use this cheat sheet. 13581 10:31:10,720 --> 10:31:15,720 But if this was a program, the program would know that one was Zeppelin 13582 10:31:15,720 --> 10:31:19,720 and our one was who made who and two was Led Zeppelin four. 13583 10:31:19,720 --> 10:31:24,720 And so this kind of stuff is easier for the program to understand 13584 10:31:24,720 --> 10:31:26,720 than for us to keep track of and understand. 13585 10:31:26,720 --> 10:31:29,720 But just so we can get through these few records. 13586 10:31:29,720 --> 10:31:32,720 And that's why I rely so heavily on my cheat sheet. 13587 10:31:32,720 --> 10:31:36,720 So here we are all with all these numbers. 13588 10:31:36,720 --> 10:31:38,720 The foreign keys are the tricky part here. 13589 10:31:38,720 --> 10:31:40,720 Everything else is really quite straightforward. 13590 10:31:40,720 --> 10:31:50,720 So now I'm going to insert four records into my track table. 13591 10:31:50,720 --> 10:31:53,720 And then run that. 13592 10:31:53,720 --> 10:31:54,720 Okay. 13593 10:31:54,720 --> 10:31:57,720 So I'll browse data and I look at my track table. 13594 10:31:57,720 --> 10:32:01,720 This column here, this ID, that's the primary key of the track table. 13595 10:32:01,720 --> 10:32:03,720 And then here are the two foreign keys. 13596 10:32:03,720 --> 10:32:08,720 Now, the interesting thing is now there is replication in these columns, 13597 10:32:08,720 --> 10:32:12,720 but the numbers are what's being replicated and that's okay. 13598 10:32:12,720 --> 10:32:17,720 We went a long time just not to put Led Zeppelin four in twice. 13599 10:32:17,720 --> 10:32:20,720 We could have made this a string, but by making this an integer, 13600 10:32:20,720 --> 10:32:23,720 it saves tons of storage and makes it super fast. 13601 10:32:23,720 --> 10:32:28,720 That turns out to be one of the key things that makes databases super fast 13602 10:32:28,720 --> 10:32:32,720 is using these integers. 13603 10:32:32,720 --> 10:32:33,720 So we take a look at all this stuff. 13604 10:32:33,720 --> 10:32:37,720 We see that in a sense by using these little numbers, 13605 10:32:37,720 --> 10:32:39,720 we are pointing to rows in other tables. 13606 10:32:39,720 --> 10:32:41,720 The foreign keys are always pointing. 13607 10:32:41,720 --> 10:32:43,720 They always point to their ID. 13608 10:32:43,720 --> 10:32:44,720 So these foreign keys are out here. 13609 10:32:44,720 --> 10:32:46,720 This is the primary key up here. 13610 10:32:46,720 --> 10:32:49,720 And they always point to a row in another table. 13611 10:32:49,720 --> 10:32:51,720 And so we have modeled all those relationships. 13612 10:32:51,720 --> 10:32:54,720 And you will notice that in this entire database, 13613 10:32:54,720 --> 10:33:01,720 the who made who only appears once. 13614 10:33:01,720 --> 10:33:03,720 The word rock only appears once. 13615 10:33:03,720 --> 10:33:06,720 The word ACDC only appears once. 13616 10:33:06,720 --> 10:33:09,720 What we have is we have duplication in our data, 13617 10:33:09,720 --> 10:33:12,720 but we are duplicating the relationships, 13618 10:33:12,720 --> 10:33:16,720 i.e. these little integer numbers, not duplicating the data itself. 13619 10:33:16,720 --> 10:33:20,720 And in something this small, it seems irrelevant. 13620 10:33:20,720 --> 10:33:22,720 But if you have billions of records, 13621 10:33:22,720 --> 10:33:25,720 or hundreds of millions of records, it is very relevant. 13622 10:33:25,720 --> 10:33:27,720 Very, very relevant. 13623 10:33:27,720 --> 10:33:29,720 So the next thing we are going to do is take a look 13624 10:33:29,720 --> 10:33:31,720 at how you actually reconnect all this stuff together 13625 10:33:31,720 --> 10:33:36,720 once we have sort of blown it out using these foreign keys 13626 10:33:36,720 --> 10:33:39,720 and hand-constructing all these relationships, 13627 10:33:39,720 --> 10:33:45,720 now how we bring it back together to show the data to the user. 13628 10:33:45,720 --> 10:33:48,720 So now that we have carefully constructed our relationships 13629 10:33:48,720 --> 10:33:53,720 in the tables, we need to reconstruct the data to show our users. 13630 10:33:53,720 --> 10:33:56,720 And you can kind of see how you would go pull this stuff together, 13631 10:33:56,720 --> 10:33:58,720 but there is a wonderful capability in relational databases 13632 10:33:58,720 --> 10:34:02,720 called join that brings this all back together. 13633 10:34:02,720 --> 10:34:06,720 And so we have done this for efficiency of storage, 13634 10:34:06,720 --> 10:34:08,720 efficiency of scanning, etc. 13635 10:34:08,720 --> 10:34:12,720 But we do need to traverse these foreign keys at times. 13636 10:34:12,720 --> 10:34:16,720 And the database software will do this for us automatically. 13637 10:34:16,720 --> 10:34:20,720 So the join operation basically is a way to specify in a select statement 13638 10:34:20,720 --> 10:34:23,720 that you want to pull data out of more than one table 13639 10:34:23,720 --> 10:34:26,720 and then specifying using what is called the on clause 13640 10:34:26,720 --> 10:34:30,720 exactly how you want that data pulled out. 13641 10:34:30,720 --> 10:34:32,720 And so here we go. 13642 10:34:32,720 --> 10:34:37,720 We already have a table, an album table to the artist table, 13643 10:34:37,720 --> 10:34:39,720 and the foreign key. 13644 10:34:39,720 --> 10:34:43,720 And we want to, in effect, pull data from both the album and the artist, 13645 10:34:43,720 --> 10:34:45,720 the album title and the artist name. 13646 10:34:45,720 --> 10:34:47,720 And we want to show that. 13647 10:34:47,720 --> 10:34:51,720 And so we're going to say select, which is the same select statement. 13648 10:34:51,720 --> 10:34:52,720 Here's a little different syntax. 13649 10:34:52,720 --> 10:34:54,720 This is the list of fields. 13650 10:34:54,720 --> 10:34:56,720 This is table.field. 13651 10:34:56,720 --> 10:35:02,720 So it's the album title and the artist.name, comma there, from the album. 13652 10:35:02,720 --> 10:35:05,720 And I always start with where the little arrow starts from, 13653 10:35:05,720 --> 10:35:07,720 album joined with. 13654 10:35:07,720 --> 10:35:11,720 So that is going to walk down this connection from album to artist. 13655 10:35:11,720 --> 10:35:14,720 Album joined with artist. 13656 10:35:14,720 --> 10:35:16,720 Don't say with, I just say it. 13657 10:35:16,720 --> 10:35:20,720 On, and then this is the conditions upon which that join is going to happen. 13658 10:35:20,720 --> 10:35:24,720 When the album's artist ID, which is this column here, 13659 10:35:24,720 --> 10:35:31,720 album's artist ID matches, think of that as is equal to or matches the artist's ID. 13660 10:35:31,720 --> 10:35:36,720 And so it only connects the rows here when there is a match between these two tables. 13661 10:35:36,720 --> 10:35:43,720 And so if we look at this and we see that this one matches this one and this one matches that one. 13662 10:35:43,720 --> 10:35:50,720 And so the join connects conditionally and it connects when the on clause is satisfied. 13663 10:35:50,720 --> 10:35:54,720 And so when this whole join runs, this is what we get. 13664 10:35:54,720 --> 10:35:56,720 So you select all this stuff. 13665 10:35:56,720 --> 10:35:57,720 Now this is an abstraction. 13666 10:35:57,720 --> 10:35:58,720 Are you writing a loop? 13667 10:35:58,720 --> 10:36:00,720 Are you doing two nested loops? 13668 10:36:00,720 --> 10:36:02,720 How are you exactly bringing all this data together? 13669 10:36:02,720 --> 10:36:05,720 We don't care about that because that's the beauty of SQL. 13670 10:36:05,720 --> 10:36:09,720 That's the beauty of how we do this in a database. 13671 10:36:09,720 --> 10:36:13,720 So now if we can just run this command, so let's grab this command. 13672 10:36:13,720 --> 10:36:19,720 Select track title, genre name, from track, join genre, that exact query. 13673 10:36:19,720 --> 10:36:21,720 Case of keywords doesn't matter. 13674 10:36:21,720 --> 10:36:24,720 And we go over here and we run this as SQL. 13675 10:36:24,720 --> 10:36:26,720 And we run it. 13676 10:36:26,720 --> 10:36:34,720 We get, oops, I got too far. 13677 10:36:34,720 --> 10:36:35,720 Let's do this one. 13678 10:36:35,720 --> 10:36:37,720 So let's do that one there. 13679 10:36:37,720 --> 10:36:40,720 Select artist name. 13680 10:36:40,720 --> 10:36:42,720 I have to add that one to my little cheat sheet. 13681 10:36:42,720 --> 10:36:46,720 The next time you see the cheat sheet, it'll be right. 13682 10:36:46,720 --> 10:36:53,720 So the title, so this is coming from one table and that's coming from another table. 13683 10:36:53,720 --> 10:36:55,720 And so that's one. 13684 10:36:55,720 --> 10:37:00,720 So here is something we can do that gives us a little more detail on that. 13685 10:37:00,720 --> 10:37:04,720 We can say, so this is where the connection is. 13686 10:37:04,720 --> 10:37:09,720 So you can think of the join as sort of spreading one table and connecting it to the other table. 13687 10:37:09,720 --> 10:37:13,720 And so what we're going to show here is it's exactly the same. 13688 10:37:13,720 --> 10:37:18,720 The thing we're going to do is we're going to add these two columns so you can see where the match happens. 13689 10:37:18,720 --> 10:37:20,720 And so this is one table. 13690 10:37:20,720 --> 10:37:22,720 This is another table. 13691 10:37:22,720 --> 10:37:26,720 And these are the kind of columns in common, even though they're not. 13692 10:37:26,720 --> 10:37:28,720 They're the columns that match. 13693 10:37:28,720 --> 10:37:30,720 This is where the on clause is happening, right? 13694 10:37:30,720 --> 10:37:44,720 We have taken this table joined with this table on these two things connecting with each other. 13695 10:37:44,720 --> 10:37:48,720 So you can almost, in some language, some variants of SQL, this would even be a where clause. 13696 10:37:48,720 --> 10:37:54,720 So you connect these two rows, but only connect them when those two numbers match. 13697 10:37:54,720 --> 10:38:04,720 So you can see, I mean, if we run this, I'll just run this. 13698 10:38:04,720 --> 10:38:08,720 And again, you just see this is where it connects. 13699 10:38:08,720 --> 10:38:18,720 Now, interestingly, we can see what happens and what the purpose of the on clause is if we omit it. 13700 10:38:18,720 --> 10:38:23,720 So this is exactly the same as that previous query, except there's no on clause. 13701 10:38:23,720 --> 10:38:27,720 So it's select all four of those fields from the track joined with the genre. 13702 10:38:27,720 --> 10:38:33,720 So it's basically taking the track table and the genre with a join, but no on clause. 13703 10:38:33,720 --> 10:38:36,720 So it's not filtering for matches. 13704 10:38:36,720 --> 10:38:37,720 This is a match. 13705 10:38:37,720 --> 10:38:38,720 This is a match. 13706 10:38:38,720 --> 10:38:39,720 That's a match. 13707 10:38:39,720 --> 10:38:40,720 That's a match. 13708 10:38:40,720 --> 10:38:44,720 But we don't have an on clause, so the matchingness doesn't matter. 13709 10:38:44,720 --> 10:38:47,720 And so you're going to get all possible combinations. 13710 10:38:47,720 --> 10:38:56,720 And literally, if there were 10 on one side and 30 on the other side, you would get 300 rows in that join. 13711 10:38:56,720 --> 10:39:00,720 So it'd be all combinations, except the on clause reduces the combinations. 13712 10:39:00,720 --> 10:39:02,720 And you might think, whoa, this is really inefficient. 13713 10:39:02,720 --> 10:39:08,720 And I will say that's what my first reaction was when I first saw this, but it's not inefficient. 13714 10:39:08,720 --> 10:39:09,720 That's the beauty of abstraction. 13715 10:39:09,720 --> 10:39:10,720 That's the beauty of SQL. 13716 10:39:10,720 --> 10:39:14,720 You say, do it, and it just figures that out. 13717 10:39:14,720 --> 10:39:20,720 So let me grab this, and you will see that we can run this one as well. 13718 10:39:20,720 --> 10:39:26,720 And that kind of gives you why the on clause is important, because now we have a whole bunch of these things. 13719 10:39:26,720 --> 10:39:29,720 And the on clause just filters that out. 13720 10:39:29,720 --> 10:39:35,720 So if we would just add the on clause back in, then that would only show the ones we showed on the previous slide. 13721 10:39:35,720 --> 10:39:37,720 So that's why the on clause is important. 13722 10:39:37,720 --> 10:39:42,720 The join is like all possible combinations of all pairs of rows between these two tables. 13723 10:39:42,720 --> 10:39:45,720 On is, oh, but only where these two things match. 13724 10:39:45,720 --> 10:39:57,720 And you might think that it's inefficient, but the on clause turns out to be the way it becomes efficient. 13725 10:39:57,720 --> 10:40:12,720 So now we're going to do the same thing where we're just going to take the track title and the genre. 13726 10:40:12,720 --> 10:40:14,720 We're going to connect that together. 13727 10:40:14,720 --> 10:40:15,720 So we select this. 13728 10:40:15,720 --> 10:40:20,720 We need to join from one table, join to the genre table with an on clause. 13729 10:40:20,720 --> 10:40:22,720 And so we're going to make those connections. 13730 10:40:22,720 --> 10:40:32,720 And the only thing we're going to look at is the title and the genre name. 13731 10:40:32,720 --> 10:40:34,720 Oh, oops. 13732 10:40:34,720 --> 10:40:37,720 And then run that. 13733 10:40:37,720 --> 10:40:39,720 And so we got the title and genre name. 13734 10:40:39,720 --> 10:40:47,720 Now the thing you'll notice is for the first time, we now have replication of string data in a vertical dimension. 13735 10:40:47,720 --> 10:40:51,720 That's okay, because the data is not replicated in the database. 13736 10:40:51,720 --> 10:40:54,720 The data is now replicated as a result of the join. 13737 10:40:54,720 --> 10:41:00,720 And so we are going to reconstruct what the user wants to see, which the user originally all the way back to the beginning 13738 10:41:00,720 --> 10:41:04,720 wanted to see the duplicate information in the vertical axis. 13739 10:41:04,720 --> 10:41:06,720 But now we're reconstructing it. 13740 10:41:06,720 --> 10:41:11,720 We didn't waste the space or performance in our database, but we still have to show them. 13741 10:41:11,720 --> 10:41:16,720 And so now the next thing we're going to do is a monster. 13742 10:41:16,720 --> 10:41:19,720 We are going to reconstruct across all four tables. 13743 10:41:19,720 --> 10:41:21,720 And you might think this is really hard. 13744 10:41:21,720 --> 10:41:26,720 And sure, it's going to be a little tricky, but as long as you follow the naming convention 13745 10:41:26,720 --> 10:41:31,720 and the naming convention makes sense, we're going to do a select from the track's title, the artist's name, 13746 10:41:31,720 --> 10:41:33,720 the album's title, and the genre name. 13747 10:41:33,720 --> 10:41:38,720 From the track, join genre, join the album, join artists. 13748 10:41:38,720 --> 10:41:41,720 And so the joins follow the little arrows, right? 13749 10:41:41,720 --> 10:41:46,720 And then the on clause qualifies each of those arrows when to follow the arrow. 13750 10:41:46,720 --> 10:41:49,720 And then this becomes pretty easy. 13751 10:41:49,720 --> 10:41:50,720 It's a foreign key. 13752 10:41:50,720 --> 10:41:55,720 The track's genre ID, that's a foreign key, equals genre.id. 13753 10:41:55,720 --> 10:42:00,720 The primary, that's primary key, that's a foreign key because I name it that way. 13754 10:42:00,720 --> 10:42:04,720 And I know that this goes to that genre table because I name it that way. 13755 10:42:04,720 --> 10:42:10,720 And track's album ID is equal to the album's ID, foreign key, primary key. 13756 10:42:10,720 --> 10:42:14,720 And album's artist ID is equal to artist's ID. 13757 10:42:14,720 --> 10:42:18,720 After a while, you can type these pretty fast as long as you follow a naming convention 13758 10:42:18,720 --> 10:42:20,720 and you know the naming convention. 13759 10:42:20,720 --> 10:42:25,720 So this looks like it's really hard to do, but after a while, it's really just a pattern. 13760 10:42:25,720 --> 10:42:30,720 So let's go ahead and run that one. 13761 10:42:30,720 --> 10:42:36,720 And it will, assuming we've done everything right, replicate all the data. 13762 10:42:36,720 --> 10:42:39,720 So there's all kinds of vertical data now being replicated. 13763 10:42:39,720 --> 10:42:41,720 Every column has vertical data. 13764 10:42:41,720 --> 10:42:46,720 Again, it's not in the database, the select and the join are reconstructing vertical data 13765 10:42:46,720 --> 10:42:50,720 as it needs to be shown to the user. 13766 10:42:50,720 --> 10:42:55,720 And so, if you've been following along, probably a couple hours later now, 13767 10:42:55,720 --> 10:42:59,720 we started with a picture that was our mock-up of what we wanted our user interface to look like. 13768 10:42:59,720 --> 10:43:04,720 And it had vertical stuff, and we're like, ah, we can't put that in a database model. 13769 10:43:04,720 --> 10:43:07,720 And then we carefully built a database model that didn't have the data, 13770 10:43:07,720 --> 10:43:09,720 and then we're like, ah, we've got to reconstruct it. 13771 10:43:09,720 --> 10:43:11,720 So we use join to reconstruct it. 13772 10:43:11,720 --> 10:43:16,720 And so, after all that, we went here with a clean little model with four tables 13773 10:43:16,720 --> 10:43:20,720 all beautifully connected together, and then we had to join it all back together. 13774 10:43:20,720 --> 10:43:22,720 So join reconstructs it. 13775 10:43:22,720 --> 10:43:27,720 And again, the key is the storage is efficient, the scanning is efficient, 13776 10:43:27,720 --> 10:43:31,720 and we still use the join to produce the output that we ultimately want 13777 10:43:31,720 --> 10:43:38,720 with all the vertical replication that our users really want to see. 13778 10:43:38,720 --> 10:43:44,720 So one more kind of relationship, that was called a one-to-many relationship. 13779 10:43:44,720 --> 10:43:46,720 That was actually three one-to-many relationships. 13780 10:43:46,720 --> 10:43:56,720 And the other major relationship is what's called a many-to-many relationship. 13781 10:43:56,720 --> 10:44:00,720 We're going to do some code walkthroughs, actually running some code. 13782 10:44:00,720 --> 10:44:03,720 And if you want to follow along with the code, 13783 10:44:03,720 --> 10:44:07,720 the sample code is here in the materials of my Python for Everybody website. 13784 10:44:07,720 --> 10:44:09,720 So you can take a look at that. 13785 10:44:09,720 --> 10:44:13,720 So the code we're going to look at is from the database chapter. 13786 10:44:13,720 --> 10:44:17,720 And we're going to look at tracks.py. 13787 10:44:17,720 --> 10:44:21,720 So a lot of the lectures that I give in this database chapter are just about SQL. 13788 10:44:21,720 --> 10:44:24,720 And this is really about SQL and Python. 13789 10:44:24,720 --> 10:44:27,720 So I'll go through this in some detail. 13790 10:44:27,720 --> 10:44:30,720 So the code that I'm going through is in tracks. 13791 10:44:30,720 --> 10:44:34,720 There's also tracks.zip that you can grab that has these two things. 13792 10:44:34,720 --> 10:44:42,720 It's got this library.xml file, which you can export from your, if you have iTunes, 13793 10:44:42,720 --> 10:44:45,720 you can export this, or you can just play with my iTunes. 13794 10:44:45,720 --> 10:44:48,720 And so this is also going to review how to read XML. 13795 10:44:48,720 --> 10:44:51,720 So we're going to actually pull all this data. 13796 10:44:51,720 --> 10:44:58,720 And this XML that Apple produces out of iTunes is a little weird 13797 10:44:58,720 --> 10:45:00,720 in that it's kind of key values. 13798 10:45:00,720 --> 10:45:02,720 And so you see key value pairs. 13799 10:45:02,720 --> 10:45:04,720 And it even uses the word dictionary. 13800 10:45:04,720 --> 10:45:07,720 And so it's like, I'm going to make a dictionary that has this, 13801 10:45:07,720 --> 10:45:09,720 then a dictionary within a dictionary. 13802 10:45:09,720 --> 10:45:12,720 This, to me, would be so nice if it was JSON, 13803 10:45:12,720 --> 10:45:16,720 because it's really a list of dictionaries. 13804 10:45:16,720 --> 10:45:20,720 This is a dictionary, then another dictionary, then another dictionary, 13805 10:45:20,720 --> 10:45:22,720 and then the key for that dictionary. 13806 10:45:22,720 --> 10:45:27,720 And it's a weird, weird format. 13807 10:45:27,720 --> 10:45:29,720 But we'll write some Python to be able to read it. 13808 10:45:29,720 --> 10:45:34,720 And so you export that from iTunes. 13809 10:45:34,720 --> 10:45:38,720 And you can use my file, or you can use your file. 13810 10:45:38,720 --> 10:45:41,720 It might be more fun to use your file. 13811 10:45:41,720 --> 10:45:43,720 So here's tracks.py. 13812 10:45:43,720 --> 10:45:45,720 We're going to do some XML. 13813 10:45:45,720 --> 10:45:47,720 And so we import that. 13814 10:45:47,720 --> 10:45:51,720 We're going to import SQLite 3 because we want to talk to the database. 13815 10:45:51,720 --> 10:45:53,720 And then we're going to make a database connection. 13816 10:45:53,720 --> 10:45:58,720 And in this, once we run this, you'll see that that file will exist. 13817 10:45:58,720 --> 10:46:03,720 And so right now, if I'm in my tracks data, that file doesn't exist. 13818 10:46:03,720 --> 10:46:07,720 But what we'll see is this is going to actually create it. 13819 10:46:07,720 --> 10:46:11,720 Now remember that we have a cursor, which is sort of our, like a file handle. 13820 10:46:11,720 --> 10:46:13,720 It's really a database handle, as it were. 13821 10:46:13,720 --> 10:46:17,720 And in order to sort of bootstrap this nicely, 13822 10:46:17,720 --> 10:46:21,720 we are going, because this code is going to run all the time, 13823 10:46:21,720 --> 10:46:24,720 it's going to run and read all of library.xml. 13824 10:46:24,720 --> 10:46:28,720 And later things, we won't wipe out the database every time. 13825 10:46:28,720 --> 10:46:33,720 And so I'm executing a script, which is a series of SQL commands 13826 10:46:33,720 --> 10:46:35,720 separated by semicolons. 13827 10:46:35,720 --> 10:46:38,720 So I'm going to throw away the artist table, album table, and track table. 13828 10:46:38,720 --> 10:46:41,720 Very similar to the stuff we covered in lecture. 13829 10:46:41,720 --> 10:46:43,720 And then I'm going to do the create table. 13830 10:46:43,720 --> 10:46:45,720 And I'm doing this all automatically. 13831 10:46:45,720 --> 10:46:47,720 And you'll notice this is a triple-coded string. 13832 10:46:47,720 --> 10:46:50,720 So this is just one big, long string here. 13833 10:46:50,720 --> 10:46:53,720 And it happens to know that it's SQL. 13834 10:46:53,720 --> 10:46:55,720 I'll thank you, Adam, for that. 13835 10:46:55,720 --> 10:46:57,720 And so it creates all these things. 13836 10:46:57,720 --> 10:47:00,720 Now it's not quite as rich as the data model we built, 13837 10:47:00,720 --> 10:47:02,720 because there's no genres in here. 13838 10:47:02,720 --> 10:47:05,720 And so it's artist, album, track. 13839 10:47:05,720 --> 10:47:08,720 And then there's a foreign key for album ID and a foreign key for artist ID, 13840 10:47:08,720 --> 10:47:15,720 which it's sort of a subset of what we're doing. 13841 10:47:15,720 --> 10:47:20,720 And so when that's done, that actually creates all the tables. 13842 10:47:20,720 --> 10:47:23,720 And we'll see those in a moment once we run the code. 13843 10:47:23,720 --> 10:47:26,720 Then it asks for a file name for the XML. 13844 10:47:26,720 --> 10:47:29,720 And so that's what that is. 13845 10:47:29,720 --> 10:47:34,720 And I wrote a function that does a lookup. 13846 10:47:34,720 --> 10:47:38,720 It's really weird, because if you look at these files, 13847 10:47:38,720 --> 10:47:42,720 like in this dictionary, there is a key. 13848 10:47:42,720 --> 10:47:45,720 And so the key of this dictionary, 13849 10:47:45,720 --> 10:47:47,720 this really should have been a key value pair. 13850 10:47:47,720 --> 10:47:52,720 But so there's this weird thing where the key for an object 13851 10:47:52,720 --> 10:47:54,720 is inside of the object. 13852 10:47:54,720 --> 10:48:00,720 And so we're going to loop through all the children 13853 10:48:00,720 --> 10:48:05,720 in this outer dictionary and find a child tag 13854 10:48:05,720 --> 10:48:06,720 that has a particular key. 13855 10:48:06,720 --> 10:48:08,720 And so you'll see how this works. 13856 10:48:08,720 --> 10:48:12,720 And this was something I was going to use over and over again. 13857 10:48:12,720 --> 10:48:14,720 And so the first thing we're going to do 13858 10:48:14,720 --> 10:48:17,720 is we're going to just parse the string, and this is the string. 13859 10:48:17,720 --> 10:48:21,720 And then this, of course, is an XML ET object. 13860 10:48:21,720 --> 10:48:24,720 And then we're going to say, we're going to do a find all. 13861 10:48:24,720 --> 10:48:26,720 And so this shows how the find all, 13862 10:48:26,720 --> 10:48:28,720 we're going to go the third level dictionaries. 13863 10:48:28,720 --> 10:48:32,720 We want to see all of the tracks. 13864 10:48:32,720 --> 10:48:35,720 And so we have a dictionary, and a dictionary, and a dictionary. 13865 10:48:35,720 --> 10:48:40,720 And so what we want is all of these guys. 13866 10:48:40,720 --> 10:48:43,720 All those guys right there. 13867 10:48:43,720 --> 10:48:45,720 Track ID. 13868 10:48:45,720 --> 10:48:47,720 So we're going to get a list of all those. 13869 10:48:47,720 --> 10:48:50,720 That'll be the first one. 13870 10:48:50,720 --> 10:48:51,720 This will be the second one. 13871 10:48:51,720 --> 10:48:59,720 Because the find all says, go to the, find the dictionary key, 13872 10:48:59,720 --> 10:49:02,720 then a dictionary tag within that, and a dictionary tag. 13873 10:49:02,720 --> 10:49:05,720 And then we'll tell how many things we got. 13874 10:49:05,720 --> 10:49:06,720 And then we're going to loop through, 13875 10:49:06,720 --> 10:49:12,720 and entry is going to iterate through each of these. 13876 10:49:12,720 --> 10:49:15,720 And see, we'll get our name, and our artist. 13877 10:49:15,720 --> 10:49:18,720 Another one bites the dust, a queen, and away we go. 13878 10:49:18,720 --> 10:49:21,720 And then the next time through the loop, we'll hit this one. 13879 10:49:21,720 --> 10:49:22,720 Okay? 13880 10:49:22,720 --> 10:49:27,720 So then what we're going to do is we're going to go through all those entries, 13881 10:49:27,720 --> 10:49:31,720 and if there is no track ID, and if that's this track ID field, 13882 10:49:31,720 --> 10:49:32,720 where are you hiding? 13883 10:49:32,720 --> 10:49:34,720 Track ID. 13884 10:49:34,720 --> 10:49:37,720 If we don't have that, we're going to continue. 13885 10:49:37,720 --> 10:49:39,720 And then we're going to look up the name, artist, album, 13886 10:49:39,720 --> 10:49:42,720 play count, rating, and total time. 13887 10:49:42,720 --> 10:49:45,720 Okay? 13888 10:49:45,720 --> 10:49:49,720 And so here they are, play count. 13889 10:49:49,720 --> 10:49:54,720 A lot of those things that we had in the sample lecture that I did. 13890 10:49:54,720 --> 10:49:56,720 And we're going to look those things up. 13891 10:49:56,720 --> 10:49:58,720 And we're going to do some sanity checking. 13892 10:49:58,720 --> 10:50:01,720 If we didn't get a name or an artist or an album, we're going to continue. 13893 10:50:01,720 --> 10:50:03,720 We're going to print them out. 13894 10:50:03,720 --> 10:50:09,720 And then we are going to ask for, get, 13895 10:50:09,720 --> 10:50:13,720 remember how you have to get the primary key of a row so you can use it. 13896 10:50:13,720 --> 10:50:18,720 So the way we're going to do this is we're going to do an insert or ignore. 13897 10:50:18,720 --> 10:50:23,720 And so this or ignore basically says, because I said that the artist's name, 13898 10:50:23,720 --> 10:50:27,720 go up here, I said the artist's name is unique. 13899 10:50:27,720 --> 10:50:30,720 Which means if I try to attempt to insert the same artist twice, 13900 10:50:30,720 --> 10:50:32,720 it will blow up. 13901 10:50:32,720 --> 10:50:35,720 Okay, because I put this constraint on that. 13902 10:50:35,720 --> 10:50:40,720 Except when I say insert or ignore, that basically says, hey, 13903 10:50:40,720 --> 10:50:43,720 if it's already there, don't insert it again. 13904 10:50:43,720 --> 10:50:47,720 So what I'm doing here is insert or ignore into artist. 13905 10:50:47,720 --> 10:50:49,720 So this is putting a new row into the artist table, 13906 10:50:49,720 --> 10:50:54,720 unless there's already a row in that artist table. 13907 10:50:54,720 --> 10:50:57,720 And the syntax right here, you know, 13908 10:50:57,720 --> 10:51:01,720 the question mark is sort of where this artist variable goes. 13909 10:51:01,720 --> 10:51:03,720 And this is a tuple. 13910 10:51:03,720 --> 10:51:06,720 But I have to sort of put this comma in to force it to be a tuple. 13911 10:51:06,720 --> 10:51:09,720 So this is the way you have a one tuple. 13912 10:51:09,720 --> 10:51:12,720 And then what I need to know is I need to know the primary key 13913 10:51:12,720 --> 10:51:14,720 of this particular artist row. 13914 10:51:14,720 --> 10:51:18,720 Now this line may or may not have actually done the insert. 13915 10:51:18,720 --> 10:51:23,720 And so I need to know what the ID for that particular artist is. 13916 10:51:23,720 --> 10:51:25,720 So I do a select ID from artist where name equals. 13917 10:51:25,720 --> 10:51:30,720 Now it either was already there or I'm getting it fresh and brand new. 13918 10:51:30,720 --> 10:51:33,720 So I do an artist ID equals I fetch one row 13919 10:51:33,720 --> 10:51:36,720 and it's going to be the first thing given that I only selected ID. 13920 10:51:36,720 --> 10:51:40,720 And so this artist ID is going to be the ID. 13921 10:51:40,720 --> 10:51:47,720 Now I have the foreign key for the album title, right? 13922 10:51:47,720 --> 10:51:51,720 And so now I'm going to insert into the title artist ID. 13923 10:51:51,720 --> 10:51:53,720 This is the foreign key to the artist table. 13924 10:51:53,720 --> 10:51:57,720 And I got this value that I just moments ago retrieved. 13925 10:51:57,720 --> 10:51:59,720 And I got the album title. 13926 10:51:59,720 --> 10:52:02,720 But this also is insert or ignore. 13927 10:52:02,720 --> 10:52:07,720 Because now if you look, I have unique on the album title. 13928 10:52:07,720 --> 10:52:09,720 Yep, unique's on the album title. 13929 10:52:09,720 --> 10:52:12,720 So that'll do nothing. 13930 10:52:12,720 --> 10:52:13,720 It doesn't blow up. 13931 10:52:13,720 --> 10:52:15,720 Or ignore says don't blow up. 13932 10:52:15,720 --> 10:52:17,720 Just do nothing. 13933 10:52:17,720 --> 10:52:19,720 Because this next line is going to select it. 13934 10:52:19,720 --> 10:52:23,720 And I grab the album's foreign key for either the existing row 13935 10:52:23,720 --> 10:52:25,720 or the new row. 13936 10:52:25,720 --> 10:52:28,720 And then I'm going to insert or replace. 13937 10:52:28,720 --> 10:52:32,720 So what this basically says is if the unique constraint would be violated, 13938 10:52:32,720 --> 10:52:35,720 this turns into an update. 13939 10:52:35,720 --> 10:52:38,720 Now not all SQLs have this but SQLite has this 13940 10:52:38,720 --> 10:52:42,720 that basically says insert or replace. 13941 10:52:42,720 --> 10:52:44,720 Some SQLs are totally standard. 13942 10:52:44,720 --> 10:52:47,720 Some things we do like this is this select statement 13943 10:52:47,720 --> 10:52:50,720 is a totally standard part of SQL. 13944 10:52:50,720 --> 10:52:53,720 Then they insert is totally standard but insert or replace 13945 10:52:53,720 --> 10:52:56,720 and insert or ignore is not totally standard. 13946 10:52:56,720 --> 10:52:57,720 But that's okay. 13947 10:52:57,720 --> 10:52:59,720 It works for SQLite which is what we're doing. 13948 10:52:59,720 --> 10:53:03,720 And so we have the title, album ID, length, rating, and count. 13949 10:53:03,720 --> 10:53:05,720 And then we have a tuple that does all that stuff. 13950 10:53:05,720 --> 10:53:10,720 And of course the title is unique. 13951 10:53:10,720 --> 10:53:12,720 The title is unique in the track table as well. 13952 10:53:12,720 --> 10:53:14,720 And so we've inserted that. 13953 10:53:14,720 --> 10:53:19,720 So the clever bit here is dealing with new or existing names 13954 10:53:19,720 --> 10:53:21,720 in these three lines. 13955 10:53:21,720 --> 10:53:24,720 And we see that pattern twice here where we're doing that. 13956 10:53:24,720 --> 10:53:28,720 Okay, so there's not much left to do except run this code. 13957 10:53:28,720 --> 10:53:30,720 Hopefully it runs. 13958 10:53:30,720 --> 10:53:36,720 Python 3 tracks.py 13959 10:53:36,720 --> 10:53:39,720 and library.xml. 13960 10:53:39,720 --> 10:53:41,720 Whoosh! 13961 10:53:41,720 --> 10:53:45,720 Okay, so that is my... 13962 10:53:45,720 --> 10:53:51,720 So we found 404 of those dictionaries, 3D dictionaries. 13963 10:53:51,720 --> 10:53:53,720 And now it's starting to insert them. 13964 10:53:53,720 --> 10:53:55,720 Insert them, insert them, insert them. 13965 10:53:55,720 --> 10:53:58,720 And we can take a look at... 13966 10:53:58,720 --> 10:54:01,720 So we can do an ls-l or dir on Windows. 13967 10:54:01,720 --> 10:54:04,720 We'll see that we made a track database. 13968 10:54:04,720 --> 10:54:06,720 We extracted the data from this library 13969 10:54:06,720 --> 10:54:08,720 and we made a track database. 13970 10:54:08,720 --> 10:54:10,720 And we have all these foreign keys. 13971 10:54:10,720 --> 10:54:13,720 So let's go and take a look at the SQLite browser. 13972 10:54:13,720 --> 10:54:18,720 File, open database, track dbsqlite. 13973 10:54:18,720 --> 10:54:20,720 And come on up. 13974 10:54:20,720 --> 10:54:22,720 Where'd you hide? 13975 10:54:22,720 --> 10:54:24,720 I got it minimized, so there you go. 13976 10:54:24,720 --> 10:54:26,720 Let's look at the database structure. 13977 10:54:26,720 --> 10:54:29,720 We have an album, this is the structure. 13978 10:54:29,720 --> 10:54:32,720 Artist and track, we have no genre. 13979 10:54:32,720 --> 10:54:34,720 And this is all like we did it by hand 13980 10:54:34,720 --> 10:54:37,720 except Python did all this work for us. 13981 10:54:37,720 --> 10:54:40,720 If we take a look at the data and we start from the outside in, 13982 10:54:40,720 --> 10:54:45,720 we have the artist names and their primary keys. 13983 10:54:45,720 --> 10:54:48,720 There's the artist names and primary keys. 13984 10:54:48,720 --> 10:54:54,720 And then we have the albums and we have the artist IDs. 13985 10:54:54,720 --> 10:54:57,720 See the artist IDs, how nice those are. 13986 10:54:57,720 --> 10:55:00,720 So we have the primary key here and the foreign key there 13987 10:55:00,720 --> 10:55:02,720 and then we have the title. 13988 10:55:02,720 --> 10:55:05,720 And if we get to the track, 13989 10:55:05,720 --> 10:55:08,720 we have the album ID and away we go. 13990 10:55:08,720 --> 10:55:12,720 So if I was clever, I could be able to type some SQL. 13991 10:55:12,720 --> 10:55:14,720 Oh, great. 13992 10:55:14,720 --> 10:55:17,720 If I was smart, I'd have had this in a paste buffer. 13993 10:55:17,720 --> 10:55:37,720 So select track.title, album.title, artist.name, I think. 13994 10:55:37,720 --> 10:55:42,720 Artist has names and albums have titles, yes. 13995 10:55:42,720 --> 10:55:52,720 Okay, so I can do that from track, join, album. 13996 10:55:52,720 --> 10:55:57,720 Oops, album, join. 13997 10:55:57,720 --> 10:56:02,720 Let me make that a little bigger. 13998 10:56:02,720 --> 10:56:04,720 Bring that over here. 13999 10:56:04,720 --> 10:56:10,720 Album, track, join, album, join, artist. 14000 10:56:10,720 --> 10:56:21,720 I need an on clause and I can say track.album. 14001 10:56:21,720 --> 10:56:23,720 ID equals album. 14002 10:56:23,720 --> 10:56:28,720 Notice how I know the name that I named these things 14003 10:56:28,720 --> 10:56:34,720 and album.artist. 14004 10:56:34,720 --> 10:56:42,720 This is so great when you use a naming convention, artist.id. 14005 10:56:42,720 --> 10:56:45,720 Golly, I think that might work. 14006 10:56:45,720 --> 10:56:49,720 So let's just see what we get when we type that into the SQL box here. 14007 10:56:49,720 --> 10:56:53,720 Execute SQL. 14008 10:56:53,720 --> 10:56:54,720 Run. 14009 10:56:54,720 --> 10:56:57,720 Yay, I got it right the first time. 14010 10:56:57,720 --> 10:57:01,720 So that's basically my nice little joined up track list. 14011 10:57:01,720 --> 10:57:04,720 Oh, I'm so happy that I got that right the first time. 14012 10:57:04,720 --> 10:57:09,720 Okay, well, so you can play with this yourself. 14013 10:57:09,720 --> 10:57:14,720 Play with this tracks, maybe make an export of your own iTunes library 14014 10:57:14,720 --> 10:57:16,720 and run it with that. 14015 10:57:16,720 --> 10:57:23,720 And so I hope that you found this particular bit of code useful, okay? 14016 10:57:23,720 --> 10:57:28,720 Cheers. 14017 10:57:28,720 --> 10:57:31,720 So our last major topic is called many-to-many relationships 14018 10:57:31,720 --> 10:57:33,720 and up till now everything that we've done 14019 10:57:33,720 --> 10:57:36,720 is what's called a one-to-many relationship. 14020 10:57:36,720 --> 10:57:40,720 And that is there are many tracks associated with one album. 14021 10:57:40,720 --> 10:57:43,720 There are many albums associated with one artist. 14022 10:57:43,720 --> 10:57:46,720 There are many tracks associated with one genre. 14023 10:57:46,720 --> 10:57:49,720 And you can think of labeling and as you look at data models 14024 10:57:49,720 --> 10:57:51,720 they put little labels on each arrow 14025 10:57:51,720 --> 10:57:54,720 that tell you which end of the arrow is the many 14026 10:57:54,720 --> 10:57:56,720 and which end of the arrow is the one. 14027 10:57:56,720 --> 10:57:59,720 And so in this case, the foreign key is pointing to 14028 10:57:59,720 --> 10:58:01,720 there are many of these rows over here, 14029 10:58:01,720 --> 10:58:04,720 many rows that point to one row over here. 14030 10:58:04,720 --> 10:58:06,720 So it's a many-to-one relationship. 14031 10:58:06,720 --> 10:58:07,720 There are various ways. 14032 10:58:07,720 --> 10:58:11,720 Sometimes I'll put two arrows at this end and one arrow at that end. 14033 10:58:11,720 --> 10:58:14,720 But whatever it is, this kind of thing we've been showing 14034 10:58:14,720 --> 10:58:16,720 is a many-to-one relationship. 14035 10:58:16,720 --> 10:58:18,720 And that's probably the most common thing. 14036 10:58:18,720 --> 10:58:22,720 But there are times when you just can't model things 14037 10:58:22,720 --> 10:58:25,720 with a one-to-many relationship. 14038 10:58:25,720 --> 10:58:27,720 So like if you have a mother and children, 14039 10:58:27,720 --> 10:58:31,720 well that's a many-to-one relationship and it's just fine 14040 10:58:31,720 --> 10:58:32,720 and that works fine. 14041 10:58:32,720 --> 10:58:35,720 But sometimes you have a many-to-many relationship 14042 10:58:35,720 --> 10:58:38,720 in that there might be many books. 14043 10:58:38,720 --> 10:58:42,720 One book has many authors and each author has many books. 14044 10:58:42,720 --> 10:58:44,720 And so you don't have like the one side. 14045 10:58:44,720 --> 10:58:45,720 There's no one. 14046 10:58:45,720 --> 10:58:49,720 And so you have to end up building a table that what we call 14047 10:58:49,720 --> 10:58:50,720 I call it a connector table. 14048 10:58:50,720 --> 10:58:53,720 They call it a junction table on Wikipedia. 14049 10:58:53,720 --> 10:58:56,720 But we need a little table that allows us to break 14050 10:58:56,720 --> 10:58:59,720 a many-to-many relationship into an effect 14051 10:58:59,720 --> 10:59:02,720 two many-to-one relationships and a connector table. 14052 10:59:02,720 --> 10:59:04,720 And so this is a connector table. 14053 10:59:04,720 --> 10:59:06,720 So you could think of this as, you know, 14054 10:59:06,720 --> 10:59:08,720 there are many, many links here 14055 10:59:08,720 --> 10:59:12,720 but we don't have a way to model the many over here to here. 14056 10:59:12,720 --> 10:59:15,720 And so what you do is you basically say, 14057 10:59:15,720 --> 10:59:16,720 oh there's a lot of these things. 14058 10:59:16,720 --> 10:59:18,720 There's many that go to the one. 14059 10:59:18,720 --> 10:59:20,720 The many that go to the one. 14060 10:59:20,720 --> 10:59:23,720 And in here you sort of create that manyness 14061 10:59:23,720 --> 10:59:24,720 that you want to create. 14062 10:59:24,720 --> 10:59:28,720 So it's probably just as easy to look at a sample of this. 14063 10:59:28,720 --> 10:59:32,720 So let's imagine a learning management system 14064 10:59:32,720 --> 10:59:35,720 where you're taking a class and there are some people 14065 10:59:35,720 --> 10:59:37,720 that are teachers and some people that are students 14066 10:59:37,720 --> 10:59:40,720 and many students are members of many classes. 14067 10:59:40,720 --> 10:59:43,720 A student can be part of many classes 14068 10:59:43,720 --> 10:59:45,720 and a class has many students in it. 14069 10:59:45,720 --> 10:59:47,720 So you can't really find the one end. 14070 10:59:47,720 --> 10:59:50,720 And so what we do is we make a table called a membership. 14071 10:59:50,720 --> 10:59:53,720 And in that table of membership we actually often 14072 10:59:53,720 --> 10:59:55,720 don't put a primary key in at all. 14073 10:59:55,720 --> 10:59:58,720 We simply put in two foreign keys. 14074 10:59:58,720 --> 11:00:00,720 And if we're going to put a uniqueness constraint 14075 11:00:00,720 --> 11:00:04,720 we put a combination of the two foreign keys 14076 11:00:04,720 --> 11:00:06,720 as the uniqueness constraint. 14077 11:00:06,720 --> 11:00:09,720 So we say there can be duplicate user IDs 14078 11:00:09,720 --> 11:00:11,720 and duplicate course IDs but there can only be, 14079 11:00:11,720 --> 11:00:14,720 you know, user ID, course ID combinations. 14080 11:00:14,720 --> 11:00:15,720 That has to be unique. 14081 11:00:15,720 --> 11:00:20,720 So you can make unique be more than one column. 14082 11:00:20,720 --> 11:00:23,720 And so if you imagine a course table and a user table 14083 11:00:23,720 --> 11:00:25,720 there's a user ID, the name and email 14084 11:00:25,720 --> 11:00:27,720 and the course has a title and an ID. 14085 11:00:27,720 --> 11:00:29,720 And then we have this little table that just is 14086 11:00:29,720 --> 11:00:33,720 the connector table that shows the points out. 14087 11:00:33,720 --> 11:00:35,720 And so we can expand this membership. 14088 11:00:35,720 --> 11:00:37,720 So let's take a look at how that works. 14089 11:00:37,720 --> 11:00:41,720 So we're going to create some tables 14090 11:00:41,720 --> 11:00:46,720 and these are very classic tables 14091 11:00:46,720 --> 11:00:49,720 because these are the one end of it. 14092 11:00:49,720 --> 11:00:51,720 So these are the one end of it. 14093 11:00:51,720 --> 11:00:55,720 So it has a primary key, a title, a logical key, email. 14094 11:00:55,720 --> 11:00:57,720 There's a primary key for course and then there's text. 14095 11:00:57,720 --> 11:00:59,720 So we have this unique to kind of indicate 14096 11:00:59,720 --> 11:01:00,720 that it's a logical key. 14097 11:01:00,720 --> 11:01:02,720 We're not going to allow ourselves 14098 11:01:02,720 --> 11:01:04,720 to put any duplicates in here. 14099 11:01:04,720 --> 11:01:09,720 Now the connector database here is a table member 14100 11:01:09,720 --> 11:01:12,720 and it has two foreign keys, user ID and course ID. 14101 11:01:12,720 --> 11:01:15,720 And you can easily model some data here. 14102 11:01:15,720 --> 11:01:17,720 So I'm going to model role which is going to be 14103 11:01:17,720 --> 11:01:21,720 zero equals student and one equals instructor. 14104 11:01:21,720 --> 11:01:23,720 And then I'm going to indicate that the primary key 14105 11:01:23,720 --> 11:01:26,720 or uniqueness constraint is the combination 14106 11:01:26,720 --> 11:01:28,720 of the user ID and a course ID. 14107 11:01:28,720 --> 11:01:30,720 Now when we say the primary key, 14108 11:01:30,720 --> 11:01:34,720 it both limits our ability to insert duplicates 14109 11:01:34,720 --> 11:01:37,720 but it also allows the database to optimize its scanning 14110 11:01:37,720 --> 11:01:40,720 because it knows that that combination is always unique 14111 11:01:40,720 --> 11:01:43,720 and so it can organize its disk structure 14112 11:01:43,720 --> 11:01:46,720 and storage structure to understand 14113 11:01:46,720 --> 11:01:48,720 how to look things up more efficiently. 14114 11:01:48,720 --> 11:01:50,720 Knowing that once it's found a user ID, 14115 11:01:50,720 --> 11:01:52,720 course ID combination, it doesn't have to look any farther 14116 11:01:52,720 --> 11:01:53,720 because they're unique. 14117 11:01:53,720 --> 11:01:55,720 And so all of these contracts that we add 14118 11:01:55,720 --> 11:01:59,720 speed things up, save storage and makes things more efficient. 14119 11:01:59,720 --> 11:02:02,720 But in ways we don't always know exactly how they happened. 14120 11:02:02,720 --> 11:02:05,720 And so let's go ahead and make these. 14121 11:02:05,720 --> 11:02:07,720 Let's go ahead and make these guys. 14122 11:02:07,720 --> 11:02:10,720 I think I will start with a new database. 14123 11:02:10,720 --> 11:02:16,720 I'm going to call it LMS for Learning Management System. 14124 11:02:16,720 --> 11:02:19,720 No, I don't really want to do that one. 14125 11:02:19,720 --> 11:02:22,720 And so I'm going to not create the table. 14126 11:02:22,720 --> 11:02:24,720 I'm going to do everything in SQL. 14127 11:02:24,720 --> 11:02:27,720 And so let me see if it's in my cheat sheet. 14128 11:02:27,720 --> 11:02:28,720 Nope, that's not in my cheat sheet. 14129 11:02:28,720 --> 11:02:30,720 So I have to fix the cheat sheet again for you. 14130 11:02:30,720 --> 11:02:32,720 By the time you see the cheat sheet, 14131 11:02:32,720 --> 11:02:33,720 all these things will be in there. 14132 11:02:33,720 --> 11:02:39,720 So I'm going to go in here and I'm going to grab create table user. 14133 11:02:39,720 --> 11:02:41,720 Actually, I'm going to grab them all. 14134 11:02:41,720 --> 11:02:44,720 Watch this. 14135 11:02:44,720 --> 11:02:46,720 Grab them all. 14136 11:02:46,720 --> 11:02:48,720 Highlight all these. 14137 11:02:48,720 --> 11:02:50,720 Go over to SQL iBrowser. 14138 11:02:50,720 --> 11:02:52,720 Blast them all in. 14139 11:02:52,720 --> 11:02:54,720 And then I'll put a semicolon at the end 14140 11:02:54,720 --> 11:02:59,720 of each one of the statements. 14141 11:02:59,720 --> 11:03:01,720 And I want to run them. 14142 11:03:01,720 --> 11:03:05,720 So does it look good? 14143 11:03:05,720 --> 11:03:07,720 Yep, yep, yep. 14144 11:03:07,720 --> 11:03:08,720 So I got a course. 14145 11:03:08,720 --> 11:03:11,720 I got membership, two foreign keys, and I got user. 14146 11:03:11,720 --> 11:03:14,720 So that all looks good. 14147 11:03:14,720 --> 11:03:20,720 So now we're going to have to insert some data in. 14148 11:03:20,720 --> 11:03:22,720 And we're going to insert from the outside in. 14149 11:03:22,720 --> 11:03:24,720 And so we're going to just put the name and email. 14150 11:03:24,720 --> 11:03:27,720 The ID will be automatically assigned for the users. 14151 11:03:27,720 --> 11:03:29,720 And we're going to do the same thing. 14152 11:03:29,720 --> 11:03:33,720 And the ID and the courses will be automatically assigned. 14153 11:03:33,720 --> 11:03:37,720 So let me just grab all this stuff. 14154 11:03:37,720 --> 11:03:38,720 Go into SQL. 14155 11:03:38,720 --> 11:03:40,720 That has the semicolons at the end already for me. 14156 11:03:40,720 --> 11:03:42,720 Thank you very much. 14157 11:03:42,720 --> 11:03:44,720 Now I'm going to run it. 14158 11:03:44,720 --> 11:03:49,720 And if I take a look at my data, now I've got primary keys for the courses. 14159 11:03:49,720 --> 11:03:52,720 And I've got primary keys for the users. 14160 11:03:52,720 --> 11:03:54,720 And I've got nothing in the membership table. 14161 11:03:54,720 --> 11:03:57,720 And I, of course, have to remember what these values are 14162 11:03:57,720 --> 11:04:01,720 because Jane is one, and Ed is two, and Sue is three, right? 14163 11:04:01,720 --> 11:04:05,720 And Python is one, SQL is two, is three. 14164 11:04:05,720 --> 11:04:09,720 And so when I go into membership, I've got two foreign keys here and a role. 14165 11:04:09,720 --> 11:04:13,720 And they just have to be for the course person combination. 14166 11:04:13,720 --> 11:04:17,720 And so it's a little tricky to figure all this stuff out. 14167 11:04:17,720 --> 11:04:19,720 But again, these are just numbers. 14168 11:04:19,720 --> 11:04:22,720 And if you look at these numbers, user ID, course ID, role. 14169 11:04:22,720 --> 11:04:24,720 Well, user ID one is in course one. 14170 11:04:24,720 --> 11:04:26,720 User ID is in course as the teacher. 14171 11:04:26,720 --> 11:04:30,720 User ID two is in course one as the student, et cetera, et cetera, et cetera. 14172 11:04:30,720 --> 11:04:34,720 So I'm making these connections by just putting these little numbers in. 14173 11:04:34,720 --> 11:04:39,720 And once again, conveniently, I have all my semicolons perfectly in place. 14174 11:04:39,720 --> 11:04:42,720 So I go to SQL. 14175 11:04:42,720 --> 11:04:44,720 And then I run that. 14176 11:04:44,720 --> 11:04:48,720 And then I take and I look at my membership data, and there it is. 14177 11:04:48,720 --> 11:04:52,720 So two foreign keys and a bit of data modeled at the connection. 14178 11:04:52,720 --> 11:04:53,720 That's the way we say that. 14179 11:04:53,720 --> 11:04:56,720 The role is modeled at the connection. 14180 11:04:56,720 --> 11:05:01,720 So now we build all this stuff up, we can write some queries that take a look at this. 14181 11:05:01,720 --> 11:05:07,720 And so what we're going to do is we're going to look at who's in what course and what role are they. 14182 11:05:07,720 --> 11:05:11,720 And we're going to sort this in a nice way. 14183 11:05:11,720 --> 11:05:14,720 So let's just take a quick look at the code we're writing. 14184 11:05:14,720 --> 11:05:20,720 We're going to do a select from three tables, the user name, the member role, the course title. 14185 11:05:20,720 --> 11:05:25,720 So in effect, we're not showing any of the foreign keys or the primary keys. 14186 11:05:25,720 --> 11:05:29,720 We're going to go from the user table, join to the member table, join to the course table. 14187 11:05:29,720 --> 11:05:31,720 This is pretty easy to write. 14188 11:05:31,720 --> 11:05:33,720 You know there are three tables you want to go across. 14189 11:05:33,720 --> 11:05:37,720 The on clause is also very easy to write, right? 14190 11:05:37,720 --> 11:05:46,720 The on clause models each of these connections, where the member's user ID is equal to the user's ID. 14191 11:05:46,720 --> 11:05:51,720 And where the member's course ID is equal to the course ID. 14192 11:05:51,720 --> 11:05:58,720 So we're going to concatenate all three of these tables together, but we're going to only keep rows where it matters. 14193 11:05:58,720 --> 11:06:03,720 Now this role doesn't participate, but we're going to print that out. 14194 11:06:03,720 --> 11:06:11,720 And we're going to order it by the course title first, and then the member role second, and the name third. 14195 11:06:11,720 --> 11:06:24,720 And so let's run that. 14196 11:06:24,720 --> 11:06:25,720 So we've reconnected it. 14197 11:06:25,720 --> 11:06:27,720 So Ed's the teacher of the PHP class. 14198 11:06:27,720 --> 11:06:29,720 Sue is the student in the PHP class. 14199 11:06:29,720 --> 11:06:31,720 Jane is the teacher in the Python class. 14200 11:06:31,720 --> 11:06:33,720 Ed's a student, and Sue are students in the Python class. 14201 11:06:33,720 --> 11:06:38,720 Ed's the teacher in the SQL class, and Jane is the student in the SQL class. 14202 11:06:38,720 --> 11:06:45,720 And so we have many people, there are many students in many classes there, and so we have modeled that. 14203 11:06:45,720 --> 11:06:48,720 But we model that with this sort of table. 14204 11:06:48,720 --> 11:06:55,720 And if you look at a piece of software that I've written called Sugi, which is a standalone learning management system that's built with learning tools, 14205 11:06:55,720 --> 11:07:04,720 you will see in anything we're in membership where we have a user table, we have a context which is also the course table, 14206 11:07:04,720 --> 11:07:07,720 and then we have a membership table, and you look, here's these foreign keys. 14207 11:07:07,720 --> 11:07:17,720 Like that's the many side, that's the one side, many to one, and so this is now an effect of many to many between these two, 14208 11:07:17,720 --> 11:07:22,720 but then it's modeled as a series of many to one, many to one relationships. 14209 11:07:22,720 --> 11:07:28,720 And you see this all the time in all kinds of things where membership or other kinds of things are necessary, 14210 11:07:28,720 --> 11:07:31,720 many to one, or many to many. 14211 11:07:31,720 --> 11:07:38,720 So, with all that, there's so much to learn. It's both easy and complex at the same time. 14212 11:07:38,720 --> 11:07:42,720 It's easy when someone shows you how to do it, but at some point you will learn how to build database models, 14213 11:07:42,720 --> 11:07:46,720 and you realize, oh, it wasn't so bad. It takes a while to get used to them. 14214 11:07:46,720 --> 11:07:49,720 This really just is a quick walk. 14215 11:07:49,720 --> 11:07:58,720 The bottom line is, what we just did seems like it was, wow, that's nice. Do you really have to do that? 14216 11:07:58,720 --> 11:08:02,720 And the answer is, if you're going to scale it all, you absolutely have to, 14217 11:08:02,720 --> 11:08:05,720 because you simply can't read and write data sequentially. 14218 11:08:05,720 --> 11:08:10,720 You can't read through and update one little piece of data in a file by reading all the way through 14219 11:08:10,720 --> 11:08:13,720 and then writing a new copy of the file. That could take seconds, 14220 11:08:13,720 --> 11:08:18,720 and in a system like an online system, you get a hundredth of a second to do something like that, 14221 11:08:18,720 --> 11:08:22,720 and the databases make it so that happens in a thousandth of a second. 14222 11:08:22,720 --> 11:08:25,720 So, ultimately, you simply have to take advantage of this. 14223 11:08:25,720 --> 11:08:29,720 You just can't, if you're going to modify data, you can read data from flat files, 14224 11:08:29,720 --> 11:08:33,720 but even if you're going to read a lot of data, if it's big, it slows down terribly. 14225 11:08:33,720 --> 11:08:38,720 So, it might seem like there's a trade-off that you could debate whether this is worth it, 14226 11:08:38,720 --> 11:08:42,720 but if you're going to deal with a lot of data, you've got no choice. 14227 11:08:42,720 --> 11:08:45,720 It's really not as much a trade-off as you think. 14228 11:08:45,720 --> 11:08:49,720 So, this has been a quick romp through databases. 14229 11:08:49,720 --> 11:08:52,720 We talked a little bit about indexes. There are constraints. 14230 11:08:52,720 --> 11:08:55,720 We talked a little bit about the not null stuff. We've talked about that. 14231 11:08:55,720 --> 11:08:57,720 The uniqueness, that's a constraint. 14232 11:08:57,720 --> 11:09:00,720 Another whole area is what's called transactions, 14233 11:09:00,720 --> 11:09:03,720 and that's the locking of little areas. 14234 11:09:03,720 --> 11:09:07,720 So, you can read an area, then lock it, and then update it to make sure no one else reads it. 14235 11:09:07,720 --> 11:09:12,720 And so, they make sure they either get the version before you looked at it 14236 11:09:12,720 --> 11:09:14,720 or before you change it or after you change it. 14237 11:09:14,720 --> 11:09:23,720 And so, that's how you make sure that you can't do things having to do with bank account balances 14238 11:09:23,720 --> 11:09:24,720 and get yourself in trouble. 14239 11:09:24,720 --> 11:09:27,720 So, these are a lot of SQL. It's really fascinating. 14240 11:09:27,720 --> 11:09:33,720 SQL is a fascinating thing to use and learn and performance tune and enjoy. 14241 11:09:33,720 --> 11:09:38,720 So, relational databases are cool. This gets us started. 14242 11:09:38,720 --> 11:09:43,720 The big thing is don't allow replication vertically of string data. 14243 11:09:43,720 --> 11:09:46,720 Pull that out into a separate table, establish a primary key, 14244 11:09:46,720 --> 11:09:48,720 and then have foreign keys that point to that primary key. 14245 11:09:48,720 --> 11:09:51,720 It is not just how much data you store. 14246 11:09:51,720 --> 11:09:54,720 It's sort of a compression way as a way of compressing data. 14247 11:09:54,720 --> 11:09:57,720 You might think strings take no data, but they do. 14248 11:09:57,720 --> 11:10:01,720 Numbers take a lot less data, and it's both how much data that's stored 14249 11:10:01,720 --> 11:10:03,720 but also how much data has to be scanned. 14250 11:10:03,720 --> 11:10:10,720 And that way joins work. That's part of the magic of why Oracle is such a successful company. 14251 11:10:10,720 --> 11:10:14,720 It's a bit of art form, and it's something that you can work your whole life 14252 11:10:14,720 --> 11:10:16,720 and always get better at. 14253 11:10:22,720 --> 11:10:25,720 Hello, and welcome to our code walkthrough on the roster code. 14254 11:10:25,720 --> 11:10:29,720 So, the learning objective of this is to do a many-to-many table. 14255 11:10:29,720 --> 11:10:34,720 And so, the idea is that we're going to, just like we talked about in lecture, 14256 11:10:34,720 --> 11:10:37,720 we're going to have a set of users, we're going to have a set of courses, 14257 11:10:37,720 --> 11:10:41,720 and then we're going to have a connector table or a many-to-many table 14258 11:10:41,720 --> 11:10:44,720 that basically has two foreign keys. 14259 11:10:44,720 --> 11:10:50,720 So, we are going to use the integer.null primary key auto-increment unique 14260 11:10:50,720 --> 11:10:58,720 as the way to get auto-assignment of the primary keys in the user table and the course table. 14261 11:10:58,720 --> 11:11:03,720 And then we're going to say that the name, which is like a logical key, 14262 11:11:03,720 --> 11:11:06,720 and then the course title, we're going to mark those as unique. 14263 11:11:06,720 --> 11:11:09,720 And we're going to take advantage of that in a moment. 14264 11:11:09,720 --> 11:11:11,720 So, you'll see how we take advantage of that. 14265 11:11:11,720 --> 11:11:16,720 So, what unique means is if you try to insert the same string into this column, 14266 11:11:16,720 --> 11:11:21,720 you know, like Chuck twice, then it's going to fail the second time 14267 11:11:21,720 --> 11:11:24,720 because it's going to refuse to create a new record. 14268 11:11:24,720 --> 11:11:26,720 And so, if we just kind of like take a look, 14269 11:11:26,720 --> 11:11:29,720 we're going to get our roster data from this sample JSON, 14270 11:11:29,720 --> 11:11:32,720 which is just an array of arrays. 14271 11:11:32,720 --> 11:11:35,720 And this is the person's name, the class that they're in, 14272 11:11:35,720 --> 11:11:38,720 and whether they are a teacher or a student. 14273 11:11:38,720 --> 11:11:40,720 And so, we're going to read that. 14274 11:11:40,720 --> 11:11:43,720 So, we need the JSON library and the SQLite library. 14275 11:11:43,720 --> 11:11:46,720 We make a database connection, and we get a cursor. 14276 11:11:46,720 --> 11:11:50,720 The cursor is kind of more like the file handle. 14277 11:11:50,720 --> 11:11:52,720 You send SQL commands to the cursor, 14278 11:11:52,720 --> 11:11:54,720 and then you read the cursor to get the data back. 14279 11:11:54,720 --> 11:11:57,720 The connection can create more than one cursor, 14280 11:11:57,720 --> 11:11:59,720 so you can have more than one set of commands. 14281 11:11:59,720 --> 11:12:05,720 But the cursor is generally like the file handle to the database server. 14282 11:12:05,720 --> 11:12:08,720 And we are going to execute a big script, 14283 11:12:08,720 --> 11:12:10,720 and you'll notice this is a triple-quoted string 14284 11:12:10,720 --> 11:12:12,720 that goes all the way down to here. 14285 11:12:12,720 --> 11:12:15,720 And so, some people would just give this to a unit text file 14286 11:12:15,720 --> 11:12:17,720 and have you cut and paste this, 14287 11:12:17,720 --> 11:12:21,720 and then go run that in your SQLite browser to create them. 14288 11:12:21,720 --> 11:12:24,720 But that's okay, because what we're going to do 14289 11:12:24,720 --> 11:12:26,720 is we're going to set this up. 14290 11:12:26,720 --> 11:12:31,720 It will either reconnect to existing file named rosterdb.sqlite, 14291 11:12:31,720 --> 11:12:33,720 and if I look where I'm at, I do an ls, 14292 11:12:33,720 --> 11:12:35,720 we find that that file is not there. 14293 11:12:35,720 --> 11:12:37,720 So, the first time I run it, it's going to create it. 14294 11:12:37,720 --> 11:12:40,720 But I want this to start fresh every time, 14295 11:12:40,720 --> 11:12:42,720 so I'm going to wipe out the tables if they exist. 14296 11:12:42,720 --> 11:12:44,720 That way, you can run it over and over and over again, 14297 11:12:44,720 --> 11:12:46,720 in case you make a mistake here. 14298 11:12:46,720 --> 11:12:47,720 Now, I don't have a mistake, 14299 11:12:47,720 --> 11:12:50,720 or hopefully I don't have a mistake on this. 14300 11:12:50,720 --> 11:12:53,720 So, we're going to drop three tables, 14301 11:12:53,720 --> 11:12:55,720 and we're going to create three tables. 14302 11:12:55,720 --> 11:12:59,720 And here, we're going to create the table 14303 11:12:59,720 --> 11:13:03,720 that has two foreign keys, user ID, course ID, 14304 11:13:03,720 --> 11:13:06,720 that are sort of going outwards from the member table, 14305 11:13:06,720 --> 11:13:09,720 and then we're going to model a little bit of the data at the role. 14306 11:13:09,720 --> 11:13:14,720 And I guess this, and again, this is straight from the lecture. 14307 11:13:14,720 --> 11:13:18,720 And the primary key is actually a composite primary key, 14308 11:13:18,720 --> 11:13:20,720 because we're going to look up, 14309 11:13:20,720 --> 11:13:23,720 and it's going to force this to be the combination of user ID 14310 11:13:23,720 --> 11:13:24,720 and course ID to be unique. 14311 11:13:24,720 --> 11:13:27,720 But there can be many user IDs and many course IDs, 14312 11:13:27,720 --> 11:13:32,720 but only one particular combination of a value for user ID and course ID. 14313 11:13:32,720 --> 11:13:34,720 And so, that's what we're basically saying. 14314 11:13:34,720 --> 11:13:37,720 You can be a member of a course, but you can only do that once. 14315 11:13:37,720 --> 11:13:41,720 You can't be like a member of the course a bunch of times. 14316 11:13:41,720 --> 11:13:46,720 So, we're going to, oh, that should be roster data sample. 14317 11:13:46,720 --> 11:13:51,720 That's okay to, oops, fix a bug. 14318 11:13:51,720 --> 11:13:54,720 Save that, roster data sample. 14319 11:13:54,720 --> 11:13:57,720 And so, that's just this file, and it's really just an array, 14320 11:13:57,720 --> 11:13:59,720 and then each row is an array, 14321 11:13:59,720 --> 11:14:02,720 and it's a way for us to get this roster data in. 14322 11:14:02,720 --> 11:14:07,720 And so, once we do load s on JSON, we're parsing it, 14323 11:14:07,720 --> 11:14:10,720 and then this is going to be an array of arrays. 14324 11:14:10,720 --> 11:14:14,720 And so, for entry in JSON data, 14325 11:14:14,720 --> 11:14:17,720 so entry is going to be one of these things. 14326 11:14:17,720 --> 11:14:20,720 So, entry itself is a row. 14327 11:14:20,720 --> 11:14:23,720 So, an entry sub zero is the name, 14328 11:14:23,720 --> 11:14:27,720 and entry sub one is the title, name, that's the sub zero, 14329 11:14:27,720 --> 11:14:33,720 and that's the sub one of the particular entry that we're looking at. 14330 11:14:33,720 --> 11:14:36,720 And we're going to print it out just for yux as a tuple. 14331 11:14:36,720 --> 11:14:38,720 So, we make, that's what the two parentheses are. 14332 11:14:38,720 --> 11:14:41,720 This inner thing is a two tuple. 14333 11:14:41,720 --> 11:14:45,720 And we're then going to take the person, 14334 11:14:45,720 --> 11:14:49,720 and we're going to do an insert, and this is new, or ignore. 14335 11:14:49,720 --> 11:14:54,720 So, what the, or ignore means is if this insert would cause an error, 14336 11:14:54,720 --> 11:14:58,720 please don't blow up, don't, just ignore that I tried to insert it. 14337 11:14:58,720 --> 11:15:01,720 And so, this is our trick, and it's a beautiful trick. 14338 11:15:01,720 --> 11:15:05,720 It's like a gorgeously beautiful trick here. 14339 11:15:05,720 --> 11:15:09,720 If we insert the name Chuck twice, 14340 11:15:09,720 --> 11:15:13,720 or ignore will just mean that nothing happens, meaning it's already there. 14341 11:15:13,720 --> 11:15:16,720 Okay, so if it's already there, if it's not there, it'll put it in. 14342 11:15:16,720 --> 11:15:19,720 And the unique will guarantee that it only goes in once. 14343 11:15:19,720 --> 11:15:23,720 So, we just, in effect, always attempt to insert it. 14344 11:15:23,720 --> 11:15:25,720 And if it's been there once, then it's all set. 14345 11:15:25,720 --> 11:15:29,720 And so, this insert or ignore is a super powerful mechanism. 14346 11:15:29,720 --> 11:15:31,720 I use it all the time. 14347 11:15:31,720 --> 11:15:35,720 And we have a placeholder in the form of a question mark, 14348 11:15:35,720 --> 11:15:37,720 and then we have, so one of these days, 14349 11:15:37,720 --> 11:15:39,720 we'll have two things that we're asking for. 14350 11:15:39,720 --> 11:15:40,720 As a matter of fact, here it is. 14351 11:15:40,720 --> 11:15:41,720 There's a tuple down here. 14352 11:15:41,720 --> 11:15:44,720 But this is kind of a tuple with one item in it, name, 14353 11:15:44,720 --> 11:15:47,720 and that name is then going to substitute in for there 14354 11:15:47,720 --> 11:15:51,720 while avoiding SQL injection. 14355 11:15:51,720 --> 11:15:53,720 So, this runs. 14356 11:15:53,720 --> 11:15:55,720 It may or may not insert a new record, 14357 11:15:55,720 --> 11:15:58,720 but if Chuck or whomever the name is is not there, 14358 11:15:58,720 --> 11:16:00,720 it will give us a new record. 14359 11:16:00,720 --> 11:16:02,720 And then we are going to get back the ID. 14360 11:16:02,720 --> 11:16:06,720 And so, this is the logical key, and this is the primary key. 14361 11:16:06,720 --> 11:16:10,720 And that primary key is going to be auto-constructed for us, 14362 11:16:10,720 --> 11:16:12,720 and so we need to know what it is. 14363 11:16:12,720 --> 11:16:16,720 So, we say select ID from user where name equals 14364 11:16:16,720 --> 11:16:17,720 and then that same name. 14365 11:16:17,720 --> 11:16:20,720 So, that's Chuck, and so that gives us one. 14366 11:16:20,720 --> 11:16:23,720 And then what we do is we're going to fetch one record 14367 11:16:23,720 --> 11:16:24,720 from the cursor because that's a select 14368 11:16:24,720 --> 11:16:26,720 and it gives us back a cursor. 14369 11:16:26,720 --> 11:16:28,720 There's only hopefully one record there because it's unique. 14370 11:16:28,720 --> 11:16:30,720 I could put a limit one in there, 14371 11:16:30,720 --> 11:16:31,720 but that would be kind of redundant 14372 11:16:31,720 --> 11:16:34,720 because the name is a unique key. 14373 11:16:34,720 --> 11:16:37,720 And then the sub-zero just means if there were more than one thing 14374 11:16:37,720 --> 11:16:40,720 that I was selecting, which we'll see in a bit, 14375 11:16:40,720 --> 11:16:42,720 the sub-zero is just the first thing. 14376 11:16:42,720 --> 11:16:46,720 And so, this is going to give us the integer user ID 14377 11:16:46,720 --> 11:16:50,720 that was assigned, or if we're coming through later 14378 11:16:50,720 --> 11:16:53,720 for Chuck, you know, Chuck later, Charlie later, 14379 11:16:53,720 --> 11:16:54,720 that will be the old one. 14380 11:16:54,720 --> 11:16:57,720 So, this is inserted if it doesn't exist, 14381 11:16:57,720 --> 11:17:00,720 and this is get the newly created ID field 14382 11:17:00,720 --> 11:17:02,720 or the original ID field. 14383 11:17:02,720 --> 11:17:06,720 And so, part of this works by having both a logical key 14384 11:17:06,720 --> 11:17:07,720 and a primary key. 14385 11:17:07,720 --> 11:17:10,720 The primary key is auto-generated, 14386 11:17:10,720 --> 11:17:12,720 but the name is a logical key and it's unique. 14387 11:17:12,720 --> 11:17:15,720 And so, that's our trick to get that assigned thing. 14388 11:17:15,720 --> 11:17:18,720 Before, we just looked at it in the user interface 14389 11:17:18,720 --> 11:17:22,720 of SQLite browser and wrote it down, 14390 11:17:22,720 --> 11:17:23,720 but this is how we do it in code. 14391 11:17:23,720 --> 11:17:25,720 So, we need to know what that key is, 14392 11:17:25,720 --> 11:17:26,720 whether it was new or not. 14393 11:17:26,720 --> 11:17:29,720 And then we do the exact same pattern for the course, 14394 11:17:29,720 --> 11:17:31,720 except we're inserting the course title. 14395 11:17:31,720 --> 11:17:36,720 So, that's no big deal. 14396 11:17:36,720 --> 11:17:40,720 And so, we're going to get the user ID, course ID. 14397 11:17:40,720 --> 11:17:43,720 And then what we're going to do is we're going to insert 14398 11:17:43,720 --> 11:17:44,720 or replace. 14399 11:17:44,720 --> 11:17:47,720 So, this is basically if they're, 14400 11:17:47,720 --> 11:17:50,720 remember that this user ID, course ID combination 14401 11:17:50,720 --> 11:17:54,720 is the primary key for this member table. 14402 11:17:54,720 --> 11:17:56,720 If there is a duplicate, 14403 11:17:56,720 --> 11:18:00,720 if this combination is already there, 14404 11:18:00,720 --> 11:18:02,720 this becomes effectively an update state. 14405 11:18:02,720 --> 11:18:04,720 And we have these two number values. 14406 11:18:04,720 --> 11:18:07,720 Now, what's missing here is the role is not there. 14407 11:18:07,720 --> 11:18:12,720 And so, user ID, course ID, this is the SQL bit. 14408 11:18:12,720 --> 11:18:15,720 And now we have a tuple with two items in it. 14409 11:18:15,720 --> 11:18:18,720 And that's because we have two question marks. 14410 11:18:18,720 --> 11:18:19,720 And then we commit it. 14411 11:18:19,720 --> 11:18:21,720 And as I mentioned before, 14412 11:18:21,720 --> 11:18:23,720 sometimes you want to commit every time through. 14413 11:18:23,720 --> 11:18:27,720 The commit is, it turns out that these things are less costly, 14414 11:18:27,720 --> 11:18:30,720 but that's because it's not always writing all the way to disk. 14415 11:18:30,720 --> 11:18:32,720 Whereas when you enter the commit, 14416 11:18:32,720 --> 11:18:34,720 it's going to go and write everything to disk, 14417 11:18:34,720 --> 11:18:36,720 pause until it's complete, 14418 11:18:36,720 --> 11:18:38,720 and then your program doesn't continue. 14419 11:18:38,720 --> 11:18:42,720 So, sometimes we don't run this every single time through. 14420 11:18:42,720 --> 11:18:43,720 Okay? 14421 11:18:43,720 --> 11:18:44,720 So, let's just go ahead and run this. 14422 11:18:44,720 --> 11:18:46,720 The only thing we're going to see is the output 14423 11:18:46,720 --> 11:18:49,720 of the name and the title as it's running. 14424 11:18:49,720 --> 11:18:56,720 So, if I do python3roster.py, 14425 11:18:56,720 --> 11:18:58,720 hopefully I can hit enter. 14426 11:18:58,720 --> 11:18:59,720 So, you'll notice, by the way, 14427 11:18:59,720 --> 11:19:02,720 that this SQLite now exists, right? 14428 11:19:02,720 --> 11:19:04,720 And it has no data in it. 14429 11:19:04,720 --> 11:19:08,720 So, let me see if I can open this database and see it. 14430 11:19:08,720 --> 11:19:10,720 So, you see that there's no data. 14431 11:19:10,720 --> 11:19:11,720 So, we're the code. 14432 11:19:11,720 --> 11:19:15,720 We've run this code, in effect, up to this point. 14433 11:19:15,720 --> 11:19:18,720 So, we've done all the create tables and all that stuff. 14434 11:19:18,720 --> 11:19:20,720 So, the create tables are there. 14435 11:19:20,720 --> 11:19:21,720 So, all this data is here. 14436 11:19:21,720 --> 11:19:23,720 It did it. 14437 11:19:23,720 --> 11:19:25,720 We haven't started putting any data into it yet 14438 11:19:25,720 --> 11:19:26,720 because if we look at browse data, 14439 11:19:26,720 --> 11:19:29,720 we're not finding anything in here. 14440 11:19:29,720 --> 11:19:30,720 Okay? 14441 11:19:30,720 --> 11:19:32,720 There's no data to browse. 14442 11:19:32,720 --> 11:19:34,720 Now, hopefully we won't have locked ourselves 14443 11:19:34,720 --> 11:19:37,720 because we are sitting right here. 14444 11:19:37,720 --> 11:19:39,720 And when I hit enter over here, 14445 11:19:39,720 --> 11:19:40,720 then it's going to go, 14446 11:19:40,720 --> 11:19:42,720 and it's just going to run really fast. 14447 11:19:42,720 --> 11:19:43,720 So, I'll hit enter. 14448 11:19:43,720 --> 11:19:44,720 It'll read it. 14449 11:19:44,720 --> 11:19:46,720 And so, it inserted all of those things. 14450 11:19:46,720 --> 11:19:48,720 And now it's been changed. 14451 11:19:48,720 --> 11:19:50,720 And if I hit refresh over here, 14452 11:19:50,720 --> 11:19:52,720 we will see in the user, 14453 11:19:52,720 --> 11:19:55,720 it just sort of assigned user IDs, right? 14454 11:19:55,720 --> 11:19:57,720 The column's auto-assigned. 14455 11:19:57,720 --> 11:19:59,720 We will find in the course that those courses 14456 11:19:59,720 --> 11:20:01,720 are all auto-assigned. 14457 11:20:01,720 --> 11:20:02,720 There's the courses. 14458 11:20:02,720 --> 11:20:05,720 And there's no duplicates because this is unique, right? 14459 11:20:05,720 --> 11:20:08,720 And so, these are the newly created things. 14460 11:20:08,720 --> 11:20:12,720 But then membership is user ID, course ID. 14461 11:20:12,720 --> 11:20:14,720 And so, again, the primary key, 14462 11:20:14,720 --> 11:20:17,720 as it were, the unique constraint slash primary key 14463 11:20:17,720 --> 11:20:19,720 is the combination of these things. 14464 11:20:19,720 --> 11:20:20,720 And I haven't put anything in roll. 14465 11:20:20,720 --> 11:20:22,720 And so, if you scroll through these, 14466 11:20:22,720 --> 11:20:25,720 you'll see all of the users who are members 14467 11:20:25,720 --> 11:20:29,720 of the courses that they're part of, okay? 14468 11:20:29,720 --> 11:20:31,720 So, there you go. 14469 11:20:31,720 --> 11:20:35,720 And I'll leave it up to you to come up with a join. 14470 11:20:35,720 --> 11:20:38,720 I'll leave it up to you to figure out 14471 11:20:38,720 --> 11:20:39,720 how to put the roll in. 14472 11:20:39,720 --> 11:20:42,720 But I just wanted to kind of give you 14473 11:20:42,720 --> 11:20:44,720 a bit of a walkthrough of this code base. 14474 11:20:44,720 --> 11:20:48,720 And in particular, the tricks of the uniqueness keys, 14475 11:20:48,720 --> 11:20:51,720 the auto-increment keys, the logical key uniqueness, 14476 11:20:51,720 --> 11:20:53,720 kind of composite primary key, 14477 11:20:53,720 --> 11:20:56,720 and then the trick of insert or ignore. 14478 11:20:56,720 --> 11:20:59,720 And then the quick select that comes right afterwards 14479 11:20:59,720 --> 11:21:04,720 to get the newly generated ID or to get the old ID. 14480 11:21:04,720 --> 11:21:06,720 You can insert or replace, 14481 11:21:06,720 --> 11:21:10,720 which is a combination of a insert and an update. 14482 11:21:10,720 --> 11:21:14,720 So, I hope you found this example useful 14483 11:21:14,720 --> 11:21:25,720 and can apply it and basically create many-to-many tables. 14484 11:21:25,720 --> 11:21:27,720 We are doing some code walkthroughs. 14485 11:21:27,720 --> 11:21:29,720 If you want to follow along with the code, 14486 11:21:29,720 --> 11:21:31,720 you can download the source code 14487 11:21:31,720 --> 11:21:36,720 from the Python for Everybody website, okay? 14488 11:21:36,720 --> 11:21:40,720 So, the code we're playing with today is twfriends.py. 14489 11:21:40,720 --> 11:21:44,720 And this is a step beyond the simple TW Spider. 14490 11:21:44,720 --> 11:21:46,720 It is a restartable spider. 14491 11:21:46,720 --> 11:21:48,720 But we're going to data model things a little bit differently. 14492 11:21:48,720 --> 11:21:49,720 We're going to have two tables, 14493 11:21:49,720 --> 11:21:53,720 and we're going to have a many-to-many relationship, 14494 11:21:53,720 --> 11:21:56,720 except that it's sort of a many-to-many relationship 14495 11:21:56,720 --> 11:21:59,720 between the same table, which is okay. 14496 11:21:59,720 --> 11:22:05,720 Friends is a, Twitter Friends are a directional relationship. 14497 11:22:05,720 --> 11:22:09,720 And so, we start out here in twfriends.py. 14498 11:22:09,720 --> 11:22:12,720 Remember that the file hidden.py, 14499 11:22:12,720 --> 11:22:14,720 I'll show it to you, but I'm not going to open it 14500 11:22:14,720 --> 11:22:16,720 because I've got my keys and secrets in it. 14501 11:22:16,720 --> 11:22:19,720 So, this hidden.py file, you've got to edit that, 14502 11:22:19,720 --> 11:22:22,720 and you've got to go to apps.twitter.com 14503 11:22:22,720 --> 11:22:24,720 and get your keys and put them in there. 14504 11:22:24,720 --> 11:22:26,720 Otherwise, these things won't work. 14505 11:22:26,720 --> 11:22:29,720 But, if you have Twitter and you set your API keys up 14506 11:22:29,720 --> 11:22:32,720 and you put them in hidden.py, then all these things will work. 14507 11:22:32,720 --> 11:22:34,720 It's kind of fun, actually, and impressive. 14508 11:22:34,720 --> 11:22:37,720 Not hard to do, actually. 14509 11:22:37,720 --> 11:22:41,720 So, the Twitter URL, that's my library 14510 11:22:41,720 --> 11:22:44,720 that reads hidden.py and augments the URL 14511 11:22:44,720 --> 11:22:46,720 and does all the OAuth stuff. 14512 11:22:46,720 --> 11:22:48,720 JSON and SSL because Twitter doesn't, 14513 11:22:48,720 --> 11:22:51,720 I mean, because Python doesn't accept any certificates, 14514 11:22:51,720 --> 11:22:55,720 even if they're good certificates, so we kind of crush that. 14515 11:22:55,720 --> 11:22:57,720 Here's our friends list that we're going to hit. 14516 11:22:57,720 --> 11:23:00,720 We're going to make a database, friends.sqlite. 14517 11:23:00,720 --> 11:23:03,720 Now, here we're doing create table if not exists. 14518 11:23:03,720 --> 11:23:05,720 So, what this really is saying is, 14519 11:23:05,720 --> 11:23:07,720 I want this to be a restartable process 14520 11:23:07,720 --> 11:23:09,720 and I don't want to lose the data. 14521 11:23:09,720 --> 11:23:15,720 We're starting out, we do not have SQLite, 14522 11:23:15,720 --> 11:23:18,720 any SQLite files, and so this is going to create the database 14523 11:23:18,720 --> 11:23:21,720 and create these tables, but the second time we run it, 14524 11:23:21,720 --> 11:23:23,720 we're not going to recreate the tables. 14525 11:23:23,720 --> 11:23:25,720 We're going to be able to restart this 14526 11:23:25,720 --> 11:23:31,720 because we're going to run out of rate limit 14527 11:23:31,720 --> 11:23:34,720 before we finish this, so we just have to wait. 14528 11:23:34,720 --> 11:23:36,720 We're going to have a people table, 14529 11:23:36,720 --> 11:23:39,720 and we're going to have a primary key in the name. 14530 11:23:39,720 --> 11:23:41,720 The name is going to be unique, 14531 11:23:41,720 --> 11:23:43,720 and whether or not we've retrieved it, 14532 11:23:43,720 --> 11:23:45,720 and that's kind of from a previous one, 14533 11:23:45,720 --> 11:23:49,720 but then there's the who follows who, the from ID to to ID, 14534 11:23:49,720 --> 11:23:51,720 and so this is a direction, 14535 11:23:51,720 --> 11:23:53,720 and we're going to put a uniqueness constraint in, 14536 11:23:53,720 --> 11:23:56,720 just like we do in many to manys that basically says, 14537 11:23:56,720 --> 11:23:59,720 the combination of from ID and to ID has got to be unique. 14538 11:23:59,720 --> 11:24:01,720 We don't allow ourselves, 14539 11:24:01,720 --> 11:24:04,720 to put duplicates of the combination, 14540 11:24:04,720 --> 11:24:06,720 so from ID can be one in many records, 14541 11:24:06,720 --> 11:24:08,720 and to ID can be one in many records, 14542 11:24:08,720 --> 11:24:11,720 but one one is only allowed once, 14543 11:24:11,720 --> 11:24:14,720 and this is the crud we have to do to convince Python 14544 11:24:14,720 --> 11:24:17,720 to accept the Twitter certificate, 14545 11:24:17,720 --> 11:24:20,720 and so this is similar to some of the other stuff that we've done. 14546 11:24:20,720 --> 11:24:23,720 We're going to enter a Twitter account or quit, 14547 11:24:23,720 --> 11:24:26,720 and if we hit enter by itself, 14548 11:24:26,720 --> 11:24:29,720 then we will actually go and retrieve the data 14549 11:24:29,720 --> 11:24:32,720 then we will actually go and retrieve a record 14550 11:24:32,720 --> 11:24:34,720 that was not yet retrieved, 14551 11:24:34,720 --> 11:24:39,720 and now we're actually pulling out two values, ID and name, 14552 11:24:39,720 --> 11:24:41,720 and so we will grab, 14553 11:24:41,720 --> 11:24:44,720 fetch one is going to give us a two-tuple basically, 14554 11:24:44,720 --> 11:24:46,720 and we're going to store that in ID and account. 14555 11:24:46,720 --> 11:24:48,720 Of course that's like, 14556 11:24:48,720 --> 11:24:50,720 this is coming back with a two-tuple, 14557 11:24:50,720 --> 11:24:52,720 first of which is the ID from the database. 14558 11:24:52,720 --> 11:24:55,720 Limit one means we're only going to get one of these, 14559 11:24:55,720 --> 11:24:56,720 or zero of these. 14560 11:24:56,720 --> 11:24:57,720 If there are zero of these, 14561 11:24:57,720 --> 11:25:00,720 that means there are no unretrieved Twitter accounts. 14562 11:25:00,720 --> 11:25:01,720 Retrieved equals zero. 14563 11:25:01,720 --> 11:25:02,720 Well, you'll see in a second 14564 11:25:02,720 --> 11:25:06,720 that all the new accounts we put in 14565 11:25:06,720 --> 11:25:08,720 are the ones for which we haven't retrieved, 14566 11:25:08,720 --> 11:25:09,720 and again, given that our rate limit, 14567 11:25:09,720 --> 11:25:13,720 we want to know which ones we've retrieved, okay? 14568 11:25:13,720 --> 11:25:18,720 And so what we're going to do next 14569 11:25:18,720 --> 11:25:20,720 is we're going to check to see 14570 11:25:20,720 --> 11:25:24,720 if the person that we just checked, 14571 11:25:24,720 --> 11:25:25,720 which means the length of the account is greater 14572 11:25:25,720 --> 11:25:27,720 than we just were entered, 14573 11:25:27,720 --> 11:25:30,720 we're going to check to see if they're already there, okay? 14574 11:25:30,720 --> 11:25:33,720 And we're going to select ID from people where name equals, 14575 11:25:33,720 --> 11:25:35,720 so that's the one we just entered, 14576 11:25:35,720 --> 11:25:37,720 and we're going to fetch one and grab the first thing 14577 11:25:37,720 --> 11:25:41,720 because we only got one thing in the select statement here. 14578 11:25:41,720 --> 11:25:45,720 And if this person that we just asked to see 14579 11:25:45,720 --> 11:25:47,720 is not in the table, 14580 11:25:47,720 --> 11:25:49,720 that means this is going to fail, 14581 11:25:49,720 --> 11:25:51,720 we're going to do an insert or ignore. 14582 11:25:51,720 --> 11:25:52,720 This or ignore is kind of redundant 14583 11:25:52,720 --> 11:25:54,720 because we just checked to see if it was there, 14584 11:25:54,720 --> 11:25:57,720 but we'll put that in just to be safe, 14585 11:25:57,720 --> 11:25:59,720 and we're going to put the name in 14586 11:25:59,720 --> 11:26:02,720 for the new account that we're looking at, 14587 11:26:02,720 --> 11:26:06,720 and we're indicating that retrieved is zero, 14588 11:26:06,720 --> 11:26:09,720 so that we will know that we haven't retrieved it yet. 14589 11:26:09,720 --> 11:26:11,720 You'll see that we'll update that in a second. 14590 11:26:11,720 --> 11:26:14,720 We commit it so that later selects will see this, 14591 11:26:14,720 --> 11:26:18,720 so that you've got to do the commit. 14592 11:26:18,720 --> 11:26:21,720 This later select wouldn't see the one we just inserted, 14593 11:26:21,720 --> 11:26:24,720 and we're going to ask how many rows were affected, 14594 11:26:24,720 --> 11:26:26,720 and if it's not equal to one, 14595 11:26:26,720 --> 11:26:29,720 then we're going to complain about we inserted it, 14596 11:26:29,720 --> 11:26:31,720 and we are going to do this thing. 14597 11:26:31,720 --> 11:26:35,720 We're going to ask, hey, remember there was an ID up there? 14598 11:26:35,720 --> 11:26:37,720 Doo doo doo. 14599 11:26:37,720 --> 11:26:39,720 Right here, ID, integer, primary key, 14600 11:26:39,720 --> 11:26:43,720 and we did not insert this here, 14601 11:26:43,720 --> 11:26:45,720 but we want to know what that ID is, 14602 11:26:45,720 --> 11:26:47,720 and every time I was showing you that in lectures, 14603 11:26:47,720 --> 11:26:49,720 I was saying it's really easy in Python to do this, 14604 11:26:49,720 --> 11:26:51,720 and that's what we're saying. 14605 11:26:51,720 --> 11:26:53,720 This cursor did the insert, 14606 11:26:53,720 --> 11:26:55,720 but one of the things that happens is after the insert, 14607 11:26:55,720 --> 11:26:57,720 we're going to grab the last row ID, 14608 11:26:57,720 --> 11:27:03,720 which is the primary key that was assigned by SQL. 14609 11:27:03,720 --> 11:27:06,720 Okay, and so that means that one way or another, 14610 11:27:06,720 --> 11:27:08,720 coming through this code here in line 45, 14611 11:27:08,720 --> 11:27:10,720 one way or another, we're either going to know 14612 11:27:10,720 --> 11:27:13,720 the ID of the user that was there before, 14613 11:27:13,720 --> 11:27:15,720 or we just inserted one, 14614 11:27:15,720 --> 11:27:17,720 and so we're going to know the primary key of the current user, 14615 11:27:17,720 --> 11:27:18,720 and you'll see why we need that. 14616 11:27:18,720 --> 11:27:22,720 So ID is the primary key of the current user 14617 11:27:22,720 --> 11:27:24,720 that we entered right here. 14618 11:27:24,720 --> 11:27:25,720 Okay? 14619 11:27:25,720 --> 11:27:28,720 And now what we're going to do is do the Twitter URL augment 14620 11:27:28,720 --> 11:27:30,720 with the OAuth and all the keys and the secrets 14621 11:27:30,720 --> 11:27:31,720 and hidden not PY. 14622 11:27:31,720 --> 11:27:33,720 Instead, we're going to go through, let's count 1000. 14623 11:27:33,720 --> 11:27:36,720 Let's go count, what the heck, let's go 200, 14624 11:27:36,720 --> 11:27:38,720 up to 200 friends. 14625 11:27:38,720 --> 11:27:39,720 Save. 14626 11:27:39,720 --> 11:27:40,720 No, let's do 100. 14627 11:27:40,720 --> 11:27:41,720 We'll keep it that way. 14628 11:27:41,720 --> 11:27:43,720 And then we're going to retrieve it, 14629 11:27:43,720 --> 11:27:46,720 and we're retrieving the account. 14630 11:27:46,720 --> 11:27:48,720 We're not going to print the nasty URL out. 14631 11:27:48,720 --> 11:27:49,720 We could. 14632 11:27:49,720 --> 11:27:52,720 Then we're going to open the URL with a connection, 14633 11:27:52,720 --> 11:27:53,720 and then we're going to read that, 14634 11:27:53,720 --> 11:27:55,720 and we're going to get the UTF-8 data from this, 14635 11:27:55,720 --> 11:27:57,720 and then we're going to decode that, 14636 11:27:57,720 --> 11:27:59,720 and we're going to have the Unicode data, 14637 11:27:59,720 --> 11:28:02,720 so the data string is a internal Python string 14638 11:28:02,720 --> 11:28:05,720 with all that data representing all the wonderful characters. 14639 11:28:05,720 --> 11:28:08,720 And of course, we're going to ask URLOpen 14640 11:28:08,720 --> 11:28:12,720 to give us back the headers as a dictionary using this call, 14641 11:28:12,720 --> 11:28:17,720 and we can see how many we have left for the remaining. 14642 11:28:17,720 --> 11:28:21,720 What's the remaining rate limit that we have. 14643 11:28:21,720 --> 11:28:23,720 So then what we're going to do is parse the data 14644 11:28:23,720 --> 11:28:25,720 with JSON load S. 14645 11:28:25,720 --> 11:28:29,720 If, oh wait, I need to continue in here. 14646 11:28:29,720 --> 11:28:31,720 Continue. 14647 11:28:31,720 --> 11:28:33,720 Save. 14648 11:28:33,720 --> 11:28:36,720 If we are going to parse this data, we'll print it out. 14649 11:28:36,720 --> 11:28:38,720 So that means that this died, 14650 11:28:38,720 --> 11:28:41,720 which means it's not syntactically correct JSON, basically. 14651 11:28:41,720 --> 11:28:44,720 And who knows if we're ever going to see that, 14652 11:28:44,720 --> 11:28:47,720 but at least when it blows up, it'll print this data out. 14653 11:28:47,720 --> 11:28:50,720 We'll have to catch it, and then it'll continue. 14654 11:28:50,720 --> 11:28:52,720 Actually, I'll make this a break. 14655 11:28:52,720 --> 11:28:55,720 Because if that's blowing up that bad, we should quit. 14656 11:28:55,720 --> 11:28:59,720 Now, I don't yet know what happens 14657 11:28:59,720 --> 11:29:01,720 when this rate limit says you can't have it. 14658 11:29:01,720 --> 11:29:05,720 But I do know that I expect when it's successful 14659 11:29:05,720 --> 11:29:08,720 that there will be a key of users 14660 11:29:08,720 --> 11:29:11,720 in this outer dictionary that we're going to get. 14661 11:29:11,720 --> 11:29:14,720 And if this outer dictionary, 14662 11:29:14,720 --> 11:29:17,720 if users is not in the parse dictionary, 14663 11:29:17,720 --> 11:29:19,720 then I'm going to dump out this data 14664 11:29:19,720 --> 11:29:22,720 so that at least I can debug what happens 14665 11:29:22,720 --> 11:29:25,720 when I've got some broken JSON. 14666 11:29:25,720 --> 11:29:28,720 So the difference between this code, 14667 11:29:28,720 --> 11:29:33,720 this code is going to fail when the JSON is syntactically bad, 14668 11:29:33,720 --> 11:29:36,720 meaning a curly brace isn't right or whatever. 14669 11:29:36,720 --> 11:29:39,720 This code will trigger when I get good JSON, 14670 11:29:39,720 --> 11:29:43,720 but I don't have a users key in it. 14671 11:29:43,720 --> 11:29:47,720 So then, once we've retrieved it, 14672 11:29:47,720 --> 11:29:49,720 we're pretty happy with it. 14673 11:29:49,720 --> 11:29:52,720 We're going to update for our account that we're retrieving. 14674 11:29:52,720 --> 11:29:56,720 We're going to set this as one of our retrieved accounts. 14675 11:29:56,720 --> 11:29:59,720 And then what we're going to do is write a loop 14676 11:29:59,720 --> 11:30:03,720 that goes through all the friends of this particular user 14677 11:30:03,720 --> 11:30:05,720 that we're asking and gets their screen name. 14678 11:30:05,720 --> 11:30:07,720 Prints it out. 14679 11:30:07,720 --> 11:30:10,720 And then we're going to check to see 14680 11:30:10,720 --> 11:30:13,720 if this one is already in our people database 14681 11:30:13,720 --> 11:30:15,720 because this is a spider. 14682 11:30:15,720 --> 11:30:17,720 We're grabbing accounts. 14683 11:30:17,720 --> 11:30:20,720 And so we'll do a friend ID 14684 11:30:20,720 --> 11:30:23,720 and do a fetch one, grab the sub-zero thing. 14685 11:30:23,720 --> 11:30:26,720 And if that works, if this person's not in there, 14686 11:30:26,720 --> 11:30:28,720 this fetch one is going to blow up, 14687 11:30:28,720 --> 11:30:30,720 which means we're going to drop down to the accept code. 14688 11:30:30,720 --> 11:30:34,720 But if it does work, we have friend ID is, 14689 11:30:34,720 --> 11:30:37,720 you know, they're in there 14690 11:30:37,720 --> 11:30:39,720 and they're already in our database. 14691 11:30:39,720 --> 11:30:41,720 They just weren't retrieved. 14692 11:30:41,720 --> 11:30:44,720 And so now, if the friend ID wasn't there, 14693 11:30:44,720 --> 11:30:47,720 we're going to do an insert into setting retrieve to zero 14694 11:30:47,720 --> 11:30:50,720 and then we're going to commit. 14695 11:30:50,720 --> 11:30:54,720 Now, remember row count is how many rows 14696 11:30:54,720 --> 11:30:56,720 were affected by this last transaction, 14697 11:30:56,720 --> 11:30:58,720 cur.row count, and we're going to die. 14698 11:30:58,720 --> 11:31:02,720 If that insert doesn't work, this is unlikely, 14699 11:31:02,720 --> 11:31:05,720 unless somehow we've ran out of disk drive or something. 14700 11:31:05,720 --> 11:31:09,720 And we're going to grab the friend ID as the key, 14701 11:31:09,720 --> 11:31:11,720 the last row that was inserted. 14702 11:31:11,720 --> 11:31:12,720 We're only going to insert one row, 14703 11:31:12,720 --> 11:31:14,720 so it's basically the primary key 14704 11:31:14,720 --> 11:31:16,720 of the row that we just inserted. 14705 11:31:16,720 --> 11:31:19,720 So if you look at this code right here, 14706 11:31:19,720 --> 11:31:21,720 it comes out the bottom one way or another 14707 11:31:21,720 --> 11:31:23,720 with friend ID successful. 14708 11:31:23,720 --> 11:31:27,720 Friend ID is either they're already in our database 14709 11:31:27,720 --> 11:31:28,720 or they're not. 14710 11:31:28,720 --> 11:31:30,720 And if we insert them, then we have it. 14711 11:31:30,720 --> 11:31:33,720 And so now, this count new and count old 14712 11:31:33,720 --> 11:31:35,720 is just so I can print out a nice print out. 14713 11:31:35,720 --> 11:31:38,720 Now we are going to insert into the friend table, 14714 11:31:38,720 --> 11:31:40,720 which is called the follows table in this case, 14715 11:31:40,720 --> 11:31:42,720 from ID and to ID. 14716 11:31:42,720 --> 11:31:47,720 Those are the two outward pointing foreign keys. 14717 11:31:47,720 --> 11:31:50,720 And we have the ID of the account 14718 11:31:50,720 --> 11:31:52,720 that we are retrieving the friends of 14719 11:31:52,720 --> 11:31:54,720 and then this particular friend. 14720 11:31:54,720 --> 11:31:56,720 And so we're inserting the connection 14721 11:31:56,720 --> 11:31:58,720 from this person to that person. 14722 11:31:58,720 --> 11:32:00,720 And then we commit it. 14723 11:32:00,720 --> 11:32:03,720 We want to commit these again so that later selects, 14724 11:32:03,720 --> 11:32:05,720 when the loop goes back up, 14725 11:32:05,720 --> 11:32:09,720 later selects get all of that data that's going on. 14726 11:32:09,720 --> 11:32:11,720 So we do want to commit from time to time 14727 11:32:11,720 --> 11:32:13,720 and then we close the cursor at the very end. 14728 11:32:13,720 --> 11:32:19,720 So let's run this and see what happens. 14729 11:32:19,720 --> 11:32:29,720 Okay, so Python twfriends.py. Oh, of course. 14730 11:32:29,720 --> 11:32:32,720 I am a refugee from Python 2, 14731 11:32:32,720 --> 11:32:35,720 so I always forget to type Python 3. 14732 11:32:35,720 --> 11:32:38,720 Okay, so we're going to start. 14733 11:32:38,720 --> 11:32:40,720 If we take a look right now, 14734 11:32:40,720 --> 11:32:42,720 I'm going to start another tab over here 14735 11:32:42,720 --> 11:32:46,720 and ls-l star sqlite. 14736 11:32:46,720 --> 11:32:49,720 Now that sqlite file is there, right? 14737 11:32:49,720 --> 11:32:51,720 And it's actually made the tables. 14738 11:32:51,720 --> 11:32:53,720 If you go up here, it ran all this stuff. 14739 11:32:53,720 --> 11:32:55,720 Create the tables, yada yada, 14740 11:32:55,720 --> 11:32:57,720 and we're sitting right here at this line. 14741 11:32:57,720 --> 11:32:58,720 As a matter of fact, I think, 14742 11:32:58,720 --> 11:33:00,720 without causing too much trouble, 14743 11:33:00,720 --> 11:33:03,720 I can open that database 14744 11:33:03,720 --> 11:33:05,720 and get into this database right here 14745 11:33:05,720 --> 11:33:07,720 and there is no data in the follows table 14746 11:33:07,720 --> 11:33:09,720 and there is no data in the people table. 14747 11:33:09,720 --> 11:33:11,720 It's completely empty, okay? 14748 11:33:11,720 --> 11:33:13,720 So we're waiting for the first one. 14749 11:33:13,720 --> 11:33:17,720 And I'll go with mine, Dr. Chuck. 14750 11:33:17,720 --> 11:33:19,720 So it's retrieving the 100 friends 14751 11:33:19,720 --> 11:33:21,720 and they all were brand new. 14752 11:33:21,720 --> 11:33:24,720 They're all inserted, right? 14753 11:33:24,720 --> 11:33:26,720 And so now if I hit refresh, 14754 11:33:26,720 --> 11:33:31,720 we will see that Dr. Chuck is retrieved. 14755 11:33:31,720 --> 11:33:32,720 Who follows? 14756 11:33:32,720 --> 11:33:34,720 So these are all the people I follow. 14757 11:33:34,720 --> 11:33:36,720 One follows two. 14758 11:33:36,720 --> 11:33:37,720 So if we look at here, 14759 11:33:37,720 --> 11:33:39,720 we see that Dr. Chuck follows Stephanie Teasley. 14760 11:33:39,720 --> 11:33:41,720 Because we grabbed the followers of Dr. Chuck, 14761 11:33:41,720 --> 11:33:43,720 you know, we're gonna have a record 14762 11:33:43,720 --> 11:33:45,720 in all of the follows 14763 11:33:45,720 --> 11:33:47,720 for all the ones that I did, right? 14764 11:33:47,720 --> 11:33:49,720 So these are all the people I followed 14765 11:33:49,720 --> 11:33:52,720 and we put them in, okay? 14766 11:33:52,720 --> 11:33:55,720 So we can go back 14767 11:33:55,720 --> 11:33:58,720 and we can, let's see, grab somebody. 14768 11:33:58,720 --> 11:34:02,720 Let's go grab Stephanie Teasley. 14769 11:34:02,720 --> 11:34:06,720 And let's pull out her friends. 14770 11:34:06,720 --> 11:34:10,720 So we grabbed a hundred of her folks. 14771 11:34:10,720 --> 11:34:11,720 I got 14 left. 14772 11:34:11,720 --> 11:34:13,720 That's my x-rate limit. 14773 11:34:13,720 --> 11:34:14,720 So I did Stephanie Teasley, 14774 11:34:14,720 --> 11:34:16,720 so let's go back here. 14775 11:34:16,720 --> 11:34:18,720 So you'll notice there's 101. 14776 11:34:18,720 --> 11:34:21,720 There's probably gonna be, oh, 182. 14777 11:34:21,720 --> 11:34:22,720 That's interesting. 14778 11:34:22,720 --> 11:34:24,720 So we've retrieved Dr. Chuck and Stephanie Teasley 14779 11:34:24,720 --> 11:34:27,720 and let's go take a look in the friends table, 14780 11:34:27,720 --> 11:34:30,720 the follows table, okay? 14781 11:34:30,720 --> 11:34:32,720 So we have all the people I follow. 14782 11:34:32,720 --> 11:34:34,720 Now all the people Stephanie follows. 14783 11:34:34,720 --> 11:34:37,720 Okay, so there we go. 14784 11:34:37,720 --> 11:34:39,720 So let's go ahead and do somebody else. 14785 11:34:39,720 --> 11:34:43,720 Let's see, I think we both follow Tim McKay. 14786 11:34:43,720 --> 11:34:50,720 Where's Tim McKay? 14787 11:34:50,720 --> 11:34:52,720 Yeah, let's follow Tim McKay. 14788 11:34:52,720 --> 11:34:53,720 Let's see who Tim follows. 14789 11:34:53,720 --> 11:34:57,720 See if we can get like an overlap. 14790 11:34:57,720 --> 11:34:59,720 Oh, we revisited some. 14791 11:34:59,720 --> 11:35:05,720 Let's see if we can see this in the follows. 14792 11:35:05,720 --> 11:35:07,720 Let's see people. 14793 11:35:07,720 --> 11:35:09,720 So we've got Dr. Chuck retrieved 14794 11:35:09,720 --> 11:35:17,720 and Tim McKay's somewhere down here. 14795 11:35:17,720 --> 11:35:19,720 You know, it might take us a while 14796 11:35:19,720 --> 11:35:23,720 before we get any really good overlaps. 14797 11:35:23,720 --> 11:35:25,720 Let's see. 14798 11:35:25,720 --> 11:35:28,720 Let's do a database call. 14799 11:35:28,720 --> 11:35:35,720 Let's see, let's do a database SQL. 14800 11:35:35,720 --> 11:35:40,720 Select. 14801 11:35:40,720 --> 11:35:46,720 Count. 14802 11:35:46,720 --> 11:35:51,720 Eh. 14803 11:35:51,720 --> 11:35:53,720 Okay, so let's just run this some more. 14804 11:35:53,720 --> 11:35:54,720 It's clearly working. 14805 11:35:54,720 --> 11:35:57,720 Now one thing I can do here is I can hit enter 14806 11:35:57,720 --> 11:36:00,720 and it will just pick one randomly. 14807 11:36:00,720 --> 11:36:03,720 So it grabbed live EDU TV and I can, 14808 11:36:03,720 --> 11:36:05,720 and let's see how many I got left. 14809 11:36:05,720 --> 11:36:06,720 We got 12 left. 14810 11:36:06,720 --> 11:36:10,720 And now I can hit enter again and it picks another one. 14811 11:36:10,720 --> 11:36:12,720 That was the next one. 14812 11:36:12,720 --> 11:36:13,720 I was kind of picking them in order. 14813 11:36:13,720 --> 11:36:14,720 Is it picking them in order? 14814 11:36:14,720 --> 11:36:16,720 Let's go to people. 14815 11:36:16,720 --> 11:36:18,720 Yeah, it's picking these. 14816 11:36:18,720 --> 11:36:20,720 So we can see that it's going to just do 14817 11:36:20,720 --> 11:36:23,720 the first unretrieved person, who's Nancy. 14818 11:36:23,720 --> 11:36:25,720 Let's let it retrieve Nancy. 14819 11:36:25,720 --> 11:36:27,720 So it grabbed Nancy, new. 14820 11:36:27,720 --> 11:36:28,720 So we're finding some. 14821 11:36:28,720 --> 11:36:30,720 And this table's getting really big. 14822 11:36:30,720 --> 11:36:32,720 And so if we look at the people table, 14823 11:36:32,720 --> 11:36:35,720 we now have 455 people. 14824 11:36:35,720 --> 11:36:40,720 And we have 467 following records. 14825 11:36:40,720 --> 11:36:43,720 And so there we go. 14826 11:36:43,720 --> 11:36:44,720 Oops. 14827 11:36:44,720 --> 11:36:45,720 Hit enter. 14828 11:36:45,720 --> 11:36:47,720 It does another one. 14829 11:36:47,720 --> 11:36:48,720 And away we go. 14830 11:36:48,720 --> 11:36:50,720 So you get the idea. 14831 11:36:50,720 --> 11:36:54,720 I can type quit to finish. 14832 11:36:54,720 --> 11:37:00,720 And just to give you a little interesting 14833 11:37:00,720 --> 11:37:02,720 bit of code to show you how to do selects, 14834 11:37:02,720 --> 11:37:04,720 I'm going to do this TW join. 14835 11:37:04,720 --> 11:37:06,720 Now you'll notice that we're not talking. 14836 11:37:06,720 --> 11:37:08,720 Oh, let's show you one thing. 14837 11:37:08,720 --> 11:37:14,720 LSBonSL friends star SQL lite. 14838 11:37:14,720 --> 11:37:16,720 So this database has it. 14839 11:37:16,720 --> 11:37:20,720 So I can restart this process and run it again. 14840 11:37:20,720 --> 11:37:22,720 And the database is still there. 14841 11:37:22,720 --> 11:37:27,720 And so we just grab Swear Trek. 14842 11:37:27,720 --> 11:37:29,720 And so we can keep doing this. 14843 11:37:29,720 --> 11:37:32,720 And so this data, it keeps extending. 14844 11:37:32,720 --> 11:37:35,720 And so this is a restartable process. 14845 11:37:35,720 --> 11:37:36,720 I can run it. 14846 11:37:36,720 --> 11:37:39,720 And then tell it to grab the next unretrieved one. 14847 11:37:39,720 --> 11:37:42,720 And so away we go, right? 14848 11:37:42,720 --> 11:37:46,720 And so that's part of it. 14849 11:37:46,720 --> 11:37:51,720 So if I run out of my, I've got eight left. 14850 11:37:51,720 --> 11:37:53,720 Oh, how many do I have left, really? 14851 11:37:53,720 --> 11:37:58,720 Let's keep going. 14852 11:37:58,720 --> 11:37:59,720 How many do I got left? 14853 11:37:59,720 --> 11:38:01,720 I got five left. 14854 11:38:01,720 --> 11:38:02,720 Okay. 14855 11:38:02,720 --> 11:38:03,720 Wait. 14856 11:38:03,720 --> 11:38:04,720 Oh, I guess we'll just run it out. 14857 11:38:04,720 --> 11:38:06,720 So I got four left. 14858 11:38:06,720 --> 11:38:08,720 You know what I should do is I can't change the code. 14859 11:38:08,720 --> 11:38:10,720 At least I can't change the code. 14860 11:38:10,720 --> 11:38:13,720 I can stop the code and I can quit the code. 14861 11:38:13,720 --> 11:38:18,720 So what I'm going to do is I'm going to change this code a little bit really quick. 14862 11:38:18,720 --> 11:38:24,720 And I'm going to print the headers are rate limiting at the beginning 14863 11:38:24,720 --> 11:38:31,720 and at the end. 14864 11:38:31,720 --> 11:38:32,720 So now I can run it again. 14865 11:38:32,720 --> 11:38:33,720 I changed the code. 14866 11:38:33,720 --> 11:38:35,720 Hopefully I didn't make a Python error. 14867 11:38:35,720 --> 11:38:37,720 Tell it to go get another one and a Navarro. 14868 11:38:37,720 --> 11:38:41,720 And so I got three left. 14869 11:38:41,720 --> 11:38:42,720 Oops. 14870 11:38:42,720 --> 11:38:46,720 We'll see what happens when I run out of rate limit. 14871 11:38:46,720 --> 11:38:49,720 Run out of rate limit. 14872 11:38:49,720 --> 11:38:51,720 So we have one left. 14873 11:38:51,720 --> 11:38:53,720 Hit enter. 14874 11:38:53,720 --> 11:38:55,720 Hit control K. 14875 11:38:55,720 --> 11:38:57,720 Open source dot org. 14876 11:38:57,720 --> 11:38:58,720 So we have zero left. 14877 11:38:58,720 --> 11:38:59,720 That worked. 14878 11:38:59,720 --> 11:39:01,720 Now let's see what happens. 14879 11:39:01,720 --> 11:39:04,720 I don't know what happens next. 14880 11:39:04,720 --> 11:39:06,720 Oh, we blew up. 14881 11:39:06,720 --> 11:39:07,720 Too many requests. 14882 11:39:07,720 --> 11:39:10,720 Oh, we got an HTTP error 429. 14883 11:39:10,720 --> 11:39:17,720 So that means that, going for Mark Cuban, that was in line 48. 14884 11:39:17,720 --> 11:39:24,720 So the right thing to do would be in line 48. 14885 11:39:24,720 --> 11:39:27,720 We should really put this in a try accept block. 14886 11:39:27,720 --> 11:39:39,720 Try accept block because it gives us an error. 14887 11:39:39,720 --> 11:39:40,720 Print. 14888 11:39:40,720 --> 11:39:42,720 Oh, fiddlesticks. 14889 11:39:42,720 --> 11:39:44,720 How do I print the exception message? 14890 11:39:44,720 --> 11:39:53,720 I always am forgetting print failed to retrieve. 14891 11:39:53,720 --> 11:39:56,720 So we'll put that in. 14892 11:39:56,720 --> 11:40:04,720 Now if I run it. 14893 11:40:04,720 --> 11:40:13,720 And then I have to put a break here because that's not a good break. 14894 11:40:13,720 --> 11:40:14,720 Failed to retrieve. 14895 11:40:14,720 --> 11:40:15,720 Now I've got to figure out. 14896 11:40:15,720 --> 11:40:20,720 Oh, see, I never know how to print out the error message. 14897 11:40:20,720 --> 11:40:22,720 Yeah. 14898 11:40:22,720 --> 11:40:24,720 So I have to... 14899 11:40:24,720 --> 11:40:28,720 See, that's the weird thing about stuff is that I don't ever remember enough. 14900 11:40:28,720 --> 11:40:34,720 I don't remember the syntax, what I say here, to print the error message out. 14901 11:40:34,720 --> 11:40:37,720 So I'm going to go to Google. 14902 11:40:37,720 --> 11:40:48,720 And I'm going to say, print out the exception message in Python. 14903 11:40:48,720 --> 11:40:50,720 Print out the exception message in Python. 14904 11:40:50,720 --> 11:40:57,720 Oh, Python 3, hello. 14905 11:40:57,720 --> 11:41:03,720 Okay, so let's go find it here in the documentation. 14906 11:41:03,720 --> 11:41:09,720 Accept, accept. 14907 11:41:09,720 --> 11:41:11,720 Is this it? 14908 11:41:11,720 --> 11:41:28,720 Is this what I say? 14909 11:41:28,720 --> 11:41:34,720 I just want to print out the message. 14910 11:41:34,720 --> 11:41:36,720 Ah, that's it. 14911 11:41:36,720 --> 11:41:39,720 Accept. 14912 11:41:39,720 --> 11:41:55,720 Let's try this. 14913 11:41:55,720 --> 11:41:59,720 So this is part of Python programming, is like, for me at least. 14914 11:41:59,720 --> 11:42:06,720 Because I'm just not like a genius expert at this stuff. 14915 11:42:06,720 --> 11:42:09,720 This is one thing I like about Python, is you can guess stuff. 14916 11:42:09,720 --> 11:42:11,720 And sometimes you guess right. 14917 11:42:11,720 --> 11:42:12,720 So there we go. 14918 11:42:12,720 --> 11:42:13,720 We got the error. 14919 11:42:13,720 --> 11:42:14,720 We got the nice little error message. 14920 11:42:14,720 --> 11:42:16,720 And we see error 429, too many requests. 14921 11:42:16,720 --> 11:42:19,720 So that cleans that up nicely. 14922 11:42:19,720 --> 11:42:22,720 So we have run out of requests. 14923 11:42:22,720 --> 11:42:28,720 And on that, it is a good time to say thanks for listening. 14924 11:42:28,720 --> 11:42:35,720 And I hope that you found this valuable. 14925 11:42:35,720 --> 11:42:39,720 Hello, and welcome to our final chapter, retrieving and visualizing data. 14926 11:42:39,720 --> 11:42:44,720 In this chapter, we are going to basically bring this all together. 14927 11:42:44,720 --> 11:42:49,720 Databases, web services, code loops, logic. 14928 11:42:49,720 --> 11:42:55,720 And we're going to solve a problem that is a multi-step data analysis. 14929 11:42:55,720 --> 11:42:58,720 We're going to find some data on the internet. 14930 11:42:58,720 --> 11:43:01,720 Might be HTML, might be an API or whatever. 14931 11:43:01,720 --> 11:43:06,720 And we're going to write a relatively slow process that's going to pull data slowly. 14932 11:43:06,720 --> 11:43:08,720 Because these are all rate limited. 14933 11:43:08,720 --> 11:43:11,720 This is a slow and restartable process. 14934 11:43:11,720 --> 11:43:13,720 So you can start this. 14935 11:43:13,720 --> 11:43:17,720 And what we're going to do is we're going to have a database that's going to hold the data that we're pulling. 14936 11:43:17,720 --> 11:43:22,720 And so this might take several days, actually, if you really have to do it. 14937 11:43:22,720 --> 11:43:24,720 And then you'll build up your data in your database. 14938 11:43:24,720 --> 11:43:28,720 And then what you tend to do is you tend to produce two databases. 14939 11:43:28,720 --> 11:43:38,720 One is kind of a raw database that, you know, all of its data columns are aimed at helping you figure out what you've got to retrieve yet. 14940 11:43:38,720 --> 11:43:39,720 And what you haven't retrieved yet. 14941 11:43:39,720 --> 11:43:42,720 So that's kind of a crawling spidering process. 14942 11:43:42,720 --> 11:43:46,720 And then you find that the data is kind of nasty and ugly. 14943 11:43:46,720 --> 11:43:51,720 And you find that before you're going to do any analysis, you probably want to clean and process it. 14944 11:43:51,720 --> 11:43:55,720 So in a lot of these, you're going to go from a raw database to a clean one. 14945 11:43:55,720 --> 11:43:57,720 And this is going to be really large. 14946 11:43:57,720 --> 11:43:59,720 And this is going to be really small. 14947 11:43:59,720 --> 11:44:02,720 And you're going to do this sort of once, but slowly. 14948 11:44:02,720 --> 11:44:08,720 And you'll do this as many times as you need, changing this program, cleaning the data up over and over and over again. 14949 11:44:08,720 --> 11:44:10,720 And then you'll end up with really clean data. 14950 11:44:10,720 --> 11:44:11,720 And it's relatively small. 14951 11:44:11,720 --> 11:44:17,720 And you might run programs that will loop through this to do visualizations or analysis or some things or whatever. 14952 11:44:17,720 --> 11:44:23,720 And so you'll actually sort of use this database as a source of information. 14953 11:44:23,720 --> 11:44:24,720 OK. 14954 11:44:24,720 --> 11:44:28,720 So that's the basic pattern of what we're going to work with. 14955 11:44:28,720 --> 11:44:31,720 Now, this is what I call personal data mining. 14956 11:44:31,720 --> 11:44:36,720 And if you're going to do this seriously, Python is used in lots of data mining activities. 14957 11:44:36,720 --> 11:44:39,720 But if you're going to do data mining seriously with really, really large data sets, 14958 11:44:39,720 --> 11:44:46,720 we're doing small to medium-sized data sets as you might do sort of for individual personal research 14959 11:44:46,720 --> 11:44:51,720 versus like an organization research where you're processing the logs of a web server or something like that. 14960 11:44:51,720 --> 11:44:54,720 And there's lots and lots of wonderful technology. 14961 11:44:54,720 --> 11:44:58,720 And what's really cool is this technology just keeps getting better and better 14962 11:44:58,720 --> 11:45:04,720 because the whole data and mining data analysis natural language processing field is just so hot right now. 14963 11:45:04,720 --> 11:45:05,720 It's so awesome. 14964 11:45:05,720 --> 11:45:09,720 We're going to keep it simple and do stuff for ourselves for now. 14965 11:45:09,720 --> 11:45:15,720 And I gave you a bunch of sample code that's going to make it so that you can adapt this sample code 14966 11:45:15,720 --> 11:45:18,720 to solve the problems that you need to solve. 14967 11:45:18,720 --> 11:45:21,720 So like I said, this is more of a programming exercise. 14968 11:45:21,720 --> 11:45:23,720 Data mining might be a lot more complex. 14969 11:45:23,720 --> 11:45:27,720 If you're doing simple research, this might actually model what you do pretty well. 14970 11:45:27,720 --> 11:45:34,720 So the first thing that we're going to do is what's called use the Google's JSON API for geocoding. 14971 11:45:34,720 --> 11:45:36,720 And there are two versions of this. 14972 11:45:36,720 --> 11:45:41,720 One version requires a key and one version doesn't require a key. 14973 11:45:41,720 --> 11:45:45,720 Google used to make all this data available for free but with just a rate limit 14974 11:45:45,720 --> 11:45:48,720 but now they're making increasingly requiring a key. 14975 11:45:48,720 --> 11:45:52,720 So I give you code in this zip file that kind of does both. 14976 11:45:52,720 --> 11:45:57,720 If you really wanted to do something in production of taking user entered places and names 14977 11:45:57,720 --> 11:46:03,720 and getting precise latitude longitude coordinates so you can produce a nice little Google map like this. 14978 11:46:03,720 --> 11:46:08,720 But since Google has made a rate limited API, 14979 11:46:08,720 --> 11:46:13,720 I've actually pre-spided a copy of a Google data and I have my own sort of fake Google API 14980 11:46:13,720 --> 11:46:19,720 and so you can do your assignments and test all your code using my fake API 14981 11:46:19,720 --> 11:46:22,720 which has no rate limits and has no problems. 14982 11:46:22,720 --> 11:46:25,720 But it's only a limited set of the data. 14983 11:46:25,720 --> 11:46:32,720 And so this is the basic process and it's one of those things that it follows that basic personal data modeling. 14984 11:46:32,720 --> 11:46:34,720 Personal data mining pattern. 14985 11:46:34,720 --> 11:46:37,720 And so here's this API which is either Google or me. 14986 11:46:37,720 --> 11:46:41,720 I've got my own Dr. Chuck version of this, Dr. Chuck.Net version of this. 14987 11:46:41,720 --> 11:46:45,720 And there is an input queue of the location. 14988 11:46:45,720 --> 11:46:49,720 So this is the user data where they just put in the name of where they think they live. 14989 11:46:49,720 --> 11:46:52,720 University of Toobigan or something. 14990 11:46:52,720 --> 11:46:56,720 And so this is the queue of the things that are to be retrieved. 14991 11:46:56,720 --> 11:47:01,720 And in my case when I built this map for the first time, there was like 15,000. 14992 11:47:01,720 --> 11:47:04,720 And it took me days to get this. 14993 11:47:04,720 --> 11:47:05,720 And so it would stop. 14994 11:47:05,720 --> 11:47:11,720 And so what I would do is I would read the first one into this geoload.py, 14995 11:47:11,720 --> 11:47:13,720 check to see if I already had it in my database. 14996 11:47:13,720 --> 11:47:18,720 If I didn't already have the database, I would go into the API, pull the data down and I would put it in the database. 14997 11:47:18,720 --> 11:47:19,720 And then I would go to the next one. 14998 11:47:19,720 --> 11:47:20,720 The next one, the next one. 14999 11:47:20,720 --> 11:47:26,720 And so I might get a thousand in my database and then it blows up or I'm told I can't go any further. 15000 11:47:26,720 --> 11:47:27,720 So I wait 24 hours. 15001 11:47:27,720 --> 11:47:31,720 I start it up and it reads the first thousand and says, oh, they're all in the database already. 15002 11:47:31,720 --> 11:47:34,720 And then it starts at one thousand and one. 15003 11:47:34,720 --> 11:47:36,720 And then it adds that and adds that. 15004 11:47:36,720 --> 11:47:37,720 And then until it stops. 15005 11:47:37,720 --> 11:47:41,720 And so it took me several days of processing to get this data right. 15006 11:47:41,720 --> 11:47:45,720 Now, I didn't have a separate cleaning process because this data is pretty simple. 15007 11:47:45,720 --> 11:47:50,720 I was pulling out the JSON and latitude and longitude, etc. 15008 11:47:50,720 --> 11:47:54,720 And so I didn't have to do two separate processes to clean this data up. 15009 11:47:54,720 --> 11:47:56,720 It was clean enough right as I pulled it. 15010 11:47:56,720 --> 11:47:58,720 Because I was talking to an API. 15011 11:47:58,720 --> 11:48:02,720 If you're talking to the HTML, sometimes it gets nasty and ugly. 15012 11:48:02,720 --> 11:48:05,720 And so then I wrote this program that just reads through it. 15013 11:48:05,720 --> 11:48:12,720 It just does a select and, you know, reads through the stuff and it prints out some summary information and tells you what to do. 15014 11:48:12,720 --> 11:48:19,720 It also prints out and you'll see this pattern because, you know, I'm visualizing using browsers, HTML, 15015 11:48:19,720 --> 11:48:25,720 and this happens to be used in the Google Maps API and putting all the data in a little JavaScript file. 15016 11:48:25,720 --> 11:48:28,720 So these end up being assignment statements in JavaScript. 15017 11:48:28,720 --> 11:48:34,720 You can take a look at that file and all the data shows up as assignment statements in the JavaScript. 15018 11:48:34,720 --> 11:48:44,720 And then when this HTML loads, it reads this file and puts up all those pins as long as you have access to the in browser JavaScript API. 15019 11:48:44,720 --> 11:48:49,720 So the next thing we're going to talk about is page rank, which is spidering now HTML. 15020 11:48:49,720 --> 11:48:52,720 We talked a lot about this spider HTML, get some links. 15021 11:48:52,720 --> 11:48:59,720 And so up next, we're going to actually build a real database full featured search engine using page rank. 15022 11:49:04,720 --> 11:49:06,720 This is another worked code example. 15023 11:49:06,720 --> 11:49:11,720 You can download the sample code zip file if you want to follow along. 15024 11:49:11,720 --> 11:49:16,720 And the code that we're working on today is what I call the geodata code. 15025 11:49:16,720 --> 11:49:25,720 And that is code that is going to pull some locations from this file. 15026 11:49:25,720 --> 11:49:33,720 We're simulating or using the Google Places API to look places up and so we can visualize them on a map. 15027 11:49:33,720 --> 11:49:35,720 And so this is the basic picture. 15028 11:49:35,720 --> 11:49:41,720 If we take a look at this weir.data file, it's just a flat file that has a list of organizations. 15029 11:49:41,720 --> 11:49:46,720 And this actually was pulled from one of my MOOC surveys. 15030 11:49:46,720 --> 11:49:52,720 We just let people type in where they went to school and this is just a sample of them. 15031 11:49:52,720 --> 11:49:56,720 So this data is read in by this program geoload.py. 15032 11:49:56,720 --> 11:50:00,720 And if you recall, this Google geodata has rate limits. 15033 11:50:00,720 --> 11:50:03,720 It also has API keys, which we'll talk about in a bit too. 15034 11:50:03,720 --> 11:50:08,720 And so the idea is this is a restartable spider-like process. 15035 11:50:08,720 --> 11:50:14,720 And so we want to be able to run this and have it blow up and run it and start it and not lose what we've got. 15036 11:50:14,720 --> 11:50:19,720 So we're now using a database as well as an API. 15037 11:50:19,720 --> 11:50:25,720 But in order to work around the rate limits of this API, we're going to use the database with a restartable process. 15038 11:50:25,720 --> 11:50:29,720 And then we'll make some sense of this and then we'll visualize this. 15039 11:50:29,720 --> 11:50:34,720 But in the short term, let's start with geoload.py code. 15040 11:50:34,720 --> 11:50:37,720 Geoload.py, take a look here. 15041 11:50:37,720 --> 11:50:42,720 So a lot of this hopefully by now is somewhat familiar to you. 15042 11:50:42,720 --> 11:50:47,720 URL lib, JSON, SQLite. 15043 11:50:47,720 --> 11:50:53,720 And so I mentioned that the Google APIs, these used to be free and did not require an API key, 15044 11:50:53,720 --> 11:50:58,720 but increasingly they're making you do API keys for especially new ones. 15045 11:50:58,720 --> 11:51:05,720 And so what happens, you can go to your Google Places, go to Google APIs and get an API key. 15046 11:51:05,720 --> 11:51:09,720 And you can put it in here, it'll be this long, big long thing that looks like that. 15047 11:51:09,720 --> 11:51:13,720 And then if you have an API key, you can use the Places API. 15048 11:51:13,720 --> 11:51:19,720 And I've got a copy of a subset, not all of it, a subset of it here at this URL. 15049 11:51:19,720 --> 11:51:26,720 As a matter of fact, you can just go to this URL in a browser. 15050 11:51:26,720 --> 11:51:30,720 And it will tell you a list of the data that it knows about. 15051 11:51:30,720 --> 11:51:41,720 And I made it so that that does the same basic protocol with the address equals as the Google Places API. 15052 11:51:41,720 --> 11:51:46,720 So this will just change how we retrieve the data, either retrieve it from my server. 15053 11:51:46,720 --> 11:51:49,720 Nice thing about my server, it's got no rate limit. 15054 11:51:49,720 --> 11:51:53,720 It's really fast and you're not fighting with Google all the time. 15055 11:51:53,720 --> 11:51:59,720 And it means that perhaps if you're in a country that Google is not well supported, you can use my API. 15056 11:51:59,720 --> 11:52:04,720 And that's really strange that somehow my API is more reliable and available than the Google one. 15057 11:52:04,720 --> 11:52:06,720 But it's true. 15058 11:52:06,720 --> 11:52:08,720 So we're going to make a database. 15059 11:52:08,720 --> 11:52:12,720 We're going to do a create table if not exists, and we'll have some address. 15060 11:52:12,720 --> 11:52:15,720 And we're really just caching the geographical data. 15061 11:52:15,720 --> 11:52:17,720 We're going to cache the JSON. 15062 11:52:17,720 --> 11:52:21,720 One of the things we do when we build these processes is we tend to simplify these things 15063 11:52:21,720 --> 11:52:25,720 and not do all the calculation and parsing the JSON. 15064 11:52:25,720 --> 11:52:30,720 Just load it and get it in and load it and get it in and fill the data up in this database. 15065 11:52:30,720 --> 11:52:33,720 And so that's what we're going to do. 15066 11:52:33,720 --> 11:52:39,720 Because Python doesn't ship with any legitimate certificates, we have to sort of ignore certificate errors. 15067 11:52:39,720 --> 11:52:42,720 We're going to open the file. 15068 11:52:42,720 --> 11:52:49,720 And we're going to loop through it and pull out the address from the file. 15069 11:52:49,720 --> 11:52:56,720 And we're going to select from the geodata where that address is the address. 15070 11:52:56,720 --> 11:52:58,720 Let's move this in a bit. 15071 11:52:58,720 --> 11:53:03,720 And so we're going to do a select and pull out that address. 15072 11:53:03,720 --> 11:53:07,720 And the idea is if it's already in the database, we don't want to do it. 15073 11:53:07,720 --> 11:53:13,720 So we do a fetch one and pull out that first thing, which is the, that will be the JSON right there. 15074 11:53:13,720 --> 11:53:15,720 If we get that, we'll continue up. 15075 11:53:15,720 --> 11:53:18,720 Otherwise, we'll keep going. 15076 11:53:18,720 --> 11:53:20,720 Pass just means don't blow up. 15077 11:53:20,720 --> 11:53:22,720 So we accept and we just do a pass. 15078 11:53:22,720 --> 11:53:24,720 That's like a no op. 15079 11:53:24,720 --> 11:53:31,720 And we're going to make a dictionary because that's what we do for the key value pairs. 15080 11:53:31,720 --> 11:53:34,720 Everything you've seen so far, I've used constants here. 15081 11:53:34,720 --> 11:53:39,720 But because we may or may not have an API key, query equals and then that's the address. 15082 11:53:39,720 --> 11:53:42,720 And then the key equals and then the API key. 15083 11:53:42,720 --> 11:53:49,720 If you recall, URL encode adds the pluses and question marks and all that nice stuff. 15084 11:53:49,720 --> 11:53:50,720 We're going to retrieve it. 15085 11:53:50,720 --> 11:53:52,720 We're going to read it and decode it. 15086 11:53:52,720 --> 11:53:55,720 Print out how much data we've got. 15087 11:53:55,720 --> 11:53:57,720 And add account. 15088 11:53:57,720 --> 11:54:02,720 And then we're going to try to parse that JSON data and print it if something goes wrong. 15089 11:54:02,720 --> 11:54:09,720 And as we've seen, at this top level of this JSON data from this geocoding API is an object, 15090 11:54:09,720 --> 11:54:13,720 which we'll see a little bit of in a bit. 15091 11:54:13,720 --> 11:54:15,720 And it has a status field in it. 15092 11:54:15,720 --> 11:54:19,720 And the status is okay if things went well. 15093 11:54:19,720 --> 11:54:25,720 So if the status is not there, that means our JavaScript is not well formed or not how we expect it. 15094 11:54:25,720 --> 11:54:32,720 If the status is not okay or not equal to zero results, then print out failure to retrieve and then quit. 15095 11:54:32,720 --> 11:54:36,720 And then we're simply going to insert this new data that we just put in. 15096 11:54:36,720 --> 11:54:38,720 And then we're going to commit it. 15097 11:54:38,720 --> 11:54:41,720 And every tenth one, this is count mod 10. 15098 11:54:41,720 --> 11:54:43,720 We're going to pause for five seconds. 15099 11:54:43,720 --> 11:54:45,720 And we can hit control C here. 15100 11:54:45,720 --> 11:54:48,720 And then we're going to play the, do the geodump. 15101 11:54:48,720 --> 11:54:49,720 Okay. 15102 11:54:49,720 --> 11:54:51,720 So let's just run this. 15103 11:54:51,720 --> 11:54:54,720 Geodata. 15104 11:54:54,720 --> 11:54:56,720 Python. 15105 11:54:56,720 --> 11:54:58,720 So let's do an LS. 15106 11:54:58,720 --> 11:55:04,720 So we don't have, we do have, let's get rid of from a previous test, geodata.sqlite. 15107 11:55:04,720 --> 11:55:13,720 So we'll start with a fresh, fresh set of data and run python geoload.py. 15108 11:55:13,720 --> 11:55:17,720 Of course, I'm always forever making the mistake of forgetting Python 3. 15109 11:55:17,720 --> 11:55:19,720 So you can see that it's running. 15110 11:55:19,720 --> 11:55:21,720 And it's adding the query. 15111 11:55:21,720 --> 11:55:24,720 And in this case, I don't have the API key. 15112 11:55:24,720 --> 11:55:25,720 And it's putting the pluses in. 15113 11:55:25,720 --> 11:55:28,720 And that's this part here with all the pluses. 15114 11:55:28,720 --> 11:55:30,720 That's the URL and code. 15115 11:55:30,720 --> 11:55:31,720 And you notice it's pausing a bit. 15116 11:55:31,720 --> 11:55:35,720 Now it depends on how fast your net connection, this may or may not go so fast. 15117 11:55:35,720 --> 11:55:36,720 But this is not that much data. 15118 11:55:36,720 --> 11:55:40,720 So it should, it's like only 2,000, 3,000 characters. 15119 11:55:40,720 --> 11:55:45,720 And so it's working and talking to my server. 15120 11:55:45,720 --> 11:55:47,720 And the interesting thing here is I can blow this up. 15121 11:55:47,720 --> 11:55:49,720 I'm going to hit control C. 15122 11:55:49,720 --> 11:55:51,720 In Windows you'd hit control. 15123 11:55:51,720 --> 11:55:53,720 In Linux you'd hit control C. 15124 11:55:53,720 --> 11:55:55,720 And in Windows I think you'd hit control Z. 15125 11:55:55,720 --> 11:55:57,720 Depending on what shell you're working in. 15126 11:55:57,720 --> 11:55:58,720 But I'm going to hit control C. 15127 11:55:58,720 --> 11:56:00,720 And you see I sort of blew it up, right? 15128 11:56:00,720 --> 11:56:04,720 And that causes a traceback, a keyboard interrupt traceback. 15129 11:56:04,720 --> 11:56:09,720 If I do an LS minus L, you can see that now this geodata is there. 15130 11:56:09,720 --> 11:56:13,720 Now in the name of restarting, I will restart this. 15131 11:56:13,720 --> 11:56:16,720 And you will see that it checks and skips. 15132 11:56:16,720 --> 11:56:23,720 And so it runs this code here where it's right here. 15133 11:56:23,720 --> 11:56:24,720 It grabs it and finds it in the database. 15134 11:56:24,720 --> 11:56:26,720 So you'll see it say found in the database really quick. 15135 11:56:26,720 --> 11:56:27,720 Chop, chop, chop. 15136 11:56:27,720 --> 11:56:28,720 And go really fast. 15137 11:56:28,720 --> 11:56:32,720 And then it'll go back to catching up where it left off. 15138 11:56:32,720 --> 11:56:36,720 And so all those up there, they did not actually re-retrieve it, 15139 11:56:36,720 --> 11:56:38,720 because it knew about those things. 15140 11:56:38,720 --> 11:56:40,720 And so now it's catching up and doing some more, 15141 11:56:40,720 --> 11:56:43,720 and doing some more, and doing some more. 15142 11:56:43,720 --> 11:56:45,720 And then I'll hit control C. 15143 11:56:45,720 --> 11:56:47,720 It has a little counter in here that basically, 15144 11:56:47,720 --> 11:56:50,720 if it hits 200 it stops and you have to restart it. 15145 11:56:50,720 --> 11:56:52,720 You could obviously change this code. 15146 11:56:52,720 --> 11:56:54,720 You could make it so it didn't sleep. 15147 11:56:54,720 --> 11:56:57,720 It doesn't hurt to sleep for like a second after every 100 or so 15148 11:56:57,720 --> 11:56:58,720 if you want. 15149 11:56:58,720 --> 11:57:00,720 You could change that code. 15150 11:57:00,720 --> 11:57:04,720 And now let's just hit control C. 15151 11:57:04,720 --> 11:57:05,720 And blow it up. 15152 11:57:05,720 --> 11:57:07,720 LS minus L. 15153 11:57:07,720 --> 11:57:08,720 And there is another bit of code. 15154 11:57:08,720 --> 11:57:12,720 And this code, it's always good to write these really simple things. 15155 11:57:12,720 --> 11:57:16,720 And so now we're going to import SQLite and JSON. 15156 11:57:16,720 --> 11:57:18,720 We're going to connect ourselves up. 15157 11:57:18,720 --> 11:57:22,720 We're going to open, except this is a UTF-8, 15158 11:57:22,720 --> 11:57:26,720 because we're going to open this with UTF-8. 15159 11:57:26,720 --> 11:57:29,720 And we're going to read through. 15160 11:57:29,720 --> 11:57:35,720 And in this case, we are going to decode. 15161 11:57:35,720 --> 11:57:37,720 We did select star from locations. 15162 11:57:37,720 --> 11:57:43,720 And if you recall, locations has a location and a geodata. 15163 11:57:43,720 --> 11:57:45,720 And so the sub-zero will be the location, 15164 11:57:45,720 --> 11:57:49,720 and the sub-one will be the geodata. 15165 11:57:49,720 --> 11:57:54,720 And we're going to parse it, convert it to a string, and then parse it. 15166 11:57:54,720 --> 11:57:57,720 If something goes wrong with the JSON, we'll just keep skipping it. 15167 11:57:57,720 --> 11:58:03,720 We'll check to see if we have the status in our JSON. 15168 11:58:03,720 --> 11:58:10,720 Let me run the SQLite browser here. 15169 11:58:10,720 --> 11:58:12,720 File, open database. 15170 11:58:12,720 --> 11:58:16,720 Let's take a look at what's in this database. 15171 11:58:16,720 --> 11:58:17,720 Oh, where are we? 15172 11:58:17,720 --> 11:58:19,720 Code three. 15173 11:58:19,720 --> 11:58:20,720 Geodata. 15174 11:58:20,720 --> 11:58:22,720 Geodata SQLite. 15175 11:58:22,720 --> 11:58:25,720 So this is the data we've got. 15176 11:58:25,720 --> 11:58:28,720 So if you make this a little bigger, if I can, can I make that bigger? 15177 11:58:28,720 --> 11:58:30,720 Yeah, it's not going to show us much. 15178 11:58:30,720 --> 11:58:33,720 So you can see that these are the addresses in the geodata. 15179 11:58:33,720 --> 11:58:35,720 That's just the JSON. 15180 11:58:35,720 --> 11:58:38,720 So that's the JSON that we've got, and it retrieves it. 15181 11:58:38,720 --> 11:58:40,720 And so this is a really simple database. 15182 11:58:40,720 --> 11:58:42,720 It's just a sort of spidering process. 15183 11:58:42,720 --> 11:58:44,720 Run, run, run. 15184 11:58:44,720 --> 11:58:46,720 But now we're going to run the geodump code, 15185 11:58:46,720 --> 11:58:50,720 which is going to read this and dump this stuff out and print where.js, 15186 11:58:50,720 --> 11:58:52,720 so it's going to actually parse this stuff. 15187 11:58:52,720 --> 11:58:56,720 And that's code we've seen before. 15188 11:58:56,720 --> 11:58:57,720 So we're actually reading it. 15189 11:58:57,720 --> 11:58:59,720 And this line goes into the results. 15190 11:58:59,720 --> 11:59:01,720 The results is an array. 15191 11:59:01,720 --> 11:59:04,720 So if we go into results, results is an array. 15192 11:59:04,720 --> 11:59:07,720 We're going to go grab the zeroth item in that array. 15193 11:59:07,720 --> 11:59:11,720 And then we're going to go find geometry. 15194 11:59:11,720 --> 11:59:13,720 And then location. 15195 11:59:13,720 --> 11:59:16,720 And then lat and long for the latitude and longitude. 15196 11:59:16,720 --> 11:59:22,720 And then we're also going to take the actual address out of the formatted address right here. 15197 11:59:22,720 --> 11:59:28,720 So in this bit of code, we're actually parsing the JSON. 15198 11:59:28,720 --> 11:59:33,720 And we're going to clean things up, get rid of some single quotes. 15199 11:59:33,720 --> 11:59:36,720 This kind of data cleaning is just stuff after you play with it for a while. 15200 11:59:36,720 --> 11:59:39,720 You realize, oh, my data is ugly or does this. 15201 11:59:39,720 --> 11:59:40,720 And I print it out. 15202 11:59:40,720 --> 11:59:42,720 And then I'm going to write this out. 15203 11:59:42,720 --> 11:59:45,720 And I'm going to write it into a JavaScript file. 15204 11:59:45,720 --> 11:59:50,720 And so the JavaScript file is this where.js. 15205 11:59:50,720 --> 11:59:53,720 And I'll show you what it looks like. 15206 11:59:53,720 --> 11:59:55,720 It's going to be overwritten. 15207 11:59:55,720 --> 11:59:57,720 This is the one that came out of the zip file. 15208 11:59:57,720 --> 11:59:59,720 It'll have the latitude, the longitude. 15209 11:59:59,720 --> 12:00:04,720 And we're going to use JavaScript to read this in this where.html file. 15210 12:00:04,720 --> 12:00:08,720 It's going to actually read this right there and pull that data in. 15211 12:00:08,720 --> 12:00:10,720 And that's how we're going to visualize. 15212 12:00:10,720 --> 12:00:16,720 I'm not going to go into great detail on how the visualization happens. 15213 12:00:16,720 --> 12:00:17,720 But that's what's happening. 15214 12:00:17,720 --> 12:00:18,720 And so we're going to write that. 15215 12:00:18,720 --> 12:00:20,720 So we're going to actually write this to a file. 15216 12:00:20,720 --> 12:00:30,720 So let's go ahead and run this code and say python3 geodump. 15217 12:00:30,720 --> 12:00:33,720 OK, so it wrote 120 records to where.js. 15218 12:00:33,720 --> 12:00:36,720 So if we look at where.js, this is now the new data 15219 12:00:36,720 --> 12:00:39,720 that I just downloaded moments ago. 15220 12:00:39,720 --> 12:00:51,720 And it says open where.html in a browser. 15221 12:00:51,720 --> 12:00:54,720 Now, this way you'll need the Google Maps API. 15222 12:00:54,720 --> 12:00:57,720 And you might not be able to see this depending on where you're at. 15223 12:00:57,720 --> 12:01:00,720 But here you go with Google Maps locations. 15224 12:01:00,720 --> 12:01:03,720 And I think if you hover over this, you can see. 15225 12:01:03,720 --> 12:01:06,720 And you see the UTF, why we there in that particular thing, 15226 12:01:06,720 --> 12:01:11,720 why we had to use the UTF-8 when we wrote the file 15227 12:01:11,720 --> 12:01:13,720 so that we didn't end up with trouble writing the file out. 15228 12:01:13,720 --> 12:01:14,720 And so there you go. 15229 12:01:14,720 --> 12:01:19,720 And so that is a simple visualization. 15230 12:01:19,720 --> 12:01:22,720 And just a simple visualization. 15231 12:01:22,720 --> 12:01:24,720 It wrote this where.js. 15232 12:01:24,720 --> 12:01:27,720 If you are smart with HTML and JavaScript, 15233 12:01:27,720 --> 12:01:31,720 you can look at this where.html file. 15234 12:01:31,720 --> 12:01:35,720 It's really just reading through a bunch of data and putting the points. 15235 12:01:35,720 --> 12:01:36,720 That's all there is. 15236 12:01:36,720 --> 12:01:39,720 But I'm not going to go through that. 15237 12:01:39,720 --> 12:01:42,720 So at least not in this. 15238 12:01:42,720 --> 12:01:45,720 And so I hope that this was useful to you. 15239 12:01:45,720 --> 12:01:51,720 And thanks for watching. 15240 12:01:51,720 --> 12:01:53,720 So now we're going to write a search engine. 15241 12:01:53,720 --> 12:01:54,720 Do some of the things. 15242 12:01:54,720 --> 12:01:55,720 We're going to do page rank. 15243 12:01:55,720 --> 12:01:59,720 And we're going to visualize it in a web browser and show the weights. 15244 12:01:59,720 --> 12:02:02,720 We're really only going to do page rank on one page 15245 12:02:02,720 --> 12:02:05,720 because you want to have links that more than one page 15246 12:02:05,720 --> 12:02:07,720 that points to a page so that you can figure out 15247 12:02:07,720 --> 12:02:09,720 which pages are more or less important. 15248 12:02:09,720 --> 12:02:10,720 And then visualize it. 15249 12:02:10,720 --> 12:02:13,720 We'll run the page rank algorithm and we'll separately do all this. 15250 12:02:13,720 --> 12:02:16,720 So at this point we're going to do pretty much the web crawling, 15251 12:02:16,720 --> 12:02:18,720 the index building, and the searching. 15252 12:02:18,720 --> 12:02:19,720 We're not going to really search it. 15253 12:02:19,720 --> 12:02:21,720 We're going to visualize the index. 15254 12:02:21,720 --> 12:02:24,720 But you could write a simple program to do searches for keywords 15255 12:02:24,720 --> 12:02:27,720 and figure out which page was the most likely page for a keyword. 15256 12:02:27,720 --> 12:02:30,720 And that would be a fun additional thing to do. 15257 12:02:30,720 --> 12:02:34,720 So the web crawler is this program that hits a page, 15258 12:02:34,720 --> 12:02:37,720 pulls down the HTML, parses the page, looks for links, 15259 12:02:37,720 --> 12:02:41,720 makes a queue of incoming links that are as yet unretrieved. 15260 12:02:41,720 --> 12:02:44,720 And I'm going to do this in a simple SQLite database. 15261 12:02:44,720 --> 12:02:48,720 It starts out with the database basically starts with one link as the starting point 15262 12:02:48,720 --> 12:02:50,720 and then it retrieves that page. 15263 12:02:50,720 --> 12:02:53,720 And then you see the database end up with lots of unretrieved pages. 15264 12:02:53,720 --> 12:02:56,720 And then it goes back in and picks a random page and retrieves that one. 15265 12:02:56,720 --> 12:02:58,720 And then it just expands and expands. 15266 12:02:58,720 --> 12:03:01,720 This code that I've built that you're going to play with 15267 12:03:01,720 --> 12:03:04,720 only stays on one website, otherwise it would go crazy. 15268 12:03:04,720 --> 12:03:08,720 And of course, Google doesn't use an SQLite database running on your hard drive. 15269 12:03:08,720 --> 12:03:10,720 But you'll get the idea. 15270 12:03:10,720 --> 12:03:13,720 You'll see this thing exponentially gain links. 15271 12:03:13,720 --> 12:03:17,720 And you'll run it for a while, pull down 1,000 web pages or whatever. 15272 12:03:17,720 --> 12:03:23,720 But of course, make sure that you don't violate any terms conditions. 15273 12:03:23,720 --> 12:03:27,720 And again, I've got some data sources that you can use. 15274 12:03:27,720 --> 12:03:29,720 And they're not rate limited. 15275 12:03:29,720 --> 12:03:31,720 But you can also use things like Wikipedia, 15276 12:03:31,720 --> 12:03:33,720 which I think they sort of discourage you. 15277 12:03:33,720 --> 12:03:36,720 Or DrChuck.com, which has no rate limit. 15278 12:03:36,720 --> 12:03:38,720 Or who knows what, right? 15279 12:03:38,720 --> 12:03:39,720 So just be careful. 15280 12:03:39,720 --> 12:03:40,720 Don't do this on Facebook. 15281 12:03:40,720 --> 12:03:42,720 And don't do it on Google. 15282 12:03:42,720 --> 12:03:43,720 Don't get yourself in trouble. 15283 12:03:43,720 --> 12:03:49,720 And if you're using, you know, a internet connection 15284 12:03:49,720 --> 12:03:51,720 where you're paying for bandwidth, be careful. 15285 12:03:51,720 --> 12:03:53,720 So this is the idea of the web crawler. 15286 12:03:53,720 --> 12:03:54,720 And this isn't my picture. 15287 12:03:54,720 --> 12:03:56,720 This is the classic picture of a web crawler. 15288 12:03:56,720 --> 12:04:02,720 Read a page, parse it, take all the URLs and stick them in a queue, 15289 12:04:02,720 --> 12:04:04,720 grab again and again. 15290 12:04:04,720 --> 12:04:07,720 So for us, the scheduler is going to do it as long as you'd say, 15291 12:04:07,720 --> 12:04:10,720 oh, do 100 pages or it runs until it blows up. 15292 12:04:10,720 --> 12:04:14,720 And again, these processes that have the network in the loop, 15293 12:04:14,720 --> 12:04:17,720 it's really important that they behave well when they blow up. 15294 12:04:17,720 --> 12:04:19,720 And that's why databases are so useful. 15295 12:04:19,720 --> 12:04:21,720 Because you can be writing along to the database. 15296 12:04:21,720 --> 12:04:25,720 And some random thing happens and blows your data up and you start over. 15297 12:04:25,720 --> 12:04:28,720 So you're reading these things, you're storing each page, 15298 12:04:28,720 --> 12:04:30,720 building up your storage, et cetera, et cetera. 15299 12:04:30,720 --> 12:04:32,720 So you just keep on doing that. 15300 12:04:32,720 --> 12:04:35,720 And with this program, you'll be able to retrieve some stuff, 15301 12:04:35,720 --> 12:04:37,720 then run the page rank, then you can retrieve them more, 15302 12:04:37,720 --> 12:04:39,720 and then you can run some more page rank. 15303 12:04:39,720 --> 12:04:44,720 And you can kind of see how Google sort of evolves its index over time. 15304 12:04:44,720 --> 12:04:46,720 Of course, we're so much simpler. 15305 12:04:46,720 --> 12:04:49,720 And like I said, be careful when you crawl. 15306 12:04:49,720 --> 12:04:52,720 You're going to run a crawler that just goes as fast as it can. 15307 12:04:52,720 --> 12:04:54,720 But Google doesn't do that. 15308 12:04:54,720 --> 12:04:57,720 It's careful not to overwhelm any websites. 15309 12:04:57,720 --> 12:05:02,720 It's trying to be smart about the use of your bandwidth on your website. 15310 12:05:02,720 --> 12:05:04,720 There is a file. 15311 12:05:04,720 --> 12:05:09,720 Our code won't bother looking at this. 15312 12:05:09,720 --> 12:05:12,720 But there's a file called robots.txt that real web crawlers look at, 15313 12:05:12,720 --> 12:05:15,720 and it gives a list of the things you are allowed to look at 15314 12:05:15,720 --> 12:05:17,720 and not allowed to look at. 15315 12:05:17,720 --> 12:05:19,720 And so if you go to Google and you see a search that says, 15316 12:05:19,720 --> 12:05:23,720 we are not allowed to show you the summary text of this page 15317 12:05:23,720 --> 12:05:26,720 because of the robots.txt, it's there. 15318 12:05:26,720 --> 12:05:31,720 And you can go and you can actually see a robots.txt. 15319 12:05:31,720 --> 12:05:33,720 Just go to any website. 15320 12:05:33,720 --> 12:05:37,720 It's at the top root, blah, blah, blah, blah, blah, slash robots.txt. 15321 12:05:37,720 --> 12:05:38,720 It's not a path. 15322 12:05:38,720 --> 12:05:41,720 It's not slash this, slash that, slash something else, robots. 15323 12:05:41,720 --> 12:05:45,720 It's at the very, very top of a website. 15324 12:05:45,720 --> 12:05:47,720 The index building uses the page rank algorithm. 15325 12:05:47,720 --> 12:05:49,720 And the whole goal of the page rank algorithm 15326 12:05:49,720 --> 12:05:57,720 is to figure out which pages have the most best links. 15327 12:05:57,720 --> 12:05:59,720 So having the most links is really easy. 15328 12:05:59,720 --> 12:06:01,720 You can just say, how many links go to this? 15329 12:06:01,720 --> 12:06:04,720 But the problem is you've got to figure out the value of those links. 15330 12:06:04,720 --> 12:06:07,720 And then you have to, how do you figure the value of those links? 15331 12:06:07,720 --> 12:06:11,720 By looking at how many good links come to it. 15332 12:06:11,720 --> 12:06:14,720 So it turns out that it's an infinite problem. 15333 12:06:14,720 --> 12:06:18,720 It's an infinitely difficult problem to use page rank. 15334 12:06:18,720 --> 12:06:20,720 But you can approximate it. 15335 12:06:20,720 --> 12:06:25,720 And what happens is, after a while, it converges to a reasonable value. 15336 12:06:25,720 --> 12:06:27,720 And so we're going to run the search index. 15337 12:06:27,720 --> 12:06:30,720 And each time it runs, you're going to see that it says, 15338 12:06:30,720 --> 12:06:32,720 how much did these numbers change? 15339 12:06:32,720 --> 12:06:35,720 And what happens is, in the beginning, they change very wildly. 15340 12:06:35,720 --> 12:06:37,720 But quickly, they flatten out. 15341 12:06:37,720 --> 12:06:43,720 And the best way to think about the page rank 15342 12:06:43,720 --> 12:06:52,720 is think about how water runs, where you have a small little stream going by a house. 15343 12:06:52,720 --> 12:06:56,720 And sometimes it rains. Sometimes it's dry. 15344 12:06:56,720 --> 12:07:02,720 And sometimes there's like a little lake. 15345 12:07:02,720 --> 12:07:04,720 And the stream is always running. 15346 12:07:04,720 --> 12:07:06,720 And it doesn't go up and it doesn't go down. 15347 12:07:06,720 --> 12:07:08,720 It might go up a little bit if it rains a lot. 15348 12:07:08,720 --> 12:07:10,720 But in general, there's sort of a steady state, 15349 12:07:10,720 --> 12:07:14,720 meaning that whatever water's coming in is about the same as the water going out. 15350 12:07:14,720 --> 12:07:17,720 So we think about this in terms of web pages. 15351 12:07:17,720 --> 12:07:22,720 The value of the links coming in is roughly the same as the value of links going out. 15352 12:07:22,720 --> 12:07:27,720 So when that starts to balance the in and the out value from each of the nodes, 15353 12:07:27,720 --> 12:07:30,720 then you've got pretty stable. 15354 12:07:30,720 --> 12:07:34,720 And so what Google does is they have a really relatively stable assessment 15355 12:07:34,720 --> 12:07:36,720 of goodness and value of pages. 15356 12:07:36,720 --> 12:07:38,720 And they use that to commute page rank. 15357 12:07:38,720 --> 12:07:41,720 And then they throw a few more pages in and it kind of has to adjust for a while, 15358 12:07:41,720 --> 12:07:42,720 but it reconverges. 15359 12:07:42,720 --> 12:07:49,720 And so this is a calculation that generally converges and it doesn't vary wildly. 15360 12:07:49,720 --> 12:07:54,720 And that's why Google's pretty good at kind of arriving at the true value of something. 15361 12:07:54,720 --> 12:07:58,720 So let's take a look at what we're going to do in this application. 15362 12:07:58,720 --> 12:08:04,720 Again, we have a file that is going to spider the web. 15363 12:08:04,720 --> 12:08:06,720 And we only have one database. 15364 12:08:06,720 --> 12:08:09,720 Again, in this one we'll have two databases in the next one. 15365 12:08:09,720 --> 12:08:13,720 And so this is spider is the restartable part. 15366 12:08:13,720 --> 12:08:17,720 And what we actually do is we put one URL in, the starting URL. 15367 12:08:17,720 --> 12:08:22,720 And then spider walks in and asks, are there any unretrieved pages? 15368 12:08:22,720 --> 12:08:24,720 And it does that randomly. 15369 12:08:24,720 --> 12:08:27,720 It sort of picks among the unretrieved pages and says, okay, great. 15370 12:08:27,720 --> 12:08:28,720 I'll go retrieve that page. 15371 12:08:28,720 --> 12:08:30,720 And then I'll parse that page. 15372 12:08:30,720 --> 12:08:33,720 And then I'll put in a bunch of new unretrieved pages. 15373 12:08:33,720 --> 12:08:37,720 Okay, as well as the text of that page and then a bunch of unretrieved pages. 15374 12:08:37,720 --> 12:08:42,720 And then it'll go back up and it'll say, oh, give me one of the randomly non-retrieved pages. 15375 12:08:42,720 --> 12:08:45,720 And it'll grab the next page and pull that page down and then add to it. 15376 12:08:45,720 --> 12:08:49,720 And so this is like there's a page and then a to-do list. 15377 12:08:49,720 --> 12:08:53,720 And then this one becomes a page and then adds a few more things to the to-do list. 15378 12:08:53,720 --> 12:08:59,720 And so the to-do list or the unretrieved URLs grows very rapidly. 15379 12:08:59,720 --> 12:09:02,720 And the retrieved ones grow sort of as you retrieve them one at a time. 15380 12:09:02,720 --> 12:09:04,720 But you've always got this long list. 15381 12:09:04,720 --> 12:09:07,720 If you have a really short site that only has like two links, 15382 12:09:07,720 --> 12:09:11,720 if you start at drchuck.com slash page1.htm, 15383 12:09:11,720 --> 12:09:14,720 it'll go to page two and then go back to page one and it'll be out of things. 15384 12:09:14,720 --> 12:09:16,720 It'll have retrieved all of the pages. 15385 12:09:16,720 --> 12:09:20,720 And so if you have a website that has no external links or has very few pages 15386 12:09:20,720 --> 12:09:23,720 and they point to each other, this will run out of things to do. 15387 12:09:23,720 --> 12:09:29,720 But if you go to a page like my blog or the sample stuff that I have up 15388 12:09:29,720 --> 12:09:33,720 for you to spider for testing on drchuck.net, 15389 12:09:33,720 --> 12:09:35,720 it'll run for a very long time. 15390 12:09:35,720 --> 12:09:38,720 And you'll have far more pages to retrieve than pages that you retrieve. 15391 12:09:38,720 --> 12:09:39,720 But that's okay. 15392 12:09:39,720 --> 12:09:41,720 At some point, you can stop this. 15393 12:09:41,720 --> 12:09:43,720 Maybe it stops because you ran out of bandwidth 15394 12:09:43,720 --> 12:09:46,720 or maybe your computer went down or who knows what, right? 15395 12:09:46,720 --> 12:09:47,720 But it's okay. 15396 12:09:47,720 --> 12:09:51,720 This is a restartable process because it always has some pages that are retrieved 15397 12:09:51,720 --> 12:09:52,720 and some unretrieved pages. 15398 12:09:52,720 --> 12:09:53,720 You start it back up. 15399 12:09:53,720 --> 12:09:55,720 It picks randomly from the unretrieved pages. 15400 12:09:55,720 --> 12:10:00,720 The database is the sort of persistent state of your spider 15401 12:10:00,720 --> 12:10:03,720 rather than a bunch of dictionaries or lists inside the Python 15402 12:10:03,720 --> 12:10:06,720 which go away when the program dies. 15403 12:10:06,720 --> 12:10:10,720 And so at some point you have, let's just say, a few hundred pages in here 15404 12:10:10,720 --> 12:10:12,720 and a few thousand unretrieved pages. 15405 12:10:12,720 --> 12:10:14,720 You can run the page rank algorithm. 15406 12:10:14,720 --> 12:10:17,720 And what the page rank algorithm does is it loops through all the pages 15407 12:10:17,720 --> 12:10:19,720 and figure out which pages are linked to which pages 15408 12:10:19,720 --> 12:10:22,720 and then reads the numbers and then updates the numbers 15409 12:10:22,720 --> 12:10:25,720 and then does that some number of times. 15410 12:10:25,720 --> 12:10:27,720 And so this is where the numbers, all the pages, 15411 12:10:27,720 --> 12:10:29,720 sort of start out with goodness of one. 15412 12:10:29,720 --> 12:10:32,720 I think this printout is showing that goodness of one. 15413 12:10:32,720 --> 12:10:34,720 And then it changes. 15414 12:10:34,720 --> 12:10:37,720 And then the goodness goes to, some of the goodness goes up to two. 15415 12:10:37,720 --> 12:10:40,720 Some of the goodness goes to seven and whatever. 15416 12:10:40,720 --> 12:10:43,720 But then it does this over and over and then it uses these numbers 15417 12:10:43,720 --> 12:10:44,720 and then they change again. 15418 12:10:44,720 --> 12:10:47,720 And so there's a number of time steps that this page rank runs. 15419 12:10:47,720 --> 12:10:51,720 And you will see as the page rank runs, when I show you the code, 15420 12:10:51,720 --> 12:10:56,720 you'll see the average sort of change in these numbers 15421 12:10:56,720 --> 12:10:57,720 across all these things. 15422 12:10:57,720 --> 12:11:01,720 And you'll see that the average goes down very rapidly as you get through. 15423 12:11:01,720 --> 12:11:04,720 And so usually with a few hundred or even thousand pages, 15424 12:11:04,720 --> 12:11:07,720 like a hundred plus times during this algorithm 15425 12:11:07,720 --> 12:11:09,720 and these numbers have converged. 15426 12:11:09,720 --> 12:11:12,720 And that's when you sort of can begin to trust the numbers. 15427 12:11:12,720 --> 12:11:15,720 Now there's this one program called SP Reset, 15428 12:11:15,720 --> 12:11:17,720 which sets all the pages back to one. 15429 12:11:17,720 --> 12:11:19,720 So you can start this over. 15430 12:11:19,720 --> 12:11:23,720 So if you were to spider for a while, run SP rank for a while, play around, 15431 12:11:23,720 --> 12:11:26,720 and then you wanted to spider some more and start it over, 15432 12:11:26,720 --> 12:11:29,720 you could say, oh, let's start the page rank completely over. 15433 12:11:29,720 --> 12:11:34,720 Or you could simply take the new pages and watch it adapt. 15434 12:11:34,720 --> 12:11:37,720 Either way, this is just a way to reset all the pages 15435 12:11:37,720 --> 12:11:41,720 to have sort of their initial value of a goodness of 1.0. 15436 12:11:41,720 --> 12:11:43,720 So at some point you run this. 15437 12:11:43,720 --> 12:11:46,720 This runs really, this part here runs really slow. 15438 12:11:46,720 --> 12:11:49,720 This part runs super fast, like in the blink of an eye. 15439 12:11:49,720 --> 12:11:52,720 This one is pretty fast. 15440 12:11:52,720 --> 12:11:57,720 And then at some point you've got these pages that have, you know, numbers on them. 15441 12:11:57,720 --> 12:11:59,720 They have values on the pages. 15442 12:11:59,720 --> 12:12:03,720 And there's a couple of programs that allow us to visualize that. 15443 12:12:03,720 --> 12:12:06,720 One is the dump which just reads it and checks to see. 15444 12:12:06,720 --> 12:12:09,720 It shows the new page rank, the old page rank, 15445 12:12:09,720 --> 12:12:13,720 and various other things and shows just a way to dump it. 15446 12:12:13,720 --> 12:12:16,720 And then there's this thing that reads the whole thing. 15447 12:12:16,720 --> 12:12:19,720 You say, I'd like to do 25 at the top, the best. 15448 12:12:19,720 --> 12:12:22,720 It sorts it by page rank and then produces a JavaScript file. 15449 12:12:22,720 --> 12:12:24,720 It has just the numbers in it. 15450 12:12:24,720 --> 12:12:29,720 And then there is some HTML and a visualization library called D3.js, 15451 12:12:29,720 --> 12:12:33,720 which you can read about, that when the HTML starts it reads this 15452 12:12:33,720 --> 12:12:36,720 and has this nice force-directed layout of the page rank. 15453 12:12:36,720 --> 12:12:41,720 And you can hover over things and you can see what page rank you've got. 15454 12:12:41,720 --> 12:12:47,720 And so that is the page rank algorithm that we're going to do. 15455 12:12:47,720 --> 12:12:50,720 And up next we'll do the largest and most complex of these things, 15456 12:12:50,720 --> 12:12:53,720 and that is the email. 15457 12:12:53,720 --> 12:12:56,720 We're going to spider some email, which is about a gigabyte of data. 15458 12:12:56,720 --> 12:13:02,720 Okay? 15459 12:13:02,720 --> 12:13:04,720 We're doing a bit of code walkthrough, and if you want to, 15460 12:13:04,720 --> 12:13:07,720 you can get to the sample code and download it all 15461 12:13:07,720 --> 12:13:09,720 so that you can walk through the code yourself. 15462 12:13:09,720 --> 12:13:12,720 What we're walking through today is the page rank code. 15463 12:13:12,720 --> 12:13:20,720 And so the page rank code, let me get the picture of the page rank code up here. 15464 12:13:20,720 --> 12:13:22,720 Here's the picture of the page rank code. 15465 12:13:22,720 --> 12:13:29,720 And so the page rank code has five chunks of code that are going to run. 15466 12:13:29,720 --> 12:13:32,720 The first one we're going to look at is the spidering code. 15467 12:13:32,720 --> 12:13:36,720 And then we'll do a separate look at these other guys later. 15468 12:13:36,720 --> 12:13:38,720 So the first one we'll look at is spidering. 15469 12:13:38,720 --> 12:13:41,720 And again, it's sort of the same pattern of we've got some stuff on the web, 15470 12:13:41,720 --> 12:13:43,720 in this case web pages. 15471 12:13:43,720 --> 12:13:47,720 We're going to have a database that sort of just captures the stuff. 15472 12:13:47,720 --> 12:13:50,720 It's not really trying to be particularly intelligent, 15473 12:13:50,720 --> 12:13:52,720 but it is going to parse these with Beautiful Soup 15474 12:13:52,720 --> 12:13:55,720 and add things to the database. 15475 12:13:55,720 --> 12:13:56,720 Okay? 15476 12:13:56,720 --> 12:13:59,720 And so then we'll talk about how we run the page rank algorithm 15477 12:13:59,720 --> 12:14:02,720 and then how we visualize the page rank algorithm in a bit. 15478 12:14:02,720 --> 12:14:06,720 Now, the first thing to notice is that I've got to put, 15479 12:14:06,720 --> 12:14:09,720 I put the Beautiful Soup code in right here. 15480 12:14:09,720 --> 12:14:10,720 Okay? 15481 12:14:10,720 --> 12:14:13,720 So this is, you can get this from the bs4.zip file. 15482 12:14:13,720 --> 12:14:15,720 There might need to be a readme. 15483 12:14:15,720 --> 12:14:17,720 No, but there's a readme somewhere. 15484 12:14:17,720 --> 12:14:20,720 But to get to use Beautiful Soup, you've got to put this bs4.zip, 15485 12:14:20,720 --> 12:14:22,720 or you have to install Beautiful Soup for your stuff. 15486 12:14:22,720 --> 12:14:26,720 So I provide this bs4.zip as a quick and dirty way 15487 12:14:26,720 --> 12:14:34,720 if you can't install something for all of the Python users on your system. 15488 12:14:34,720 --> 12:14:35,720 So that's what it's supposed to look like. 15489 12:14:35,720 --> 12:14:37,720 You're supposed to have it unzipped right here in these files. 15490 12:14:37,720 --> 12:14:39,720 And I don't know what dammit.py means. 15491 12:14:39,720 --> 12:14:41,720 That came from Beautiful Soup. 15492 12:14:41,720 --> 12:14:43,720 If you look, it's in their source code. 15493 12:14:43,720 --> 12:14:45,720 So I'm not swearing. 15494 12:14:45,720 --> 12:14:46,720 It's Beautiful Soup. 15495 12:14:46,720 --> 12:14:47,720 People are swearing. 15496 12:14:47,720 --> 12:14:48,720 I'm sorry. 15497 12:14:48,720 --> 12:14:49,720 I apologize. 15498 12:14:49,720 --> 12:14:50,720 Okay. 15499 12:14:50,720 --> 12:14:52,720 So the code we're going to play with the most is, 15500 12:14:52,720 --> 12:14:55,720 and this first one is called spider.py. 15501 12:14:55,720 --> 12:14:57,720 And, you know, we're going to do databases. 15502 12:14:57,720 --> 12:14:59,720 We're going to read URLs. 15503 12:14:59,720 --> 12:15:02,720 And we're going to parse them with Beautiful Soup. 15504 12:15:02,720 --> 12:15:04,720 Okay? 15505 12:15:04,720 --> 12:15:08,720 And so what we're going to do is we're going to make a file. 15506 12:15:08,720 --> 12:15:10,720 Again, this will make spider.sql lite. 15507 12:15:10,720 --> 12:15:13,720 And here we are in PageRank and else minus l. 15508 12:15:13,720 --> 12:15:16,720 Spider.sql lite is not there. 15509 12:15:16,720 --> 12:15:17,720 So it's going to create the database. 15510 12:15:17,720 --> 12:15:19,720 We do create table if not exists. 15511 12:15:19,720 --> 12:15:21,720 We're going to have an integer primary key 15512 12:15:21,720 --> 12:15:23,720 because we're going to do foreign keys here. 15513 12:15:23,720 --> 12:15:24,720 We're going to have a URL. 15514 12:15:24,720 --> 12:15:29,720 And the URL, which is unique, the HTML, which is unique, 15515 12:15:29,720 --> 12:15:30,720 whether we got an error. 15516 12:15:30,720 --> 12:15:33,720 And then for the second half, when we start doing PageRank, 15517 12:15:33,720 --> 12:15:34,720 we're going to have old rank and new rank. 15518 12:15:34,720 --> 12:15:37,720 Because the way PageRank works is it takes the old rank, 15519 12:15:37,720 --> 12:15:39,720 computes the new rank, and then replaces the new rank 15520 12:15:39,720 --> 12:15:42,720 with the old rank, and then does it over and over again. 15521 12:15:42,720 --> 12:15:46,720 And then we're going to have a many-to-many table, 15522 12:15:46,720 --> 12:15:48,720 which points really back. 15523 12:15:48,720 --> 12:15:50,720 So I call this from ID and to ID. 15524 12:15:50,720 --> 12:15:53,720 We did this with some of the Twitter stuff. 15525 12:15:53,720 --> 12:15:56,720 And then this webs is just in case I have more than one web, 15526 12:15:56,720 --> 12:15:58,720 but that really doesn't make much difference. 15527 12:15:58,720 --> 12:16:05,720 Okay, so what we're going to do is we're going to select 15528 12:16:05,720 --> 12:16:08,720 ID, URL from pages where HTML is null. 15529 12:16:08,720 --> 12:16:11,720 This is our indicator that a page has not yet been retrieved. 15530 12:16:11,720 --> 12:16:14,720 And error is null, ordered by random. 15531 12:16:14,720 --> 12:16:17,720 And so this is our way, this long bit of stuff. 15532 12:16:17,720 --> 12:16:20,720 And not all this SQL is completely standard, 15533 12:16:20,720 --> 12:16:23,720 but this order by random is really quite nice in SQLite. 15534 12:16:23,720 --> 12:16:28,720 Limit once is just randomly pick a record in this database 15535 12:16:28,720 --> 12:16:32,720 where this true is true, and then pick it randomly. 15536 12:16:32,720 --> 12:16:34,720 And then we're going to fetch a row. 15537 12:16:34,720 --> 12:16:41,720 And if that row is none, right, we're going to ask for a new web, 15538 12:16:41,720 --> 12:16:45,720 a starting URL, and this is going to fire things up, 15539 12:16:45,720 --> 12:16:47,720 and we're going to insert this new URL. 15540 12:16:47,720 --> 12:16:49,720 Otherwise, we're going to restart. 15541 12:16:49,720 --> 12:16:51,720 We have a row to start with. 15542 12:16:51,720 --> 12:16:53,720 And otherwise, we're going to sort of prime this 15543 12:16:53,720 --> 12:16:58,720 by inserting the URL we start with, insert into it. 15544 12:16:58,720 --> 12:17:00,720 If you enter it, it just goes to drchuck.com, 15545 12:17:00,720 --> 12:17:02,720 which is a fine place to start. 15546 12:17:02,720 --> 12:17:07,720 And then what we do is we, what this does is its page rank, 15547 12:17:07,720 --> 12:17:12,720 is it uses this web's table to limit the links. 15548 12:17:12,720 --> 12:17:16,720 It only does links to the sites that you tell it to do links. 15549 12:17:16,720 --> 12:17:20,720 And probably the best for your page rank is to stick with one site. 15550 12:17:20,720 --> 12:17:23,720 Otherwise, you will just never find the same site again 15551 12:17:23,720 --> 12:17:26,720 if you let this wander the web aimlessly. 15552 12:17:26,720 --> 12:17:29,720 And so I generally run with one web, 15553 12:17:29,720 --> 12:17:32,720 which this should be probably called web sites. 15554 12:17:32,720 --> 12:17:36,720 And I pull in all the data, and I read this in, 15555 12:17:36,720 --> 12:17:39,720 and I just make myself a list of the legit URLs, 15556 12:17:39,720 --> 12:17:41,720 and you'll see how we use that. 15557 12:17:41,720 --> 12:17:45,720 And the web is what are the legit places we're going to go, 15558 12:17:45,720 --> 12:17:48,720 because we're going to go through a loop, 15559 12:17:48,720 --> 12:17:51,720 ask for how many pages, 15560 12:17:51,720 --> 12:17:53,720 and we're going to look for a null page. 15561 12:17:53,720 --> 12:17:57,720 Again, we're using that random order by random limit one. 15562 12:17:57,720 --> 12:18:04,720 And then we're going to have a, we're going to grab one. 15563 12:18:04,720 --> 12:18:08,720 We're going to get the from ID, which is the page we're linking from, 15564 12:18:08,720 --> 12:18:11,720 and then the URL. 15565 12:18:11,720 --> 12:18:13,720 Otherwise, there's no on retrieved. 15566 12:18:13,720 --> 12:18:17,720 And so the from ID is when we start adding links 15567 12:18:17,720 --> 12:18:21,720 to our page links, we've got to know the page we started with. 15568 12:18:21,720 --> 12:18:23,720 And that's the primary key. 15569 12:18:23,720 --> 12:18:25,720 We'll see how that primary key is set in a second. 15570 12:18:25,720 --> 12:18:27,720 So otherwise, we have none. 15571 12:18:27,720 --> 12:18:30,720 And we're going to print this from ID, 15572 12:18:30,720 --> 12:18:33,720 the from ID and the URL that we're working with, 15573 12:18:33,720 --> 12:18:37,720 just to make sure we're going to wipe out all of the links. 15574 12:18:37,720 --> 12:18:40,720 Because it's on retrieved, we're going to wipe out from the links. 15575 12:18:40,720 --> 12:18:45,720 The links is the connection table that connects from pages back to pages. 15576 12:18:45,720 --> 12:18:47,720 And so we're going to wipe out. 15577 12:18:47,720 --> 12:18:50,720 So we're going to go grab this URL. 15578 12:18:50,720 --> 12:18:52,720 We're going to read it. 15579 12:18:52,720 --> 12:18:56,720 We're not decoding it because we're using Beautiful Soup, 15580 12:18:56,720 --> 12:19:02,720 which compensates for the UTF-8. 15581 12:19:02,720 --> 12:19:06,720 And so we can ask, this is the HTML error code. 15582 12:19:06,720 --> 12:19:08,720 And we checked 200 is a good error. 15583 12:19:08,720 --> 12:19:12,720 And if we get a bad error, we're going to say this error on page. 15584 12:19:12,720 --> 12:19:14,720 We're going to set that error. 15585 12:19:14,720 --> 12:19:15,720 We're going to take pages. 15586 12:19:15,720 --> 12:19:18,720 That way, we don't retrieve it ever again. 15587 12:19:18,720 --> 12:19:24,720 We basically check to see if the content type is text HTML. 15588 12:19:24,720 --> 12:19:27,720 Remember, in HTTP, you get the content type. 15589 12:19:27,720 --> 12:19:28,720 We only want to retrieve it. 15590 12:19:28,720 --> 12:19:31,720 We only want to look for the links on HTML pages. 15591 12:19:31,720 --> 12:19:33,720 And so we wipe that guy out. 15592 12:19:33,720 --> 12:19:38,720 If we get a JPEG or something like that, we're not going to retrieve JPEG. 15593 12:19:38,720 --> 12:19:40,720 And then we commit and continue. 15594 12:19:40,720 --> 12:19:43,720 So these are kind of like, oh, those are pages we didn't want to mess with. 15595 12:19:43,720 --> 12:19:47,720 And then we print out how many characters we got and parse it. 15596 12:19:47,720 --> 12:19:50,720 And we do this whole thing in a try-accept block, 15597 12:19:50,720 --> 12:19:52,720 because a lot of things can go wrong here. 15598 12:19:52,720 --> 12:19:54,720 It's a bit of a long try-accept block. 15599 12:19:54,720 --> 12:19:59,720 Keyboard interrupt, that's what happens if I hit CTRL-C at my keyboard 15600 12:19:59,720 --> 12:20:01,720 or CTRL-Z on Windows. 15601 12:20:01,720 --> 12:20:05,720 Some other exception probably means Beautiful Soup blew up 15602 12:20:05,720 --> 12:20:06,720 or something else blew up. 15603 12:20:06,720 --> 12:20:13,720 And so we indicate with the error equals negative 1 for that URL 15604 12:20:13,720 --> 12:20:15,720 so we don't retrieve it again. 15605 12:20:15,720 --> 12:20:21,720 At this point, at line 103, we have got the HTML for that URL. 15606 12:20:21,720 --> 12:20:23,720 And so we're going to insert it in. 15607 12:20:23,720 --> 12:20:25,720 And we're going to set the page rank to 1. 15608 12:20:25,720 --> 12:20:30,720 So the way page rank works is it gives all the pages some normal value. 15609 12:20:30,720 --> 12:20:32,720 And then it alters that. 15610 12:20:32,720 --> 12:20:33,720 We'll see that in a bit. 15611 12:20:33,720 --> 12:20:36,720 So it sets it in with 1. 15612 12:20:36,720 --> 12:20:40,720 We're going to insert or ignore. 15613 12:20:40,720 --> 12:20:44,720 That's just in case the pages is not there. 15614 12:20:44,720 --> 12:20:46,720 And then we're going to do an update. 15615 12:20:46,720 --> 12:20:48,720 And that's kind of doing the same thing twice, 15616 12:20:48,720 --> 12:20:51,720 just sort of doubly making sure if it's already there, 15617 12:20:51,720 --> 12:20:54,720 this or ignore will cause this to do nothing. 15618 12:20:54,720 --> 12:20:56,720 And the update will cause us to retain it. 15619 12:20:56,720 --> 12:20:59,720 And then we commit it so that if we do selects later, 15620 12:20:59,720 --> 12:21:01,720 we get that information. 15621 12:21:01,720 --> 12:21:04,720 Now this code is similar. 15622 12:21:04,720 --> 12:21:07,720 Remember, we used beautiful soup to pull out all the anchor tags. 15623 12:21:07,720 --> 12:21:08,720 We have a for loop. 15624 12:21:08,720 --> 12:21:10,720 We pull out the href. 15625 12:21:10,720 --> 12:21:13,720 And you'll see this code's a little more complex 15626 12:21:13,720 --> 12:21:15,720 than some of the earlier stuff, 15627 12:21:15,720 --> 12:21:17,720 because it has to deal with the real nastiness 15628 12:21:17,720 --> 12:21:19,720 or imperfection of the web. 15629 12:21:19,720 --> 12:21:21,720 And so we're going to use URL parse, 15630 12:21:21,720 --> 12:21:26,720 which is actually part of the URL lib code. 15631 12:21:26,720 --> 12:21:29,720 And that's going to break the URL into pieces. 15632 12:21:29,720 --> 12:21:30,720 Come back. 15633 12:21:30,720 --> 12:21:32,720 We use URL parse. 15634 12:21:32,720 --> 12:21:36,720 We have the scheme, which is HTTP or HTTPS. 15635 12:21:36,720 --> 12:21:39,720 If this solves relative references, 15636 12:21:39,720 --> 12:21:41,720 this is solved relative references 15637 12:21:41,720 --> 12:21:44,720 by taking the current URL and hooking it up. 15638 12:21:44,720 --> 12:21:47,720 URL join knows about slashes and all those other things. 15639 12:21:47,720 --> 12:21:49,720 We check to see if there's an anchor, 15640 12:21:49,720 --> 12:21:51,720 the pound sign at the end of a URL, 15641 12:21:51,720 --> 12:21:56,720 and we throw everything past including the anchor away. 15642 12:21:56,720 --> 12:22:00,720 If we have a JPEG or a PNG or a GIF, 15643 12:22:00,720 --> 12:22:01,720 we are going to skip it. 15644 12:22:01,720 --> 12:22:03,720 We don't want to bother with that. 15645 12:22:03,720 --> 12:22:04,720 We're looking through links now. 15646 12:22:04,720 --> 12:22:06,720 We're looking at all the links. 15647 12:22:06,720 --> 12:22:09,720 And if we have a slash at the end, 15648 12:22:09,720 --> 12:22:12,720 we're going to chop off the slash by saying minus one. 15649 12:22:12,720 --> 12:22:15,720 And so this is just kind of nasty choppage 15650 12:22:15,720 --> 12:22:18,720 and throwing away the URLs that we're going through a page, 15651 12:22:18,720 --> 12:22:20,720 and we have a bunch that we don't like 15652 12:22:20,720 --> 12:22:23,720 or we have to clean them up or whatever. 15653 12:22:23,720 --> 12:22:25,720 And now, and we've made them absolute by doing this, 15654 12:22:25,720 --> 12:22:27,720 it's an absolute URL. 15655 12:22:27,720 --> 12:22:30,720 This is just, you write this slowly but surely 15656 12:22:30,720 --> 12:22:32,720 when your code blows up and you start it over 15657 12:22:32,720 --> 12:22:35,720 and start it over and start over. 15658 12:22:35,720 --> 12:22:38,720 Then what we do is we check to see through all the webs. 15659 12:22:38,720 --> 12:22:41,720 Remember, those were the URLs that we're willing to stay with 15660 12:22:41,720 --> 12:22:43,720 and usually it's just one. 15661 12:22:43,720 --> 12:22:46,720 If this would link off the sites, 15662 12:22:46,720 --> 12:22:48,720 of the sites we're interested in, we're going to skip it. 15663 12:22:48,720 --> 12:22:51,720 We are not interested in links that leave the site. 15664 12:22:51,720 --> 12:22:54,720 So this is like link that left the site, skip it. 15665 12:22:54,720 --> 12:22:58,720 But now we finally here at line 132, 15666 12:22:58,720 --> 12:23:01,720 we are ready to put this into pages, 15667 12:23:01,720 --> 12:23:05,720 URL and the HTML, and it's all good, right? 15668 12:23:05,720 --> 12:23:11,720 And that one's going to be null right there 15669 12:23:11,720 --> 12:23:14,720 because we haven't retrieved the HTML. 15670 12:23:14,720 --> 12:23:18,720 This is null because this is a page we're going to retrieve, 15671 12:23:18,720 --> 12:23:20,720 we're giving the page rank of one, 15672 12:23:20,720 --> 12:23:23,720 and we're giving it no HTML and that way it'll be retrieved. 15673 12:23:23,720 --> 12:23:27,720 And then we commit that, okay? 15674 12:23:27,720 --> 12:23:29,720 And then we want to get the ID. 15675 12:23:29,720 --> 12:23:33,720 So we could have done this with one way or another, 15676 12:23:33,720 --> 12:23:35,720 but we're going to do a select to say, 15677 12:23:35,720 --> 12:23:38,720 hey, what was the ID that either was already there 15678 12:23:38,720 --> 12:23:41,720 or was just created? 15679 12:23:41,720 --> 12:23:44,720 And we grab that with a fetch one and say, 15680 12:23:44,720 --> 12:23:47,720 retrieve two ID, and now we're going to put a link in, 15681 12:23:47,720 --> 12:23:51,720 insert or into links from ID to ID, 15682 12:23:51,720 --> 12:23:54,720 which is the primary key of the page 15683 12:23:54,720 --> 12:23:56,720 that we're going through and looking for links. 15684 12:23:56,720 --> 12:24:00,720 Two ID is the link that we just created and away we run. 15685 12:24:00,720 --> 12:24:03,720 So it's going to go and go and go and go. 15686 12:24:03,720 --> 12:24:10,720 Let's go look at the create statement up here 15687 12:24:10,720 --> 12:24:13,720 from ID and to ID right there, okay. 15688 12:24:13,720 --> 12:24:19,720 So let's run it. 15689 12:24:19,720 --> 12:24:23,720 Python 3, oops. 15690 12:24:23,720 --> 12:24:32,720 Python 3, spider, python. 15691 12:24:32,720 --> 12:24:36,720 So it's fresh and so it wants a URL with which to start, 15692 12:24:36,720 --> 12:24:40,720 and I'll just start with my favorite website, 15693 12:24:40,720 --> 12:24:42,720 www.drchuck.com. 15694 12:24:42,720 --> 12:24:45,720 Now this basically, this first one you put in, 15695 12:24:45,720 --> 12:24:49,720 it's going to stay on this website for a while, okay. 15696 12:24:49,720 --> 12:24:52,720 So I'll hit enter and let's just grab like, 15697 12:24:52,720 --> 12:24:55,720 let's grab one page just for yucks. 15698 12:24:55,720 --> 12:25:00,720 Okay, so it grabbed that and it printed out that it got 15699 12:25:00,720 --> 12:25:11,720 85, 45 characters and it printed out that it got six links. 15700 12:25:11,720 --> 12:25:21,720 So if I go to this and open database, 15701 12:25:21,720 --> 12:25:28,720 and I go to code 3 and I go to page rank and I look at this, 15702 12:25:28,720 --> 12:25:31,720 oh, let me get out so it closes. 15703 12:25:31,720 --> 12:25:34,720 So notice this SQLite journal, 15704 12:25:34,720 --> 12:25:38,720 that means it's not done closing so I'm going to get out of this 15705 12:25:38,720 --> 12:25:42,720 by pressing enter and so you'll notice now that that journal file went away 15706 12:25:42,720 --> 12:25:45,720 otherwise we would not be getting the final data. 15707 12:25:45,720 --> 12:25:46,720 There we go. 15708 12:25:46,720 --> 12:25:50,720 Okay, so webs, let's take a look at the data. 15709 12:25:50,720 --> 12:25:54,720 Webs has just one URL, 15710 12:25:54,720 --> 12:25:57,720 that's the URLs that we're allowing ourselves to look at. 15711 12:25:57,720 --> 12:25:59,720 You can put more than one in here if you want 15712 12:25:59,720 --> 12:26:01,720 but most people will just leave this as one. 15713 12:26:01,720 --> 12:26:07,720 Pages, so we got this first one and we retrieved this 15714 12:26:07,720 --> 12:26:12,720 as the HTML of it and we found six other URLs in there 15715 12:26:12,720 --> 12:26:14,720 that are drchuck.com URLs. 15716 12:26:14,720 --> 12:26:16,720 There was lots of other URLs in there 15717 12:26:16,720 --> 12:26:21,720 but there were only five other ones that we found. 15718 12:26:21,720 --> 12:26:24,720 And what we'll find is if we go to links, 15719 12:26:24,720 --> 12:26:27,720 we'll see that page one, links to two, links to three, 15720 12:26:27,720 --> 12:26:29,720 links to four, links to five, links to six 15721 12:26:29,720 --> 12:26:31,720 because the links is just a many to many table. 15722 12:26:31,720 --> 12:26:34,720 So page one points to page two, 15723 12:26:34,720 --> 12:26:37,720 page one to three, page one to five, okay? 15724 12:26:37,720 --> 12:26:42,720 So that's what happens when we have the first page. 15725 12:26:42,720 --> 12:26:47,720 So let's retrieve one more page. 15726 12:26:47,720 --> 12:26:51,720 Now it's, we could have started a new crawl 15727 12:26:51,720 --> 12:26:54,720 but we're just gonna, it's gonna stay on drchuck.com 15728 12:26:54,720 --> 12:26:56,720 and I'll just ask for one more page. 15729 12:26:56,720 --> 12:26:58,720 And so now it went and grabbed. 15730 12:26:58,720 --> 12:27:00,720 It randomly picked among these null guys 15731 12:27:00,720 --> 12:27:02,720 and I'm gonna hit enter to close it 15732 12:27:02,720 --> 12:27:04,720 and then I'll refresh this. 15733 12:27:04,720 --> 12:27:09,720 And oh, so it looks like we retrieved OBI sample 15734 12:27:09,720 --> 12:27:11,720 and we didn't get any new links. 15735 12:27:11,720 --> 12:27:14,720 And so the links page, no, we didn't get any new links. 15736 12:27:14,720 --> 12:27:18,720 So that page, whatever that was, OBI sample 15737 12:27:18,720 --> 12:27:19,720 had no external links. 15738 12:27:19,720 --> 12:27:25,720 So let's do another one. 15739 12:27:25,720 --> 12:27:29,720 Oh, one more page. 15740 12:27:29,720 --> 12:27:32,720 So that one had 15 links, so let's take a look now. 15741 12:27:32,720 --> 12:27:35,720 So now we have 15 pages. 15742 12:27:35,720 --> 12:27:38,720 It picked this one to do, right? 15743 12:27:38,720 --> 12:27:40,720 And now it added 15 more pages 15744 12:27:40,720 --> 12:27:44,720 and then if you look at links you will see that page four, 15745 12:27:44,720 --> 12:27:47,720 which is one it just retrieved, links back to page one. 15746 12:27:47,720 --> 12:27:49,720 So now we're seeing this is where the page rank 15747 12:27:49,720 --> 12:27:50,720 is gonna be cool. 15748 12:27:50,720 --> 12:27:56,720 Four links to one, four links to whatever, away we go, right? 15749 12:27:56,720 --> 12:27:59,720 One goes to four, four goes to one. 15750 12:27:59,720 --> 12:28:02,720 I should have probably put a uniqueness constraint on that. 15751 12:28:02,720 --> 12:28:06,720 It's not supposed to duplicated that. 15752 12:28:06,720 --> 12:28:10,720 Okay, so let's run this a bunch of times now. 15753 12:28:10,720 --> 12:28:18,720 So let's just run it 100 times for 100 pages. 15754 12:28:18,720 --> 12:28:21,720 It'll take a minute. 15755 12:28:21,720 --> 12:28:24,720 So you'll see it's like freaking out on certain pages 15756 12:28:24,720 --> 12:28:27,720 and not parsing them. 15757 12:28:27,720 --> 12:28:32,720 It's finding its way into my blog. 15758 12:28:32,720 --> 12:28:34,720 It's finding like 27 links. 15759 12:28:34,720 --> 12:28:39,720 This table is growing wildly at this point. 15760 12:28:39,720 --> 12:28:41,720 It's gonna take us a while before we get to 100. 15761 12:28:41,720 --> 12:28:42,720 It's kind of slow. 15762 12:28:42,720 --> 12:28:45,720 Now the interesting thing is I can hit control C 15763 12:28:45,720 --> 12:28:48,720 at any point in time. 15764 12:28:48,720 --> 12:28:49,720 Right? 15765 12:28:49,720 --> 12:28:51,720 And so that blew up. 15766 12:28:51,720 --> 12:28:53,720 But it's okay because the data is still there 15767 12:28:53,720 --> 12:28:55,720 and if we go back to pages, for example, 15768 12:28:55,720 --> 12:28:58,720 and we refresh our data, we see we got a ton of stuff. 15769 12:28:58,720 --> 12:29:01,720 And this will restart and all the things, 15770 12:29:01,720 --> 12:29:04,720 so if we search this, I started that by HTML, 15771 12:29:04,720 --> 12:29:06,720 you see that there's lots of files that we've got 15772 12:29:06,720 --> 12:29:08,720 and it's never gonna retrieve those again 15773 12:29:08,720 --> 12:29:10,720 because those have HTML. 15774 12:29:10,720 --> 12:29:14,720 So then I can run this thing again and start it up. 15775 12:29:14,720 --> 12:29:17,720 And when I say control C, your computer might go down, 15776 12:29:17,720 --> 12:29:18,720 your network might go down. 15777 12:29:18,720 --> 12:29:21,720 There's all kinds of things that might happen 15778 12:29:21,720 --> 12:29:22,720 and you just pick up where it leaves off. 15779 12:29:22,720 --> 12:29:24,720 It just picks up where it leaves off 15780 12:29:24,720 --> 12:29:26,720 and that's what's nice about this. 15781 12:29:26,720 --> 12:29:27,720 Okay? 15782 12:29:27,720 --> 12:29:32,720 So that's pretty much how this works. 15783 12:29:32,720 --> 12:29:35,720 We've got this part running. 15784 12:29:35,720 --> 12:29:37,720 We're seeing it flow into Spider DeskQL Lite. 15785 12:29:37,720 --> 12:29:40,720 We're seeing that we can start this and replace this. 15786 12:29:40,720 --> 12:29:44,720 And so what I'll do is I will come back in the next video 15787 12:29:44,720 --> 12:29:47,720 and show you how all these things work together 15788 12:29:47,720 --> 12:29:51,720 and then how we actually do the page rank. 15789 12:29:51,720 --> 12:29:55,720 So thanks again for listening and see you in the next video. 15790 12:30:03,720 --> 12:30:04,720 We're picking up in the middle here 15791 12:30:04,720 --> 12:30:09,720 where we are running a simple spider that's retrieving data 15792 12:30:09,720 --> 12:30:14,720 and putting it into running this spider.py file 15793 12:30:14,720 --> 12:30:17,720 and it's cruising around and doing things. 15794 12:30:17,720 --> 12:30:19,720 And the beauty of any of these spider processes 15795 12:30:19,720 --> 12:30:23,720 is I can stop any time and just hit control C. 15796 12:30:23,720 --> 12:30:29,720 And so we take a look at the spider.sqlite file 15797 12:30:29,720 --> 12:30:30,720 and retrieve it. 15798 12:30:30,720 --> 12:30:33,720 And it looks like we've got 302 pages. 15799 12:30:33,720 --> 12:30:36,720 I don't know how many we've got retrieved. 15800 12:30:36,720 --> 12:30:37,720 70. 15801 12:30:37,720 --> 12:30:39,720 Okay, there we go. 15802 12:30:39,720 --> 12:30:43,720 We've got about 100. 15803 12:30:43,720 --> 12:30:45,720 Oh wait, I'm looking for the wrong thing. 15804 12:30:45,720 --> 12:30:48,720 No, no, no, no, no. 15805 12:30:48,720 --> 12:30:51,720 Yeah, we've got about 107 pages. 15806 12:30:51,720 --> 12:30:54,720 So what we're going to do now with 107 pages 15807 12:30:54,720 --> 12:30:59,720 is we are going to run the page rank algorithm. 15808 12:30:59,720 --> 12:31:01,720 Okay, so let's take a look at that code. 15809 12:31:01,720 --> 12:31:06,720 So the idea of page rank, 15810 12:31:06,720 --> 12:31:09,720 we're going to run this page rank algorithm. 15811 12:31:09,720 --> 12:31:12,720 The spreset just resets the page rank 15812 12:31:12,720 --> 12:31:15,720 and sprank runs as many iterations of page rank. 15813 12:31:15,720 --> 12:31:20,720 So the basic idea is that if you were to look at the links here, 15814 12:31:20,720 --> 12:31:25,720 we think of page 1 pointing to page 2 15815 12:31:25,720 --> 12:31:28,720 gives some of page 1's love to page 2. 15816 12:31:28,720 --> 12:31:33,720 Page 4 has some value that it gives to page 1. 15817 12:31:33,720 --> 12:31:37,720 You go on and page 2 gives love to page 46 15818 12:31:37,720 --> 12:31:39,720 over and over and over again. 15819 12:31:39,720 --> 12:31:42,720 But the problem is that how good is page 1 15820 12:31:42,720 --> 12:31:47,720 and how much positive karma does it give to page 2? 15821 12:31:47,720 --> 12:31:54,720 And so what happens is we start by giving every page a rank of 1. 15822 12:31:54,720 --> 12:31:57,720 We say, look, everybody starts out equal. 15823 12:31:57,720 --> 12:31:59,720 But then what we do is we divide up 15824 12:31:59,720 --> 12:32:02,720 in one iteration of the page rank algorithm, 15825 12:32:02,720 --> 12:32:06,720 we divide up the goodness of a page across its outbound links 15826 12:32:06,720 --> 12:32:11,720 and then accumulate that and that becomes the next rank. 15827 12:32:11,720 --> 12:32:19,720 So let's take a look at the code for the page rank algorithm. 15828 12:32:19,720 --> 12:32:21,720 So this is pretty simple. 15829 12:32:21,720 --> 12:32:25,720 It only imports SQLite 3 because it's really doing everything in the database. 15830 12:32:25,720 --> 12:32:30,720 It's going to be updating these columns right here in the database. 15831 12:32:30,720 --> 12:32:36,720 So we're going to do some things here to speed this up. 15832 12:32:36,720 --> 12:32:41,720 This rank runs, if you're thinking of Google, this rank runs slowly 15833 12:32:41,720 --> 12:32:45,720 and is going to run continuously to keep updating these things. 15834 12:32:45,720 --> 12:32:50,720 So the first thing I do is I read in all of the from IDs from the links. 15835 12:32:50,720 --> 12:32:54,720 Select distinct throws out any duplicates. 15836 12:32:54,720 --> 12:32:58,720 And so I have all the from IDs, 15837 12:32:58,720 --> 12:33:02,720 which are all the pages that have links to other pages 15838 12:33:02,720 --> 12:33:05,720 because all the pages are in pages, 15839 12:33:05,720 --> 12:33:09,720 but in links to have a from ID, you have to also have a to ID. 15840 12:33:09,720 --> 12:33:15,720 And so we're also going to look at the pages that receive page rank 15841 12:33:15,720 --> 12:33:17,720 and we're kind of precaching this stuff. 15842 12:33:17,720 --> 12:33:21,720 So we're going to do a select distinct of from ID and to ID 15843 12:33:21,720 --> 12:33:23,720 and loop through that group of things. 15844 12:33:23,720 --> 12:33:27,720 And we're making a links list here. 15845 12:33:27,720 --> 12:33:31,720 And so we're saying if the from ID is the same as the to ID, 15846 12:33:31,720 --> 12:33:36,720 we're not interested if the from ID is not already in my from IDs 15847 12:33:36,720 --> 12:33:37,720 that I've got. 15848 12:33:37,720 --> 12:33:38,720 I'm going to skip it. 15849 12:33:38,720 --> 12:33:40,720 If the to ID is not in the from ID, 15850 12:33:40,720 --> 12:33:44,720 meaning that this is a to ID that's not also, 15851 12:33:44,720 --> 12:33:47,720 we don't want links that point off to nowhere 15852 12:33:47,720 --> 12:33:49,720 or point to pages that we haven't retrieved yet. 15853 12:33:49,720 --> 12:33:51,720 And that's what this is saying. 15854 12:33:51,720 --> 12:33:53,720 So this is really going to give us, 15855 12:33:53,720 --> 12:33:58,720 it's a filter on the from IDs and the to IDs from the links table 15856 12:33:58,720 --> 12:34:02,720 so that it only are the links that point to another page we've already retrieved. 15857 12:34:02,720 --> 12:34:07,720 And then we're going to keep track of the entire super set of two IDs, 15858 12:34:07,720 --> 12:34:09,720 the destination IDs. 15859 12:34:09,720 --> 12:34:10,720 And I'm just putting these all in lists 15860 12:34:10,720 --> 12:34:13,720 so that I don't have to hit the database so hard. 15861 12:34:13,720 --> 12:34:16,720 Okay, so this is getting what's called the strongly connected component, 15862 12:34:16,720 --> 12:34:18,720 meaning that any of these IDs, 15863 12:34:18,720 --> 12:34:23,720 there is a path from every ID to every other ID eventually. 15864 12:34:23,720 --> 12:34:27,720 So that's called the strongly connected component in graph theory. 15865 12:34:27,720 --> 12:34:30,720 Then what we're going to do is we're going to grab the, 15866 12:34:30,720 --> 12:34:34,720 we're going to select new rank from pages 15867 12:34:34,720 --> 12:34:40,720 where for all the from IDs, right? 15868 12:34:40,720 --> 12:34:45,720 And so we're going to have a dictionary that's based on the ID, 15869 12:34:45,720 --> 12:34:49,720 the primary key, that's what node is, equals the rank. 15870 12:34:49,720 --> 12:34:52,720 And so if we look at our database, 15871 12:34:52,720 --> 12:34:57,720 that means that for the part of the strongly connected component in links, 15872 12:34:57,720 --> 12:35:00,720 we're going to grab this number and stick it into a dictionary 15873 12:35:00,720 --> 12:35:06,720 based on the primary key of this, 15874 12:35:06,720 --> 12:35:09,720 based on the primary key, this number right here. 15875 12:35:09,720 --> 12:35:12,720 So we're going to have a dictionary that's this map to that. 15876 12:35:12,720 --> 12:35:15,720 Again, we want to do this as fast as possible. 15877 12:35:15,720 --> 12:35:17,720 Now we're only doing one iteration at the beginning, 15878 12:35:17,720 --> 12:35:21,720 so it asks how many times you want to run it, okay? 15879 12:35:21,720 --> 12:35:25,720 And so we just make an integer of that. 15880 12:35:25,720 --> 12:35:28,720 We check to see if there's any values in there. 15881 12:35:28,720 --> 12:35:31,720 If there are no values, we are bad. 15882 12:35:31,720 --> 12:35:34,720 And now we're going to go I equals one to range many. 15883 12:35:34,720 --> 12:35:38,720 This is going to be one to one, so it might run however many times. 15884 12:35:38,720 --> 12:35:44,720 And then what it's going to do is it's going to compute the new page ranks. 15885 12:35:44,720 --> 12:35:49,720 And so what it's really going to do is it's going to take the previous ranks 15886 12:35:49,720 --> 12:35:56,720 and loop through them, and the previous ranks 15887 12:35:56,720 --> 12:36:02,720 is the mapping of primary key to old page rank, okay? 15888 12:36:02,720 --> 12:36:07,720 And for each node, we're going to have total equals total plus old rank, 15889 12:36:07,720 --> 12:36:15,720 and then we're going to set the next ranks to be zero, okay? 15890 12:36:15,720 --> 12:36:18,720 And then what we're going to do is figure out the number of outbound links 15891 12:36:18,720 --> 12:36:25,720 for each page rank item, so node and old rank in the list of the previous ranks. 15892 12:36:25,720 --> 12:36:27,720 These are the IDs we're going to give it to, 15893 12:36:27,720 --> 12:36:33,720 and so for this particular node, we're going to have the outbound links, 15894 12:36:33,720 --> 12:36:40,720 and we're going to go through the links and not link to itself, 15895 12:36:40,720 --> 12:36:42,720 although we made sure that doesn't happen. 15896 12:36:42,720 --> 12:36:46,720 We make sure that this, but then we're going to make a list called give IDs, 15897 12:36:46,720 --> 12:36:52,720 which are the IDs that node is going to share its goodness. 15898 12:36:52,720 --> 12:36:55,720 And now what we're going to do is we're going to say how much goodness 15899 12:36:55,720 --> 12:37:00,720 are we going to flow outbound based on our previous rank of this particular node 15900 12:37:00,720 --> 12:37:03,720 and the number of outbound links we have. 15901 12:37:03,720 --> 12:37:09,720 So that's how much we're going to give in our outbound links. 15902 12:37:09,720 --> 12:37:13,720 And then what we're doing is all the IDs we're giving it to, 15903 12:37:13,720 --> 12:37:17,720 we started with the next ranks being zero for these folks. 15904 12:37:17,720 --> 12:37:22,720 These are the receiving end, and we're going to add the amount of page rank 15905 12:37:22,720 --> 12:37:24,720 to each one, so whatever this is. 15906 12:37:24,720 --> 12:37:28,720 So we'll go through all of the links, 15907 12:37:28,720 --> 12:37:31,720 give out fractional bits of our current goodness, 15908 12:37:31,720 --> 12:37:34,720 and it's accumulated in each one, 15909 12:37:34,720 --> 12:37:42,720 and so eventually all the incoming links will have granted each new link value. 15910 12:37:42,720 --> 12:37:46,720 Now I'm just going to run through and calculate the new total, 15911 12:37:46,720 --> 12:37:56,720 and this evaporation, the idea is that it has to do with the page rank algorithm 15912 12:37:56,720 --> 12:38:00,720 that there are dysfunctional shapes in which page rank can be trapped, 15913 12:38:00,720 --> 12:38:05,720 and this evaporation is taking a fraction away from everyone 15914 12:38:05,720 --> 12:38:07,720 and giving it back to everybody else. 15915 12:38:07,720 --> 12:38:12,720 And so we add this evaporative factor, 15916 12:38:12,720 --> 12:38:17,720 and then we're going to do some computations just to show some stuff, 15917 12:38:17,720 --> 12:38:24,720 and that is we're calculating the average difference between the page ranks, 15918 12:38:24,720 --> 12:38:26,720 and you'll see this when I start running it, 15919 12:38:26,720 --> 12:38:32,720 and this is going to tell us the stability of the page rank. 15920 12:38:32,720 --> 12:38:36,720 So from one iteration to the next, the more it changes, the least stable it is, 15921 12:38:36,720 --> 12:38:39,720 and you'll see in a sec that these things stabilize, 15922 12:38:39,720 --> 12:38:43,720 and we say what's the average difference in the page ranks per node, 15923 12:38:43,720 --> 12:38:46,720 which is what this is, and that's what we're going to print, 15924 12:38:46,720 --> 12:38:51,720 and now we're going to take the new ranks and make them the old ranks 15925 12:38:51,720 --> 12:38:53,720 and then run the loop again. 15926 12:38:53,720 --> 12:38:58,720 So I'm not actually updating the database each time through the page rank iteration, 15927 12:38:58,720 --> 12:39:03,720 but then at the very end I am going to do the update for all of these things 15928 12:39:03,720 --> 12:39:07,720 and update all of the rankings with a new rank. 15929 12:39:07,720 --> 12:39:13,720 So I'm doing an in-memory calculation so that this loop here runs screamingly fast. 15930 12:39:13,720 --> 12:39:17,720 Even if I want to do this loop 100 times or 1000 times, 15931 12:39:17,720 --> 12:39:21,720 it's really all just in-memory data structures. 15932 12:39:21,720 --> 12:39:24,720 Okay, so it's probably easier just for me to show you this. 15933 12:39:24,720 --> 12:39:28,720 The code runs quite simply. 15934 12:39:28,720 --> 12:39:34,720 Python 3, 15935 12:39:34,720 --> 12:39:39,720 SprankRank.py. 15936 12:39:39,720 --> 12:39:41,720 And so I'm only going to run it for one iteration, 15937 12:39:41,720 --> 12:39:46,720 and that means that this loop here is just going to run one time. 15938 12:39:46,720 --> 12:39:54,720 And so it's going to start with the page ranks of the new rank of one, 15939 12:39:54,720 --> 12:39:58,720 and it's going to just run one iteration and put the rank there. 15940 12:39:58,720 --> 12:40:00,720 Okay, and then update this as well. 15941 12:40:00,720 --> 12:40:05,720 So let's go ahead and run that once for one iteration. 15942 12:40:05,720 --> 12:40:08,720 Okay, and so it ran one iteration, 15943 12:40:08,720 --> 12:40:14,720 and the average change between the previous rank and the new rank is one. 15944 12:40:14,720 --> 12:40:16,720 So it's actually quite crazy. 15945 12:40:16,720 --> 12:40:18,720 So I'm going to refresh here, 15946 12:40:18,720 --> 12:40:21,720 and you'll see that the old rank was one, 15947 12:40:21,720 --> 12:40:26,720 and the new rank went way down, way down, way down, way down, 15948 12:40:26,720 --> 12:40:32,720 down a little bit, down some, up a whole bunch. 15949 12:40:32,720 --> 12:40:33,720 Down, down, up. 15950 12:40:33,720 --> 12:40:35,720 So you see that they went down and up. 15951 12:40:35,720 --> 12:40:39,720 Now the sum of all of these numbers is going to be the same, right? 15952 12:40:39,720 --> 12:40:44,720 Because all it did was like float it out and recalculate it. 15953 12:40:44,720 --> 12:40:47,720 And so that's what happens with PageRank. 15954 12:40:47,720 --> 12:40:50,720 And so what will happen is if I run one more PageRank iteration, 15955 12:40:50,720 --> 12:40:55,720 this number will, these numbers will be used to compute the new new rank, 15956 12:40:55,720 --> 12:40:57,720 and then these will be calculated to the old rank. 15957 12:40:57,720 --> 12:41:00,720 And so you'll see that these will get, they will change again. 15958 12:41:00,720 --> 12:41:05,720 So I'll just run it one more time. 15959 12:41:05,720 --> 12:41:09,720 So I'm going to run one iteration, and then I'm going to hit refresh. 15960 12:41:09,720 --> 12:41:12,720 So you see all these numbers got copied over, 15961 12:41:12,720 --> 12:41:17,720 but now there's a new rank that's computed based on these guys. 15962 12:41:17,720 --> 12:41:19,720 And so they're getting, this one went up. 15963 12:41:19,720 --> 12:41:20,720 This was 0.13. 15964 12:41:20,720 --> 12:41:21,720 That's gone up a little bit. 15965 12:41:21,720 --> 12:41:23,720 This one's gone up some more. 15966 12:41:23,720 --> 12:41:24,720 This one's gone up. 15967 12:41:24,720 --> 12:41:26,720 This one went down, right? 15968 12:41:26,720 --> 12:41:28,720 So this one went down from 6 to 8. 15969 12:41:28,720 --> 12:41:33,720 And you can see that the difference is now the average difference between 15970 12:41:33,720 --> 12:41:39,720 this number and this number across all of them went from 1 point something to 0.41. 15971 12:41:39,720 --> 12:41:41,720 And you'll see that with these very few pages, 15972 12:41:41,720 --> 12:41:47,720 this PageRank converges really quickly, okay? 15973 12:41:47,720 --> 12:41:49,720 So let's run it again. 15974 12:41:49,720 --> 12:41:53,720 And I'll just run 10, and you will watch how this converges, okay? 15975 12:41:53,720 --> 12:41:54,720 So there you go. 15976 12:41:54,720 --> 12:41:56,720 It converges. 15977 12:41:56,720 --> 12:41:59,720 And you're seeing now after like 12 iterations 15978 12:41:59,720 --> 12:42:06,720 that the difference between the old rank and the new rank, 15979 12:42:06,720 --> 12:42:08,720 well, that's because it's that old rank. 15980 12:42:08,720 --> 12:42:11,720 I'll run one more iteration so that you can see. 15981 12:42:11,720 --> 12:42:15,720 So this old rank is less than 0.005. 15982 12:42:15,720 --> 12:42:19,720 And so now you can see that these numbers are sort of stabilizing. 15983 12:42:19,720 --> 12:42:20,720 This is the average. 15984 12:42:20,720 --> 12:42:24,720 That 0.005 number is the average difference between these two things. 15985 12:42:24,720 --> 12:42:27,720 Now, if we're going to pretend to be Google for a moment, 15986 12:42:27,720 --> 12:42:32,720 we can say python3 spider.py. 15987 12:42:36,720 --> 12:42:38,720 So let's just do 10 more pages. 15988 12:42:38,720 --> 12:42:41,720 Now what's going to happen here is these new pages 15989 12:42:41,720 --> 12:42:44,720 are going to have PageRanks of 1, okay? 15990 12:42:44,720 --> 12:42:48,720 So let's get out. 15991 12:42:48,720 --> 12:42:52,720 So if I do a refresh now, and I look at new rank. 15992 12:42:52,720 --> 12:42:55,720 So there's these guys that have high rank. 15993 12:42:55,720 --> 12:42:58,720 What you'll see, I hope, if we, yeah, okay. 15994 12:42:58,720 --> 12:43:00,720 So you see new pages, right? 15995 12:43:00,720 --> 12:43:02,720 These are the new ones that we just retrieved. 15996 12:43:02,720 --> 12:43:05,720 I don't know if they're linked or not, and they all got one. 15997 12:43:05,720 --> 12:43:08,720 So some old pages are way up, 14. 15998 12:43:08,720 --> 12:43:11,720 Some pages, if we go downwards, are way down, right? 15999 12:43:11,720 --> 12:43:13,720 So these are like useless pages. 16000 12:43:13,720 --> 12:43:16,720 They, you know, they point to somewhere, but nobody points to them. 16001 12:43:16,720 --> 12:43:18,720 That's what happens with these PageRanks, okay? 16002 12:43:18,720 --> 12:43:23,720 So what happens is the new records get this 0.1. 16003 12:43:23,720 --> 12:43:29,720 And so if I run the ranking code again, and I run, let's just run five iterations, 16004 12:43:29,720 --> 12:43:33,720 you'll see that the average delta goes up just briefly 16005 12:43:33,720 --> 12:43:36,720 as it sort of assimilates these new pages, 16006 12:43:36,720 --> 12:43:38,720 and then it goes right back down again. 16007 12:43:38,720 --> 12:43:39,720 And so that's what's happening with Google. 16008 12:43:39,720 --> 12:43:42,720 It's sort of running the spider to get more pages, 16009 12:43:42,720 --> 12:43:45,720 then running the PageRank, which gets disturbed a little bit, 16010 12:43:45,720 --> 12:43:47,720 but then it reconverges very rapidly. 16011 12:43:47,720 --> 12:43:50,720 And of course, they've got billions of pages, and we've got hundreds of pages, 16012 12:43:50,720 --> 12:43:52,720 but you get the idea, okay? 16013 12:43:52,720 --> 12:43:56,720 And so I can run PageRank like 100 times, 16014 12:43:56,720 --> 12:43:59,720 and after a while, it just sort of hardly is changing. 16015 12:43:59,720 --> 12:44:03,720 So that's 2.7 to the negative 10th power. 16016 12:44:03,720 --> 12:44:08,720 So now, you know, let me run it one more time to update the stuff. 16017 12:44:08,720 --> 12:44:13,720 And if I refresh this, you're going to see, look at how stable these numbers are. 16018 12:44:13,720 --> 12:44:19,720 14, 9, 4, 3, 5, 9, 1, 5, 6, 7. 16019 12:44:19,720 --> 12:44:22,720 The difference is they're in the seventh one. 16020 12:44:22,720 --> 12:44:24,720 So that's why this whole PageRank is really cool. 16021 12:44:24,720 --> 12:44:27,720 It seems like it's really chaotic when it first starts out, 16022 12:44:27,720 --> 12:44:30,720 and away you go, okay? 16023 12:44:30,720 --> 12:44:36,720 So that was just this, SPRank, right? 16024 12:44:36,720 --> 12:44:40,720 SPRank, and SPReset, we can look at that code. 16025 12:44:40,720 --> 12:44:42,720 I won't bother running it. 16026 12:44:42,720 --> 12:44:45,720 It just sets the old rank to 1. 16027 12:44:45,720 --> 12:44:46,720 That's it. 16028 12:44:46,720 --> 12:44:47,720 That's as much code as you've got. 16029 12:44:47,720 --> 12:44:50,720 It just starts it and lets it rerun. 16030 12:44:50,720 --> 12:44:54,720 So I'm going to stop now, and I'm going to start a new video, 16031 12:44:54,720 --> 12:44:56,720 where I should talk about this phase here, 16032 12:44:56,720 --> 12:45:06,720 where we're actually going to visualize the PageRank data. 16033 12:45:06,720 --> 12:45:11,720 And what we are in the middle of is we're in the middle of the PageRank code, 16034 12:45:11,720 --> 12:45:14,720 and we just got done running the PageRank, 16035 12:45:14,720 --> 12:45:17,720 and so we have spiedered the code. 16036 12:45:17,720 --> 12:45:19,720 We've run PageRank a bunch of times. 16037 12:45:19,720 --> 12:45:22,720 SPReset allows us to restart the PageRank algorithm if we want, 16038 12:45:22,720 --> 12:45:24,720 but we're not going to play with that. 16039 12:45:24,720 --> 12:45:27,720 We're just going to play with spdump and spjson and do the visualization, 16040 12:45:27,720 --> 12:45:29,720 which is the fun part. 16041 12:45:29,720 --> 12:45:32,720 So I'll go into spdump. 16042 12:45:32,720 --> 12:45:34,720 So this is a simple code, 16043 12:45:34,720 --> 12:45:38,720 because it's really just running a SQL query and then printing stuff out, right? 16044 12:45:38,720 --> 12:45:41,720 So we connect to our database, create a cursor, 16045 12:45:41,720 --> 12:45:44,720 and then just do a select count, 16046 12:45:44,720 --> 12:45:48,720 and we're going to just show the number of links. 16047 12:45:48,720 --> 12:45:51,720 We're going to order by the number of inbound links descending 16048 12:45:51,720 --> 12:45:55,720 so we see the most linked things, and we'll see the top 50 that. 16049 12:45:55,720 --> 12:45:56,720 So this is just a sample. 16050 12:45:56,720 --> 12:46:00,720 You'll tend to write little helpers like this that make your life easier 16051 12:46:00,720 --> 12:46:05,720 just to show you the kinds of things that you want, spdump.py. 16052 12:46:05,720 --> 12:46:07,720 And you just kind of test to make sure that it's like, 16053 12:46:07,720 --> 12:46:09,720 oh, this looks right to me. 16054 12:46:09,720 --> 12:46:12,720 And so here is the number of inbound links. 16055 12:46:12,720 --> 12:46:15,720 So that's my blog that has the most inbound links, 16056 12:46:15,720 --> 12:46:18,720 followed by my uncategorized, whatever that is. 16057 12:46:18,720 --> 12:46:23,720 And these are the number of inbound links within my own blog somehow. 16058 12:46:23,720 --> 12:46:29,720 I don't know, because this is not looking at the whole internet at all. 16059 12:46:29,720 --> 12:46:31,720 So there we go. 16060 12:46:31,720 --> 12:46:32,720 So that's spdump. 16061 12:46:32,720 --> 12:46:34,720 Pretty straightforward. 16062 12:46:34,720 --> 12:46:37,720 And now we're going to go through the visualization process. 16063 12:46:37,720 --> 12:46:40,720 And so this is going to look at all that data 16064 12:46:40,720 --> 12:46:43,720 and produce a JavaScript file. 16065 12:46:43,720 --> 12:46:45,720 It's going to write a JavaScript file 16066 12:46:45,720 --> 12:46:49,720 that will then be fed into my visualization using D3. 16067 12:46:49,720 --> 12:46:57,720 And spjson is going to do a big, long join. 16068 12:46:57,720 --> 12:46:59,720 It joins the links with the thing. 16069 12:46:59,720 --> 12:47:01,720 And HTML is not null. 16070 12:47:01,720 --> 12:47:03,720 And error is not null. 16071 12:47:03,720 --> 12:47:05,720 You know, order by the number of inbound links. 16072 12:47:05,720 --> 12:47:10,720 So we're looking at the things that have the highest number of inbound links. 16073 12:47:10,720 --> 12:47:14,720 We're going to read all this stuff. 16074 12:47:14,720 --> 12:47:18,720 We're going to read through all those rows 16075 12:47:18,720 --> 12:47:21,720 and pull out the page rank for each one. 16076 12:47:21,720 --> 12:47:24,720 We are looking for the highest and lowest rank 16077 12:47:24,720 --> 12:47:27,720 because these numbers can vary quite widely. 16078 12:47:27,720 --> 12:47:31,720 They go all the way from 0.000 to 20 or 30. 16079 12:47:31,720 --> 12:47:35,720 And so it asks, how many do you want to do? 16080 12:47:35,720 --> 12:47:38,720 So it only does the top, like 20 or something. 16081 12:47:38,720 --> 12:47:41,720 And you'll see why we need that in the visualization. 16082 12:47:41,720 --> 12:47:44,720 And so this is just checking. 16083 12:47:44,720 --> 12:47:46,720 And so we're going to write out a file. 16084 12:47:46,720 --> 12:47:48,720 We'll see what the format of this is. 16085 12:47:48,720 --> 12:47:51,720 It's just a little, it's just a JavaScript file. 16086 12:47:51,720 --> 12:47:54,720 And we're going to write out, 16087 12:47:54,720 --> 12:47:57,720 we're basically normalizing the rank. 16088 12:47:57,720 --> 12:47:59,720 We're subtracting the minimum rank. 16089 12:47:59,720 --> 12:48:03,720 And because we're going to turn this into line weight, 16090 12:48:03,720 --> 12:48:04,720 the thickness of the line, 16091 12:48:04,720 --> 12:48:07,720 and so we're dividing by, you know, 16092 12:48:07,720 --> 12:48:11,720 we're normalizing the rank to be the thickness of the line 16093 12:48:11,720 --> 12:48:16,720 and the size of the ball. 16094 12:48:16,720 --> 12:48:17,720 You'll see all this. 16095 12:48:17,720 --> 12:48:19,720 And so this is really just writing some JavaScript 16096 12:48:19,720 --> 12:48:22,720 with the little strings and stuff like that. 16097 12:48:22,720 --> 12:48:25,720 And then we're going to finish the JavaScript. 16098 12:48:25,720 --> 12:48:27,720 And then we're going to write all the links out. 16099 12:48:27,720 --> 12:48:29,720 So these are the balls that you'll see. 16100 12:48:29,720 --> 12:48:32,720 And this is showing what, this is drawing all the lines. 16101 12:48:32,720 --> 12:48:35,720 And this is again normalizing things for thickness 16102 12:48:35,720 --> 12:48:36,720 and printing these things out. 16103 12:48:36,720 --> 12:48:40,720 Now I don't want to go through this in tremendous detail, 16104 12:48:40,720 --> 12:48:47,720 but so I'll do python spjson.py. 16105 12:48:47,720 --> 12:48:50,720 Let's do the top 20 nodes. 16106 12:48:50,720 --> 12:48:55,720 And if I take a look at this file spider.js, 16107 12:48:55,720 --> 12:48:58,720 you can see that it's some objects that basically 16108 12:48:58,720 --> 12:49:01,720 put the page rank in, which ID it is, 16109 12:49:01,720 --> 12:49:04,720 and that's a way for me to be able to link back and forth. 16110 12:49:04,720 --> 12:49:07,720 Weight is how big the little circle is. 16111 12:49:07,720 --> 12:49:08,720 And then I have the links. 16112 12:49:08,720 --> 12:49:11,720 And I only asked for the top 20. 16113 12:49:11,720 --> 12:49:15,720 And then this is the thickness of the line, 16114 12:49:15,720 --> 12:49:18,720 where the line starts, where the line ends. 16115 12:49:18,720 --> 12:49:24,720 So this is read by this HTML file. 16116 12:49:24,720 --> 12:49:31,720 And it's going to read somewhere this force.js file. 16117 12:49:31,720 --> 12:49:36,720 And my own spider.js code, this is some JavaScript. 16118 12:49:36,720 --> 12:49:40,720 I mean, no, the force.js is the visualization code. 16119 12:49:40,720 --> 12:49:43,720 And this is D3, the visualization library. 16120 12:49:43,720 --> 12:49:46,720 So I'm using this D3.js, 16121 12:49:46,720 --> 12:49:49,720 which is a really great visualization library. 16122 12:49:49,720 --> 12:49:51,720 And this is just drawing the circles 16123 12:49:51,720 --> 12:49:53,720 and making the circles of colors 16124 12:49:53,720 --> 12:49:55,720 and making the circles bigger and smaller 16125 12:49:55,720 --> 12:49:57,720 and then connecting all the lines in between it. 16126 12:49:57,720 --> 12:49:59,720 So this is just there. 16127 12:49:59,720 --> 12:50:01,720 This data feeds that thing. 16128 12:50:01,720 --> 12:50:04,720 And so when we're all done, you simply say open. 16129 12:50:04,720 --> 12:50:05,720 You don't have to do anything. 16130 12:50:05,720 --> 12:50:10,720 Open force.html. 16131 12:50:10,720 --> 12:50:13,720 And so all this beautiful JavaScript stuff is like, 16132 12:50:13,720 --> 12:50:14,720 oh, wow, that's really cool, 16133 12:50:14,720 --> 12:50:16,720 because you can move these things around. 16134 12:50:16,720 --> 12:50:17,720 Whoa. 16135 12:50:17,720 --> 12:50:20,720 You can see the circles are bigger. 16136 12:50:20,720 --> 12:50:21,720 If you hover over it for a while, 16137 12:50:21,720 --> 12:50:24,720 it shows you the big ones. 16138 12:50:24,720 --> 12:50:26,720 You know, you can see these things, and it's kind of cool. 16139 12:50:26,720 --> 12:50:28,720 So I gave you all this force.js 16140 12:50:28,720 --> 12:50:30,720 and force.html. 16141 12:50:30,720 --> 12:50:33,720 And so that kind of visualizes the page rank. 16142 12:50:33,720 --> 12:50:37,720 And you could use this to visualize quite a bit of stuff. 16143 12:50:37,720 --> 12:50:42,720 You know, it'll take you a while to pull down enough data 16144 12:50:42,720 --> 12:50:45,720 from a real website. 16145 12:50:45,720 --> 12:50:47,720 But after you pull down 400 or 500 pages 16146 12:50:47,720 --> 12:50:48,720 if you have some time, 16147 12:50:48,720 --> 12:50:51,720 then the visualization is quite interesting. 16148 12:50:51,720 --> 12:50:54,720 But you can see why we had to pull down several hundred pages 16149 12:50:54,720 --> 12:50:57,720 just to get this much page rank information. 16150 12:50:57,720 --> 12:51:03,720 Okay, so that gives you a sense 16151 12:51:03,720 --> 12:51:08,720 of how to run the page rank code in Python for everybody. 16152 12:51:08,720 --> 12:51:11,720 So thanks for listening. 16153 12:51:11,720 --> 12:51:16,720 The last visualization application 16154 12:51:16,720 --> 12:51:18,720 that we're going to take a look at is mailing lists, 16155 12:51:18,720 --> 12:51:19,720 and that's kind of ironic. 16156 12:51:19,720 --> 12:51:20,720 We started with the mailing lists, 16157 12:51:20,720 --> 12:51:22,720 and we're going to end with the mailing lists. 16158 12:51:22,720 --> 12:51:23,720 The mailing lists, of course, 16159 12:51:23,720 --> 12:51:25,720 are from my open source Sakai project, 16160 12:51:25,720 --> 12:51:28,720 which I love and am very proud of. 16161 12:51:28,720 --> 12:51:30,720 And so what we're going to do 16162 12:51:30,720 --> 12:51:32,720 is we're going to crawl the archive of a mailing list, 16163 12:51:32,720 --> 12:51:34,720 and then we're going to do two visualizations. 16164 12:51:34,720 --> 12:51:36,720 One is an activity visualization, 16165 12:51:36,720 --> 12:51:38,720 and another is a word cloud. 16166 12:51:38,720 --> 12:51:41,720 So probably the more important thing 16167 12:51:41,720 --> 12:51:44,720 is when I do the demonstration of how the software works. 16168 12:51:44,720 --> 12:51:47,720 So this is a large data set, so you've got to be careful. 16169 12:51:47,720 --> 12:51:50,720 This could spider gmain.org, 16170 12:51:50,720 --> 12:51:52,720 which is a very free and friendly archive. 16171 12:51:52,720 --> 12:51:56,720 This data originally came from gmain.org, 16172 12:51:56,720 --> 12:51:58,720 but I've got a copy of it. 16173 12:51:58,720 --> 12:52:01,720 And so gmain.org is not rate limited, 16174 12:52:01,720 --> 12:52:04,720 but if everyone who is watching this 16175 12:52:04,720 --> 12:52:06,720 starts spidering gmain.org at the same time, 16176 12:52:06,720 --> 12:52:07,720 you will crash it. 16177 12:52:07,720 --> 12:52:09,720 It just doesn't have the horsepower 16178 12:52:09,720 --> 12:52:11,720 to give you this data as fast. 16179 12:52:11,720 --> 12:52:14,720 And so I've got something that can give you the data super fast 16180 12:52:14,720 --> 12:52:17,720 and has no rate limited on a really good server, 16181 12:52:17,720 --> 12:52:19,720 and it's cached all around the world 16182 12:52:19,720 --> 12:52:21,720 using a technology called CloudFlare. 16183 12:52:21,720 --> 12:52:25,720 So please, please, please don't point this at gmain.org. 16184 12:52:25,720 --> 12:52:27,720 Point this at the URL here, 16185 12:52:27,720 --> 12:52:30,720 mboxdrchuck.net, et cetera, et cetera. 16186 12:52:30,720 --> 12:52:33,720 And then you can run this as fast as you like. 16187 12:52:33,720 --> 12:52:35,720 Now, another thing to worry about is 16188 12:52:35,720 --> 12:52:39,720 if you have a metered connection. 16189 12:52:39,720 --> 12:52:41,720 So don't do this on a cell phone connection 16190 12:52:41,720 --> 12:52:44,720 because you'll pay thousands of dollars perhaps. 16191 12:52:44,720 --> 12:52:47,720 Make sure you run a no cost connection 16192 12:52:47,720 --> 12:52:48,720 before you start running this 16193 12:52:48,720 --> 12:52:50,720 because this is going to pull a lot of data down. 16194 12:52:50,720 --> 12:52:53,720 If you just start this from scratch and you let it run, 16195 12:52:53,720 --> 12:52:56,720 on a super fast connection, 16196 12:52:56,720 --> 12:53:00,720 downloading the whole thing is probably about four hours. 16197 12:53:00,720 --> 12:53:04,720 On my home connection, 16198 12:53:04,720 --> 12:53:07,720 when I had like about a 10 megabit connection, 16199 12:53:07,720 --> 12:53:09,720 it took several days. 16200 12:53:09,720 --> 12:53:12,720 And so just understand that in this one, 16201 12:53:12,720 --> 12:53:15,720 it's both fun to deal with a ton of data, 16202 12:53:15,720 --> 12:53:17,720 and it's scary to deal with a ton of data. 16203 12:53:17,720 --> 12:53:18,720 So this one is big. 16204 12:53:18,720 --> 12:53:22,720 This one is, you'll see the process in action 16205 12:53:22,720 --> 12:53:24,720 because it'll run for a while. 16206 12:53:24,720 --> 12:53:27,720 Everything, the things will take a long time. 16207 12:53:27,720 --> 12:53:30,720 So here's basically the flow of the data 16208 12:53:30,720 --> 12:53:32,720 in this particular one. 16209 12:53:32,720 --> 12:53:34,720 You are going to have the restartable spider 16210 12:53:34,720 --> 12:53:38,720 that talks to the API, mboxdrchuck.net, 16211 12:53:38,720 --> 12:53:42,720 which has a scalable copy of all this information. 16212 12:53:42,720 --> 12:53:46,720 And again, it's going to do kind of a raw database, 16213 12:53:46,720 --> 12:53:47,720 not a very clean database. 16214 12:53:47,720 --> 12:53:48,720 It's sort of a mess. 16215 12:53:48,720 --> 12:53:51,720 It's just enough columns to keep track of whether or not 16216 12:53:51,720 --> 12:53:53,720 we've got this page or not. 16217 12:53:53,720 --> 12:53:57,720 And so this has the ones we've retrieved so far. 16218 12:53:57,720 --> 12:54:00,720 And so what gmain does is it sort of scans down 16219 12:54:00,720 --> 12:54:02,720 to see where to retrieve next, gets that, 16220 12:54:02,720 --> 12:54:05,720 and then starts scanning and then adding things here. 16221 12:54:05,720 --> 12:54:07,720 So it just adds it and then it blows up 16222 12:54:07,720 --> 12:54:09,720 and then it comes in again and says, 16223 12:54:09,720 --> 12:54:11,720 okay, I'll start here and then it starts retrieving stuff 16224 12:54:11,720 --> 12:54:14,720 and fills this in, fills this in, fills this in. 16225 12:54:14,720 --> 12:54:16,720 And sometimes you put like a delay in this 16226 12:54:16,720 --> 12:54:19,720 so you don't overwhelm networks, you don't overwhelm servers. 16227 12:54:19,720 --> 12:54:22,720 But basically this is pretty much a raw retrieval 16228 12:54:22,720 --> 12:54:24,720 of the email messages. 16229 12:54:24,720 --> 12:54:26,720 And this file can get rather large. 16230 12:54:26,720 --> 12:54:28,720 This is the one that's greater than a gigabyte. 16231 12:54:28,720 --> 12:54:31,720 Now this data is actually really nasty. 16232 12:54:31,720 --> 12:54:33,720 It's email data. 16233 12:54:33,720 --> 12:54:34,720 The date format's changed. 16234 12:54:34,720 --> 12:54:41,720 This is data that lasted from 2004 to like 2012 or 2013. 16235 12:54:41,720 --> 12:54:45,720 And so this data has got a lot of things wrong with it. 16236 12:54:45,720 --> 12:54:48,720 It even has things where people's email address has changed. 16237 12:54:48,720 --> 12:54:50,720 And so it has this mapping file. 16238 12:54:50,720 --> 12:54:53,720 This comes along with it, this mapping file that says, 16239 12:54:53,720 --> 12:54:56,720 here's this one person and here are the six email addresses 16240 12:54:56,720 --> 12:55:00,720 that they used throughout the life of the project. 16241 12:55:00,720 --> 12:55:03,720 And so there is a relatively complex, 16242 12:55:03,720 --> 12:55:09,720 and so this part here is super slow, very slow. 16243 12:55:09,720 --> 12:55:12,720 This part here is slow. 16244 12:55:12,720 --> 12:55:15,720 But it'll take like, depending on how fast your computer is, 16245 12:55:15,720 --> 12:55:17,720 somewhere between two minutes and ten minutes. 16246 12:55:17,720 --> 12:55:20,720 This first part will take days, perhaps, 16247 12:55:20,720 --> 12:55:22,720 depending on the speed of your network connection. 16248 12:55:22,720 --> 12:55:25,720 And so what gmodel does is it reads through this. 16249 12:55:25,720 --> 12:55:27,720 It actually re-creates, it wipes this out 16250 12:55:27,720 --> 12:55:30,720 and re-creates index.sqlite every time it runs 16251 12:55:30,720 --> 12:55:32,720 so that you can change any number of things, 16252 12:55:32,720 --> 12:55:35,720 you can respiter things, you can do whatever. 16253 12:55:35,720 --> 12:55:38,720 And often the cleanup, this is one of those cleanup processes, 16254 12:55:38,720 --> 12:55:40,720 and you have to tweak the cleanup process. 16255 12:55:40,720 --> 12:55:43,720 You're like, look at your data, like, oh, the cleanup missed something, 16256 12:55:43,720 --> 12:55:44,720 so I've got to run it again. 16257 12:55:44,720 --> 12:55:48,720 So this produces index.sqlite every time it runs. 16258 12:55:48,720 --> 12:55:50,720 So this is like two to ten minutes. 16259 12:55:50,720 --> 12:55:52,720 gmodel is two to ten minutes. 16260 12:55:52,720 --> 12:55:56,720 And it maps names, and when it's all said and done, 16261 12:55:56,720 --> 12:56:01,720 this is a very small, highly normalized, it's a nice data model. 16262 12:56:01,720 --> 12:56:04,720 This one here, the content.sqlite has an ugly data model. 16263 12:56:04,720 --> 12:56:06,720 Index.sqlite has a pretty data model. 16264 12:56:06,720 --> 12:56:09,720 It's got foreign keys, it's got all this stuff. 16265 12:56:09,720 --> 12:56:12,720 And all those things we talked about in the database where it's efficient. 16266 12:56:12,720 --> 12:56:16,720 And so in your mind, keep track of how fast it is to scan all the data 16267 12:56:16,720 --> 12:56:18,720 in a database with a bad model, 16268 12:56:18,720 --> 12:56:21,720 and then watch when you run like gbasic, which is a scanner, 16269 12:56:21,720 --> 12:56:23,720 or gline, which produces line data, or gword, 16270 12:56:23,720 --> 12:56:25,720 and watch how fast they run. 16271 12:56:25,720 --> 12:56:28,720 They run in like a couple of seconds at the most, 16272 12:56:28,720 --> 12:56:30,720 and this runs in two to ten minutes. 16273 12:56:30,720 --> 12:56:34,720 And the difference is that's because the data is efficiently modeled 16274 12:56:34,720 --> 12:56:35,720 in index.sqlite. 16275 12:56:35,720 --> 12:56:38,720 So you can take a look at that using SQLite browser 16276 12:56:38,720 --> 12:56:40,720 and take a look at the data model. 16277 12:56:40,720 --> 12:56:42,720 And you'll see it looks just like the stuff we talked about 16278 12:56:42,720 --> 12:56:43,720 in the database chapter. 16279 12:56:43,720 --> 12:56:46,720 It's got foreign keys and all those things. 16280 12:56:46,720 --> 12:56:48,720 And so that runs, and you've got this. 16281 12:56:48,720 --> 12:56:51,720 And then we do our visualizations and our analysis 16282 12:56:51,720 --> 12:56:53,720 from this clean version of all the data. 16283 12:56:53,720 --> 12:56:56,720 And so gbasic just loops through and prints some stuff out. 16284 12:56:56,720 --> 12:56:58,720 It's a great way to test things. 16285 12:56:58,720 --> 12:57:00,720 It's a pretty easy to understand program, 16286 12:57:00,720 --> 12:57:01,720 and you could take a look at it. 16287 12:57:01,720 --> 12:57:04,720 Gline does some bucketing and makes some histograms 16288 12:57:04,720 --> 12:57:06,720 to produce a line graph. 16289 12:57:06,720 --> 12:57:09,720 And then gword does a different histogram. 16290 12:57:09,720 --> 12:57:11,720 It does a histogram of word frequency 16291 12:57:11,720 --> 12:57:14,720 and then produces that as the word frequency ends up in gword.js. 16292 12:57:14,720 --> 12:57:21,720 And then we have two HTML files that use the d3.js visualization 16293 12:57:21,720 --> 12:57:24,720 to produce a line and a word chart. 16294 12:57:24,720 --> 12:57:28,720 And so in another video, I will show you how this code works, 16295 12:57:28,720 --> 12:57:31,720 which is probably more useful than this picture. 16296 12:57:31,720 --> 12:57:37,720 But this is a whole bunch of good stuff 16297 12:57:37,720 --> 12:57:39,720 in this particular application. 16298 12:57:39,720 --> 12:57:42,720 And if you really understand everything in here, 16299 12:57:42,720 --> 12:57:44,720 you can build a pretty sophisticated 16300 12:57:44,720 --> 12:57:47,720 data retrieval and analysis pipeline. 16301 12:57:47,720 --> 12:57:49,720 And so that's it. 16302 12:57:49,720 --> 12:57:51,720 Thank you for watching all these lectures, 16303 12:57:51,720 --> 12:57:54,720 and I look forward to seeing you on the net. 16304 12:57:58,720 --> 12:58:00,720 We're doing some code walkthroughs. 16305 12:58:00,720 --> 12:58:02,720 If you want to get the source code, 16306 12:58:02,720 --> 12:58:04,720 you can take a look at the sample code 16307 12:58:04,720 --> 12:58:07,720 and download it and work through it. 16308 12:58:07,720 --> 12:58:12,720 And so what we're working on now is doing some retrieval 16309 12:58:12,720 --> 12:58:15,720 and visualization of email data. 16310 12:58:15,720 --> 12:58:16,720 It's kind of ironic. 16311 12:58:16,720 --> 12:58:24,720 We are going to now look at the email data that we started with. 16312 12:58:24,720 --> 12:58:28,720 It's the same Sakai developer list email data. 16313 12:58:28,720 --> 12:58:32,720 And so there's this service called gmain. 16314 12:58:32,720 --> 12:58:36,720 And gmain archives developer lists and various email lists. 16315 12:58:36,720 --> 12:58:39,720 And I've made a copy of their data because all the students 16316 12:58:39,720 --> 12:58:43,720 in my class hitting their server with their API would crush it. 16317 12:58:43,720 --> 12:58:47,720 So in order to be a nice guy, I put up a much more powerful server 16318 12:58:47,720 --> 12:58:51,720 with just the data from this one list. 16319 12:58:51,720 --> 12:58:53,720 And it's about a gigabyte of data, 16320 12:58:53,720 --> 12:58:56,720 so be real careful if you're paying for network. 16321 12:58:56,720 --> 12:58:59,720 So the basic process we're going to go through is we're going to have 16322 12:58:59,720 --> 12:59:03,720 a spidering process that's a simple, restartable, 16323 12:59:03,720 --> 12:59:07,720 focused on the network problems, data pulling, 16324 12:59:07,720 --> 12:59:11,720 to pull content.sqlite, and there's going to be a database there. 16325 12:59:11,720 --> 12:59:13,720 And then we're going to have a cleanup process. 16326 12:59:13,720 --> 12:59:16,720 This database is going to get large, about a gigabyte. 16327 12:59:16,720 --> 12:59:19,720 And then we're going to have a process that kind of grinds through this data. 16328 12:59:19,720 --> 12:59:22,720 It takes a while. 16329 12:59:22,720 --> 12:59:24,720 And so then it's going to read this mapping, 16330 12:59:24,720 --> 12:59:26,720 and I'll show you that when it comes, 16331 12:59:26,720 --> 12:59:29,720 because things like people's names have changed over all these years. 16332 12:59:29,720 --> 12:59:31,720 And it does a cleanup and makes a really nice, 16333 12:59:31,720 --> 12:59:34,720 highly relational version of this data. 16334 12:59:34,720 --> 12:59:36,720 And then we visualize from here. 16335 12:59:36,720 --> 12:59:40,720 And so this could take you several days to finish this. 16336 12:59:40,720 --> 12:59:42,720 This will take like a few minutes to run, 16337 12:59:42,720 --> 12:59:45,720 and then this will just take seconds to run. 16338 12:59:45,720 --> 12:59:50,720 And so this is a multi-step process where if you were doing something 16339 12:59:50,720 --> 12:59:53,720 like running something for two days to produce a visualization, 16340 12:59:53,720 --> 12:59:56,720 and it blew up three cars the way through, it would do you no good. 16341 12:59:56,720 --> 12:59:59,720 And so that's why we break this into simple parts. 16342 12:59:59,720 --> 13:00:03,720 But right now we're just going to focus on this part right here, 16343 13:00:03,720 --> 13:00:10,720 and take a look at the mail bit, and retrieve the mail, 16344 13:00:10,720 --> 13:00:15,720 and then we'll have another video to talk about the rest of this stuff. 16345 13:00:15,720 --> 13:00:21,720 So let's take a look at the code. 16346 13:00:21,720 --> 13:00:24,720 So here is gmain.py. 16347 13:00:24,720 --> 13:00:26,720 That is the basic code. 16348 13:00:26,720 --> 13:00:29,720 And hopefully this stuff is starting to look familiar. 16349 13:00:29,720 --> 13:00:32,720 The thing that's weird here is we've got to do some date-time parsing. 16350 13:00:32,720 --> 13:00:35,720 And there is code that's out there, but you may have to install it. 16351 13:00:35,720 --> 13:00:40,720 And I had to write my code in a way that didn't assume 16352 13:00:40,720 --> 13:00:42,720 that you could install the date-time parser. 16353 13:00:42,720 --> 13:00:45,720 And so it has it, even if that's not there, 16354 13:00:45,720 --> 13:00:47,720 it uses my own date-time parser, and that's what this code is. 16355 13:00:47,720 --> 13:00:50,720 Don't worry too much about that. 16356 13:00:50,720 --> 13:00:55,720 And of course we have to deal with the lack of certificates inside of Python. 16357 13:00:55,720 --> 13:00:59,720 And so we start things out. 16358 13:00:59,720 --> 13:01:02,720 And this is really a simple table. 16359 13:01:02,720 --> 13:01:05,720 We've got a messages table that's got a primary key, 16360 13:01:05,720 --> 13:01:09,720 the email itself, when it was sent, what the subject, 16361 13:01:09,720 --> 13:01:13,720 and the headers, and the body. 16362 13:01:13,720 --> 13:01:17,720 And so what we're going to do is, because we have to pick up where we left off, 16363 13:01:17,720 --> 13:01:25,720 we're going to select the largest primary key from the messages table 16364 13:01:25,720 --> 13:01:27,720 and retrieve that. 16365 13:01:27,720 --> 13:01:31,720 And then we're going to go to the one after that. 16366 13:01:31,720 --> 13:01:35,720 And so we know what the ID is, 16367 13:01:35,720 --> 13:01:38,720 and we're going to pick up where we left off. 16368 13:01:38,720 --> 13:01:43,720 And so we have a starting point that starts either 0 or 1. 16369 13:01:43,720 --> 13:01:47,720 And we're going to ask how many messages to retrieve. 16370 13:01:47,720 --> 13:01:49,720 We've got some counters. 16371 13:01:49,720 --> 13:01:51,720 And so we're going to say, okay, 16372 13:01:51,720 --> 13:01:56,720 see if select ID for messages where ID equals whatever that starting is, 16373 13:01:56,720 --> 13:01:58,720 that's the highest number we've seen so far. 16374 13:01:58,720 --> 13:02:03,720 And if row is not none, 16375 13:02:03,720 --> 13:02:07,720 that means we've already retrieved this particular email message. 16376 13:02:07,720 --> 13:02:10,720 Otherwise we're going to keep on going, and we're in good shape. 16377 13:02:10,720 --> 13:02:12,720 And this is one that we want to retrieve. 16378 13:02:12,720 --> 13:02:14,720 And we're subtracting that so we don't. 16379 13:02:14,720 --> 13:02:16,720 And so this is the base URL. 16380 13:02:16,720 --> 13:02:21,720 This is the URL of our API, 16381 13:02:21,720 --> 13:02:26,720 the one that I have a nice copy of all this data on a server 16382 13:02:26,720 --> 13:02:29,720 that's accessible worldwide and won't crash. 16383 13:02:29,720 --> 13:02:33,720 So the format of this is you can say I would like the email address 16384 13:02:33,720 --> 13:02:48,720 from 1 to 2 or from 100, oops, from 102, 101, message 101 to 102. 16385 13:02:48,720 --> 13:02:51,720 And we can just kind of walk through these things. 16386 13:02:51,720 --> 13:02:53,720 So that's the message ID. 16387 13:02:53,720 --> 13:02:59,720 And so if we're going to make the URL, 16388 13:02:59,720 --> 13:03:01,720 we're going to take the base URL, 16389 13:03:01,720 --> 13:03:03,720 add the starting address, and then add plus one. 16390 13:03:03,720 --> 13:03:06,720 So we got the slash at the end of this starting address. 16391 13:03:06,720 --> 13:03:10,720 And so that's how we form those. 16392 13:03:10,720 --> 13:03:14,720 And we're going to retrieve that, and we're going to decode it. 16393 13:03:14,720 --> 13:03:16,720 We've seen this in some other ones. 16394 13:03:16,720 --> 13:03:19,720 We're going to check to see if we got legit data. 16395 13:03:19,720 --> 13:03:24,720 If not, if I got a 404 not found or something else, we're going to quit. 16396 13:03:24,720 --> 13:03:28,720 If someone has control C, which is our control Z, 16397 13:03:28,720 --> 13:03:30,720 we'll get the program interrupt and we'll stop. 16398 13:03:30,720 --> 13:03:38,720 If there's some other problem, we're going to complain and keep going. 16399 13:03:38,720 --> 13:03:40,720 And if we have five failures in a row, we're going to quit, 16400 13:03:40,720 --> 13:03:44,720 but we'll just keep on going because these things do have glitchy bits here. 16401 13:03:44,720 --> 13:03:48,720 And so at this point, if we made it this far, we've retrieved the URL 16402 13:03:48,720 --> 13:03:50,720 and we've got the number of characters we've retrieved. 16403 13:03:50,720 --> 13:03:55,720 And if we get bad data, if it doesn't start with from, 16404 13:03:55,720 --> 13:03:57,720 because this is a mail message, 16405 13:03:57,720 --> 13:04:05,720 and they all start with from space, if it's right, it starts with from space. 16406 13:04:05,720 --> 13:04:09,720 Then what we're going to, we're going to tolerate up to five failures there 16407 13:04:09,720 --> 13:04:12,720 for bad data because it could be bad. 16408 13:04:12,720 --> 13:04:14,720 And then we're going to find a blank line because that's the new line 16409 13:04:14,720 --> 13:04:17,720 at the end of one line and then a blank line. 16410 13:04:17,720 --> 13:04:20,720 And then we're going to take and break this into the headers, 16411 13:04:20,720 --> 13:04:25,720 the mail headers, which is that mail headers is this stuff right here, 16412 13:04:25,720 --> 13:04:27,720 up to but not including the blank line, 16413 13:04:27,720 --> 13:04:32,720 and then the body is everything after that, okay? 16414 13:04:32,720 --> 13:04:36,720 And so we'll just have, break that into pieces. 16415 13:04:36,720 --> 13:04:40,720 Otherwise we'll complain and tolerate up to five characters. 16416 13:04:40,720 --> 13:04:42,720 And then we're going to use a regular expression, 16417 13:04:42,720 --> 13:04:44,720 kind of from the regular expressions chapter, 16418 13:04:44,720 --> 13:04:50,720 to pull out an email address from the from colon line 16419 13:04:50,720 --> 13:04:54,720 somewhere in these headers, from colon right there. 16420 13:04:54,720 --> 13:04:58,720 It's going to go find a less than and then pull, oops, come on, 16421 13:04:58,720 --> 13:05:00,720 pull this stuff out up to it. 16422 13:05:00,720 --> 13:05:03,720 So you got the less than, you got the parenthesis, 16423 13:05:03,720 --> 13:05:06,720 you got one or more non-blank characters followed by the outside, 16424 13:05:06,720 --> 13:05:08,720 followed by one or more non-blank characters. 16425 13:05:08,720 --> 13:05:11,720 And we'll get back a list of those. 16426 13:05:11,720 --> 13:05:13,720 We should only get one. 16427 13:05:13,720 --> 13:05:16,720 If we find one, we're going to grab the email. 16428 13:05:16,720 --> 13:05:18,720 We're going to strip the lower case. 16429 13:05:18,720 --> 13:05:22,720 And if we got some little nasty less than sign in there, 16430 13:05:22,720 --> 13:05:24,720 we'll tolerate that as well. 16431 13:05:24,720 --> 13:05:26,720 So this is kind of clean up, and you get used to this 16432 13:05:26,720 --> 13:05:28,720 where you're like, oh, how come all those email addresses 16433 13:05:28,720 --> 13:05:32,720 have this other stuff in them? 16434 13:05:32,720 --> 13:05:36,720 And then we also look for it if there are no less than signs. 16435 13:05:36,720 --> 13:05:40,720 And we do this way, this is, and that's different. 16436 13:05:40,720 --> 13:05:43,720 Some mail messages have it this way, and others, 16437 13:05:43,720 --> 13:05:45,720 again, you write this code after you watch it for a while, 16438 13:05:45,720 --> 13:05:48,720 and you're like, oh, it's crapped out and giving me bad stuff. 16439 13:05:48,720 --> 13:05:51,720 And I make them all lower case so they match better 16440 13:05:51,720 --> 13:05:53,720 and I get rid of bad characters. 16441 13:05:53,720 --> 13:05:55,720 Now I got an email address. 16442 13:05:55,720 --> 13:05:58,720 Then what I do is I look for the date of this. 16443 13:05:58,720 --> 13:06:00,720 So I'm going to graph these by date, 16444 13:06:00,720 --> 13:06:03,720 so I look for this line and use a regular expression 16445 13:06:03,720 --> 13:06:05,720 to pull that out. 16446 13:06:05,720 --> 13:06:08,720 So I'm looking for a date, followed by a blank, 16447 13:06:08,720 --> 13:06:13,720 followed by any number of characters, followed by a comma. 16448 13:06:13,720 --> 13:06:16,720 So I'm not interested in this Wednesday bit, 16449 13:06:16,720 --> 13:06:18,720 so I'm skipping that bit right there, 16450 13:06:18,720 --> 13:06:22,720 and going and grabbing everything after that comma space. 16451 13:06:22,720 --> 13:06:26,720 And so it's really here to the end of the line. 16452 13:06:26,720 --> 13:06:27,720 So that's the new line. 16453 13:06:27,720 --> 13:06:29,720 So it's going all the way. 16454 13:06:29,720 --> 13:06:31,720 It's going to pull this bit right here. 16455 13:06:31,720 --> 13:06:33,720 That's the text. 16456 13:06:33,720 --> 13:06:34,720 And this is where we're going to say, 16457 13:06:34,720 --> 13:06:36,720 oh, that's kind of a funky-looking date 16458 13:06:36,720 --> 13:06:38,720 and we want to standardize that date. 16459 13:06:38,720 --> 13:06:43,720 So we're going to, let's see. 16460 13:06:43,720 --> 13:06:46,720 Yeah, we're going to chop it off at the 26th character. 16461 13:06:46,720 --> 13:06:48,720 Apparently, I don't know what the 26th, 16462 13:06:48,720 --> 13:06:50,720 why we care about the 26th character, 16463 13:06:50,720 --> 13:06:52,720 but we chop that off at the 26th character. 16464 13:06:52,720 --> 13:06:54,720 And then we're going to parse it, 16465 13:06:54,720 --> 13:06:57,720 and that's going to give us back a nice clean date, 16466 13:06:57,720 --> 13:06:59,720 sent at date. 16467 13:06:59,720 --> 13:07:01,720 Otherwise, we're going to complete. 16468 13:07:01,720 --> 13:07:02,720 We're going to quit. 16469 13:07:02,720 --> 13:07:03,720 And if we can't parse it, 16470 13:07:03,720 --> 13:07:07,720 then we're going to tolerate five bad email addresses in a row. 16471 13:07:07,720 --> 13:07:09,720 Then we're looking for the subject line 16472 13:07:09,720 --> 13:07:12,720 using another regular expression. 16473 13:07:12,720 --> 13:07:14,720 Subject line, regular expression. 16474 13:07:14,720 --> 13:07:15,720 That's pretty easy. 16475 13:07:15,720 --> 13:07:17,720 Up to, but not including, right? 16476 13:07:17,720 --> 13:07:19,720 There's a blank there. 16477 13:07:19,720 --> 13:07:27,720 It's the subject. 16478 13:07:27,720 --> 13:07:28,720 Let me pull that out. 16479 13:07:28,720 --> 13:07:29,720 We get the subject. 16480 13:07:29,720 --> 13:07:31,720 Now, at this point, we've parsed it 16481 13:07:31,720 --> 13:07:32,720 and we've got good stuff, 16482 13:07:32,720 --> 13:07:33,720 so we reset the fail counter 16483 13:07:33,720 --> 13:07:34,720 because I kept saying, 16484 13:07:34,720 --> 13:07:37,720 if you fail five straight times, you quit. 16485 13:07:37,720 --> 13:07:39,720 And we're going to print it out, 16486 13:07:39,720 --> 13:07:41,720 and then we're just going to insert that stuff. 16487 13:07:41,720 --> 13:07:44,720 We've got the ID of the message, 16488 13:07:44,720 --> 13:07:48,720 which we've got the email address that it came from, 16489 13:07:48,720 --> 13:07:50,720 the time it was sent, the subject, 16490 13:07:50,720 --> 13:07:52,720 and then basically the headers in the body, 16491 13:07:52,720 --> 13:07:53,720 and we're just inserting it. 16492 13:07:53,720 --> 13:07:54,720 And now we're going to say, 16493 13:07:54,720 --> 13:07:56,720 every 50th we're going to commit it, 16494 13:07:56,720 --> 13:07:57,720 so that speeds things up, 16495 13:07:57,720 --> 13:07:59,720 and every 100th we're going to wait a second. 16496 13:07:59,720 --> 13:08:02,720 So that's, you know, count is going up, up, up, up, up, 16497 13:08:02,720 --> 13:08:05,720 and every 50th you'll see it pause, 16498 13:08:05,720 --> 13:08:07,720 and then it will, every 100th, 16499 13:08:07,720 --> 13:08:09,720 it'll pause for a second. 16500 13:08:09,720 --> 13:08:11,720 Mostly that's to let me hit control C 16501 13:08:11,720 --> 13:08:15,720 or to not overload any server. 16502 13:08:15,720 --> 13:08:17,720 Okay, so that's the simple one. 16503 13:08:17,720 --> 13:08:19,720 The problem is, is this data just gets ugly, 16504 13:08:19,720 --> 13:08:22,720 and so you'll find yourself wanting to reset this 16505 13:08:22,720 --> 13:08:23,720 and start it over. 16506 13:08:23,720 --> 13:08:26,720 This one's going to work, of course, 16507 13:08:26,720 --> 13:08:29,720 but it's, these are hard to build, 16508 13:08:29,720 --> 13:08:32,720 and that's why it's a good idea. 16509 13:08:32,720 --> 13:08:33,720 Oops. 16510 13:08:33,720 --> 13:08:34,720 Python 16511 13:08:36,720 --> 13:08:39,720 three gmain.py. 16512 13:08:39,720 --> 13:08:41,720 How many messages? 16513 13:08:41,720 --> 13:08:43,720 Well, let's just do one. 16514 13:08:43,720 --> 13:08:44,720 Choo! 16515 13:08:44,720 --> 13:08:45,720 Okay, so it went and grabbed, 16516 13:08:45,720 --> 13:08:48,720 oh, do I have this already running? 16517 13:08:48,720 --> 13:08:50,720 51 through 52. 16518 13:08:50,720 --> 13:08:52,720 Let me start over. 16519 13:08:52,720 --> 13:08:56,720 That's minus L star SQLite. 16520 13:08:56,720 --> 13:08:57,720 Okay, rm content. 16521 13:08:57,720 --> 13:09:00,720 I must have run it to test it. 16522 13:09:00,720 --> 13:09:02,720 So, let's run it again. 16523 13:09:02,720 --> 13:09:06,720 Python gmain.py and ask for one message. 16524 13:09:06,720 --> 13:09:08,720 Okay, so there we went and got message one 16525 13:09:08,720 --> 13:09:09,720 from one to two. 16526 13:09:09,720 --> 13:09:11,720 We got 226 two characters, 16527 13:09:11,720 --> 13:09:13,720 and we printed out the email address, 16528 13:09:13,720 --> 13:09:15,720 the time we got it after all that hacking, 16529 13:09:15,720 --> 13:09:18,720 and the subject line, and that's what we got. 16530 13:09:18,720 --> 13:09:21,720 So, if we take a look at the database 16531 13:09:21,720 --> 13:09:24,720 and we go into the gmain, 16532 13:09:24,720 --> 13:09:28,720 oh, every time you see the content SQLite journal, 16533 13:09:28,720 --> 13:09:31,720 that means it needed to run a commit, 16534 13:09:31,720 --> 13:09:32,720 and it hasn't run a commit, 16535 13:09:32,720 --> 13:09:34,720 but I'll hit enter and that will do the commit, 16536 13:09:34,720 --> 13:09:36,720 and you see that vanish. 16537 13:09:36,720 --> 13:09:40,720 So, now I can open it and I take a look at, 16538 13:09:43,720 --> 13:09:45,720 how come there's no messages? 16539 13:09:45,720 --> 13:09:48,720 Did that one not get stored in there for some reason? 16540 13:09:48,720 --> 13:09:50,720 Used refresh. 16541 13:09:54,720 --> 13:09:56,720 Huh, let's run it again. 16542 13:09:59,720 --> 13:10:01,720 Maybe it didn't commit. 16543 13:10:01,720 --> 13:10:03,720 Maybe it got a bug in it. 16544 13:10:03,720 --> 13:10:06,720 Let's make a change to the code. 16545 13:10:13,720 --> 13:10:17,720 I'm going to see this connection.commit. 16546 13:10:18,720 --> 13:10:19,720 See that? 16547 13:10:20,720 --> 13:10:22,720 Connection.commit. 16548 13:10:24,720 --> 13:10:26,720 Gonna commit there, 16549 13:10:26,720 --> 13:10:28,720 and the other thing I'm gonna do is, 16550 13:10:28,720 --> 13:10:30,720 every time I stop to read, 16551 13:10:30,720 --> 13:10:33,720 I'm gonna commit right before I read it. 16552 13:10:33,720 --> 13:10:35,720 So, I think we should, I hope that doesn't blow up. 16553 13:10:35,720 --> 13:10:36,720 We'll see. 16554 13:10:36,720 --> 13:10:39,720 So, the idea is, if I wanna stop, 16555 13:10:39,720 --> 13:10:40,720 I wanna commit it. 16556 13:10:40,720 --> 13:10:41,720 So, let's do this. 16557 13:10:41,720 --> 13:10:43,720 Let's do one message, 16558 13:10:44,720 --> 13:10:47,720 and now I should hit, is it committed? 16559 13:10:47,720 --> 13:10:48,720 Now that I've put the commits in, 16560 13:10:48,720 --> 13:10:51,720 I think that it will look better. 16561 13:10:54,720 --> 13:10:55,720 I can't refresh, 16562 13:10:55,720 --> 13:10:57,720 and so there it is because I committed it, 16563 13:10:57,720 --> 13:11:00,720 and I don't have, yeah, I don't have the journal file, 16564 13:11:00,720 --> 13:11:01,720 so that's good. 16565 13:11:01,720 --> 13:11:03,720 So, that's a good idea to put those commits there. 16566 13:11:03,720 --> 13:11:05,720 So, I'll just leave those commits in. 16567 13:11:05,720 --> 13:11:08,720 When you download it, it'll have those commits in there. 16568 13:11:08,720 --> 13:11:10,720 So, again, I put a commit here, 16569 13:11:10,720 --> 13:11:14,720 and a commit at the very, very end, 16570 13:11:15,720 --> 13:11:18,720 to make sure, and then I, so, I missed that. 16571 13:11:19,720 --> 13:11:21,720 But now we get one, right? 16572 13:11:21,720 --> 13:11:22,720 And so, let's just run it again, 16573 13:11:22,720 --> 13:11:25,720 and you'll see how by selecting the max of the ID, 16574 13:11:25,720 --> 13:11:28,720 it's gonna select the max of this and then add one to it, 16575 13:11:28,720 --> 13:11:30,720 so it doesn't do the next one. 16576 13:11:30,720 --> 13:11:33,720 So, if I run it again, 16577 13:11:33,720 --> 13:11:36,720 I say, give me one message, so it goes two to three, 16578 13:11:36,720 --> 13:11:39,720 and give me two messages, right? 16579 13:11:39,720 --> 13:11:42,720 So, I hit enter, and I can do refresh, 16580 13:11:42,720 --> 13:11:45,720 and now you see we've got four messages, okay? 16581 13:11:45,720 --> 13:11:48,720 And so, let's just fire this baby up. 16582 13:11:48,720 --> 13:11:49,720 Tell it to get 100. 16583 13:11:50,720 --> 13:11:53,720 Er, run, run, run, run, run, run, run. 16584 13:11:53,720 --> 13:11:56,720 All right, it just goes and goes, 16585 13:11:56,720 --> 13:11:58,720 and it pauses once in a while to do a commit, 16586 13:11:58,720 --> 13:12:00,720 and if I made a commit every time, 16587 13:12:00,720 --> 13:12:03,720 oop, it just paused there, now it finished. 16588 13:12:03,720 --> 13:12:07,720 So, this'll run, and we will get a bunch of data. 16589 13:12:10,720 --> 13:12:12,720 The problem is, is if I just run this, 16590 13:12:12,720 --> 13:12:15,720 it'll take about five hours, okay, 16591 13:12:15,720 --> 13:12:17,720 to run this and get this all, 16592 13:12:17,720 --> 13:12:18,720 and I've got a really fast connection. 16593 13:12:18,720 --> 13:12:22,720 So, I have got a file that you can download, 16594 13:12:22,720 --> 13:12:25,720 let's go find it, let's see if I can, 16595 13:12:25,720 --> 13:12:28,720 let's see how long it'll take me to download this. 16596 13:12:28,720 --> 13:12:31,720 I've got a file that you can download and save. 16597 13:12:31,720 --> 13:12:36,720 Now, I'm gonna use the command line, curl, 16598 13:12:36,720 --> 13:12:39,720 or wget is another command that we Linux 16599 13:12:39,720 --> 13:12:41,720 and Mac people can use. 16600 13:12:41,720 --> 13:12:43,720 I don't know, you might have to use your browser do it, 16601 13:12:43,720 --> 13:12:45,720 let's see how long this is gonna take. 16602 13:12:45,720 --> 13:12:49,720 Yum, it's retrieving, minute 30. 16603 13:12:49,720 --> 13:12:54,720 Okay, well, I'll just wait when this come back. 16604 13:13:14,720 --> 13:13:16,720 Okay, so now that's done. 16605 13:13:16,720 --> 13:13:19,720 I was averaging 10 megabits a second. 16606 13:13:19,720 --> 13:13:23,720 I downloaded about 600 megabytes, 10 megabits a second. 16607 13:13:23,720 --> 13:13:26,720 That will probably be slower for you. 16608 13:13:26,720 --> 13:13:29,720 But, so now if I take a look, 16609 13:13:29,720 --> 13:13:34,720 you're gonna find that that content.sqlite 16610 13:13:34,720 --> 13:13:37,720 is 624 megabytes. 16611 13:13:37,720 --> 13:13:39,720 Now, what happens is I've pre-spitered this, 16612 13:13:39,720 --> 13:13:42,720 and so now if you run gmain.py 16613 13:13:42,720 --> 13:13:45,720 and ask for five more messages, 16614 13:13:45,720 --> 13:13:48,720 it will pick up where I left that one off. 16615 13:13:48,720 --> 13:13:51,720 So it's up to message 59,000. 16616 13:13:51,720 --> 13:13:53,720 And I think that, oh, you saw an error. 16617 13:13:53,720 --> 13:13:54,720 You saw a bug in that one. 16618 13:13:54,720 --> 13:13:56,720 I don't know what's wrong with that one. 16619 13:13:56,720 --> 13:13:58,720 So let's see if, so at this point, 16620 13:13:58,720 --> 13:14:00,720 we're gonna have most of the data. 16621 13:14:00,720 --> 13:14:04,720 It might find its way to the very end. 16622 13:14:04,720 --> 13:14:07,720 Once you get this, it should be not too much more. 16623 13:14:07,720 --> 13:14:13,720 I don't know, maybe it's like 63,000 or something. 16624 13:14:13,720 --> 13:14:16,720 So what we'll do is we will let that run, 16625 13:14:16,720 --> 13:14:20,720 and we will come back when that one's finished 16626 13:14:20,720 --> 13:14:25,720 and run the next phase after it's got all of its data, okay? 16627 13:14:25,720 --> 13:14:27,720 So thanks for listening. 16628 13:14:34,720 --> 13:14:36,720 The work that we're doing right now 16629 13:14:36,720 --> 13:14:38,720 is we are in the process of building 16630 13:14:38,720 --> 13:14:43,720 a writer and visualization tool for email data 16631 13:14:43,720 --> 13:14:46,720 that came originally from this website gmain, 16632 13:14:46,720 --> 13:14:48,720 but I've got my own copy of it. 16633 13:14:48,720 --> 13:14:51,720 And so what we've done before is we ran gmain.py, 16634 13:14:51,720 --> 13:14:55,720 and I grabbed a URL. 16635 13:14:55,720 --> 13:14:59,720 I have a URL that has all this data, 16636 13:14:59,720 --> 13:15:03,720 and I downloaded that, and then I ran gmain again 16637 13:15:03,720 --> 13:15:07,720 to catch up, and so it took quite a bit of catching up. 16638 13:15:07,720 --> 13:15:09,720 But by the time I get to, remember how I said 16639 13:15:09,720 --> 13:15:12,720 it tries to fail five times? 16640 13:15:12,720 --> 13:15:16,720 Well, it ran out of data at 60,421, 16641 13:15:16,720 --> 13:15:19,720 and then it started failing, and then it quit. 16642 13:15:19,720 --> 13:15:22,720 So we pretty much have all of our data now. 16643 13:15:22,720 --> 13:15:28,720 We have finished this process in content SQLite, okay? 16644 13:15:28,720 --> 13:15:34,720 And if I take a look in the database browser, 16645 13:15:34,720 --> 13:15:38,720 we can see we've got 59,823 email messages. 16646 13:15:38,720 --> 13:15:40,720 And so if I look at any of these things, 16647 13:15:40,720 --> 13:15:44,720 you see the headers, you see the subject line, 16648 13:15:44,720 --> 13:15:47,720 you see the email address, you see the body of it. 16649 13:15:47,720 --> 13:15:52,720 So remember I split the body in half and the headers. 16650 13:15:52,720 --> 13:15:56,720 And so I made this as raw as I possibly could 16651 13:15:56,720 --> 13:15:59,720 because as you saw, I had to spend so much time in the gmain 16652 13:15:59,720 --> 13:16:03,720 just getting the data successfully retrieved. 16653 13:16:03,720 --> 13:16:05,720 And so I don't like cleaning the data up too much. 16654 13:16:05,720 --> 13:16:07,720 And so what we're gonna look at next 16655 13:16:07,720 --> 13:16:11,720 is the data cleaning process, okay? 16656 13:16:11,720 --> 13:16:14,720 And so this is gmodel.py is the code 16657 13:16:14,720 --> 13:16:17,720 we're gonna take a look at now. 16658 13:16:17,720 --> 13:16:22,720 So let's get rid of those guys and look at gmodel.py. 16659 13:16:24,720 --> 13:16:27,720 I don't think I need URL lib in this code. 16660 13:16:27,720 --> 13:16:33,720 Do I have any URL lib? 16661 13:16:33,720 --> 13:16:36,720 No, so I don't need that, sorry. 16662 13:16:36,720 --> 13:16:38,720 Fix that. 16663 13:16:38,720 --> 13:16:41,720 Okay, so it's gonna read from the database, 16664 13:16:41,720 --> 13:16:43,720 it's gonna use regular expressions, 16665 13:16:43,720 --> 13:16:45,720 and zlib is a way to do some compression. 16666 13:16:45,720 --> 13:16:46,720 And so I'm gonna do, in this one, 16667 13:16:46,720 --> 13:16:48,720 I'm gonna compress some of the data 16668 13:16:48,720 --> 13:16:50,720 to make it so that I have less data to, 16669 13:16:50,720 --> 13:16:52,720 some of the text fields are gonna be compressed. 16670 13:16:52,720 --> 13:16:54,720 I wanted to keep these fields uncompressed 16671 13:16:54,720 --> 13:16:56,720 inside of messages. 16672 13:16:56,720 --> 13:17:01,720 And so we have some just cleanup messages 16673 13:17:01,720 --> 13:17:03,720 and cleans things up. 16674 13:17:03,720 --> 13:17:06,720 And it turns out that the way email addresses 16675 13:17:06,720 --> 13:17:11,720 in this particular mail corpus, they changed over time 16676 13:17:11,720 --> 13:17:14,720 and there's certain kinds of things. 16677 13:17:14,720 --> 13:17:16,720 Sometimes the gmain.org is the email address 16678 13:17:16,720 --> 13:17:19,720 when people wanna hide their address. 16679 13:17:19,720 --> 13:17:22,720 And I made all kinds of stuff and I split it 16680 13:17:22,720 --> 13:17:24,720 and checked to see if it ended with this. 16681 13:17:24,720 --> 13:17:28,720 And I cleaned up things, just that kind of thing. 16682 13:17:28,720 --> 13:17:32,720 And so I have all kinds of cleanup stuff going on in here. 16683 13:17:32,720 --> 13:17:34,720 And I have this mapping and DNS mapping 16684 13:17:34,720 --> 13:17:36,720 that I'll talk about in a bit 16685 13:17:36,720 --> 13:17:39,720 where organizations sometimes sent email 16686 13:17:39,720 --> 13:17:41,720 with different addresses over time 16687 13:17:41,720 --> 13:17:44,720 and people sent email from different time. 16688 13:17:44,720 --> 13:17:47,720 And we're gonna do the parsing of the date 16689 13:17:47,720 --> 13:17:50,720 and that is the code for that. 16690 13:17:50,720 --> 13:17:53,720 I'm gonna pull out the header information. 16691 13:17:53,720 --> 13:17:58,720 This is sort of borrowed from the other code. 16692 13:17:59,720 --> 13:18:04,720 We'll clean up the email addresses and the domain names. 16693 13:18:04,720 --> 13:18:07,720 And we'll pull the date out, pull the subject out, 16694 13:18:07,720 --> 13:18:10,720 pull out the message ID, various things. 16695 13:18:10,720 --> 13:18:13,720 So here's the main body of the code. 16696 13:18:13,720 --> 13:18:18,720 We're going to go from content.sqlite to index.sqlite. 16697 13:18:18,720 --> 13:18:20,720 And what I'm gonna do every time 16698 13:18:20,720 --> 13:18:22,720 is I'm gonna wipe out index.sqlite 16699 13:18:22,720 --> 13:18:25,720 and drop the messages, senders, subjects, and replies. 16700 13:18:25,720 --> 13:18:27,720 So this is a normalized database 16701 13:18:27,720 --> 13:18:29,720 in that it has foreign keys. 16702 13:18:29,720 --> 13:18:31,720 So there's a messages table here 16703 13:18:31,720 --> 13:18:34,720 with an integer primary key, the GUID for it. 16704 13:18:34,720 --> 13:18:37,720 The GUID stands for global unique ID, 16705 13:18:37,720 --> 13:18:41,720 sender ID, and it's gonna have a blob. 16706 13:18:41,720 --> 13:18:43,720 These are blobs, binary or large objects 16707 13:18:43,720 --> 13:18:44,720 for the headers in the body 16708 13:18:44,720 --> 13:18:46,720 because I'm gonna compress them in this database 16709 13:18:46,720 --> 13:18:48,720 to make them. 16710 13:18:48,720 --> 13:18:53,720 And then the senders, each sender has a key 16711 13:18:53,720 --> 13:18:57,720 and then each subject line is gonna have a key 16712 13:18:57,720 --> 13:19:02,720 and then replies our connection from one message to another. 16713 13:19:02,720 --> 13:19:04,720 And so this is like a many to many. 16714 13:19:04,720 --> 13:19:07,720 Now, I also have this file called mapping.sqlite 16715 13:19:07,720 --> 13:19:12,720 and so we can take a look at that one, mapping.sqlite. 16716 13:19:12,720 --> 13:19:17,720 And so what happened is this has two tables 16717 13:19:17,720 --> 13:19:19,720 that I hand deal with. 16718 13:19:19,720 --> 13:19:21,720 And so sometimes in the end, 16719 13:19:21,720 --> 13:19:24,720 this was a email address that mapped to that. 16720 13:19:24,720 --> 13:19:27,720 So Indiana.edu, that's a way to take 16721 13:19:27,720 --> 13:19:29,720 an at's the email address. 16722 13:19:29,720 --> 13:19:31,720 And then these were a bunch of people 16723 13:19:31,720 --> 13:19:35,720 that had email addresses changing throughout the project 16724 13:19:35,720 --> 13:19:38,720 and I sort of kind of mapped them in a way. 16725 13:19:38,720 --> 13:19:40,720 And so this is just sort of like, 16726 13:19:40,720 --> 13:19:42,720 I pull this in really quick 16727 13:19:42,720 --> 13:19:46,720 and I read all this stuff from the DNS mapping 16728 13:19:46,720 --> 13:19:50,720 and I, other than stripping and making this lowercase, 16729 13:19:50,720 --> 13:19:54,720 et cetera, I just am gonna make a dictionary. 16730 13:19:54,720 --> 13:19:57,720 DNS mapping, which is the old name to the new name 16731 13:19:57,720 --> 13:20:00,720 and the email address mapping 16732 13:20:00,720 --> 13:20:02,720 from the old name to the new name 16733 13:20:02,720 --> 13:20:03,720 and I'm using fixsender. 16734 13:20:03,720 --> 13:20:05,720 Fixsender is because the email addresses 16735 13:20:05,720 --> 13:20:08,720 even within gmain were kind of funky. 16736 13:20:08,720 --> 13:20:12,720 So don't worry so much about this. 16737 13:20:12,720 --> 13:20:15,720 Okay, and so now what I'm gonna do is 16738 13:20:15,720 --> 13:20:18,720 I opened up a connection just to read all that stuff in 16739 13:20:18,720 --> 13:20:21,720 and now I'm going to actually open the main content 16740 13:20:21,720 --> 13:20:23,720 and I'm asking this a little trickier. 16741 13:20:23,720 --> 13:20:25,720 I open that read only. 16742 13:20:25,720 --> 13:20:28,720 That was so that I could potentially be running the spider 16743 13:20:28,720 --> 13:20:30,720 and running this at the same time. 16744 13:20:30,720 --> 13:20:32,720 I get a cursor. 16745 13:20:32,720 --> 13:20:35,720 And so I'm gonna read through, 16746 13:20:35,720 --> 13:20:38,720 so in the content file, this is the big one, 16747 13:20:38,720 --> 13:20:40,720 I'm gonna read through and go through every one 16748 13:20:40,720 --> 13:20:43,720 and write all of these things in. 16749 13:20:43,720 --> 13:20:47,720 And I'm gonna take all the email addresses 16750 13:20:47,720 --> 13:20:51,720 and I'm going to put those in a list. 16751 13:20:52,720 --> 13:20:55,720 So I loaded that, I've got the mappings loaded 16752 13:20:55,720 --> 13:20:59,720 and so now I'm going to go through every single message. 16753 13:20:59,720 --> 13:21:01,720 I got all the senders, all the subjects 16754 13:21:01,720 --> 13:21:04,720 and all the global unique IDs. 16755 13:21:04,720 --> 13:21:06,720 So I read in each message. 16756 13:21:06,720 --> 13:21:10,720 So now I'm going through content one at a time. 16757 13:21:10,720 --> 13:21:14,720 I parse the headers. 16758 13:21:16,720 --> 13:21:21,720 I check to see if the sender's name, email address, 16759 13:21:21,720 --> 13:21:25,720 after it's been cleaned up, is in my mapping. 16760 13:21:25,720 --> 13:21:28,720 Mapping.getSender and the default is I get backSender. 16761 13:21:28,720 --> 13:21:30,720 That's what that's saying. 16762 13:21:30,720 --> 13:21:32,720 Lookup Sender, if it's in there, 16763 13:21:32,720 --> 13:21:34,720 give me the entry of that key, 16764 13:21:34,720 --> 13:21:37,720 otherwise give me sender back. 16765 13:21:37,720 --> 13:21:41,720 We're gonna print every 250 things we do. 16766 13:21:41,720 --> 13:21:44,720 We'll complain if this is true. 16767 13:21:44,720 --> 13:21:47,720 We're gonna go get the mapping between the senders 16768 13:21:47,720 --> 13:21:49,720 which is a way to look up the primary key. 16769 13:21:49,720 --> 13:21:51,720 I could have done this with a database thing 16770 13:21:51,720 --> 13:21:53,720 but I wanted it to be fast. 16771 13:21:53,720 --> 13:21:55,720 So that's part of the reason I read all these things in 16772 13:21:55,720 --> 13:21:58,720 so I could have those mappings to be really fast. 16773 13:21:58,720 --> 13:22:00,720 You'll see this takes a little while 16774 13:22:00,720 --> 13:22:04,720 even though I got all this stuff cached. 16775 13:22:04,720 --> 13:22:08,720 And so then if I don't have a sender ID, 16776 13:22:08,720 --> 13:22:10,720 meaning that I haven't seen it yet, 16777 13:22:10,720 --> 13:22:13,720 then I'm gonna do an insert or ignore into senders 16778 13:22:13,720 --> 13:22:15,720 and then I'm gonna do a select 16779 13:22:15,720 --> 13:22:18,720 and then you've seen this where I grab the row back 16780 13:22:18,720 --> 13:22:22,720 and I'm really just trying to look at the recently assigned ID 16781 13:22:22,720 --> 13:22:26,720 and then I'm going to not only set the sender ID 16782 13:22:26,720 --> 13:22:29,720 for this iteration loop but I'm also gonna store it 16783 13:22:29,720 --> 13:22:33,720 in the dictionary and so that builds this dictionary up. 16784 13:22:33,720 --> 13:22:37,720 And you'll see the same thing is true for subject ID. 16785 13:22:37,720 --> 13:22:39,720 I'm gonna insert it into the subjects table 16786 13:22:39,720 --> 13:22:42,720 and get a primary key if I don't know what it is 16787 13:22:42,720 --> 13:22:44,720 and then I'm gonna put it into, 16788 13:22:44,720 --> 13:22:46,720 not only am I going to put it into the database 16789 13:22:46,720 --> 13:22:50,720 but I'm also gonna put it into my dictionary. 16790 13:22:50,720 --> 13:22:54,720 And the same thing, I guess I didn't do it for the GUID. 16791 13:22:54,720 --> 13:22:56,720 Okay. 16792 13:22:56,720 --> 13:23:00,720 So now what I have is the sender ID and the subject ID 16793 13:23:00,720 --> 13:23:03,720 which are foreign keys into the sender table and the subject table 16794 13:23:03,720 --> 13:23:07,720 and I'm gonna insert the message with the sender ID, subject ID, 16795 13:23:07,720 --> 13:23:09,720 the sent at, headers, and body. 16796 13:23:09,720 --> 13:23:16,720 And the values here are the GUID, sender ID, subject ID, sent at. 16797 13:23:16,720 --> 13:23:19,720 Now this here is Zlib compress. 16798 13:23:19,720 --> 13:23:27,720 So what I'm taking is the message, the header, and the body 16799 13:23:27,720 --> 13:23:31,720 and this little bit ends up with a compressed version of this stuff 16800 13:23:31,720 --> 13:23:33,720 and you'll see it in a second. 16801 13:23:33,720 --> 13:23:35,720 And this keeps the size of these text things 16802 13:23:35,720 --> 13:23:38,720 down at the cost of the computation of, 16803 13:23:38,720 --> 13:23:42,720 we have to, at the cost of the computation 16804 13:23:42,720 --> 13:23:44,720 to compress and decompress when we want to read it. 16805 13:23:44,720 --> 13:23:46,720 Okay. 16806 13:23:46,720 --> 13:23:52,720 And then I pull the GUIDs out, the ID which is the GUID 16807 13:23:52,720 --> 13:23:56,720 and I pull out the primary key for this thing based on the GUID. 16808 13:23:56,720 --> 13:23:59,720 And I update this dictionary. 16809 13:23:59,720 --> 13:24:00,720 Okay. 16810 13:24:00,720 --> 13:24:04,720 So let me run that code. 16811 13:24:04,720 --> 13:24:06,720 It is doing a lot of cleanup 16812 13:24:06,720 --> 13:24:08,720 and I'll tell you it took me a long time to make this work. 16813 13:24:08,720 --> 13:24:13,720 So just, so this code that I'm running now, 16814 13:24:13,720 --> 13:24:17,720 oh, don't forget to take a Python 3, Chuck. 16815 13:24:17,720 --> 13:24:21,720 So this is gonna run every 250. 16816 13:24:21,720 --> 13:24:23,720 So it did all this precaching. 16817 13:24:23,720 --> 13:24:25,720 So that's how long it takes to do 250. 16818 13:24:25,720 --> 13:24:27,720 Now there's 60,000 in here. 16819 13:24:27,720 --> 13:24:30,720 And so this is really busy. 16820 13:24:30,720 --> 13:24:32,720 The reason it's bouncing back and forth is that 16821 13:24:32,720 --> 13:24:34,720 every time it makes this journal file, that's, 16822 13:24:34,720 --> 13:24:35,720 and then does a commit. 16823 13:24:35,720 --> 13:24:38,720 So you can kind of see that it's, 16824 13:24:38,720 --> 13:24:40,720 it's busy making journal files and committing 16825 13:24:40,720 --> 13:24:44,720 and there's a lot of activity going on here. 16826 13:24:44,720 --> 13:24:48,720 It just so happens that Adam shows me these files. 16827 13:24:48,720 --> 13:24:57,720 Okay, so it finished. 16828 13:24:57,720 --> 13:25:00,720 It took about three minutes to finish that, right? 16829 13:25:00,720 --> 13:25:06,720 And so if we take a look at the size of the files, 16830 13:25:06,720 --> 13:25:09,720 we will see that the index is much smaller. 16831 13:25:09,720 --> 13:25:11,720 It's fully normalized. 16832 13:25:11,720 --> 13:25:13,720 It's still 263 megabytes. 16833 13:25:13,720 --> 13:25:14,720 It's all compressed. 16834 13:25:14,720 --> 13:25:21,720 So let's take a look at that in the browser. 16835 13:25:21,720 --> 13:25:23,720 So it's 200 megabytes. 16836 13:25:23,720 --> 13:25:28,720 But it loads up a lot faster. 16837 13:25:28,720 --> 13:25:30,720 There we go. 16838 13:25:30,720 --> 13:25:33,720 So we have a senders table, right? 16839 13:25:33,720 --> 13:25:36,720 Which is just kind of a many to one table. 16840 13:25:36,720 --> 13:25:40,720 We have a subjects to table, which is a many to one table. 16841 13:25:40,720 --> 13:25:43,720 And we have messages, which has foreign keys. 16842 13:25:43,720 --> 13:25:46,720 It takes a little bit to load that up. 16843 13:25:46,720 --> 13:25:51,720 Okay, and so we see the foreign keys for sender and subject 16844 13:25:51,720 --> 13:25:53,720 and that saves us. 16845 13:25:53,720 --> 13:25:55,720 All those foreign keys save us. 16846 13:25:55,720 --> 13:25:58,720 And so we have, you can kind of see that I can't see the headers in the body 16847 13:25:58,720 --> 13:26:00,720 because now they're compressed. 16848 13:26:00,720 --> 13:26:03,720 That saves me a whole bunch of stuff, right? 16849 13:26:03,720 --> 13:26:06,720 It saved me a whole bunch of stuff. 16850 13:26:06,720 --> 13:26:10,720 And so that's what's in that file. 16851 13:26:10,720 --> 13:26:15,720 And that, we've finished this process, okay? 16852 13:26:15,720 --> 13:26:19,720 And we've finished modeling the data and making it really clean. 16853 13:26:19,720 --> 13:26:22,720 And we'll pick back up and the rest of the stuff we will do 16854 13:26:22,720 --> 13:26:26,720 is actually visualizing pulling data out of index.sqlite. 16855 13:26:26,720 --> 13:26:28,720 The idea is this can be restarted. 16856 13:26:28,720 --> 13:26:30,720 This can be run over and over and over. 16857 13:26:30,720 --> 13:26:32,720 Even though it takes like three minutes to run this, 16858 13:26:32,720 --> 13:26:35,720 that's way better than five hours to run this. 16859 13:26:35,720 --> 13:26:37,720 So three minutes, five hours. 16860 13:26:37,720 --> 13:26:41,720 And then you'll see, and we'll see now reading this as in seconds 16861 13:26:41,720 --> 13:26:45,720 because we got it all nice and normalized in a quite pretty way. 16862 13:26:45,720 --> 13:26:48,720 So I hope this has been useful. 16863 13:26:48,720 --> 13:26:51,720 In the next one, we'll actually do the visualization. 16864 13:26:56,720 --> 13:27:00,720 We are in the process of retrieving data from this gmain server, 16865 13:27:00,720 --> 13:27:03,720 one that I've made a copy of. 16866 13:27:03,720 --> 13:27:08,720 And we have, so far, spied it all, ended up with 600 megabytes 16867 13:27:08,720 --> 13:27:09,720 of spied-ed information. 16868 13:27:09,720 --> 13:27:12,720 We have ran a rather complex cleanup process 16869 13:27:12,720 --> 13:27:15,720 that you probably don't need to fully understand. 16870 13:27:15,720 --> 13:27:17,720 You can look at it for patterns. 16871 13:27:17,720 --> 13:27:22,720 But in general, the cleanup process will be very sensitive to the data. 16872 13:27:22,720 --> 13:27:27,720 And then we have this index.sqlite, which is 260 megabytes right now. 16873 13:27:27,720 --> 13:27:32,720 And we are going to now do the easy, the fun, easy bits here 16874 13:27:32,720 --> 13:27:35,720 where we're going to run little queries that just pull data out. 16875 13:27:35,720 --> 13:27:37,720 And so these are much simpler. 16876 13:27:37,720 --> 13:27:40,720 So part of what I wrote when I was doing this 16877 13:27:40,720 --> 13:27:45,720 is I wanted to do some simple, basic calculations on the data 16878 13:27:45,720 --> 13:27:48,720 to make sure I really was sort of looking for anomalies, right? 16879 13:27:48,720 --> 13:27:50,720 What was working, what wasn't working. 16880 13:27:50,720 --> 13:27:55,720 So I wrote a series of really simple things like this gbasic, 16881 13:27:55,720 --> 13:27:59,720 the gbasic code, just to give me some basic data, right? 16882 13:27:59,720 --> 13:28:02,720 So I wrote things down and I counted things. 16883 13:28:02,720 --> 13:28:06,720 And so, do I need URL librequest in this one? 16884 13:28:06,720 --> 13:28:07,720 I don't think so. 16885 13:28:07,720 --> 13:28:09,720 Let's fix that bug. 16886 13:28:09,720 --> 13:28:10,720 It's not there. 16887 13:28:10,720 --> 13:28:12,720 No reason to put any of that stuff in there. 16888 13:28:12,720 --> 13:28:17,720 So it just, it reads that index.sqlite, which is our cleaned up data. 16889 13:28:17,720 --> 13:28:20,720 It reads through and makes a dictionary of this pattern. 16890 13:28:20,720 --> 13:28:26,720 You're going to see a lot where I'm going to make a dictionary of ID to senders, 16891 13:28:26,720 --> 13:28:29,720 save myself repeatedly looking at things. 16892 13:28:29,720 --> 13:28:30,720 I'm going to grab the subjects. 16893 13:28:30,720 --> 13:28:31,720 I've cached them all. 16894 13:28:31,720 --> 13:28:37,720 I could have done this all with SQL, but I just wanted to do things faster. 16895 13:28:37,720 --> 13:28:41,720 And now I'm going to go through each of these messages 16896 13:28:41,720 --> 13:28:43,720 and make a dictionary of them. 16897 13:28:43,720 --> 13:28:45,720 I'm going to put a lot of stuff in memory. 16898 13:28:45,720 --> 13:28:47,720 And then I'm going to do some counts. 16899 13:28:47,720 --> 13:28:50,720 I'm going to see who is sent the most, right? 16900 13:28:50,720 --> 13:28:52,720 The organizations. 16901 13:28:52,720 --> 13:28:58,720 And so now I've got to go through all the messages. 16902 13:28:58,720 --> 13:29:04,720 I am not actually, so you'll notice that I'm not selecting the body or the headers here. 16903 13:29:04,720 --> 13:29:08,720 I am just getting sender ID, subject ID. 16904 13:29:08,720 --> 13:29:10,720 I probably could have done this with a join. 16905 13:29:10,720 --> 13:29:11,720 It would have been cleaner. 16906 13:29:11,720 --> 13:29:12,720 You can do that. 16907 13:29:12,720 --> 13:29:14,720 You can make that change. 16908 13:29:14,720 --> 13:29:17,720 Do that with a join so it's cleaner. 16909 13:29:17,720 --> 13:29:22,720 And so I'm going through all the messages except not the body. 16910 13:29:22,720 --> 13:29:24,720 So this is going to be really quick. 16911 13:29:24,720 --> 13:29:27,720 And I'm pulling out the senders ID. 16912 13:29:27,720 --> 13:29:29,720 I'm breaking the sender into pieces. 16913 13:29:29,720 --> 13:29:30,720 See, my data is clean now. 16914 13:29:30,720 --> 13:29:33,720 I cleaned it all up in the previous processes. 16915 13:29:33,720 --> 13:29:37,720 And if I don't have two pieces, I continue and I get the domain name. 16916 13:29:37,720 --> 13:29:38,720 So I have the person. 16917 13:29:38,720 --> 13:29:45,720 I'm doing a basic dictionary histogram for the people and the domains. 16918 13:29:45,720 --> 13:29:53,720 And then I'm going to sort them with a sorted. 16919 13:29:53,720 --> 13:29:55,720 And we're going to grab the key. 16920 13:29:55,720 --> 13:29:59,720 We're going to sort it by the how many there are reverse. 16921 13:29:59,720 --> 13:30:05,720 And then print out the top few of the organizations and the top few of the people. 16922 13:30:05,720 --> 13:30:06,720 OK? 16923 13:30:06,720 --> 13:30:08,720 So we'll just run that code. 16924 13:30:08,720 --> 13:30:12,720 Python gbasic.py. 16925 13:30:12,720 --> 13:30:14,720 Let's type the dump out the top 10. 16926 13:30:14,720 --> 13:30:20,720 So we loaded 59,000 messages, 29,000 subjects, and 1,800 senders, 16927 13:30:20,720 --> 13:30:25,720 and figured out the top 10 people and the top 10 organizations. 16928 13:30:25,720 --> 13:30:31,720 And you can write various things like that that just sort of scream through your data 16929 13:30:31,720 --> 13:30:35,720 and it's good to get sanity checking on your data. 16930 13:30:35,720 --> 13:30:36,720 OK? 16931 13:30:36,720 --> 13:30:38,720 So that's gbasic. 16932 13:30:38,720 --> 13:30:43,720 Now I want to do gword.py because that's kind of fun. 16933 13:30:43,720 --> 13:30:45,720 gword.py. 16934 13:30:45,720 --> 13:30:47,720 I don't need URLib. 16935 13:30:47,720 --> 13:30:49,720 Why do I keep putting URLib in all these things? 16936 13:30:49,720 --> 13:30:51,720 So we'll get rid of that. 16937 13:30:51,720 --> 13:30:56,720 So this is really simple because I'm just going to go for the words in the subject line. 16938 13:30:56,720 --> 13:30:59,720 And so I go through index.sqlite. 16939 13:30:59,720 --> 13:31:03,720 I read in all of the subjects. 16940 13:31:03,720 --> 13:31:06,720 And I make a dictionary of those. 16941 13:31:06,720 --> 13:31:09,720 And then I go and find all the subjects. 16942 13:31:09,720 --> 13:31:12,720 And then I'm doing this code right here. 16943 13:31:12,720 --> 13:31:19,720 I'm pulling out the subject based on the message. 16944 13:31:19,720 --> 13:31:22,720 And I'm doing this so that when the subjects are used more than once, 16945 13:31:22,720 --> 13:31:25,720 I count the words more than once. 16946 13:31:25,720 --> 13:31:30,720 DisturMakeTrans, I talked about that in an earlier chapter. 16947 13:31:30,720 --> 13:31:34,720 This basically throws away a punctuation in numbers 16948 13:31:34,720 --> 13:31:38,720 so that when I make my words, I don't end up with words that are like dashes. 16949 13:31:38,720 --> 13:31:40,720 It compresses them down. 16950 13:31:40,720 --> 13:31:42,720 Then I strip it. 16951 13:31:42,720 --> 13:31:44,720 I convert everything to lowercase. 16952 13:31:44,720 --> 13:31:47,720 This is basically just to keep too many words from showing up. 16953 13:31:47,720 --> 13:31:48,720 Then I do a split. 16954 13:31:48,720 --> 13:31:51,720 And then I got accounts, a dictionary. 16955 13:31:51,720 --> 13:31:59,720 So this is a no punctuation, no numbers dictionary count. 16956 13:31:59,720 --> 13:32:04,720 And then I just take the and do a dictionary. 16957 13:32:04,720 --> 13:32:08,720 And then I sort them in reverse order. 16958 13:32:08,720 --> 13:32:13,720 And I figure out what the highest and lowest is by running through a, 16959 13:32:13,720 --> 13:32:20,720 I could have probably done this with a max and a min if I felt like it. 16960 13:32:20,720 --> 13:32:24,720 And so now I have the highest and the lowest. 16961 13:32:24,720 --> 13:32:27,720 I should have done a max and a min on that one. 16962 13:32:27,720 --> 13:32:28,720 Why did I do that? 16963 13:32:28,720 --> 13:32:31,720 But oh well. 16964 13:32:31,720 --> 13:32:34,720 And now I've got to spread out the size. 16965 13:32:34,720 --> 13:32:39,720 And so I'm going to produce this file gword.js, which is needed by the visualization 16966 13:32:39,720 --> 13:32:45,720 because it's going to use d3.js, a word visualizer, and gword.js. 16967 13:32:45,720 --> 13:32:47,720 I have to tell it how big the text is. 16968 13:32:47,720 --> 13:32:50,720 And so I'm doing some text normalization. 16969 13:32:50,720 --> 13:32:52,720 Took me a little experimentation. 16970 13:32:52,720 --> 13:33:02,720 So if I run this now, and I say python gword.js, 16971 13:33:02,720 --> 13:33:07,720 and I say python 3gword.js, which is a lot better. 16972 13:33:07,720 --> 13:33:13,720 Oh, not python. 16973 13:33:13,720 --> 13:33:19,720 Okay, so now I can go look at the gword.js, wherever that is, gword.js. 16974 13:33:19,720 --> 13:33:20,720 Yep. 16975 13:33:20,720 --> 13:33:25,720 And so this is basically, it normalized all the frequencies 16976 13:33:25,720 --> 13:33:27,720 and made it font size. 16977 13:33:27,720 --> 13:33:29,720 These are font sizes now. 16978 13:33:29,720 --> 13:33:34,720 And so this is just the data that's needed by this gword.jm, 16979 13:33:34,720 --> 13:33:40,720 which uses this d3 visualization word cloud code. 16980 13:33:40,720 --> 13:33:44,720 So this pulls in all my data, and then this is just some JavaScript 16981 13:33:44,720 --> 13:33:49,720 that draws the picture on the page. 16982 13:33:49,720 --> 13:33:57,720 And so the easy part now is to just open gword.htm in a browser. 16983 13:33:57,720 --> 13:33:59,720 It just so happens on a Mac I can do this. 16984 13:33:59,720 --> 13:34:05,720 And so that gives me a word cloud based on that data. 16985 13:34:05,720 --> 13:34:07,720 It kind of randomizes it. 16986 13:34:07,720 --> 13:34:08,720 It shows different stuff. 16987 13:34:08,720 --> 13:34:17,720 But it's using this data to generate how big those things are, 16988 13:34:17,720 --> 13:34:21,720 and then using a bit of randomness and simulated annealing to lay it out. 16989 13:34:21,720 --> 13:34:24,720 That's not stuff that we actually have to worry about, okay? 16990 13:34:24,720 --> 13:34:30,720 So that's how we get to the point where we're seeing a word cloud from this. 16991 13:34:30,720 --> 13:34:33,720 Now we're going to do another visualization. 16992 13:34:33,720 --> 13:34:36,720 And this time we're going to do a line visualization. 16993 13:34:36,720 --> 13:34:39,720 And we're going to create a thing called gline.js 16994 13:34:39,720 --> 13:34:42,720 and produce, with another HTML file, we're going to use d3 16995 13:34:42,720 --> 13:34:45,720 and produce that output. 16996 13:34:45,720 --> 13:34:51,720 So let's say goodbye here, goodbye, goodbye, goodbye, goodbye. 16997 13:34:51,720 --> 13:34:57,720 So gline.py, get rid of that file. 16998 13:34:57,720 --> 13:35:02,720 So again, I'm going to preload all of the senders in this case. 16999 13:35:02,720 --> 13:35:04,720 And again, I could have done this with a join. 17000 13:35:04,720 --> 13:35:06,720 Probably should have done this with a join. 17001 13:35:06,720 --> 13:35:13,720 I'm going to preload all the messages, the sender ID, subject ID, etc. 17002 13:35:13,720 --> 13:35:15,720 I'll load those up. 17003 13:35:15,720 --> 13:35:18,720 And now I'm going to read through. 17004 13:35:18,720 --> 13:35:23,720 I'm going to have the sending organizations and the senders. 17005 13:35:23,720 --> 13:35:27,720 And I'm going to accumulate and split the senders. 17006 13:35:27,720 --> 13:35:30,720 And I'm going to have the sending organizations. 17007 13:35:30,720 --> 13:35:32,720 And then I'm going to do a simple dictionary 17008 13:35:32,720 --> 13:35:35,720 as I accumulate the sending organizations 17009 13:35:35,720 --> 13:35:38,720 by splitting the person's name into add signs. 17010 13:35:38,720 --> 13:35:42,720 And then based on the organization, I accumulate it. 17011 13:35:42,720 --> 13:35:44,720 And then I sort them. 17012 13:35:44,720 --> 13:35:47,720 And I pull out the top ten organizations. 17013 13:35:47,720 --> 13:35:49,720 I print those out. 17014 13:35:49,720 --> 13:35:57,720 And now I'm going to produce, break this down into months. 17015 13:35:57,720 --> 13:36:00,720 And I'll show you what this looks like in a second. 17016 13:36:00,720 --> 13:36:03,720 Let's go to the gline.js. 17017 13:36:03,720 --> 13:36:06,720 So the month looks like this, okay? 17018 13:36:06,720 --> 13:36:07,720 So the month looks like that. 17019 13:36:07,720 --> 13:36:10,720 So that's the first seven characters of the date. 17020 13:36:10,720 --> 13:36:15,720 So if we look at the date, date looks like that. 17021 13:36:15,720 --> 13:36:18,720 The month is the first seven characters. 17022 13:36:18,720 --> 13:36:22,720 And this is the data that I've got to give it. 17023 13:36:22,720 --> 13:36:24,720 We'll clean that up in a second. 17024 13:36:24,720 --> 13:36:28,720 That data will look better in a moment. 17025 13:36:28,720 --> 13:36:30,720 Go back to gline.py. 17026 13:36:30,720 --> 13:36:34,720 And so this is... 17027 13:36:34,720 --> 13:36:37,720 We're doing a... 17028 13:36:37,720 --> 13:36:40,720 The key is a tuple, which is the month, 17029 13:36:40,720 --> 13:36:45,720 and which organization it is that did it. 17030 13:36:45,720 --> 13:36:47,720 And it's only in the top ten organizations. 17031 13:36:47,720 --> 13:36:53,720 And then we're going to do a... 17032 13:36:53,720 --> 13:36:56,720 We're going to basically do a dictionary 17033 13:36:56,720 --> 13:36:58,720 where the key is a tuple. 17034 13:36:58,720 --> 13:37:00,720 And then we're going to sort it. 17035 13:37:00,720 --> 13:37:04,720 Sort by key in this case, not by value. 17036 13:37:04,720 --> 13:37:05,720 That's... 17037 13:37:05,720 --> 13:37:07,720 And the months is going to sort that. 17038 13:37:07,720 --> 13:37:10,720 And then we're going to write all this data out 17039 13:37:10,720 --> 13:37:12,720 into gline.js. 17040 13:37:12,720 --> 13:37:14,720 So let's go ahead and run this. 17041 13:37:14,720 --> 13:37:17,720 And again, this is just the data that has to be written 17042 13:37:17,720 --> 13:37:21,720 in a way that the JavaScript can understand it. 17043 13:37:21,720 --> 13:37:28,720 Python, gline, python3, gline.py. 17044 13:37:28,720 --> 13:37:30,720 Okay, so top ten organizations. 17045 13:37:30,720 --> 13:37:32,720 So let's take a look at that JavaScript. 17046 13:37:32,720 --> 13:37:34,720 So this is what it looks like. 17047 13:37:34,720 --> 13:37:39,720 So it just so happens that you got to tell it the... 17048 13:37:39,720 --> 13:37:42,720 These are the data points, these are the lines. 17049 13:37:42,720 --> 13:37:45,720 So this is the year, the line for University of Michigan, 17050 13:37:45,720 --> 13:37:48,720 gmail.com, swinsburg.com. 17051 13:37:48,720 --> 13:37:51,720 So this first column is that line points 17052 13:37:51,720 --> 13:37:53,720 and the next line points. 17053 13:37:53,720 --> 13:37:57,720 So all this code was to get the data in such a way 17054 13:37:57,720 --> 13:38:00,720 that I could produce this JavaScript file. 17055 13:38:00,720 --> 13:38:04,720 Because if I look at gline.htm, 17056 13:38:04,720 --> 13:38:07,720 I need that data in that particular format. 17057 13:38:07,720 --> 13:38:10,720 And I've got all this stuff. 17058 13:38:10,720 --> 13:38:11,720 I make a line chart. 17059 13:38:11,720 --> 13:38:14,720 And I draw it with this data, that data. 17060 13:38:14,720 --> 13:38:16,720 I had to go read all the documentation 17061 13:38:16,720 --> 13:38:19,720 on how to figure this stuff out. 17062 13:38:19,720 --> 13:38:21,720 And that's the data that I'm going to use. 17063 13:38:21,720 --> 13:38:22,720 And I had to figure this out. 17064 13:38:22,720 --> 13:38:24,720 And I had to transform it and make it pretty. 17065 13:38:24,720 --> 13:38:26,720 It took me quite a while to get this to work. 17066 13:38:26,720 --> 13:38:28,720 And this is not a JavaScript class 17067 13:38:28,720 --> 13:38:31,720 nor a how to visualize in D3. 17068 13:38:31,720 --> 13:38:35,720 But basically, we pulled all that stuff in. 17069 13:38:35,720 --> 13:38:40,720 And here's the gline that came from the JavaScript. 17070 13:38:40,720 --> 13:38:42,720 And then it makes an array to data table. 17071 13:38:42,720 --> 13:38:45,720 And then that data table is what gline draws. 17072 13:38:45,720 --> 13:38:57,720 So with no further ado, let's open gline.htm to show that data. 17073 13:38:57,720 --> 13:38:58,720 So there you go. 17074 13:38:58,720 --> 13:39:02,720 That's the Sakai developer participation from 2015 17075 13:39:02,720 --> 13:39:09,720 through 2005 through 2015, based on which organizations did 17076 13:39:09,720 --> 13:39:12,720 the most commits in Sakai. 17077 13:39:12,720 --> 13:39:15,720 And so I know that I haven't done all this code full justice. 17078 13:39:15,720 --> 13:39:17,720 There's a lot of code here. 17079 13:39:17,720 --> 13:39:20,720 The fun is just to kind of run it and see it. 17080 13:39:20,720 --> 13:39:23,720 And then when the time comes to come back and see 17081 13:39:23,720 --> 13:39:26,720 the techniques that are used when you're trying to build 17082 13:39:26,720 --> 13:39:29,720 your own visualization pipeline. 17083 13:39:29,720 --> 13:39:33,720 So I hope that you found this useful. 17084 13:39:33,720 --> 13:39:35,720 You know, this is a lot of code. 17085 13:39:35,720 --> 13:39:37,720 Hard to explain in 15, 20 minutes. 17086 13:39:37,720 --> 13:39:40,720 But I hope you take some time and look it over. 17087 13:39:40,720 --> 13:39:43,720 And I hope you found all these videos. 17088 13:39:43,720 --> 13:39:45,720 This is kind of the last walk-through video 17089 13:39:45,720 --> 13:39:47,720 for chapter 16 of the book. 17090 13:39:47,720 --> 13:39:50,720 And so I hope that I will see you on the net. 17091 13:39:50,720 --> 13:39:57,720 Thank you. 1360912

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.