All language subtitles for cdn.cs50.net_2022_fall_lectures_7_lang_en_lecture7

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (SoranĂ®)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:02,988 [MUSIC PLAYING] 2 00:00:02,988 --> 00:01:12,582 3 00:01:12,582 --> 00:01:13,540 DAVID MALAN: All right. 4 00:01:13,540 --> 00:01:17,150 This is CS50 and this is already week seven. 5 00:01:17,150 --> 00:01:20,240 And this is the week where we'll continue where we left off 6 00:01:20,240 --> 00:01:22,430 with Python, introducing you to a bit more syntax 7 00:01:22,430 --> 00:01:25,580 and capabilities of the language so you can solve interesting problems. 8 00:01:25,580 --> 00:01:27,580 But a lot of those problems increasingly are now 9 00:01:27,580 --> 00:01:29,180 going to involve data in some form. 10 00:01:29,180 --> 00:01:31,250 After all, if you think of most any website 11 00:01:31,250 --> 00:01:36,320 or mobile app or process nowadays that involves solving problems, 12 00:01:36,320 --> 00:01:39,530 it almost always involves some amount of data and often data at scale. 13 00:01:39,530 --> 00:01:40,680 Lots and lots of data. 14 00:01:40,680 --> 00:01:42,430 And so what we're going to see first today 15 00:01:42,430 --> 00:01:45,710 is that, yes, you can use Python to solve all the problems past that we've 16 00:01:45,710 --> 00:01:48,327 seen and also some data specific ones, but sometimes it's 17 00:01:48,327 --> 00:01:49,410 just going to be annoying. 18 00:01:49,410 --> 00:01:50,420 It's going to be a little painful. 19 00:01:50,420 --> 00:01:54,060 It's just going to be more work than you might like to just get to some answer. 20 00:01:54,060 --> 00:01:56,870 And so today we'll too introduce you to a new language 21 00:01:56,870 --> 00:01:59,240 called SQL, Structured Query Language. 22 00:01:59,240 --> 00:02:02,780 And this is a language that rest assured is actually much smaller, 23 00:02:02,780 --> 00:02:05,240 relatively speaking, than C and Python. 24 00:02:05,240 --> 00:02:07,860 It sort of does less, but it doesn't really well. 25 00:02:07,860 --> 00:02:11,240 And it's a language for querying databases, storing data in it, 26 00:02:11,240 --> 00:02:13,950 updating it, inserting it, deleting it, and so much more. 27 00:02:13,950 --> 00:02:16,160 And it's the kind of technology that's used nowadays 28 00:02:16,160 --> 00:02:19,700 in, indeed, web apps and mobile apps, data science, analytics, 29 00:02:19,700 --> 00:02:20,940 and so much more. 30 00:02:20,940 --> 00:02:23,600 It's really good at storing lots and lots of data. 31 00:02:23,600 --> 00:02:25,430 Now, this is yet another language. 32 00:02:25,430 --> 00:02:27,800 And believe it or not, next week we'll introduce you 33 00:02:27,800 --> 00:02:31,880 to three more languages, HTML and CSS, which are not technically 34 00:02:31,880 --> 00:02:32,900 programming languages. 35 00:02:32,900 --> 00:02:35,630 They're all about aesthetics and markup of information. 36 00:02:35,630 --> 00:02:38,660 But also JavaScript, which is, in fact, a programming language. 37 00:02:38,660 --> 00:02:41,630 But the goals here in CS50 really are going 38 00:02:41,630 --> 00:02:44,000 to be to empower you to program more generally. 39 00:02:44,000 --> 00:02:47,000 And indeed, when you're out there in the real world some years from now, 40 00:02:47,000 --> 00:02:50,330 invariably there's going to be some new other popular language out there. 41 00:02:50,330 --> 00:02:53,880 And hopefully in this week and next week and beyond, among the goals 42 00:02:53,880 --> 00:02:56,630 is not just to teach you these languages specifically but again, 43 00:02:56,630 --> 00:02:59,330 how to teach yourself the future languages that we've not even 44 00:02:59,330 --> 00:03:00,480 heard about just yet. 45 00:03:00,480 --> 00:03:04,220 So with that said, let's begin with a survey of sorts. 46 00:03:04,220 --> 00:03:10,190 If you go to this URL on your phone or laptop, cs50.ly/favorites, 47 00:03:10,190 --> 00:03:14,390 a very simple Google Form awaits you that's just going to ask you a couple 48 00:03:14,390 --> 00:03:16,530 of multiple choice questions. 49 00:03:16,530 --> 00:03:24,800 So go to cs50.ly/favorites, and that should lead you to a Google Form that 50 00:03:24,800 --> 00:03:27,590 looks a little something like this asking you first, 51 00:03:27,590 --> 00:03:31,700 as of now in week seven, what is your favorite language among those options 52 00:03:31,700 --> 00:03:32,250 here. 53 00:03:32,250 --> 00:03:34,220 And then further down, one more question. 54 00:03:34,220 --> 00:03:39,170 If you think back on problem sets 0 through 6, what was, if any, 55 00:03:39,170 --> 00:03:42,080 your favorite problem set problem? 56 00:03:42,080 --> 00:03:45,110 Be it in Scratch or C or Python. 57 00:03:45,110 --> 00:03:48,650 So answer those two questions. 58 00:03:48,650 --> 00:03:52,820 And in a moment, I'll flip over to my screen here where you'll see, 59 00:03:52,820 --> 00:03:56,570 and anyone who's used Google Forms knows, the spreadsheet that's 60 00:03:56,570 --> 00:03:58,130 collecting now this data. 61 00:03:58,130 --> 00:04:01,060 Microsoft Office 365 can do the same if you use one of those forms. 62 00:04:01,060 --> 00:04:02,810 And what you see here now is a spreadsheet 63 00:04:02,810 --> 00:04:06,770 in Google Sheets enumerating all of the audience's questions. 64 00:04:06,770 --> 00:04:09,980 Language is in column B, problem is in column C, 65 00:04:09,980 --> 00:04:13,190 and each row represents one student who has responded. 66 00:04:13,190 --> 00:04:17,360 A few of you were super eager for class today at 8:33 AM Eastern time. 67 00:04:17,360 --> 00:04:19,040 10:32, 11:10. 68 00:04:19,040 --> 00:04:21,779 OK, so now we're getting into the actual class time here. 69 00:04:21,779 --> 00:04:24,830 And if I scroll down, we'll probably see a few dozen, 70 00:04:24,830 --> 00:04:26,900 a couple of hundred answers by now. 71 00:04:26,900 --> 00:04:29,225 And yeah, so we're getting a whole lot of answers here. 72 00:04:29,225 --> 00:04:31,100 And I'm seeing some patterns emerge, but it's 73 00:04:31,100 --> 00:04:34,190 not necessarily obvious to the human eyes what those patterns are. 74 00:04:34,190 --> 00:04:36,320 Now of course, you can use Google Spreadsheets. 75 00:04:36,320 --> 00:04:40,143 You can highlight the data and you can create charts magically out of it. 76 00:04:40,143 --> 00:04:42,560 But you can only do what Google lets you do with the data. 77 00:04:42,560 --> 00:04:45,350 And same thing for Microsoft Excel or Apple Numbers. 78 00:04:45,350 --> 00:04:48,230 But wouldn't it be nice to just be able to manipulate the raw data, 79 00:04:48,230 --> 00:04:51,740 relatively simple though it is, to just answer questions about the data? 80 00:04:51,740 --> 00:04:54,260 Maybe long term create your own charts, customize it 81 00:04:54,260 --> 00:04:56,240 just the way you want rather than beholden 82 00:04:56,240 --> 00:04:59,250 to software that's off the shelf like this. 83 00:04:59,250 --> 00:05:02,130 Well, how could we go about doing this? 84 00:05:02,130 --> 00:05:05,330 Well, let me propose that we treat this data set now 85 00:05:05,330 --> 00:05:08,790 as what we're going to call for now a flat file database. 86 00:05:08,790 --> 00:05:10,850 We'll see today that there's fancier databases, 87 00:05:10,850 --> 00:05:15,200 but the simplest database in the world is really just like a .csv file. 88 00:05:15,200 --> 00:05:18,500 And we saw that a couple of weeks ago in C. We wrote a bit of C code 89 00:05:18,500 --> 00:05:23,660 that used fprintf to write data to a file using commas as the separator. 90 00:05:23,660 --> 00:05:25,910 We didn't really do much more with CSVs at the time 91 00:05:25,910 --> 00:05:28,610 though, because it's really annoying, painful, time consuming, 92 00:05:28,610 --> 00:05:32,840 not fun to use C for something like that because of malloc and memory and all 93 00:05:32,840 --> 00:05:33,413 that stuff. 94 00:05:33,413 --> 00:05:35,330 But with Python, it's going to be much easier. 95 00:05:35,330 --> 00:05:39,530 And so any time you have access to some data set where you can just download it 96 00:05:39,530 --> 00:05:42,140 to your own Mac or PC or your cloud environment, 97 00:05:42,140 --> 00:05:46,460 it's sort of a candidate for now writing code to do something with the data. 98 00:05:46,460 --> 00:05:48,120 Maybe analyze it right away. 99 00:05:48,120 --> 00:05:50,150 If it's been human imported manually, maybe you 100 00:05:50,150 --> 00:05:52,550 have to clean it up by doing a lot of find and replace 101 00:05:52,550 --> 00:05:54,852 but not with your keyboard but rather with code. 102 00:05:54,852 --> 00:05:56,310 And so let me go ahead and do this. 103 00:05:56,310 --> 00:05:59,780 Let me go back to my Google Sheet here that has 104 00:05:59,780 --> 00:06:02,010 all of the data that's come in now. 105 00:06:02,010 --> 00:06:08,430 And let me go ahead and download this via the File menu here. 106 00:06:08,430 --> 00:06:09,590 And let's see. 107 00:06:09,590 --> 00:06:10,310 Download. 108 00:06:10,310 --> 00:06:12,170 And you can see a whole bunch of options. 109 00:06:12,170 --> 00:06:14,210 Most formats might be familiar, but today we'll 110 00:06:14,210 --> 00:06:17,602 focus just on this one, Comma Separated Values or CSV. 111 00:06:17,602 --> 00:06:20,810 That's going to go ahead and download it on my Mac here into my own Downloads 112 00:06:20,810 --> 00:06:21,140 folder. 113 00:06:21,140 --> 00:06:22,890 And now I'm going to go ahead and do this. 114 00:06:22,890 --> 00:06:25,140 Let me go ahead and pull up VS Code in the cloud here. 115 00:06:25,140 --> 00:06:28,140 And if you've never done this before, there's a couple of ways to do it. 116 00:06:28,140 --> 00:06:31,320 But the simplest way to upload a file to your code space, so to speak, 117 00:06:31,320 --> 00:06:32,990 is just a sort of drag and drop. 118 00:06:32,990 --> 00:06:36,020 That's going to magically upload it to the server there. 119 00:06:36,020 --> 00:06:39,688 And we'll see that, one, it has a very long file name, which I'm actually 120 00:06:39,688 --> 00:06:41,480 going to clean this up because that's going 121 00:06:41,480 --> 00:06:43,880 to be very tedious to type in my code. 122 00:06:43,880 --> 00:06:46,927 So I could either right click, of course, up here, 123 00:06:46,927 --> 00:06:48,510 but I'm going to use my Linux command. 124 00:06:48,510 --> 00:06:52,550 So let's move this file called CSV 50 2022 something or other and let's 125 00:06:52,550 --> 00:06:54,740 just name it more simply favorites.csv. 126 00:06:54,740 --> 00:06:58,370 So all lowercase, no spaces, sort of good basics. 127 00:06:58,370 --> 00:07:02,960 And let me go ahead now and open up this file with code favorites.csv. 128 00:07:02,960 --> 00:07:06,500 I'll close my File Explorer and we'll see exactly the same data 129 00:07:06,500 --> 00:07:09,890 as before but not quite as pretty as Google Sheets makes it be. 130 00:07:09,890 --> 00:07:13,640 Rather we see here that I still have three columns, timestamp, language, 131 00:07:13,640 --> 00:07:17,013 problem, and then all of the values down below, including the timestamps 132 00:07:17,013 --> 00:07:18,180 and the answers they're for. 133 00:07:18,180 --> 00:07:20,360 But it doesn't have proper columns. 134 00:07:20,360 --> 00:07:22,250 It just has commas separating them. 135 00:07:22,250 --> 00:07:24,860 Now, we could very easily write Python code 136 00:07:24,860 --> 00:07:28,820 just like we wrote C code to manipulate files like this 137 00:07:28,820 --> 00:07:30,140 either to write or read. 138 00:07:30,140 --> 00:07:32,450 But instead let's do something that's a little more 139 00:07:32,450 --> 00:07:34,640 pleasant, which is indeed in the form of Python. 140 00:07:34,640 --> 00:07:38,300 So Python actually comes with native support for CSVs. 141 00:07:38,300 --> 00:07:41,200 It has indeed a package called CSV that just 142 00:07:41,200 --> 00:07:43,700 lets you read and write and do a whole bunch of useful stuff 143 00:07:43,700 --> 00:07:45,007 when it comes to CSV files. 144 00:07:45,007 --> 00:07:47,090 So let's go ahead and do something with this file. 145 00:07:47,090 --> 00:07:49,250 Let me go back here to VS Code. 146 00:07:49,250 --> 00:07:51,860 I'm going to close favorites.csv for now. 147 00:07:51,860 --> 00:07:55,430 But just remember in your mind that timestamp was the first column, 148 00:07:55,430 --> 00:07:58,640 language was the second column, and problem was the third. 149 00:07:58,640 --> 00:08:01,790 And notice because we're using commas, they don't again line up perfectly, 150 00:08:01,790 --> 00:08:02,832 but that's not a problem. 151 00:08:02,832 --> 00:08:05,270 There are two commas in every line presumably. 152 00:08:05,270 --> 00:08:09,892 And I'm going to go ahead and now create a file called how about favorites.py so 153 00:08:09,892 --> 00:08:12,350 that I can start writing some code to manipulate this data. 154 00:08:12,350 --> 00:08:13,600 And let's do something simple. 155 00:08:13,600 --> 00:08:16,910 Let's just write a simple program in Python that opens this file, 156 00:08:16,910 --> 00:08:19,997 reads it, and print something out just as a safety check 157 00:08:19,997 --> 00:08:22,830 that I know what I'm doing, even though it's not going to be useful. 158 00:08:22,830 --> 00:08:26,780 So in Python, if you want CSV support, you import CSV. 159 00:08:26,780 --> 00:08:30,350 And that gives you access to all the magical capabilities thereof. 160 00:08:30,350 --> 00:08:32,450 Let me now go ahead and use this technique 161 00:08:32,450 --> 00:08:35,960 to open a file in Python, which is similar in C. 162 00:08:35,960 --> 00:08:37,789 But with Python, we're going to do this. 163 00:08:37,789 --> 00:08:39,020 The keyword with. 164 00:08:39,020 --> 00:08:42,049 I'm going to open a file called favorites.csv, 165 00:08:42,049 --> 00:08:43,880 which was the shorter name I gave it. 166 00:08:43,880 --> 00:08:46,280 This is optional, but just for explicitness, I'm 167 00:08:46,280 --> 00:08:50,090 going to open it in read mode explicitly, just like f open 168 00:08:50,090 --> 00:08:51,720 took a second argument as well. 169 00:08:51,720 --> 00:08:54,500 And I'm going to name this file once open quite simply file, 170 00:08:54,500 --> 00:08:56,240 though I could call it anything I want. 171 00:08:56,240 --> 00:08:59,060 And now it's just an open file. 172 00:08:59,060 --> 00:09:02,870 So far as Python knows at this moment, it's just text, or better yet, 173 00:09:02,870 --> 00:09:04,250 it's just zeros and ones. 174 00:09:04,250 --> 00:09:08,360 If you want this Python package called CSV to actually do 175 00:09:08,360 --> 00:09:12,500 something useful with it, you have to load this file now into the library. 176 00:09:12,500 --> 00:09:16,165 And the simplest way to do this is to give myself a variable called reader 177 00:09:16,165 --> 00:09:17,540 because I want to read this file. 178 00:09:17,540 --> 00:09:19,430 Though this too I could call anything else. 179 00:09:19,430 --> 00:09:22,610 I'm going to then set that equal to the return value of a function called 180 00:09:22,610 --> 00:09:24,710 csv.reader. 181 00:09:24,710 --> 00:09:28,220 And I pass to that per the documentation the open file. 182 00:09:28,220 --> 00:09:29,780 So step one, I open the file. 183 00:09:29,780 --> 00:09:32,030 And this just gives me access to the bytes therein. 184 00:09:32,030 --> 00:09:36,530 Step two now with csv.reader tells the Python package 185 00:09:36,530 --> 00:09:40,940 called CSV to do something useful with it and start analyzing the commas 186 00:09:40,940 --> 00:09:43,770 and allow me to parse it further. 187 00:09:43,770 --> 00:09:45,180 So let's go ahead and do this. 188 00:09:45,180 --> 00:09:49,730 Let me go ahead now and within this loop let's say this. 189 00:09:49,730 --> 00:09:52,340 Sorry, within this open file, let's do this. 190 00:09:52,340 --> 00:09:58,040 For every row, if you will, or line in the file, a.k.a. 191 00:09:58,040 --> 00:10:04,910 reader, let's go ahead and print out just how about row bracket 1. 192 00:10:04,910 --> 00:10:06,198 Now what's going on here? 193 00:10:06,198 --> 00:10:08,990 Well, it turns out if you read the documentation for the CSV reader 194 00:10:08,990 --> 00:10:13,610 function, what it hands you back is essentially this special object, so 195 00:10:13,610 --> 00:10:16,850 to speak, that allows you to treat it as though it's just 196 00:10:16,850 --> 00:10:20,660 a really big list of lines from the file, a.k.a. reader. 197 00:10:20,660 --> 00:10:25,310 So by saying for row in reader, this is a way more succinct way 198 00:10:25,310 --> 00:10:28,763 of saying give me the first line in the file plus plus, 199 00:10:28,763 --> 00:10:30,680 give me the second line in the file plus plus, 200 00:10:30,680 --> 00:10:33,597 and so forth that we would have done what much more mechanically in C. 201 00:10:33,597 --> 00:10:37,350 This is just much more Pythonic and English friendly, if you will. 202 00:10:37,350 --> 00:10:42,225 So in every iteration of this loop, row is going to contain all of the data 203 00:10:42,225 --> 00:10:43,100 from the current row. 204 00:10:43,100 --> 00:10:45,590 But better yet, what the reader function does 205 00:10:45,590 --> 00:10:50,240 for me is it hands me each row not just as a big string or STR 206 00:10:50,240 --> 00:10:51,590 of text in Python. 207 00:10:51,590 --> 00:10:56,590 It gives me what apparently based on the syntax on line six. 208 00:10:56,590 --> 00:10:57,460 Any instinct? 209 00:10:57,460 --> 00:10:58,030 Yeah. 210 00:10:58,030 --> 00:10:58,740 AUDIENCE: A list. 211 00:10:58,740 --> 00:11:00,698 DAVID MALAN: It's giving me back indeed a list. 212 00:11:00,698 --> 00:11:02,880 And I presume the visual clue for you was the fact 213 00:11:02,880 --> 00:11:04,463 that we're using square brackets here. 214 00:11:04,463 --> 00:11:07,780 And indeed, row bracket 1 is going to be not the first 215 00:11:07,780 --> 00:11:10,060 but the second element in that list. 216 00:11:10,060 --> 00:11:11,280 And so just take a guess. 217 00:11:11,280 --> 00:11:14,400 When I run this code in a moment, what's going to get printed? 218 00:11:14,400 --> 00:11:16,950 The timestamp, the language, or the problem? 219 00:11:16,950 --> 00:11:20,892 220 00:11:20,892 --> 00:11:21,392 Yeah? 221 00:11:21,392 --> 00:11:22,350 AUDIENCE: The language. 222 00:11:22,350 --> 00:11:25,260 DAVID MALAN: The language because it's the second column that is 223 00:11:25,260 --> 00:11:27,848 in the file delimited by those commas. 224 00:11:27,848 --> 00:11:29,140 So let me go ahead and do this. 225 00:11:29,140 --> 00:11:30,630 Let me clear my terminal down here. 226 00:11:30,630 --> 00:11:33,570 Let me run Python of favorites.py and Enter. 227 00:11:33,570 --> 00:11:34,710 And there's everything. 228 00:11:34,710 --> 00:11:35,490 It was super fast. 229 00:11:35,490 --> 00:11:37,090 But there's a really long list here. 230 00:11:37,090 --> 00:11:41,130 And in fact, if I increase the size of my terminal and start scrolling up, 231 00:11:41,130 --> 00:11:43,090 you'll just see all of the raw data. 232 00:11:43,090 --> 00:11:44,490 Now, this isn't that useful yet. 233 00:11:44,490 --> 00:11:46,032 I could have just glanced at the CSV. 234 00:11:46,032 --> 00:11:49,320 But clearly now I have the ability to open the file, 235 00:11:49,320 --> 00:11:52,830 parse it, so to speak, that is break it up into its constituent parts, 236 00:11:52,830 --> 00:11:56,380 and do something with specific parts therein. 237 00:11:56,380 --> 00:11:56,880 All right. 238 00:11:56,880 --> 00:11:59,640 So if I want to do this a little more pleasantly though, 239 00:11:59,640 --> 00:12:02,460 let me at least make this semantically a little cleaner. 240 00:12:02,460 --> 00:12:05,650 And you know what, just for clarity, let me just give myself a variable. 241 00:12:05,650 --> 00:12:09,000 It's not strictly necessary, but I know that this is 242 00:12:09,000 --> 00:12:12,958 the favorite, for instance, language. 243 00:12:12,958 --> 00:12:14,250 So let's just call it favorite. 244 00:12:14,250 --> 00:12:16,410 Set it equal to row bracket 1. 245 00:12:16,410 --> 00:12:18,600 And now just to be more explicit in my code, 246 00:12:18,600 --> 00:12:21,030 even though again, we don't need the variable per se, 247 00:12:21,030 --> 00:12:22,840 this code is, of course, going to do the same thing. 248 00:12:22,840 --> 00:12:25,132 It's just using an additional variable called favorite. 249 00:12:25,132 --> 00:12:28,450 If I go down here, scroll up, run the program again, 250 00:12:28,450 --> 00:12:30,280 I get back to the exact same data. 251 00:12:30,280 --> 00:12:32,280 But this is a stepping stone to something 252 00:12:32,280 --> 00:12:34,800 that's even more powerful about Python support 253 00:12:34,800 --> 00:12:38,610 for CSV files is that you don't have to just treat the return 254 00:12:38,610 --> 00:12:43,500 value as a list with 0 and 1 and 2. 255 00:12:43,500 --> 00:12:48,540 So just thinking intuitively here, why is this maybe not the best design 256 00:12:48,540 --> 00:12:51,960 to hand you, the programmer, back the data in a list 257 00:12:51,960 --> 00:12:55,640 that's numerically indexed with 0, 1, 2? 258 00:12:55,640 --> 00:12:59,760 It clearly works, but critique this. 259 00:12:59,760 --> 00:13:00,890 What could go wrong? 260 00:13:00,890 --> 00:13:03,282 What's a little poorly designed? 261 00:13:03,282 --> 00:13:04,169 Yeah? 262 00:13:04,169 --> 00:13:06,586 AUDIENCE: You have to always remember what the things are, 263 00:13:06,586 --> 00:13:10,945 what the order is [INAUDIBLE] 264 00:13:10,945 --> 00:13:11,820 DAVID MALAN: Exactly. 265 00:13:11,820 --> 00:13:14,880 So yeah, so it's up to you to repeat-- 266 00:13:14,880 --> 00:13:18,360 it's up to you to remember what column the data is actually in. 267 00:13:18,360 --> 00:13:22,020 And God forbid you're collaborating with someone else on the spreadsheet. 268 00:13:22,020 --> 00:13:24,803 If you've used Google Spreadsheets, you can move columns around 269 00:13:24,803 --> 00:13:27,220 maybe just because you want to visually reorganize things. 270 00:13:27,220 --> 00:13:30,150 And if you do this and then someone else downloads that same data, 271 00:13:30,150 --> 00:13:31,840 all of their code is going to break. 272 00:13:31,840 --> 00:13:33,360 So that's just really bad design. 273 00:13:33,360 --> 00:13:36,390 It's fragile just because you're sort of on the honor system 274 00:13:36,390 --> 00:13:38,588 that one means the data that you want. 275 00:13:38,588 --> 00:13:41,130 So wouldn't it be nice if it could be a little more explicit? 276 00:13:41,130 --> 00:13:47,130 Well, recall that the very first line in this file is actually this. 277 00:13:47,130 --> 00:13:50,730 And I paused the output this time so that we can see more optionally. 278 00:13:50,730 --> 00:13:51,960 I just reran favorites.py. 279 00:13:51,960 --> 00:13:55,050 And notice one of these things is not like the other. 280 00:13:55,050 --> 00:13:59,070 Every output was either scratch or C or Python except for this first one. 281 00:13:59,070 --> 00:14:00,855 Why am I seeing the word language here? 282 00:14:00,855 --> 00:14:03,610 283 00:14:03,610 --> 00:14:05,860 Where did language come from? 284 00:14:05,860 --> 00:14:08,230 You didn't have the ability to manually input. 285 00:14:08,230 --> 00:14:09,372 No. 286 00:14:09,372 --> 00:14:10,330 Where did it come from? 287 00:14:10,330 --> 00:14:11,030 Yeah. 288 00:14:11,030 --> 00:14:11,850 AUDIENCE: That would be the header. 289 00:14:11,850 --> 00:14:12,630 DAVID MALAN: Yeah, the header. 290 00:14:12,630 --> 00:14:15,300 The very first row in the file, which by human convention 291 00:14:15,300 --> 00:14:18,210 generally just defines what the columns represent so that there's 292 00:14:18,210 --> 00:14:20,470 some human useful information there. 293 00:14:20,470 --> 00:14:24,400 Now, that's not really intended to be part of my output at the moment, 294 00:14:24,400 --> 00:14:25,710 so there is a way to skip this. 295 00:14:25,710 --> 00:14:29,380 If you want to skip the first row, you can actually do something like this. 296 00:14:29,380 --> 00:14:32,560 You can say next row, and that will just ignore that row. 297 00:14:32,560 --> 00:14:35,520 So then I'm starting really with the every row thereafter. 298 00:14:35,520 --> 00:14:37,995 But there's a better way to handle this than that. 299 00:14:37,995 --> 00:14:40,620 That will get rid of the row in the output, but let me go ahead 300 00:14:40,620 --> 00:14:42,888 and use a different feature of the CSV package 301 00:14:42,888 --> 00:14:45,430 that's just going to make this a little cleaner all together. 302 00:14:45,430 --> 00:14:47,140 So let me clear my terminal window here. 303 00:14:47,140 --> 00:14:49,830 Let me undo this next thing that I just added. 304 00:14:49,830 --> 00:14:54,690 And instead of using a reader, let me go ahead and use a dictionary reader, 305 00:14:54,690 --> 00:15:00,720 abbreviated dict reader, that's going to now return me the equivalent of all 306 00:15:00,720 --> 00:15:02,260 of the rows one at a time. 307 00:15:02,260 --> 00:15:04,720 So I can still call it reader just as before. 308 00:15:04,720 --> 00:15:08,340 But as the name implies, what this reader is going to return 309 00:15:08,340 --> 00:15:11,560 is not a list after list after list but a dictionary, 310 00:15:11,560 --> 00:15:13,140 a dictionary, a dictionary. 311 00:15:13,140 --> 00:15:16,210 And remember, a dictionary is just a collection of key value pairs. 312 00:15:16,210 --> 00:15:17,640 So what does that mean? 313 00:15:17,640 --> 00:15:18,390 What are the keys? 314 00:15:18,390 --> 00:15:19,480 What are the values? 315 00:15:19,480 --> 00:15:24,210 Well, now that I'm using a dictionary reader, I can actually do this. 316 00:15:24,210 --> 00:15:27,270 Instead of on the honor system remembering that I want column one, 317 00:15:27,270 --> 00:15:30,630 I can treat row now not as a list but as a dictionary. 318 00:15:30,630 --> 00:15:34,410 And that means I can go in here and say quote unquote "language." 319 00:15:34,410 --> 00:15:36,150 And we saw that back in week six. 320 00:15:36,150 --> 00:15:38,820 Python allows you to index into dictionaries 321 00:15:38,820 --> 00:15:42,480 using square bracket notation in strings or STRs on the inside, 322 00:15:42,480 --> 00:15:44,790 just like lists allow for numbers. 323 00:15:44,790 --> 00:15:48,180 But this now I think is going to be a little more robust. 324 00:15:48,180 --> 00:15:52,830 If I run this again, Python of favorites.py, of that worked out fine. 325 00:15:52,830 --> 00:15:56,580 And let me pause the output too by using this program called more. 326 00:15:56,580 --> 00:15:58,600 Now I don't even see the header. 327 00:15:58,600 --> 00:16:03,660 So now whoever works with Python wrote the code for this package 328 00:16:03,660 --> 00:16:06,600 to just analyze that first line of code, use the header 329 00:16:06,600 --> 00:16:10,230 as you just called it as the keys, and then every time you 330 00:16:10,230 --> 00:16:13,380 iterate through this loop, it updates the values, the values, the values, 331 00:16:13,380 --> 00:16:15,810 but the keys stay the same. 332 00:16:15,810 --> 00:16:20,320 Any questions then on this technique? 333 00:16:20,320 --> 00:16:24,206 Suffice it to say this would be painful in C. Yes? 334 00:16:24,206 --> 00:16:28,670 AUDIENCE: [INAUDIBLE] 335 00:16:28,670 --> 00:16:33,395 336 00:16:33,395 --> 00:16:34,270 DAVID MALAN: Exactly. 337 00:16:34,270 --> 00:16:39,130 So the keys are always going to be quote unquote "timestamp, language, 338 00:16:39,130 --> 00:16:40,130 and problem." 339 00:16:40,130 --> 00:16:43,330 But on each iteration of this loop here, the row 340 00:16:43,330 --> 00:16:46,600 is going to contain a different row of values, different row of values, 341 00:16:46,600 --> 00:16:47,600 different row of values. 342 00:16:47,600 --> 00:16:50,830 So you're going to get back one dictionary for every student who 343 00:16:50,830 --> 00:16:54,140 submitted the Google Form, if you will, while iterating through it there. 344 00:16:54,140 --> 00:16:54,640 All right. 345 00:16:54,640 --> 00:17:01,270 So once we have this ability here, why don't we go ahead and transition to 346 00:17:01,270 --> 00:17:03,850 how about not just using that dictionary reader, which 347 00:17:03,850 --> 00:17:05,890 it makes the code a little more robust. 348 00:17:05,890 --> 00:17:08,599 Because now if you move the columns around, no big deal. 349 00:17:08,599 --> 00:17:11,170 It doesn't matter if the numeric indices change. 350 00:17:11,170 --> 00:17:13,660 You can still use those keywords instead. 351 00:17:13,660 --> 00:17:15,460 But let's actually analyze the data now. 352 00:17:15,460 --> 00:17:18,619 I'm just spitting it out, which is not solving any problems for anyone. 353 00:17:18,619 --> 00:17:22,750 So let's go ahead and count the popularity of Scratch, C, and Python 354 00:17:22,750 --> 00:17:25,790 and see what everyone's been thinking here. 355 00:17:25,790 --> 00:17:26,290 All right. 356 00:17:26,290 --> 00:17:27,470 So how might I do this? 357 00:17:27,470 --> 00:17:29,890 Well, let me go ahead and do this up here. 358 00:17:29,890 --> 00:17:33,880 Before I start iterating, let me give myself, let's say, three variables. 359 00:17:33,880 --> 00:17:37,120 And to keep things simple, I'll say one variable called Scratch. 360 00:17:37,120 --> 00:17:40,270 Set it equal to 0 for 0 students so far. 361 00:17:40,270 --> 00:17:44,200 C is going to equal 0 and Python is going to equal 0. 362 00:17:44,200 --> 00:17:46,157 There's a slightly prettier way of doing this, 363 00:17:46,157 --> 00:17:49,240 just because this is like three lines of code to do something very simple. 364 00:17:49,240 --> 00:17:53,080 You could alternatively in Python but not C 365 00:17:53,080 --> 00:17:58,720 do Scratch comma C comma Python equals 0 comma 0. 366 00:17:58,720 --> 00:18:02,168 So kind of slightly more elegant just to fit it all into one line. 367 00:18:02,168 --> 00:18:04,210 But now let's just do something more interesting. 368 00:18:04,210 --> 00:18:06,160 On line seven, I'm still going to figure out 369 00:18:06,160 --> 00:18:08,320 what the current favorite language is. 370 00:18:08,320 --> 00:18:11,740 And now I'm just going to do some conditional checks. 371 00:18:11,740 --> 00:18:16,610 How about if that favorite equals equals quote unquote "Scratch," 372 00:18:16,610 --> 00:18:19,990 Let's go ahead and increment Scratch by 1. 373 00:18:19,990 --> 00:18:23,560 We can't do plus plus in Python, but we can do plus equals 1. 374 00:18:23,560 --> 00:18:30,700 How about elif if favorite equals equals C, then let's do C plus equals 1. 375 00:18:30,700 --> 00:18:32,063 We could do else. 376 00:18:32,063 --> 00:18:33,730 This is actually a good design question. 377 00:18:33,730 --> 00:18:34,840 Should I do else? 378 00:18:34,840 --> 00:18:36,880 Should I do elif? 379 00:18:36,880 --> 00:18:39,250 Any instincts here? 380 00:18:39,250 --> 00:18:40,824 Yeah. 381 00:18:40,824 --> 00:18:43,259 AUDIENCE: [INAUDIBLE] 382 00:18:43,259 --> 00:18:46,322 383 00:18:46,322 --> 00:18:48,030 DAVID MALAN: Yeah, really good instincts. 384 00:18:48,030 --> 00:18:50,370 Just in case someone goes and adds another language 385 00:18:50,370 --> 00:18:51,960 to the form next week because we're obviously 386 00:18:51,960 --> 00:18:53,370 going to introduce another language today, 387 00:18:53,370 --> 00:18:55,620 you don't want your code to now artificially inflate 388 00:18:55,620 --> 00:18:58,380 the scores for Python just because you're 389 00:18:58,380 --> 00:19:00,340 conflating multiple languages together. 390 00:19:00,340 --> 00:19:03,810 So the more defensive sort of better way to write this code, I agree, 391 00:19:03,810 --> 00:19:07,680 would be elif favorite equals equals Python. 392 00:19:07,680 --> 00:19:10,590 Then let's go ahead and increment Python plus equals 1. 393 00:19:10,590 --> 00:19:12,840 And if there's a new language next week, we're 394 00:19:12,840 --> 00:19:14,260 obviously going to have to update the code, 395 00:19:14,260 --> 00:19:15,630 but at least we're not miscounting. 396 00:19:15,630 --> 00:19:17,130 We're just missing the new language. 397 00:19:17,130 --> 00:19:19,210 So I think that's slightly more robust. 398 00:19:19,210 --> 00:19:22,290 All right, now at the very bottom of this program and outside of the loop 399 00:19:22,290 --> 00:19:25,680 when I'm all done counting, let me go ahead and print out 400 00:19:25,680 --> 00:19:27,060 using some f strings. 401 00:19:27,060 --> 00:19:31,450 How about the total number of people whose favorite is Scratch? 402 00:19:31,450 --> 00:19:34,890 So this is just week six f string syntax. 403 00:19:34,890 --> 00:19:38,940 Let me go ahead and print out another f string for C. And I'm, of course, 404 00:19:38,940 --> 00:19:41,880 putting the variables in curly braces, all lowercase, 405 00:19:41,880 --> 00:19:45,060 but the English words I'm doing capitalization for. 406 00:19:45,060 --> 00:19:49,470 Let's do a final one with f Python colon and then in curly braces 407 00:19:49,470 --> 00:19:51,000 Python close quote. 408 00:19:51,000 --> 00:19:52,105 And I think I'm done. 409 00:19:52,105 --> 00:19:53,980 So let me just hide my terminal for a second. 410 00:19:53,980 --> 00:19:55,290 Here's the total program. 411 00:19:55,290 --> 00:19:56,730 Same stuff as before. 412 00:19:56,730 --> 00:19:58,980 Open favorites.csv. 413 00:19:58,980 --> 00:20:02,880 Open it further with the dictionary reader to do that processing for us. 414 00:20:02,880 --> 00:20:06,330 Initialize three variables to 0 just so we have something to count with. 415 00:20:06,330 --> 00:20:09,300 And then iterate over the file row by row. 416 00:20:09,300 --> 00:20:12,120 And this is just some sort of week one style conditional logic, 417 00:20:12,120 --> 00:20:14,410 albeit in Python, counting things. 418 00:20:14,410 --> 00:20:14,910 All right. 419 00:20:14,910 --> 00:20:17,080 So how can we now execute this? 420 00:20:17,080 --> 00:20:18,450 Let me go back to my terminal. 421 00:20:18,450 --> 00:20:20,610 Python of favorites.py. 422 00:20:20,610 --> 00:20:22,050 And here we go. 423 00:20:22,050 --> 00:20:26,400 As of today, everyone who's reporting in live via the Google Form, 424 00:20:26,400 --> 00:20:30,190 their favorite languages are-- 425 00:20:30,190 --> 00:20:31,330 interesting. 426 00:20:31,330 --> 00:20:34,510 That's pretty interesting too after just one week of Python no less. 427 00:20:34,510 --> 00:20:36,310 But Scratch is a healthy contender there. 428 00:20:36,310 --> 00:20:38,210 A lot of C. So a pretty good mix here. 429 00:20:38,210 --> 00:20:43,127 So is this going to be the best way to write this program long term? 430 00:20:43,127 --> 00:20:45,460 Well, as you noted, if there's a new language next week, 431 00:20:45,460 --> 00:20:47,830 this week we're going to have to constantly update this. 432 00:20:47,830 --> 00:20:51,610 And here's where you should let your mind wander to the future. 433 00:20:51,610 --> 00:20:55,270 If we have a fourth language, fifth language, sixth, seventh, eighth, 434 00:20:55,270 --> 00:20:59,410 which aspects here might kind of have some code smell to it? 435 00:20:59,410 --> 00:21:04,310 This probably isn't the best design to set us up for the future. 436 00:21:04,310 --> 00:21:07,480 What might be better than this? 437 00:21:07,480 --> 00:21:08,417 Yeah. 438 00:21:08,417 --> 00:21:09,880 AUDIENCE: We need to add a language to line five. 439 00:21:09,880 --> 00:21:10,330 DAVID MALAN: Yeah. 440 00:21:10,330 --> 00:21:12,288 We have to keep adding a language to line five. 441 00:21:12,288 --> 00:21:13,340 And OK, not a big deal. 442 00:21:13,340 --> 00:21:16,690 We could add SQL today and maybe JavaScript next week. 443 00:21:16,690 --> 00:21:19,510 But any time a line of code, a line of logic, 444 00:21:19,510 --> 00:21:21,450 it's just going to grow out of control. 445 00:21:21,450 --> 00:21:23,950 We've had this chat a couple of times with different syntax. 446 00:21:23,950 --> 00:21:25,700 There's probably a better way than that. 447 00:21:25,700 --> 00:21:26,440 So let's do that. 448 00:21:26,440 --> 00:21:30,550 Instead of using these individual variables, we could maybe use a list, 449 00:21:30,550 --> 00:21:35,560 but a list would be a little confusing because what does bracket 0 mean? 450 00:21:35,560 --> 00:21:37,090 What is bracket 1, bracket 2? 451 00:21:37,090 --> 00:21:40,930 But a dictionary, recall, is this Swiss army knife of data structures 452 00:21:40,930 --> 00:21:44,630 whereby you can associate anything with anything else, keys with values. 453 00:21:44,630 --> 00:21:48,370 So I dare say a cleaner way to solve this problem that sets us up 454 00:21:48,370 --> 00:21:52,570 for less work or confusion later would be to create a new variable called 455 00:21:52,570 --> 00:21:55,000 counts, if that's what we're doing, counting things up, 456 00:21:55,000 --> 00:21:58,340 and just set it equal to an empty dictionary. 457 00:21:58,340 --> 00:22:01,675 And you can literally say dict with the open parenthesis, close parenthesis 458 00:22:01,675 --> 00:22:05,020 and nothing or the more Pythonic just use open 459 00:22:05,020 --> 00:22:07,000 and close curly braces with nothing inside. 460 00:22:07,000 --> 00:22:11,320 That gives me an empty dictionary just like square brackets gives me a list. 461 00:22:11,320 --> 00:22:13,990 Now, my logic down here has to change a little bit. 462 00:22:13,990 --> 00:22:19,150 But what's nice is I don't need one conditional for every language. 463 00:22:19,150 --> 00:22:22,240 Because again, if we have a fourth, a fifth, a sixth, that chunk of code 464 00:22:22,240 --> 00:22:24,580 is also going to grow a bit out of control too. 465 00:22:24,580 --> 00:22:26,570 So I can get rid of this here. 466 00:22:26,570 --> 00:22:29,180 And what I think I'm going to do is say this. 467 00:22:29,180 --> 00:22:32,380 Whatever the current favorite is from the current row in the file, 468 00:22:32,380 --> 00:22:38,420 why don't we go into our counts variable at that key? 469 00:22:38,420 --> 00:22:40,303 And again, favorite is a variable. 470 00:22:40,303 --> 00:22:41,720 It's not quote unquote "favorite." 471 00:22:41,720 --> 00:22:44,560 It's going to be Scratch or C or Python. 472 00:22:44,560 --> 00:22:47,350 And then why don't we go ahead and just increment whatever 473 00:22:47,350 --> 00:22:51,740 the value of that count is at that key? 474 00:22:51,740 --> 00:22:53,500 Now, this is technically buggy. 475 00:22:53,500 --> 00:22:57,140 We're really close, but there is a bug. 476 00:22:57,140 --> 00:23:00,350 Does anyone want to conjecture what the bug is? 477 00:23:00,350 --> 00:23:00,890 Yeah. 478 00:23:00,890 --> 00:23:04,215 AUDIENCE: [INAUDIBLE] 479 00:23:04,215 --> 00:23:08,490 480 00:23:08,490 --> 00:23:11,640 DAVID MALAN: A good question that answers my question nonetheless. 481 00:23:11,640 --> 00:23:12,685 So no. 482 00:23:12,685 --> 00:23:14,310 The magic you describe will not happen. 483 00:23:14,310 --> 00:23:17,490 And to repeat the hypothesis, will this automatically 484 00:23:17,490 --> 00:23:22,590 create a key for every language that we try 485 00:23:22,590 --> 00:23:24,090 plugging into those square brackets? 486 00:23:24,090 --> 00:23:25,140 Short answer no. 487 00:23:25,140 --> 00:23:27,750 Odds are this is going to create a key error, one of those 488 00:23:27,750 --> 00:23:29,640 traceback error messages that you've probably 489 00:23:29,640 --> 00:23:31,890 seen by now either in class or in problem sets 490 00:23:31,890 --> 00:23:37,410 whereby if Scratch hasn't appeared in the dictionary before or C or Python, 491 00:23:37,410 --> 00:23:40,330 then the dictionary has no clue what you're talking about. 492 00:23:40,330 --> 00:23:43,200 So I think we actually still need some conditional logic but not 493 00:23:43,200 --> 00:23:45,900 that's going to grow longer and longer with each language. 494 00:23:45,900 --> 00:23:47,860 What I think we probably want to do is this. 495 00:23:47,860 --> 00:23:52,530 If the current favorite is in the count dictionary, 496 00:23:52,530 --> 00:23:56,730 and this is the Pythonic way of just saying is this key in this dictionary, 497 00:23:56,730 --> 00:24:02,160 then go ahead and safely do counts favorite plus equals 1. 498 00:24:02,160 --> 00:24:05,250 Else, to your conjecture now, else what I want to do. 499 00:24:05,250 --> 00:24:10,330 Counts favorites equals, yeah, 1. 500 00:24:10,330 --> 00:24:14,380 So initialize a brand new key to a brand new value of 1 501 00:24:14,380 --> 00:24:16,630 because I'm obviously just seeing this language. 502 00:24:16,630 --> 00:24:19,190 Otherwise increment again and again. 503 00:24:19,190 --> 00:24:22,390 And now down here I just need to tweak my syntax a little bit. 504 00:24:22,390 --> 00:24:26,540 I don't need to print out all of these things one at a time manually. 505 00:24:26,540 --> 00:24:30,470 I can actually get away I think with another loop at the very bottom here. 506 00:24:30,470 --> 00:24:31,480 So how about I do this? 507 00:24:31,480 --> 00:24:34,790 For each favorite in those counts, and this is, 508 00:24:34,790 --> 00:24:39,340 again, the Pythonic way to iterate over all of the keys in a dictionary, 509 00:24:39,340 --> 00:24:45,220 go ahead and print out using an f string whatever the current favorite is, 510 00:24:45,220 --> 00:24:52,180 Scratch or C or Python, and then a colon and then figure out what its count is. 511 00:24:52,180 --> 00:24:54,610 And you can do that by going into the counts dictionary, 512 00:24:54,610 --> 00:24:58,910 looking at the favorite key, and get back its value. 513 00:24:58,910 --> 00:25:00,190 So I close my curly braces. 514 00:25:00,190 --> 00:25:01,060 I close my quotes. 515 00:25:01,060 --> 00:25:04,690 And even though this looks ugly at the moment, now this is much more dynamic. 516 00:25:04,690 --> 00:25:07,690 Because if we go and add SQL to the CSV file tomorrow 517 00:25:07,690 --> 00:25:10,720 or we add JavaScript next week, this will just work. 518 00:25:10,720 --> 00:25:12,760 It will keep working now automatically. 519 00:25:12,760 --> 00:25:15,140 All I change is the Google Form, not my actual code. 520 00:25:15,140 --> 00:25:15,640 All right. 521 00:25:15,640 --> 00:25:17,620 Let's try Python of favorites.py. 522 00:25:17,620 --> 00:25:19,150 Cross my fingers as always. 523 00:25:19,150 --> 00:25:23,020 And there now is the data as of now. 524 00:25:23,020 --> 00:25:25,270 Questions on this code here? 525 00:25:25,270 --> 00:25:26,195 Yeah. 526 00:25:26,195 --> 00:25:28,093 AUDIENCE: [INAUDIBLE] 527 00:25:28,093 --> 00:25:29,510 DAVID MALAN: Really good question. 528 00:25:29,510 --> 00:25:33,040 What if you wanted to print it in a particular order? 529 00:25:33,040 --> 00:25:35,530 Well, I could give you a couple of solutions. 530 00:25:35,530 --> 00:25:37,390 If you want to print it out in-- 531 00:25:37,390 --> 00:25:40,850 it's already coincidentally in alphabetical order. 532 00:25:40,850 --> 00:25:43,630 So you got that for free although that's just by chance here. 533 00:25:43,630 --> 00:25:45,010 But there is a way to do this. 534 00:25:45,010 --> 00:25:47,320 And let me propose that we go down here to my loop. 535 00:25:47,320 --> 00:25:50,320 And I explicitly use a function you might not have seen in Python yet, 536 00:25:50,320 --> 00:25:52,240 but it's literally called sorted, which is 537 00:25:52,240 --> 00:25:54,970 going to take either a list or, in this case, a dictionary 538 00:25:54,970 --> 00:25:58,270 and by default sort it by key alphabetically. 539 00:25:58,270 --> 00:26:00,907 Now, if my intuition is correct, this is not 540 00:26:00,907 --> 00:26:03,490 going to change the output, because it's already alphabetical. 541 00:26:03,490 --> 00:26:05,980 But if you read the documentation for the sorted function, 542 00:26:05,980 --> 00:26:08,800 it takes multiple parameters potentially, some of which 543 00:26:08,800 --> 00:26:10,000 are named parameters. 544 00:26:10,000 --> 00:26:11,870 And so you can actually do this. 545 00:26:11,870 --> 00:26:14,200 If you want to the counts but you want to reverse 546 00:26:14,200 --> 00:26:18,018 the order for whatever reason here so that it's reverse alphabetical order. 547 00:26:18,018 --> 00:26:19,810 Now let me go ahead and rerun this and I'll 548 00:26:19,810 --> 00:26:21,470 keep the previous output on the screen. 549 00:26:21,470 --> 00:26:27,000 Enter and now it's backwards alphabetically, if you will. 550 00:26:27,000 --> 00:26:29,130 Other questions on this here? 551 00:26:29,130 --> 00:26:31,770 552 00:26:31,770 --> 00:26:33,830 No? 553 00:26:33,830 --> 00:26:39,930 How about then we transition to changing sorting by value. 554 00:26:39,930 --> 00:26:42,770 And this is going to escalate a little quickly briefly 555 00:26:42,770 --> 00:26:44,540 but then we'll tone it down again. 556 00:26:44,540 --> 00:26:47,600 Notice that right now this is indeed sorting by key. 557 00:26:47,600 --> 00:26:49,400 What if, especially if I have lots of data, 558 00:26:49,400 --> 00:26:52,670 it'd be nice to make a top 10 list or, in this case, a top three list 559 00:26:52,670 --> 00:26:58,890 and actually see in order of the counts, the values what these popular ones are. 560 00:26:58,890 --> 00:27:00,620 So it's not C, Python, Scratch. 561 00:27:00,620 --> 00:27:03,770 It should ideally be Python, then C, then 562 00:27:03,770 --> 00:27:06,710 Scratch because of the values and the magnitude thereof. 563 00:27:06,710 --> 00:27:08,010 So how can I do this? 564 00:27:08,010 --> 00:27:12,170 Well, it turns out there is another key, another parameter that you 565 00:27:12,170 --> 00:27:16,550 can pass to the sorted function that is typically 566 00:27:16,550 --> 00:27:17,925 implemented as a function itself. 567 00:27:17,925 --> 00:27:19,633 And so I'm going to go ahead and do this. 568 00:27:19,633 --> 00:27:21,950 I'm going to temporarily define a function called 569 00:27:21,950 --> 00:27:24,230 get value just to make my life easier. 570 00:27:24,230 --> 00:27:29,390 And this get value function is going to take I'll say a language parameter. 571 00:27:29,390 --> 00:27:35,400 And then all I'm going to do is return whatever the count is of that language. 572 00:27:35,400 --> 00:27:37,850 So out of context, this is just a super simple function 573 00:27:37,850 --> 00:27:41,252 that you hand it a language like Scratch or C or Python, 574 00:27:41,252 --> 00:27:43,460 it's just going to tell you what the count is thereof 575 00:27:43,460 --> 00:27:46,400 in that dictionary called counts. 576 00:27:46,400 --> 00:27:51,440 But what I can do now down here in my newly introduced call to sort it is I 577 00:27:51,440 --> 00:27:54,380 can tell it what to use as its key. 578 00:27:54,380 --> 00:27:57,170 Instead of using literally the key, Scratch, C, 579 00:27:57,170 --> 00:28:01,040 Python, I can sort of override that behavior and say, you know what? 580 00:28:01,040 --> 00:28:05,570 To figure out what to sort by, go ahead and call this function 581 00:28:05,570 --> 00:28:07,890 called get value. 582 00:28:07,890 --> 00:28:11,270 Notice that I have not put parentheses after get value 583 00:28:11,270 --> 00:28:14,420 because I don't want to call get value right then and there. 584 00:28:14,420 --> 00:28:20,737 I want to pass the get value function as itself in argument to the sorted 585 00:28:20,737 --> 00:28:23,570 function so that the sorted function written years ago by the people 586 00:28:23,570 --> 00:28:27,320 at Python can call my version of get value again and again 587 00:28:27,320 --> 00:28:30,300 and again when they try to sort this actual data. 588 00:28:30,300 --> 00:28:36,350 So now if I add that and I leave reverse equals true, let's see what happens. 589 00:28:36,350 --> 00:28:38,240 Python of favorites.py. 590 00:28:38,240 --> 00:28:38,810 Enter. 591 00:28:38,810 --> 00:28:42,410 And now I get my top 10, or in this case, top three list. 592 00:28:42,410 --> 00:28:45,290 And if I had more sophisticated data with more columns 593 00:28:45,290 --> 00:28:47,150 all together that I actually care about, I 594 00:28:47,150 --> 00:28:50,758 could even sort this more powerfully as well. 595 00:28:50,758 --> 00:28:52,550 But let me clean this up a little bit, just 596 00:28:52,550 --> 00:28:54,592 so you've seen it, even though we won't use these 597 00:28:54,592 --> 00:28:59,010 that often in CS50 until the end of the class will they come up again. 598 00:28:59,010 --> 00:29:00,740 Technically this is a little bit-- 599 00:29:00,740 --> 00:29:03,620 this isn't necessarily the best design to spend all this time 600 00:29:03,620 --> 00:29:06,680 implementing a function and then only use it in one place. 601 00:29:06,680 --> 00:29:09,430 In general, we've argued that you don't necessarily 602 00:29:09,430 --> 00:29:11,930 need a variable if you're only going to use it in one place. 603 00:29:11,930 --> 00:29:13,430 You don't really need a function if you're only 604 00:29:13,430 --> 00:29:14,638 going to use it in one place. 605 00:29:14,638 --> 00:29:17,160 And here we kind of have a good candidate for that. 606 00:29:17,160 --> 00:29:19,400 And so it turns out in Python if you don't 607 00:29:19,400 --> 00:29:23,240 want to bother creating a function just to use it once, 608 00:29:23,240 --> 00:29:26,810 you can create what's called an anonymous function, a.k.a. 609 00:29:26,810 --> 00:29:29,900 a lambda function, like the lambda symbol familiar. 610 00:29:29,900 --> 00:29:33,380 And a lambda function, the syntax is a little strange looking, 611 00:29:33,380 --> 00:29:34,490 but you say this. 612 00:29:34,490 --> 00:29:36,080 You literally say lambda. 613 00:29:36,080 --> 00:29:38,660 You literally then say the name of the argument 614 00:29:38,660 --> 00:29:42,470 that you want this anonymous function with no name to take. 615 00:29:42,470 --> 00:29:44,430 Then you have a colon. 616 00:29:44,430 --> 00:29:47,930 And then quite simply, you write what you want the return 617 00:29:47,930 --> 00:29:49,850 value of this function to be. 618 00:29:49,850 --> 00:29:51,800 You don't even say return literally. 619 00:29:51,800 --> 00:29:53,720 These lambda functions are meant to be used 620 00:29:53,720 --> 00:30:00,680 super tersely so that you can in one line express something like this. 621 00:30:00,680 --> 00:30:03,720 And I admit this looks more cryptic, I think, than the previous version. 622 00:30:03,720 --> 00:30:06,512 But as you get more comfortable with Python or other languages that 623 00:30:06,512 --> 00:30:10,370 support this feature, it allows you to not bother with lines of code like that 624 00:30:10,370 --> 00:30:12,690 and just tighten up your code a little bit. 625 00:30:12,690 --> 00:30:16,670 So this line here, lambda language colon counts language, 626 00:30:16,670 --> 00:30:18,927 is the one line version of this. 627 00:30:18,927 --> 00:30:21,260 And you don't even need to bother picking a name for it. 628 00:30:21,260 --> 00:30:23,840 Lambda tells Python I didn't waste any time 629 00:30:23,840 --> 00:30:26,440 thinking of a name for this function. 630 00:30:26,440 --> 00:30:35,310 So questions then on this technique of using Python to analyze data like this? 631 00:30:35,310 --> 00:30:37,900 Any questions? 632 00:30:37,900 --> 00:30:39,860 We're almost done with Python. 633 00:30:39,860 --> 00:30:42,070 Questions? 634 00:30:42,070 --> 00:30:42,640 No? 635 00:30:42,640 --> 00:30:43,390 OK. 636 00:30:43,390 --> 00:30:45,790 So why don't we make things a little more interesting? 637 00:30:45,790 --> 00:30:48,620 Because we had a much juicier data set with the problems 638 00:30:48,620 --> 00:30:50,620 that we've assigned over the past several weeks. 639 00:30:50,620 --> 00:30:54,970 Why don't we go ahead and, quite simply, I think 640 00:30:54,970 --> 00:30:57,290 we wrote pretty darn good code here. 641 00:30:57,290 --> 00:31:02,590 So I think we can pretty much just change a bit of it to say, 642 00:31:02,590 --> 00:31:05,720 let's see, if I don't want language, I want problem. 643 00:31:05,720 --> 00:31:10,510 And if I want to sort by not language but problem, I think that's it. 644 00:31:10,510 --> 00:31:12,580 I think if I didn't overlook something here 645 00:31:12,580 --> 00:31:16,210 just by changing what column I'm reading the data from and then 646 00:31:16,210 --> 00:31:18,550 just to be consistent renaming my variables 647 00:31:18,550 --> 00:31:21,310 just so I know what I'm looking at, what will this program now 648 00:31:21,310 --> 00:31:25,470 do after those minor changes? 649 00:31:25,470 --> 00:31:26,970 What will I see when I run this? 650 00:31:26,970 --> 00:31:29,693 651 00:31:29,693 --> 00:31:31,860 What would be the first thing I see when I run this? 652 00:31:31,860 --> 00:31:34,950 653 00:31:34,950 --> 00:31:35,970 Tough crowd today. 654 00:31:35,970 --> 00:31:36,720 Yes. 655 00:31:36,720 --> 00:31:37,637 AUDIENCE: The problem. 656 00:31:37,637 --> 00:31:39,110 DAVID MALAN: Yeah. 657 00:31:39,110 --> 00:31:40,330 The top problem. 658 00:31:40,330 --> 00:31:42,627 So the most popular problem, which I'm a little worried 659 00:31:42,627 --> 00:31:45,210 it might be hello or just Scratch, but let's go ahead and see. 660 00:31:45,210 --> 00:31:47,700 So let me go ahead and open my terminal window. 661 00:31:47,700 --> 00:31:50,670 I'll even maximize my terminal window so we can see a lot. 662 00:31:50,670 --> 00:31:54,167 Let me go ahead and run Python of favorites.py. 663 00:31:54,167 --> 00:31:57,000 I'm going to go ahead now and cross my fingers that I didn't mess up 664 00:31:57,000 --> 00:31:57,660 and hit Enter. 665 00:31:57,660 --> 00:31:59,550 And OK, great. 666 00:31:59,550 --> 00:32:00,480 We peaked early. 667 00:32:00,480 --> 00:32:03,840 So Scratch was the most popular program according 668 00:32:03,840 --> 00:32:06,460 to the data at the time I downloaded it. 669 00:32:06,460 --> 00:32:08,520 I'm sure other votes have come in since. 670 00:32:08,520 --> 00:32:13,080 Filter in week four was tied then with Tideman as well. 671 00:32:13,080 --> 00:32:16,480 Mario is a close third there and so forth. 672 00:32:16,480 --> 00:32:19,540 So this is helpful for us on staff that not so much love 673 00:32:19,540 --> 00:32:21,040 down here at the bottom of the list. 674 00:32:21,040 --> 00:32:23,400 So it was a bunch of code to write, but now 675 00:32:23,400 --> 00:32:26,250 that we've written it in this very versatile dynamic way, 676 00:32:26,250 --> 00:32:29,280 it's pretty good for just like crunching data and doing some analytics. 677 00:32:29,280 --> 00:32:33,480 But it's still a decent number of lines to have had to write manually. 678 00:32:33,480 --> 00:32:41,820 And this is where sometimes it isn't necessarily the right tool for the job, 679 00:32:41,820 --> 00:32:44,520 but rather a candidate for using some other language altogether, 680 00:32:44,520 --> 00:32:47,428 especially when it's not just a one time program that you run 681 00:32:47,428 --> 00:32:48,720 and you want to see the answer. 682 00:32:48,720 --> 00:32:53,130 What if you want to take input from the user and answer questions dynamically 683 00:32:53,130 --> 00:32:55,050 like a mobile app would, like a website would, 684 00:32:55,050 --> 00:32:58,740 like Microsoft Excel or Apple Numbers or Google Sheets would for you. 685 00:32:58,740 --> 00:33:02,430 Well, let's make one final change for now to this version of the program 686 00:33:02,430 --> 00:33:05,020 and actually take in some user input. 687 00:33:05,020 --> 00:33:08,170 So besides just loading all of the data into memory, 688 00:33:08,170 --> 00:33:13,290 let's go ahead and down below here not just print out the top 10 689 00:33:13,290 --> 00:33:16,660 list, if you will, but prompt the user for their favorites. 690 00:33:16,660 --> 00:33:18,660 I'm going to use Python's input function and I'm 691 00:33:18,660 --> 00:33:21,840 going to prompt them with "favorite," quote unquote, like tell me what 692 00:33:21,840 --> 00:33:23,370 your favorite problem-- 693 00:33:23,370 --> 00:33:27,660 what problem rather you are interested in. 694 00:33:27,660 --> 00:33:33,267 And now let me go ahead and say if that favorite is in the count's variable, 695 00:33:33,267 --> 00:33:35,850 so you didn't type in something random that we didn't actually 696 00:33:35,850 --> 00:33:37,890 assign as a problem, then let me go ahead 697 00:33:37,890 --> 00:33:44,130 and print with a format string whatever that favorite is of yours and show you 698 00:33:44,130 --> 00:33:47,040 the actual popularity thereof by indexing 699 00:33:47,040 --> 00:33:51,370 into counts using that favorite as the key and printing this. 700 00:33:51,370 --> 00:33:52,620 So now it's a dynamic program. 701 00:33:52,620 --> 00:33:55,170 It doesn't dump all of the data and all of the summations. 702 00:33:55,170 --> 00:33:59,100 Rather it's going to allow me to see what my choice of favorite is. 703 00:33:59,100 --> 00:34:03,940 And I'm going to go ahead and say, let's see, I'm a fan of Mario here. 704 00:34:03,940 --> 00:34:04,950 So Enter. 705 00:34:04,950 --> 00:34:08,880 And indeed, we see the same value we saw a moment ago but just for Mario. 706 00:34:08,880 --> 00:34:11,880 But the point now is that, one, all of this is possible. 707 00:34:11,880 --> 00:34:15,630 Two, it's way easier and more pleasant than this would have been in C. 708 00:34:15,630 --> 00:34:17,280 This is still only 15 lines of code. 709 00:34:17,280 --> 00:34:19,238 And in C, again, there's the memory management. 710 00:34:19,238 --> 00:34:21,947 There's the iterating over the strings trying to find the commas. 711 00:34:21,947 --> 00:34:23,190 There's just a lot more work. 712 00:34:23,190 --> 00:34:26,489 But honestly even when you just want to answer a question like this 713 00:34:26,489 --> 00:34:29,067 in Excel and Apple Numbers, Google Sheets, generally, you 714 00:34:29,067 --> 00:34:30,150 can just highlight things. 715 00:34:30,150 --> 00:34:34,500 You can click a button and boom, you get your answer for summation or max or min 716 00:34:34,500 --> 00:34:35,909 or any of those sort of basics. 717 00:34:35,909 --> 00:34:39,270 Wouldn't it be nice if we weren't taking a step backwards as programmers 718 00:34:39,270 --> 00:34:42,880 and being more powerful and yet we now have to do more of the work? 719 00:34:42,880 --> 00:34:46,230 So maybe sometimes Python's not or any language 720 00:34:46,230 --> 00:34:48,370 is not the best tool for the job. 721 00:34:48,370 --> 00:34:51,960 And that's going to now allow us to introduce more generally something 722 00:34:51,960 --> 00:34:53,670 called a relational database. 723 00:34:53,670 --> 00:34:57,137 Graduating from mere flat file databases like text files 724 00:34:57,137 --> 00:34:58,845 or binary files in which all of your data 725 00:34:58,845 --> 00:35:00,570 is stored to something more proper. 726 00:35:00,570 --> 00:35:02,566 But first, questions. 727 00:35:02,566 --> 00:35:08,518 AUDIENCE: [INAUDIBLE] 728 00:35:08,518 --> 00:35:12,943 729 00:35:12,943 --> 00:35:14,360 DAVID MALAN: Really good question. 730 00:35:14,360 --> 00:35:17,670 To reiterate, if I were-- is this case sensitive? 731 00:35:17,670 --> 00:35:21,470 So if I were to type in Mario in all lowercase and hit Enter, 732 00:35:21,470 --> 00:35:23,450 I actually get no such response. 733 00:35:23,450 --> 00:35:26,720 Now that might be acceptable, because the problem technically is a capital M. 734 00:35:26,720 --> 00:35:30,840 But that's a little ridiculous to be that pedantic about the input. 735 00:35:30,840 --> 00:35:34,250 So how could we solve this? 736 00:35:34,250 --> 00:35:37,010 Any tips for how we can make this a little more robust? 737 00:35:37,010 --> 00:35:38,378 Yeah. 738 00:35:38,378 --> 00:35:42,290 AUDIENCE: [INAUDIBLE] 739 00:35:42,290 --> 00:35:45,848 740 00:35:45,848 --> 00:35:46,765 DAVID MALAN: OK, yeah. 741 00:35:46,765 --> 00:35:49,400 742 00:35:49,400 --> 00:35:49,900 Yeah. 743 00:35:49,900 --> 00:35:51,700 So we could use a few different functions, one of which 744 00:35:51,700 --> 00:35:54,970 is called title, which will change it to title case where it capitalizes, 745 00:35:54,970 --> 00:35:58,390 like in most English sentences, the first letter of that sentence. 746 00:35:58,390 --> 00:35:59,740 We could use capitalize. 747 00:35:59,740 --> 00:36:00,940 We could use upper. 748 00:36:00,940 --> 00:36:01,990 We could use lower. 749 00:36:01,990 --> 00:36:06,435 But indeed, we could just decide how we want to standardize the capitalization. 750 00:36:06,435 --> 00:36:08,810 Either uppercase, lowercase, or some combination thereof. 751 00:36:08,810 --> 00:36:11,770 And just make sure that you change the counts themselves. 752 00:36:11,770 --> 00:36:13,930 Make sure that you do the same to favorite 753 00:36:13,930 --> 00:36:16,420 and make sure that maybe you keep a backup of the data 754 00:36:16,420 --> 00:36:19,780 if you want to show the original version that came from the CSV 755 00:36:19,780 --> 00:36:23,205 without presuming to just capitalize everything for the user. 756 00:36:23,205 --> 00:36:25,330 But indeed, that would be the most common scenario. 757 00:36:25,330 --> 00:36:28,540 You just make things case insensitive when doing those matches. 758 00:36:28,540 --> 00:36:36,910 Other questions now on Python before we leave it behind for the coming week? 759 00:36:36,910 --> 00:36:39,880 All right, well then let's introduce these relational databases. 760 00:36:39,880 --> 00:36:43,060 So relational database is what every-- 761 00:36:43,060 --> 00:36:45,572 it's a super popular way of storing lots of data. 762 00:36:45,572 --> 00:36:48,280 This is what the Twitters of the world, the Googles of the world, 763 00:36:48,280 --> 00:36:52,390 the Metas of the world use to store some of their data at scale. 764 00:36:52,390 --> 00:36:55,492 There are alternatives to relational databases. 765 00:36:55,492 --> 00:36:57,700 Indeed, today we'll talk about a language called SQL. 766 00:36:57,700 --> 00:37:00,408 There's also a movement, if you will, or an alternative generally 767 00:37:00,408 --> 00:37:02,330 called NoSQL, which is just the opposite. 768 00:37:02,330 --> 00:37:03,280 You don't use SQL. 769 00:37:03,280 --> 00:37:06,020 There are things called object oriented databases and the like. 770 00:37:06,020 --> 00:37:11,110 But if you've ever heard of MySQL or PostgreSQL or Microsoft SQL 771 00:37:11,110 --> 00:37:15,580 Server or Oracle or MariaDB or a bunch of other products, 772 00:37:15,580 --> 00:37:18,850 both free and commercial, this is what they're talking about. 773 00:37:18,850 --> 00:37:21,560 Databases that are designed to store lots of data. 774 00:37:21,560 --> 00:37:23,710 And what's nice about relational databases 775 00:37:23,710 --> 00:37:26,530 is that they're really similar to the spreadsheets with which you 776 00:37:26,530 --> 00:37:29,800 were presumably familiar long before today's class. 777 00:37:29,800 --> 00:37:32,530 So a relational database is going to store, 778 00:37:32,530 --> 00:37:35,452 as you'll see, all of the data in rows and columns. 779 00:37:35,452 --> 00:37:37,910 Now, the terminology will thereafter be a little different. 780 00:37:37,910 --> 00:37:40,570 Instead of having sheets, you're going to have tables. 781 00:37:40,570 --> 00:37:43,240 But those tables are still going to have rows and columns. 782 00:37:43,240 --> 00:37:46,330 And you're going to have even more control over the performance 783 00:37:46,330 --> 00:37:51,970 of your data when you start to access it using this Structured Query Language 784 00:37:51,970 --> 00:37:53,050 or SQL. 785 00:37:53,050 --> 00:37:56,440 This is a language you can use for web apps, mobile apps. 786 00:37:56,440 --> 00:37:58,990 A lot of analysts would sit down at their Mac or PC 787 00:37:58,990 --> 00:38:01,810 and actually ask questions of data to get back the answer. 788 00:38:01,810 --> 00:38:05,050 And wonderfully, even though there will be some new syntax today, 789 00:38:05,050 --> 00:38:07,570 SQL really just does four basic things. 790 00:38:07,570 --> 00:38:10,030 CRUD is the sort of crude acronym here. 791 00:38:10,030 --> 00:38:14,470 CRUD is a way of remembering that a relational database supports 792 00:38:14,470 --> 00:38:20,390 ultimately creating data, reading data, updating data, and deleting data. 793 00:38:20,390 --> 00:38:22,240 So even if you're feeling like, wow, this 794 00:38:22,240 --> 00:38:25,510 is a lot of new syntax, which it isn't relative to our past languages, 795 00:38:25,510 --> 00:38:29,650 the only things you're doing really are creating data, reading data, updating, 796 00:38:29,650 --> 00:38:30,910 and deleting the same. 797 00:38:30,910 --> 00:38:36,940 Now a little confusingly in SQL, the corresponding functions or commands 798 00:38:36,940 --> 00:38:41,150 that exist that map to CRUD are actually this. 799 00:38:41,150 --> 00:38:44,290 So it's still create, but there's another one called insert. 800 00:38:44,290 --> 00:38:48,250 It's not read, which is more of the computer scientist way of saying it, 801 00:38:48,250 --> 00:38:50,380 but select, which is a little more explicit. 802 00:38:50,380 --> 00:38:52,030 Like select data you care about. 803 00:38:52,030 --> 00:38:53,320 Update is still update. 804 00:38:53,320 --> 00:38:54,310 Delete is still delete. 805 00:38:54,310 --> 00:38:57,910 But there's another command called drop, which lets you drop, that is delete, 806 00:38:57,910 --> 00:38:59,930 entire tables as well. 807 00:38:59,930 --> 00:39:03,280 So you can create tables using syntax that's 808 00:39:03,280 --> 00:39:05,480 generally going to look like this. 809 00:39:05,480 --> 00:39:06,910 You'll say create table. 810 00:39:06,910 --> 00:39:10,210 You'll give the name of the table, which you can call most 811 00:39:10,210 --> 00:39:14,260 anything you want, but generally all lowercase, no spaces is best. 812 00:39:14,260 --> 00:39:18,550 Then in parentheses, you can specify a comma separated list of the columns 813 00:39:18,550 --> 00:39:20,660 that you might want in this table. 814 00:39:20,660 --> 00:39:24,070 So this is the code equivalent in the SQL language 815 00:39:24,070 --> 00:39:27,220 of manually opening Google Sheets or Excel or Numbers 816 00:39:27,220 --> 00:39:30,490 and clicking in the top left cell and like typing timestamp 817 00:39:30,490 --> 00:39:34,180 and then in the next typing language and then the third typing problem. 818 00:39:34,180 --> 00:39:37,750 This is the way to define what your headers are, if you will, 819 00:39:37,750 --> 00:39:38,500 in a spreadsheet. 820 00:39:38,500 --> 00:39:39,790 But now it's called a table. 821 00:39:39,790 --> 00:39:42,100 Now, we won't use this command manually first. 822 00:39:42,100 --> 00:39:44,050 Let's do something a little simpler. 823 00:39:44,050 --> 00:39:47,590 We're going to start off by just importing this data ourselves. 824 00:39:47,590 --> 00:39:49,340 And I'm going to go ahead and do this. 825 00:39:49,340 --> 00:39:51,160 Let me go back to VS Code here. 826 00:39:51,160 --> 00:39:54,040 I'm going to leave behind favorites.py for now, 827 00:39:54,040 --> 00:39:57,100 because now we're going to transition to this other language called SQL. 828 00:39:57,100 --> 00:40:02,680 And to do this, I am going to create a new database file. 829 00:40:02,680 --> 00:40:06,790 And I'm going to do so using a command called sqlite3, 830 00:40:06,790 --> 00:40:09,160 which is just the third version thereof, and I'm 831 00:40:09,160 --> 00:40:12,110 going to give the database a name of favorites.db. 832 00:40:12,110 --> 00:40:14,860 There's different conventions, but this is one of the most common. 833 00:40:14,860 --> 00:40:18,610 When I hit Enter, this is going to create for me a new empty database just 834 00:40:18,610 --> 00:40:23,080 like opening an untitled spreadsheet in Excel, Google Sheets, or Apple Numbers. 835 00:40:23,080 --> 00:40:25,720 I'm being prompted do I want to create favorites.db. 836 00:40:25,720 --> 00:40:26,890 I'll hit Y for Yes. 837 00:40:26,890 --> 00:40:28,562 OK, we're up and running. 838 00:40:28,562 --> 00:40:30,520 Now, you're going to notice a different prompt. 839 00:40:30,520 --> 00:40:34,360 I'm not in my Linux prompt per se, which is always the dollar sign. 840 00:40:34,360 --> 00:40:37,090 I'm now inside of the program called SQLite. 841 00:40:37,090 --> 00:40:41,500 And we're going to use SQLite, SQLite3, as just an interactive way for now 842 00:40:41,500 --> 00:40:43,180 of playing with SQL code. 843 00:40:43,180 --> 00:40:48,550 At the end of today, we'll show you how you can use SQL in Python code 844 00:40:48,550 --> 00:40:51,050 so that you still write Python code to do whatever you want, 845 00:40:51,050 --> 00:40:53,170 but you can talk to databases using Python. 846 00:40:53,170 --> 00:40:55,630 And this is exactly how web apps, mobile apps work. 847 00:40:55,630 --> 00:40:59,140 For instance, on iOS, on an iPhone, an iPad, or the like, 848 00:40:59,140 --> 00:41:03,010 if you want to store data, it's very often stored in a SQL database, 849 00:41:03,010 --> 00:41:04,580 as we're about to do. 850 00:41:04,580 --> 00:41:07,398 But you might use a language called SWIFT or Objective C. 851 00:41:07,398 --> 00:41:09,190 And the same exists in the world of Android 852 00:41:09,190 --> 00:41:12,830 using Java or Kotlin or something else to query the database. 853 00:41:12,830 --> 00:41:16,270 So we're going to see SQL in isolation for now like an analyst might just 854 00:41:16,270 --> 00:41:19,880 use with their Mac or PC, but we're going to tie it together by day's end. 855 00:41:19,880 --> 00:41:28,180 So at this terminal SQLite, let me go ahead and execute this command first. 856 00:41:28,180 --> 00:41:30,200 I'm going to first put SQLite into CSV mode, 857 00:41:30,200 --> 00:41:32,200 because I'm going to cut some corners initially. 858 00:41:32,200 --> 00:41:34,600 And I'm just going to automatically import 859 00:41:34,600 --> 00:41:38,560 all of the data that was submitted via the Google Form, which I exported 860 00:41:38,560 --> 00:41:41,410 as a CSV and uploaded to my code space. 861 00:41:41,410 --> 00:41:45,790 And I'm just going to automatically say turn this CSV file into a SQL 862 00:41:45,790 --> 00:41:48,040 database for me just so I don't have to figure out 863 00:41:48,040 --> 00:41:49,760 what those create table commands are. 864 00:41:49,760 --> 00:41:53,200 So to do this, I'm going to say mode csv so 865 00:41:53,200 --> 00:41:57,850 that SQLite knows that this is the command, knows that this is a CSV file. 866 00:41:57,850 --> 00:41:59,290 It's literally .mode. 867 00:41:59,290 --> 00:42:01,480 So the dot comes before the keyword there. 868 00:42:01,480 --> 00:42:05,770 And now I'm going to say .import and then the name of the file I want 869 00:42:05,770 --> 00:42:08,020 to import, which is favorites.csv. 870 00:42:08,020 --> 00:42:12,635 And now the name of the table that I want to create with that data. 871 00:42:12,635 --> 00:42:15,010 And just for consistency, I'm going to call it favorites. 872 00:42:15,010 --> 00:42:17,270 I could change these things to be anything I want, 873 00:42:17,270 --> 00:42:18,530 but I'm going to do that. 874 00:42:18,530 --> 00:42:20,860 And voila, nothing seems to have happened. 875 00:42:20,860 --> 00:42:24,503 But just like in C and in Python and Linux when nothing seems to happen, 876 00:42:24,503 --> 00:42:25,670 that's usually a good thing. 877 00:42:25,670 --> 00:42:27,590 It means I didn't mess up. 878 00:42:27,590 --> 00:42:31,120 So if I want to see what just happened, there's this other command. 879 00:42:31,120 --> 00:42:33,430 And these commands that start with dots, these 880 00:42:33,430 --> 00:42:37,690 are SQLite specific, which is indeed a lightweight version of SQL. 881 00:42:37,690 --> 00:42:39,010 They're not SQL, per se. 882 00:42:39,010 --> 00:42:41,260 So if you're using Oracle or something else like that, 883 00:42:41,260 --> 00:42:43,135 you're not going to use these exact commands. 884 00:42:43,135 --> 00:42:45,160 You'll see the ones we use in just a moment. 885 00:42:45,160 --> 00:42:46,660 And here's the first. 886 00:42:46,660 --> 00:42:51,230 When I type .schema, the schema of a database is the design of the database. 887 00:42:51,230 --> 00:42:52,120 What are the tables? 888 00:42:52,120 --> 00:42:53,900 What are the columns and all of that? 889 00:42:53,900 --> 00:42:58,390 So when I type .schema, this actually in this case shows me the create table 890 00:42:58,390 --> 00:43:03,700 command that was automatically drawn for me by just doing this import line. 891 00:43:03,700 --> 00:43:07,450 Once I get more comfortable with SQL, I could literally type this out myself 892 00:43:07,450 --> 00:43:09,560 or use some program to generate that as well. 893 00:43:09,560 --> 00:43:11,290 But what it's creating for me is this. 894 00:43:11,290 --> 00:43:15,580 Create table if it doesn't exist, even though it's more terse than that. 895 00:43:15,580 --> 00:43:18,130 I want to create a table called favorites. 896 00:43:18,130 --> 00:43:20,590 And then the columns for that table are going 897 00:43:20,590 --> 00:43:24,880 to be timestamp, which is going to be text, comma, language, 898 00:43:24,880 --> 00:43:27,248 which is also going to be text, comma, problem, 899 00:43:27,248 --> 00:43:28,540 which is also going to be text. 900 00:43:28,540 --> 00:43:34,330 That was just inferred very trivially by the .import command to just figure out 901 00:43:34,330 --> 00:43:38,680 that, yes, just give me a three column database table based on the Google 902 00:43:38,680 --> 00:43:40,390 Form. 903 00:43:40,390 --> 00:43:41,740 Questions on this? 904 00:43:41,740 --> 00:43:44,140 These are commands you run once to get up and running. 905 00:43:44,140 --> 00:43:47,260 You don't run these commands frequently, but we have them 906 00:43:47,260 --> 00:43:49,280 on the slide just for reference. 907 00:43:49,280 --> 00:43:49,780 All right. 908 00:43:49,780 --> 00:43:52,010 So now let's do something a little more interesting. 909 00:43:52,010 --> 00:43:55,330 I'm going to clear my SQLite terminal here, but I'm still in SQLite. 910 00:43:55,330 --> 00:43:58,570 I'm going to now use some of my first SQL commands, which 911 00:43:58,570 --> 00:44:02,440 recall were among them select. 912 00:44:02,440 --> 00:44:05,522 So CRUD, C-R-U-D. The R was select. 913 00:44:05,522 --> 00:44:07,480 This is maybe the most common, the most useful, 914 00:44:07,480 --> 00:44:10,330 the most powerful thing to use with a SQL database 915 00:44:10,330 --> 00:44:13,000 selecting data to answer questions akin to the ones we 916 00:44:13,000 --> 00:44:14,980 were trying to answer with Python. 917 00:44:14,980 --> 00:44:16,270 This is the general syntax. 918 00:44:16,270 --> 00:44:19,900 Any time you want to select data from a SQL database, you literally say select. 919 00:44:19,900 --> 00:44:24,730 You then specify the column or columns that you want to select data from. 920 00:44:24,730 --> 00:44:28,450 You literally write the word from and then you specify the name of the table. 921 00:44:28,450 --> 00:44:32,030 You want to get that data from semicolon, in this case. 922 00:44:32,030 --> 00:44:35,200 Everything that's in capitals here is a SQL keyword. 923 00:44:35,200 --> 00:44:37,750 Strictly speaking, you don't have to capitalize things, 924 00:44:37,750 --> 00:44:40,150 but we would encourage you to do so stylistically. 925 00:44:40,150 --> 00:44:43,030 And especially as you're learning and even as you're writing it, 926 00:44:43,030 --> 00:44:47,530 it just helps to distinguish SQL from words you chose, like the names 927 00:44:47,530 --> 00:44:49,460 of the columns and the data therein. 928 00:44:49,460 --> 00:44:52,700 So do adopt early on this convention. 929 00:44:52,700 --> 00:44:56,240 So let me go back now to my code space here. 930 00:44:56,240 --> 00:44:59,012 I'm running my terminal window with SQLite3 inside of it. 931 00:44:59,012 --> 00:45:00,970 Suppose that I just want to get all of the data 932 00:45:00,970 --> 00:45:04,720 from the favorites table, which was automatically imported. 933 00:45:04,720 --> 00:45:05,480 Let's do this. 934 00:45:05,480 --> 00:45:06,220 Select. 935 00:45:06,220 --> 00:45:07,630 I want everything. 936 00:45:07,630 --> 00:45:13,090 Well, I can do timestamp comma language comma problem. 937 00:45:13,090 --> 00:45:13,840 But you know what? 938 00:45:13,840 --> 00:45:15,850 Here's a convenience already. 939 00:45:15,850 --> 00:45:19,120 If you want everything, there's what's called a wild card character 940 00:45:19,120 --> 00:45:21,460 in SQL, which is just a star, an asterisk, which 941 00:45:21,460 --> 00:45:25,120 means give me every column without my knowing even what they're called. 942 00:45:25,120 --> 00:45:28,400 Let me go ahead now and say from favorites semicolon. 943 00:45:28,400 --> 00:45:33,010 And this is the SQL way of opening the database, iterating over 944 00:45:33,010 --> 00:45:36,400 every row therein, printing out every row therein, done. 945 00:45:36,400 --> 00:45:39,640 So those three steps, which was like nine lines of Python code 946 00:45:39,640 --> 00:45:42,700 give or take earlier, is now one line of SQL. 947 00:45:42,700 --> 00:45:43,990 I hit Enter. 948 00:45:43,990 --> 00:45:46,370 There is all of the data. 949 00:45:46,370 --> 00:45:48,130 So I see now all of the data. 950 00:45:48,130 --> 00:45:50,230 Just output it as a CSV here. 951 00:45:50,230 --> 00:45:51,940 But it's not the CSV file. 952 00:45:51,940 --> 00:45:53,210 It's now actually the table. 953 00:45:53,210 --> 00:45:54,580 And in fact, just for good measure, let me 954 00:45:54,580 --> 00:45:58,000 do this, because you'll see the behavior a little differently the next time we 955 00:45:58,000 --> 00:45:58,690 open the file. 956 00:45:58,690 --> 00:46:00,730 I've just exited out of SQLite3. 957 00:46:00,730 --> 00:46:03,490 I'm going to rerun it, but I'm not going to reimport the data 958 00:46:03,490 --> 00:46:05,920 or do anything like that, because my file now exists. 959 00:46:05,920 --> 00:46:07,790 In fact, let me take one step back. 960 00:46:07,790 --> 00:46:11,830 If I type ls at my Linux prompt, there's my favorites.py from before. 961 00:46:11,830 --> 00:46:13,870 There's my favorites.csv from before. 962 00:46:13,870 --> 00:46:18,580 And here's a third file that I did create a moment ago when I first 963 00:46:18,580 --> 00:46:19,690 ran SQLite3. 964 00:46:19,690 --> 00:46:21,130 So the data is persistent. 965 00:46:21,130 --> 00:46:23,050 It's not using RAM or memory. 966 00:46:23,050 --> 00:46:24,950 Anything I do now is saved there. 967 00:46:24,950 --> 00:46:27,653 So let's go ahead and rerun SQLite3 with the same file. 968 00:46:27,653 --> 00:46:28,570 But I'm not going to-- 969 00:46:28,570 --> 00:46:31,130 I don't have to reimport everything, because the file already exists. 970 00:46:31,130 --> 00:46:32,800 Let me now do that same thing again. 971 00:46:32,800 --> 00:46:36,460 Select star from favorites to get all of the data. 972 00:46:36,460 --> 00:46:39,990 And what you'll see now is the same data, but it's a little prettier now. 973 00:46:39,990 --> 00:46:44,400 Because I reran it, I effectively disabled CSV mode this time. 974 00:46:44,400 --> 00:46:49,590 And what I'm now seeing is the entire contents of this database 975 00:46:49,590 --> 00:46:51,420 table called favorites. 976 00:46:51,420 --> 00:46:53,370 Now, there's nothing new here, but you're just 977 00:46:53,370 --> 00:46:55,980 seeing now like an ASCII or Unicode version of all 978 00:46:55,980 --> 00:46:57,988 of the same data from that database. 979 00:46:57,988 --> 00:47:00,030 Well, suppose I want to get a subset of the data. 980 00:47:00,030 --> 00:47:01,238 Well, let me clear my screen. 981 00:47:01,238 --> 00:47:04,530 And just like in Linux, I can Control L just to clean things up aesthetically. 982 00:47:04,530 --> 00:47:06,580 Suppose I want to get just the languages. 983 00:47:06,580 --> 00:47:09,540 So I could do select language from favorites. 984 00:47:09,540 --> 00:47:12,930 And this will now select not all three columns, a.k.a. 985 00:47:12,930 --> 00:47:17,280 star, this will only select the language column and all of the data therein. 986 00:47:17,280 --> 00:47:19,020 If I hit Enter, voila. 987 00:47:19,020 --> 00:47:21,150 Now I just see those there. 988 00:47:21,150 --> 00:47:22,990 No timestamps, no problems. 989 00:47:22,990 --> 00:47:25,470 It's just a slice of the table, if you will. 990 00:47:25,470 --> 00:47:30,010 All right, not that interesting still because it's just a big column of data. 991 00:47:30,010 --> 00:47:31,900 But now things get more interesting. 992 00:47:31,900 --> 00:47:35,850 It turns out in SQL that there are functions that come with this language, 993 00:47:35,850 --> 00:47:37,920 just like C, just like Python. 994 00:47:37,920 --> 00:47:41,790 In SQL, some of the more useful ones, some of the simpler ones, 995 00:47:41,790 --> 00:47:42,630 are these here. 996 00:47:42,630 --> 00:47:44,880 Average, count, distinct, lower, max, min, 997 00:47:44,880 --> 00:47:48,150 upper, which pretty much do what they say. 998 00:47:48,150 --> 00:47:50,533 And count is a particularly useful one. 999 00:47:50,533 --> 00:47:51,450 Let's start with that. 1000 00:47:51,450 --> 00:47:55,740 It's a reasonable question to be asked how many people submitted the Google 1001 00:47:55,740 --> 00:47:58,523 Form by the time I actually downloaded the CSV. 1002 00:47:58,523 --> 00:48:00,190 Well, why don't we go ahead and do this? 1003 00:48:00,190 --> 00:48:02,850 Let me go back to VS Code here in my terminal window. 1004 00:48:02,850 --> 00:48:06,900 Let me select not star but the count of star. 1005 00:48:06,900 --> 00:48:10,020 So give me the count of the rows that are being 1006 00:48:10,020 --> 00:48:12,660 returned from the database called-- 1007 00:48:12,660 --> 00:48:14,190 the database table called favorites. 1008 00:48:14,190 --> 00:48:16,648 Now when I hit Enter, I'm not going to get all of the data. 1009 00:48:16,648 --> 00:48:18,630 I'm just going to get simply a number. 1010 00:48:18,630 --> 00:48:21,480 430 rows came back. 1011 00:48:21,480 --> 00:48:22,740 So that's pretty good. 1012 00:48:22,740 --> 00:48:24,580 I now know how much data is in there. 1013 00:48:24,580 --> 00:48:26,350 Well, what languages were in there? 1014 00:48:26,350 --> 00:48:30,520 Well, I could do select language from favorites just as before, 1015 00:48:30,520 --> 00:48:33,270 but that's not that useful, especially if I'm inheriting the data. 1016 00:48:33,270 --> 00:48:35,770 Like I'm the analyst who's been handed a data set by my boss 1017 00:48:35,770 --> 00:48:37,440 and they want me to crunch some numbers. 1018 00:48:37,440 --> 00:48:39,810 OK, I could load this into Excel. 1019 00:48:39,810 --> 00:48:40,770 I could sort it. 1020 00:48:40,770 --> 00:48:44,400 But you can use SQL now to answer pretty basic questions too. 1021 00:48:44,400 --> 00:48:48,540 If you want to select the distinct languages in the data set because you 1022 00:48:48,540 --> 00:48:51,000 weren't privy to the Google Form, let me go ahead 1023 00:48:51,000 --> 00:48:56,820 and select only the distinct languages from the favorites table. 1024 00:48:56,820 --> 00:49:00,330 And now I hit Enter and I get back a much more succinct answer. 1025 00:49:00,330 --> 00:49:02,220 Just the three languages in question. 1026 00:49:02,220 --> 00:49:04,963 Not really that useful since I created the Google Form, 1027 00:49:04,963 --> 00:49:07,380 but certainly if you're inheriting data from someone else, 1028 00:49:07,380 --> 00:49:09,390 you've just downloaded a data set, at least now 1029 00:49:09,390 --> 00:49:12,190 I'm arguably wrapping my mind around what's going on. 1030 00:49:12,190 --> 00:49:15,178 Now, this is not necessary for such a small data set, 1031 00:49:15,178 --> 00:49:16,470 but I can combine these things. 1032 00:49:16,470 --> 00:49:21,180 Select the count of the distinct languages in this data set 1033 00:49:21,180 --> 00:49:22,140 called favorites. 1034 00:49:22,140 --> 00:49:25,220 And now I should get back what answer? 1035 00:49:25,220 --> 00:49:27,378 So hopefully indeed an answer called three. 1036 00:49:27,378 --> 00:49:29,420 And what you're getting back notice aesthetically 1037 00:49:29,420 --> 00:49:32,180 too is like a mini temporary table. 1038 00:49:32,180 --> 00:49:36,890 When I asked for just the distinct languages, what SQL hands 1039 00:49:36,890 --> 00:49:39,410 me back is this temporary table in memory 1040 00:49:39,410 --> 00:49:43,340 that has one column called language and then three rows. 1041 00:49:43,340 --> 00:49:45,210 Now, this is not saved anywhere. 1042 00:49:45,210 --> 00:49:47,450 It's just executed ephemerally like this. 1043 00:49:47,450 --> 00:49:49,580 But that's why it's depicted in this way. 1044 00:49:49,580 --> 00:49:52,610 What you're getting is subsets of your data, smaller 1045 00:49:52,610 --> 00:49:54,410 tables containing some of your data. 1046 00:49:54,410 --> 00:49:55,550 And same thing down here. 1047 00:49:55,550 --> 00:49:58,400 This is a crazy long column name. 1048 00:49:58,400 --> 00:50:00,480 You can rename it if you really want. 1049 00:50:00,480 --> 00:50:02,900 But that's all we're seeing there. 1050 00:50:02,900 --> 00:50:07,100 And in fact, if that's a little ugly, we can actually alias these things. 1051 00:50:07,100 --> 00:50:12,090 N is a common name for a variable, a number in any programming language. 1052 00:50:12,090 --> 00:50:15,200 So I can actually alias this to be a column called n. 1053 00:50:15,200 --> 00:50:16,160 Hit Enter. 1054 00:50:16,160 --> 00:50:19,490 And now I'm getting a tiny, tiny table whose column is called 1055 00:50:19,490 --> 00:50:23,280 n that just has the one value there. 1056 00:50:23,280 --> 00:50:28,500 All right, questions on these application of these functions here? 1057 00:50:28,500 --> 00:50:31,950 1058 00:50:31,950 --> 00:50:33,152 Questions, yeah. 1059 00:50:33,152 --> 00:50:35,000 AUDIENCE: [INAUDIBLE] 1060 00:50:35,000 --> 00:50:36,400 DAVID MALAN: Say a little louder. 1061 00:50:36,400 --> 00:50:37,280 AUDIENCE: A-S. 1062 00:50:37,280 --> 00:50:38,475 DAVID MALAN: Oh, A-S. As. 1063 00:50:38,475 --> 00:50:39,350 Literally in English. 1064 00:50:39,350 --> 00:50:42,650 So rename this column as this. 1065 00:50:42,650 --> 00:50:45,317 Technically it creates an alias for the column. 1066 00:50:45,317 --> 00:50:45,900 So that's all. 1067 00:50:45,900 --> 00:50:46,818 Yeah? 1068 00:50:46,818 --> 00:50:49,955 AUDIENCE: [INAUDIBLE] 1069 00:50:49,955 --> 00:50:50,830 DAVID MALAN: Exactly. 1070 00:50:50,830 --> 00:50:53,530 Distinct will operate on whatever you hand it in parentheses 1071 00:50:53,530 --> 00:50:57,460 and get rid of all of the duplicates, giving you back just the uniques. 1072 00:50:57,460 --> 00:50:58,360 Correct. 1073 00:50:58,360 --> 00:50:59,780 Other questions here? 1074 00:50:59,780 --> 00:51:00,524 Yeah. 1075 00:51:00,524 --> 00:51:02,944 AUDIENCE: [INAUDIBLE] 1076 00:51:02,944 --> 00:51:06,935 1077 00:51:06,935 --> 00:51:08,060 DAVID MALAN: Good question. 1078 00:51:08,060 --> 00:51:10,355 When you define an alias like n, which I just did, 1079 00:51:10,355 --> 00:51:12,230 does it become like a variable you can reuse? 1080 00:51:12,230 --> 00:51:16,580 Short answer, no in this case, but you can reuse it within your same query. 1081 00:51:16,580 --> 00:51:20,630 Even though these queries are getting a little longer, admittedly, 1082 00:51:20,630 --> 00:51:23,270 statements that they are, you can actually 1083 00:51:23,270 --> 00:51:26,090 reuse n in an even longer query. 1084 00:51:26,090 --> 00:51:27,440 So later in your query. 1085 00:51:27,440 --> 00:51:30,380 And we'll see a few that are going to start to grow in length. 1086 00:51:30,380 --> 00:51:33,050 So it's a nice way of nicknaming things just to be 1087 00:51:33,050 --> 00:51:35,190 a little more terse in your query. 1088 00:51:35,190 --> 00:51:38,480 So we can transition to some of these more sophisticated queries 1089 00:51:38,480 --> 00:51:41,360 because it turns out there are some other techniques we 1090 00:51:41,360 --> 00:51:42,630 can introduce as well. 1091 00:51:42,630 --> 00:51:44,660 Here are some other keywords in SQL. 1092 00:51:44,660 --> 00:51:47,005 And again, even though this is another list of things, 1093 00:51:47,005 --> 00:51:49,130 there's only four things fundamentally we're doing. 1094 00:51:49,130 --> 00:51:51,600 Creating, reading, updating, and deleting data. 1095 00:51:51,600 --> 00:51:55,580 These are just allowing us to fine tune how we do it exactly. 1096 00:51:55,580 --> 00:51:59,210 So where is going to allow us to filter data, as we'll do in just a moment. 1097 00:51:59,210 --> 00:52:03,440 Like select data where this conditional is true. 1098 00:52:03,440 --> 00:52:06,260 Like is going to be an alternative to an equal sign. 1099 00:52:06,260 --> 00:52:11,210 So instead of looking for exactly Scratch or exactly Python or exactly C, 1100 00:52:11,210 --> 00:52:13,790 you can look for something like dot dot dot 1101 00:52:13,790 --> 00:52:16,770 and it can be a little bit of a fuzzier match, if you will, 1102 00:52:16,770 --> 00:52:18,020 with other characters as well. 1103 00:52:18,020 --> 00:52:19,790 Order by is going to deal with sorting. 1104 00:52:19,790 --> 00:52:23,360 Limit is going to just let me limit the total number of rows that come back 1105 00:52:23,360 --> 00:52:26,330 to 1 or 10 or finite if I don't want to see 1106 00:52:26,330 --> 00:52:28,375 all 400 plus rows all at once, because I'm just 1107 00:52:28,375 --> 00:52:29,750 trying to wrap my mind around it. 1108 00:52:29,750 --> 00:52:32,207 And group by is best shown by example. 1109 00:52:32,207 --> 00:52:34,290 So let's play with just a couple of these as well. 1110 00:52:34,290 --> 00:52:36,080 Let me go back to VS Code here. 1111 00:52:36,080 --> 00:52:37,170 I'll clear my screen. 1112 00:52:37,170 --> 00:52:39,530 I'm still in the same SQLite instance. 1113 00:52:39,530 --> 00:52:44,040 And let's count how many of you liked C without writing Python code as before. 1114 00:52:44,040 --> 00:52:47,960 So let me go ahead and select the count of the rows 1115 00:52:47,960 --> 00:52:54,530 from favorites where the language in each row equals C. 1116 00:52:54,530 --> 00:52:58,100 And the convention in SQLite is to use single quotes any time 1117 00:52:58,100 --> 00:53:02,810 you're surrounding a string that's meant to represent a literal piece of text 1118 00:53:02,810 --> 00:53:06,720 as opposed to C, which was double quotes, or Python, which was either. 1119 00:53:06,720 --> 00:53:11,090 So this is selecting the count of rows from favorites table where 1120 00:53:11,090 --> 00:53:13,730 the language in question is C. Enter. 1121 00:53:13,730 --> 00:53:15,410 And this gives me 98. 1122 00:53:15,410 --> 00:53:19,190 Notice, though, if I omit that predicate like we did before, 1123 00:53:19,190 --> 00:53:23,850 you'll get back the total number of rows that were in the table. 1124 00:53:23,850 --> 00:53:26,030 So where is what's called a predicate that just 1125 00:53:26,030 --> 00:53:30,020 allows me to filter things just like an if condition or the like in a language 1126 00:53:30,020 --> 00:53:31,130 that we've seen before. 1127 00:53:31,130 --> 00:53:34,670 You can be a little more specific like how many people really liked 1128 00:53:34,670 --> 00:53:38,870 C and the Mario problem specifically? 1129 00:53:38,870 --> 00:53:40,440 Well, let's do this. 1130 00:53:40,440 --> 00:53:45,830 Let's go ahead and do select the number of rows from the favorites table 1131 00:53:45,830 --> 00:53:49,640 where the language is C and. 1132 00:53:49,640 --> 00:53:52,880 So it's still literally the word ands and or, just 1133 00:53:52,880 --> 00:53:57,830 like in Python but not like in C. And problem equals Mario. 1134 00:53:57,830 --> 00:54:01,520 So let's see if there's any fans of both C and the Mario problem. 1135 00:54:01,520 --> 00:54:05,638 And three of us really like those two things together in this case. 1136 00:54:05,638 --> 00:54:06,930 All right, what else can we do? 1137 00:54:06,930 --> 00:54:10,280 Well, more compelling might be to see, kind of like in Python, 1138 00:54:10,280 --> 00:54:13,580 for each language, what was the popularity thereof? 1139 00:54:13,580 --> 00:54:15,560 And at the moment, we don't really have a way 1140 00:54:15,560 --> 00:54:17,960 of doing that except in Python where we had the loop 1141 00:54:17,960 --> 00:54:20,710 and we had those variables and the dictionary that did all of that 1142 00:54:20,710 --> 00:54:21,410 counting for us. 1143 00:54:21,410 --> 00:54:24,710 Totally doable but tedious, especially if your job is to analyze data. 1144 00:54:24,710 --> 00:54:28,190 My God, even writing 15 lines of code to answer simple questions 1145 00:54:28,190 --> 00:54:29,390 is kind of ridiculous. 1146 00:54:29,390 --> 00:54:31,560 SQL can do better for us. 1147 00:54:31,560 --> 00:54:33,630 So let me go ahead and do this. 1148 00:54:33,630 --> 00:54:38,540 Let me go ahead and select every language and the count 1149 00:54:38,540 --> 00:54:45,650 thereof from the favorites table but this time group by language. 1150 00:54:45,650 --> 00:54:48,440 So this was another one of the keywords that we 1151 00:54:48,440 --> 00:54:52,460 can use in this abbreviated list of extra features of SQL. 1152 00:54:52,460 --> 00:54:55,760 And this one takes a moment to wrap your mind around, 1153 00:54:55,760 --> 00:54:59,510 but this is going to give me a two column temporary table where 1154 00:54:59,510 --> 00:55:02,030 the first column is a language and the second column 1155 00:55:02,030 --> 00:55:05,330 is the count thereof from this data set. 1156 00:55:05,330 --> 00:55:10,010 And group by language just means that only show me Scratch once, 1157 00:55:10,010 --> 00:55:12,350 only show me C once, only show me Python once. 1158 00:55:12,350 --> 00:55:16,130 That is group all of the identical values together, 1159 00:55:16,130 --> 00:55:19,260 but keep track of how many of them there are. 1160 00:55:19,260 --> 00:55:27,200 And so now if I go over to SQLite and I hit Enter, now I have in SQL version 1161 00:55:27,200 --> 00:55:29,720 the exact same output that I had from Python 1162 00:55:29,720 --> 00:55:32,120 that took me, what, 15 plus lines before. 1163 00:55:32,120 --> 00:55:35,810 Now we're down to just one because SQL, Structured Query Language, 1164 00:55:35,810 --> 00:55:39,830 is all about constructing queries like this to answer questions and get back 1165 00:55:39,830 --> 00:55:41,152 answers quickly. 1166 00:55:41,152 --> 00:55:42,860 If we want to clean this up a little bit, 1167 00:55:42,860 --> 00:55:44,960 you asked earlier about sorting order. 1168 00:55:44,960 --> 00:55:46,200 Well, we can do that too. 1169 00:55:46,200 --> 00:55:48,930 There's another key phrase we can use here. 1170 00:55:48,930 --> 00:55:55,080 We can order by the count of those rows and then run that query here. 1171 00:55:55,080 --> 00:55:57,480 So now unfortunately they're from smallest to biggest, 1172 00:55:57,480 --> 00:55:58,675 but we can reverse that. 1173 00:55:58,675 --> 00:56:00,800 It turns out, and my query's starting to wrap here. 1174 00:56:00,800 --> 00:56:02,360 I'll zoom out for a moment. 1175 00:56:02,360 --> 00:56:07,050 If you want to order by count, the default is in ascending order, 1176 00:56:07,050 --> 00:56:10,980 abbreviated A-S-C. If you want to reverse the sort in SQL, 1177 00:56:10,980 --> 00:56:13,920 instead of using reverse equals true like we did in Python, 1178 00:56:13,920 --> 00:56:16,380 you say D-E-S-C for descending order. 1179 00:56:16,380 --> 00:56:19,557 And now we get almost the same output but flipped in reverse. 1180 00:56:19,557 --> 00:56:22,140 So it's just a lot faster to answer questions once, of course, 1181 00:56:22,140 --> 00:56:24,660 you get some muscle memory and some comfort with it. 1182 00:56:24,660 --> 00:56:26,280 Well, what else can I do? 1183 00:56:26,280 --> 00:56:28,890 What if I just care about the most popular language? 1184 00:56:28,890 --> 00:56:31,140 I don't care about the second place or the third place 1185 00:56:31,140 --> 00:56:32,460 languages or anything else. 1186 00:56:32,460 --> 00:56:34,830 Well, let me add one more clause here. 1187 00:56:34,830 --> 00:56:36,750 Limit the answer to one. 1188 00:56:36,750 --> 00:56:39,330 And no matter how many rows should come back now, 1189 00:56:39,330 --> 00:56:41,880 I just get the number one language as of the data 1190 00:56:41,880 --> 00:56:47,870 set we collected with 270 votes for it. 1191 00:56:47,870 --> 00:56:52,060 Questions on this? 1192 00:56:52,060 --> 00:56:55,260 Any questions here? 1193 00:56:55,260 --> 00:56:55,800 No? 1194 00:56:55,800 --> 00:56:58,620 Well, what if we're starting to introduce SQL 1195 00:56:58,620 --> 00:57:02,250 and it was kind of too late to make it into the Google Form? 1196 00:57:02,250 --> 00:57:04,260 So it turns out there's syntax for this too. 1197 00:57:04,260 --> 00:57:05,670 You can create data, of course. 1198 00:57:05,670 --> 00:57:09,240 Not just the tables, but the data therein and here's the typical syntax 1199 00:57:09,240 --> 00:57:11,640 for inserting data into a SQL database. 1200 00:57:11,640 --> 00:57:14,850 You literally say insert into the name of the table. 1201 00:57:14,850 --> 00:57:19,020 And then in parentheses, you specify one or more columns for which you 1202 00:57:19,020 --> 00:57:20,940 have values that you want to insert. 1203 00:57:20,940 --> 00:57:22,950 This is to say you don't have to give values 1204 00:57:22,950 --> 00:57:25,650 for every column in the given row. 1205 00:57:25,650 --> 00:57:27,990 If you only have answers to some of those questions, 1206 00:57:27,990 --> 00:57:29,940 you can enumerate them here like this. 1207 00:57:29,940 --> 00:57:32,010 But the values you insert are going to be these. 1208 00:57:32,010 --> 00:57:35,500 So you literally say after the close parenthesis values. 1209 00:57:35,500 --> 00:57:38,430 And then in a second set of parentheses with a same length 1210 00:57:38,430 --> 00:57:42,370 comma separated list, you specify what values do you want to insert. 1211 00:57:42,370 --> 00:57:43,710 So it's a little verbose. 1212 00:57:43,710 --> 00:57:47,520 And frankly longer term, you're going to use Python code to automatically do 1213 00:57:47,520 --> 00:57:50,010 these kinds of insertions, but let's go ahead and try this. 1214 00:57:50,010 --> 00:57:57,990 Right now if I do select distinct language from favorites, again, 1215 00:57:57,990 --> 00:57:58,833 we see this. 1216 00:57:58,833 --> 00:58:00,000 Just these three candidates. 1217 00:58:00,000 --> 00:58:01,870 But we've now taught you a bit of SQL. 1218 00:58:01,870 --> 00:58:09,120 So let's do insert into favorites the column called language. 1219 00:58:09,120 --> 00:58:10,620 And you know what? 1220 00:58:10,620 --> 00:58:12,150 I'm going to give a problem here. 1221 00:58:12,150 --> 00:58:14,430 The values for which, and let me zoom back out, 1222 00:58:14,430 --> 00:58:17,910 are going to be quote unquote "SQL" and quote, unquote "fiftyville." 1223 00:58:17,910 --> 00:58:20,130 You'll soon see what that's all about. 1224 00:58:20,130 --> 00:58:21,240 Semicolon. 1225 00:58:21,240 --> 00:58:24,580 Nothing seems to happen, but that's usually a good thing. 1226 00:58:24,580 --> 00:58:26,580 And now if I scroll back up in my queries, 1227 00:58:26,580 --> 00:58:30,990 in SQLite3 you can scroll back and forth in time to avoid retyping things, 1228 00:58:30,990 --> 00:58:35,400 now I should see indeed four candidate languages here. 1229 00:58:35,400 --> 00:58:39,150 Now, suppose that you were never really a fan of C and maybe you programmed 1230 00:58:39,150 --> 00:58:42,990 a little bit in high school or in the real world and you liked C++. 1231 00:58:42,990 --> 00:58:50,010 Well, there's a whole lot of answers for C. So select star from favorites 1232 00:58:50,010 --> 00:58:53,430 where language equals quote unquote "C." 1233 00:58:53,430 --> 00:58:58,020 So here's everyone who submitted the answer for C. Let's presume that, no, 1234 00:58:58,020 --> 00:58:59,850 they didn't really want C, they wanted C++, 1235 00:58:59,850 --> 00:59:01,767 which is not a language we teach in the class. 1236 00:59:01,767 --> 00:59:03,910 But I could also now do this. 1237 00:59:03,910 --> 00:59:08,670 You can use the update command to set a column or columns to different values 1238 00:59:08,670 --> 00:59:10,210 where some condition is met. 1239 00:59:10,210 --> 00:59:15,360 So if I do update table name set column name equal to some value 1240 00:59:15,360 --> 00:59:19,330 filtering it perhaps by where some condition is true. 1241 00:59:19,330 --> 00:59:22,890 So suppose I've changed my mind, or you know what, 1242 00:59:22,890 --> 00:59:31,320 let's go ahead and do update favorites set language equal to maybe C++ where 1243 00:59:31,320 --> 00:59:33,870 language equals C. 1244 00:59:33,870 --> 00:59:35,910 Now, this is destructive, so you generally 1245 00:59:35,910 --> 00:59:38,368 don't want to do this unless you have a backup of your data 1246 00:59:38,368 --> 00:59:40,170 too, overriding what people's answers are. 1247 00:59:40,170 --> 00:59:42,753 This seems to have been successful, because no error messages. 1248 00:59:42,753 --> 00:59:45,270 And if I rerun the previous select that gives me 1249 00:59:45,270 --> 00:59:49,410 all of the favorites where language equals C, now indeed I get none. 1250 00:59:49,410 --> 00:59:53,127 But if I search for C++, now I get a lot. 1251 00:59:53,127 --> 00:59:55,710 And if I get rid of that where clause altogether and just look 1252 00:59:55,710 --> 01:00:00,330 at the contents of my database, now you see that indeed C++ is comingled with 1253 01:00:00,330 --> 01:00:01,140 all the other data. 1254 01:00:01,140 --> 01:00:04,270 This is not what you all intended, of course, so I can undo this. 1255 01:00:04,270 --> 01:00:06,150 Let me go ahead and undo what I just did. 1256 01:00:06,150 --> 01:00:12,570 Let me set my favorite language to C where language equals C++. 1257 01:00:12,570 --> 01:00:14,460 But the predicate is important. 1258 01:00:14,460 --> 01:00:16,230 This I'm not going to do. 1259 01:00:16,230 --> 01:00:20,190 What if I accidentally omitted this predicate, the where clause? 1260 01:00:20,190 --> 01:00:22,545 How would that screw things up might you think? 1261 01:00:22,545 --> 01:00:26,160 1262 01:00:26,160 --> 01:00:26,970 Yeah, in the back. 1263 01:00:26,970 --> 01:00:29,110 AUDIENCE: [INAUDIBLE] 1264 01:00:29,110 --> 01:00:32,590 DAVID MALAN: It would set every row's language to indeed C. 1265 01:00:32,590 --> 01:00:34,060 And this is dangerous. 1266 01:00:34,060 --> 01:00:37,720 And if you start googling around for SQL mistakes or the like, 1267 01:00:37,720 --> 01:00:41,590 people in the real world have accidentally run commands like this. 1268 01:00:41,590 --> 01:00:43,510 And without naming names, a former member 1269 01:00:43,510 --> 01:00:47,200 of our teaching staff at one point accidentally ran a command like this 1270 01:00:47,200 --> 01:00:51,610 and changed every student's name in our database to Bobby I think it was. 1271 01:00:51,610 --> 01:00:55,120 The same name for every row because they simply forgot a predicate. 1272 01:00:55,120 --> 01:00:58,240 So here too there's dangers in code, and you 1273 01:00:58,240 --> 01:01:00,730 should adopt the habit quite quickly of always, 1274 01:01:00,730 --> 01:01:03,730 one, backing up your data like with CP, for instance, 1275 01:01:03,730 --> 01:01:09,010 in Linux or any other technique or just making sure before you hit Enter that, 1276 01:01:09,010 --> 01:01:11,322 yes, this is indeed the query I want to execute. 1277 01:01:11,322 --> 01:01:13,030 And generally speaking in the real world, 1278 01:01:13,030 --> 01:01:14,890 there should be process controls in place. 1279 01:01:14,890 --> 01:01:17,770 Like the intern should not have access to the production 1280 01:01:17,770 --> 01:01:19,570 database, the live database, and the like. 1281 01:01:19,570 --> 01:01:21,860 But you have a lot of power now with these queries. 1282 01:01:21,860 --> 01:01:25,520 So just be all the more careful, because very easily can you do bad things. 1283 01:01:25,520 --> 01:01:26,660 So let me undo this. 1284 01:01:26,660 --> 01:01:30,790 Where language equals quote unquote "C++." 1285 01:01:30,790 --> 01:01:31,840 And I'll zoom back out. 1286 01:01:31,840 --> 01:01:32,470 Enter. 1287 01:01:32,470 --> 01:01:34,480 And now I think we're back in business. 1288 01:01:34,480 --> 01:01:36,100 C is among the answers. 1289 01:01:36,100 --> 01:01:37,520 Yeah? 1290 01:01:37,520 --> 01:01:40,340 AUDIENCE: [INAUDIBLE] 1291 01:01:40,340 --> 01:01:42,293 1292 01:01:42,293 --> 01:01:44,460 DAVID MALAN: It's essentially doing what at the end? 1293 01:01:44,460 --> 01:01:45,270 AUDIENCE: [INAUDIBLE] 1294 01:01:45,270 --> 01:01:46,680 DAVID MALAN: It's essentially find and replace. 1295 01:01:46,680 --> 01:01:47,180 Yes. 1296 01:01:47,180 --> 01:01:51,060 In layperson's terms, this is find and replace implemented with SQL. 1297 01:01:51,060 --> 01:01:55,020 And in fact, the authors of Microsoft Word or Google Docs 1298 01:01:55,020 --> 01:01:58,290 might very well be using language like this SQL 1299 01:01:58,290 --> 01:02:01,710 when you go to the nice graphical user friendly find and replace box. 1300 01:02:01,710 --> 01:02:05,010 This may very well be what they're doing underneath the hood or, of course, 1301 01:02:05,010 --> 01:02:07,770 they could be using some other language altogether. 1302 01:02:07,770 --> 01:02:10,530 There's one last syntax that's worth knowing, 1303 01:02:10,530 --> 01:02:14,280 delete, which for better or for worse is even more destructive whereby it 1304 01:02:14,280 --> 01:02:16,590 allows you to delete rows from tables. 1305 01:02:16,590 --> 01:02:20,130 It's distinct from drop, which lets you delete tables themselves. 1306 01:02:20,130 --> 01:02:21,630 This focuses on rows. 1307 01:02:21,630 --> 01:02:26,250 So suppose that you really, really didn't like, 1308 01:02:26,250 --> 01:02:30,000 let's say, Tideman was a little challenging if you 1309 01:02:30,000 --> 01:02:31,840 tackled that more comfortable problem. 1310 01:02:31,840 --> 01:02:34,800 So if you really don't want to even think about Tideman anymore, 1311 01:02:34,800 --> 01:02:40,080 so why don't we do delete from favorites where problem equals, and I 1312 01:02:40,080 --> 01:02:42,240 won't execute it for real, Tideman. 1313 01:02:42,240 --> 01:02:45,360 This would have the effect of deleting every row, including 1314 01:02:45,360 --> 01:02:50,010 the language therein and the timestamp, where the student answered Tideman. 1315 01:02:50,010 --> 01:02:53,910 Worse than this would be this. 1316 01:02:53,910 --> 01:02:55,125 Why might this be bad? 1317 01:02:55,125 --> 01:02:58,560 1318 01:02:58,560 --> 01:03:00,930 OK, chuckling because there's no predicate. 1319 01:03:00,930 --> 01:03:04,960 There's no filter, which means literally this would delete all of the data. 1320 01:03:04,960 --> 01:03:07,740 So again, with great power here comes great responsibility. 1321 01:03:07,740 --> 01:03:12,390 Now, this has just been a data set of 430 rows by us dynamically created. 1322 01:03:12,390 --> 01:03:15,240 There's, of course, some really juicy data sets in the real world. 1323 01:03:15,240 --> 01:03:17,990 And one website you might have heard or an app you might have used 1324 01:03:17,990 --> 01:03:20,460 is IMDb, the Internet Movie Database, which wonderfully 1325 01:03:20,460 --> 01:03:24,330 makes some of their data available for download as CSV files 1326 01:03:24,330 --> 01:03:27,420 or technically TSV files, Tab Separated Values. 1327 01:03:27,420 --> 01:03:31,830 But what we did in advance of class was download some of that data for both TV 1328 01:03:31,830 --> 01:03:34,170 shows in the real world and movies in the real world. 1329 01:03:34,170 --> 01:03:35,962 And what's wonderful about this data set is 1330 01:03:35,962 --> 01:03:39,630 it's not just dozens or hundreds or even thousands of lines. 1331 01:03:39,630 --> 01:03:44,190 There are millions of rows of juicy data, TV shows and movies 1332 01:03:44,190 --> 01:03:47,190 with which most folks are probably familiar, at least with a subset. 1333 01:03:47,190 --> 01:03:50,580 And we'll see in just a little bit that this data 1334 01:03:50,580 --> 01:03:54,600 comes in the form of now six different tables that we've given you. 1335 01:03:54,600 --> 01:03:56,610 And the tables in question for today are going 1336 01:03:56,610 --> 01:04:00,735 to be the people in the TV business, the stars therein, the shows that people 1337 01:04:00,735 --> 01:04:01,860 are producing and the like. 1338 01:04:01,860 --> 01:04:04,860 This is a picture we'll revisit to enable 1339 01:04:04,860 --> 01:04:07,110 you to wrap your minds around what the actual data is. 1340 01:04:07,110 --> 01:04:09,948 This feels like a good opportunity though for a snack. 1341 01:04:09,948 --> 01:04:12,990 In fact, in just a moment, we have a whole lot of Rice Krispie treats out 1342 01:04:12,990 --> 01:04:13,690 in the lobby. 1343 01:04:13,690 --> 01:04:18,630 But if folks could perhaps acknowledge this mini wedding cake here. 1344 01:04:18,630 --> 01:04:21,640 CS50 zone Carter Zenke is getting married this week. 1345 01:04:21,640 --> 01:04:28,880 So congratulations to Carter as well. 1346 01:04:28,880 --> 01:04:30,550 Congrats. 1347 01:04:30,550 --> 01:04:31,940 All right. 1348 01:04:31,940 --> 01:04:34,440 [APPLAUSE] 1349 01:04:34,440 --> 01:04:38,115 1350 01:04:38,115 --> 01:04:40,952 There's only one piece of cake in that box but a lot of Rice Krispie 1351 01:04:40,952 --> 01:04:41,910 treats in the transept. 1352 01:04:41,910 --> 01:04:45,420 Let's take 10 minutes and we'll be back with Internet Movie Database in 10. 1353 01:04:45,420 --> 01:04:46,470 All right. 1354 01:04:46,470 --> 01:04:47,320 We are back. 1355 01:04:47,320 --> 01:04:51,000 So if you've never been, you can actually go to imdb.com right now 1356 01:04:51,000 --> 01:04:53,020 and play around or download the mobile app. 1357 01:04:53,020 --> 01:04:56,580 And it's just big database of a lot of TV shows and movies 1358 01:04:56,580 --> 01:04:58,090 and actors and the like. 1359 01:04:58,090 --> 01:05:00,828 But what indeed is nice is you can download some of that data. 1360 01:05:00,828 --> 01:05:02,370 And that's what I've done in advance. 1361 01:05:02,370 --> 01:05:05,160 And what we've done is we wrote some Python code 1362 01:05:05,160 --> 01:05:09,540 to convert some of the flat file databases that they let you download 1363 01:05:09,540 --> 01:05:13,890 and we converted it into a SQL database with six tables. 1364 01:05:13,890 --> 01:05:17,940 So not just one but six that ultimately are these here. 1365 01:05:17,940 --> 01:05:21,090 And let me just help you wrap your minds around what this picture is, which 1366 01:05:21,090 --> 01:05:25,410 is an entity relationship diagram, which is just to say each of these boxes 1367 01:05:25,410 --> 01:05:27,090 on the screen represents a table. 1368 01:05:27,090 --> 01:05:31,230 And each of the arrows or edges represents some kind of relationship 1369 01:05:31,230 --> 01:05:32,100 across the tables. 1370 01:05:32,100 --> 01:05:34,200 Because up until now, the only data we had 1371 01:05:34,200 --> 01:05:36,270 were those three columns in the favorites table. 1372 01:05:36,270 --> 01:05:39,130 But what gets really useful about SQL databases, 1373 01:05:39,130 --> 01:05:41,700 just like a Google Spreadsheet or an Excel file, 1374 01:05:41,700 --> 01:05:45,300 is you can have multiple sheets or in a database multiple tables. 1375 01:05:45,300 --> 01:05:50,010 And so what we're about to see is that in the IMDb database for TV shows, 1376 01:05:50,010 --> 01:05:53,820 there's going to be a dedicated table for all the people in the TV business. 1377 01:05:53,820 --> 01:05:57,210 There's going to be a dedicated table for all of the TV shows 1378 01:05:57,210 --> 01:05:59,130 that are in their database as of right now. 1379 01:05:59,130 --> 01:06:03,180 There's going to be a dedicated table for writers in that industry, 1380 01:06:03,180 --> 01:06:08,580 for the ratings of shows, for the genres to which shows belong, 1381 01:06:08,580 --> 01:06:09,700 comedy and the like. 1382 01:06:09,700 --> 01:06:12,310 And then lastly, there's going to be this table, 1383 01:06:12,310 --> 01:06:17,700 which somehow associates people with the TV shows that they star in 1384 01:06:17,700 --> 01:06:19,030 and vice versa. 1385 01:06:19,030 --> 01:06:22,290 And so let's consider first what this looks like in code. 1386 01:06:22,290 --> 01:06:25,110 And we'll see that it's going to overwhelm intentionally at first, 1387 01:06:25,110 --> 01:06:26,200 but I'm going to do this. 1388 01:06:26,200 --> 01:06:27,690 I'm going to go back to my terminal window. 1389 01:06:27,690 --> 01:06:30,240 And during the break, I downloaded from the course's website 1390 01:06:30,240 --> 01:06:33,960 a file called shows.db, which we made in advance for you. 1391 01:06:33,960 --> 01:06:37,440 And if I type ls, I'll see all of my favorites files from before. 1392 01:06:37,440 --> 01:06:40,020 The CSV, the DB, and the Python file. 1393 01:06:40,020 --> 01:06:42,390 But now they're shows.db. 1394 01:06:42,390 --> 01:06:45,540 So I'm going to go ahead in my full screen terminal window here. 1395 01:06:45,540 --> 01:06:47,850 I'm not using actual tabs or code files. 1396 01:06:47,850 --> 01:06:52,380 Now I'm going to run sqlite3 on the file called shows.db. 1397 01:06:52,380 --> 01:06:54,900 And I'm just going to see this version information here. 1398 01:06:54,900 --> 01:06:57,120 Let me clear my screen and run the one command 1399 01:06:57,120 --> 01:07:01,170 I ran earlier to show us the schema of the favorites database. 1400 01:07:01,170 --> 01:07:03,947 Now we'll see the schema for the shows database. 1401 01:07:03,947 --> 01:07:06,030 And there's a lot going on here, but let me scroll 1402 01:07:06,030 --> 01:07:08,070 back up to the very top, the beginning. 1403 01:07:08,070 --> 01:07:10,570 And we see this here. 1404 01:07:10,570 --> 01:07:15,690 So when I run .schema, we see a dump, really, 1405 01:07:15,690 --> 01:07:20,310 of all of the SQL create table commands that were run in order to create this 1406 01:07:20,310 --> 01:07:21,330 database for you. 1407 01:07:21,330 --> 01:07:24,420 And one of those tables is called genres and another people, 1408 01:07:24,420 --> 01:07:27,370 ratings, shows, stars, and so forth. 1409 01:07:27,370 --> 01:07:29,340 And the columns therein, even though it's 1410 01:07:29,340 --> 01:07:32,670 formatted a little more prettily than the automatically generated create 1411 01:07:32,670 --> 01:07:37,680 table statement for favorites whereby we have one column per line of output 1412 01:07:37,680 --> 01:07:41,520 here in the, for instance, people table, there's 1413 01:07:41,520 --> 01:07:45,900 going to be an ID column, like a unique identifier like a Harvard ID, a Yale 1414 01:07:45,900 --> 01:07:50,730 ID or the like, a name column, a birth year, and then some other stuff. 1415 01:07:50,730 --> 01:07:53,760 If I scroll down to shows, every show in the world 1416 01:07:53,760 --> 01:07:58,410 is going to have a unique ID as well, a title of course, the year in which it 1417 01:07:58,410 --> 01:08:00,390 debuted, and the total number of episodes 1418 01:08:00,390 --> 01:08:02,370 as of the time we downloaded the data. 1419 01:08:02,370 --> 01:08:04,080 And then what else is there? 1420 01:08:04,080 --> 01:08:08,170 Some of these are a little less obvious like ratings here. 1421 01:08:08,170 --> 01:08:10,830 So ratings don't have an ID column, but they 1422 01:08:10,830 --> 01:08:15,038 have a show ID column and a rating like on a five point scale or a 10 point 1423 01:08:15,038 --> 01:08:16,830 scale or the like and then the total number 1424 01:08:16,830 --> 01:08:19,319 of votes that were collected to contribute to that rating. 1425 01:08:19,319 --> 01:08:23,680 IMDb allows people to upvote and downvote shows and movies and the like. 1426 01:08:23,680 --> 01:08:26,380 And then similarly is genre structured. 1427 01:08:26,380 --> 01:08:29,290 There's a show ID and then there's a genre, 1428 01:08:29,290 --> 01:08:33,370 which is going to be an English word like comedy or drama or something else. 1429 01:08:33,370 --> 01:08:34,200 And then what else? 1430 01:08:34,200 --> 01:08:38,609 Let's go a little further at the bottom here for stars and writers. 1431 01:08:38,609 --> 01:08:41,729 If we go to the very bottom here, stars and writers 1432 01:08:41,729 --> 01:08:43,470 are similarly structured too. 1433 01:08:43,470 --> 01:08:46,890 They have a show ID and a person ID. 1434 01:08:46,890 --> 01:08:48,540 So show and person. 1435 01:08:48,540 --> 01:08:52,470 And then this writers table has a show ID and a person ID. 1436 01:08:52,470 --> 01:08:56,340 And there's a whole lot of other words that we'll come to in just a moment. 1437 01:08:56,340 --> 01:08:58,330 But what is this code hinting at? 1438 01:08:58,330 --> 01:09:01,000 Well, if I go back to the picture from earlier here, 1439 01:09:01,000 --> 01:09:05,160 you'll see that this picture captures the relationships 1440 01:09:05,160 --> 01:09:07,020 among these various tables. 1441 01:09:07,020 --> 01:09:11,670 So for instance, if we focus on shows for just a moment, a show, 1442 01:09:11,670 --> 01:09:15,359 again, has a unique ID, a title, a year in which it debuted, 1443 01:09:15,359 --> 01:09:16,859 and a total number of episodes. 1444 01:09:16,859 --> 01:09:20,550 If you want to figure out what genre or genres 1445 01:09:20,550 --> 01:09:24,300 a show belongs to, because some shows are just comedies, 1446 01:09:24,300 --> 01:09:29,010 some shows are just dramas, but some shows are arguably comedies and dramas 1447 01:09:29,010 --> 01:09:30,700 depending on the episode or the like. 1448 01:09:30,700 --> 01:09:34,740 So you can imagine wanting to associate two or three or even more genres 1449 01:09:34,740 --> 01:09:35,640 with a show. 1450 01:09:35,640 --> 01:09:39,830 This line here in this second table allows us to do that. 1451 01:09:39,830 --> 01:09:46,310 Every row in the genres table we'll see has two items, a show ID 1452 01:09:46,310 --> 01:09:49,250 which relates to the ID of a show. 1453 01:09:49,250 --> 01:09:54,350 And that's why these lines literally line up with that specific column name. 1454 01:09:54,350 --> 01:09:57,050 And a genre, which is going to be like, quote unquote, "comedy," 1455 01:09:57,050 --> 01:09:59,250 quote unquote "drama," or something else. 1456 01:09:59,250 --> 01:10:02,180 Now with that said, design question. 1457 01:10:02,180 --> 01:10:07,250 Why have we deliberately not just gotten rid of this genres table 1458 01:10:07,250 --> 01:10:12,950 and made our lives simpler by just adding a genre column to this show's 1459 01:10:12,950 --> 01:10:13,640 table? 1460 01:10:13,640 --> 01:10:17,150 And again, a table is just like a sheet with rows and columns. 1461 01:10:17,150 --> 01:10:21,140 At the moment, shows only have four columns, ID, title, year, episodes. 1462 01:10:21,140 --> 01:10:28,980 Why not just add a fifth column called genre and put the show's genre there? 1463 01:10:28,980 --> 01:10:32,200 Any intuition here? 1464 01:10:32,200 --> 01:10:34,960 Why not just keep things simple? 1465 01:10:34,960 --> 01:10:36,286 Yeah, in back. 1466 01:10:36,286 --> 01:10:42,035 AUDIENCE: [INAUDIBLE] 1467 01:10:42,035 --> 01:10:42,910 DAVID MALAN: Exactly. 1468 01:10:42,910 --> 01:10:46,670 If you add a fifth column here and call it genre, 1469 01:10:46,670 --> 01:10:49,900 then you have to pick a genre specifically. 1470 01:10:49,900 --> 01:10:55,840 You have to put in that cell presumably comedy or drama or musical or something 1471 01:10:55,840 --> 01:10:56,480 else. 1472 01:10:56,480 --> 01:10:58,270 Now, you could write multiple words in the cell, 1473 01:10:58,270 --> 01:11:00,645 but generally speaking, that would be sloppy, bad design. 1474 01:11:00,645 --> 01:11:02,492 Like every cell just like in a spreadsheet 1475 01:11:02,492 --> 01:11:03,700 should really have one value. 1476 01:11:03,700 --> 01:11:07,270 It might have multiple words, but it shouldn't be a weirdly comma separated 1477 01:11:07,270 --> 01:11:08,530 list of multiple things. 1478 01:11:08,530 --> 01:11:10,655 It should just be in a different cell in that case. 1479 01:11:10,655 --> 01:11:14,320 So if you instead were to design this with just a single column called genre, 1480 01:11:14,320 --> 01:11:16,510 you're imposing what a computer scientist 1481 01:11:16,510 --> 01:11:19,480 would call a one to one relationship. 1482 01:11:19,480 --> 01:11:21,580 Every show has one genre. 1483 01:11:21,580 --> 01:11:23,800 And that's not necessarily a good thing. 1484 01:11:23,800 --> 01:11:26,170 Or strictly speaking, it would be a many to one, 1485 01:11:26,170 --> 01:11:28,510 because the same genre could belong to multiple shows, 1486 01:11:28,510 --> 01:11:31,760 but each show could only have one genre in that case. 1487 01:11:31,760 --> 01:11:33,760 What a relational database allows you to do, 1488 01:11:33,760 --> 01:11:35,890 and relational is indeed the operative word, 1489 01:11:35,890 --> 01:11:39,100 it allows you to factor out some of your information 1490 01:11:39,100 --> 01:11:44,590 and then have maybe one show here in one row but then in this genres table, 1491 01:11:44,590 --> 01:11:47,320 you could have one row for that one show genre, 1492 01:11:47,320 --> 01:11:52,450 or you could have two rows in the genres table for comedy and for drama. 1493 01:11:52,450 --> 01:11:55,490 Or if it has a third genre, you could just add another row here. 1494 01:11:55,490 --> 01:11:59,380 So you still have one row for the show itself 1495 01:11:59,380 --> 01:12:02,890 with all the juiciest details but a variable number of rows 1496 01:12:02,890 --> 01:12:05,950 by having this relationship with another table. 1497 01:12:05,950 --> 01:12:10,000 Meanwhile, ratings work the same way, at least in this case. 1498 01:12:10,000 --> 01:12:12,882 A show has ID, title, year, and episodes. 1499 01:12:12,882 --> 01:12:14,590 But if you want to figure out its rating, 1500 01:12:14,590 --> 01:12:17,710 you have to follow the arrow here, so to speak, and look up 1501 01:12:17,710 --> 01:12:20,020 the corresponding show ID in this table. 1502 01:12:20,020 --> 01:12:23,750 Find the rating of that show and the total number of ratings. 1503 01:12:23,750 --> 01:12:26,470 So that's been factored out too, for better or for worse. 1504 01:12:26,470 --> 01:12:28,600 Now let's consider people. 1505 01:12:28,600 --> 01:12:32,050 People have just three columns, ID, name, and birth. 1506 01:12:32,050 --> 01:12:35,350 But there's no mention of the TV show in which people have starred 1507 01:12:35,350 --> 01:12:38,140 or the TV shows that a person has written. 1508 01:12:38,140 --> 01:12:39,200 Well, why is that? 1509 01:12:39,200 --> 01:12:42,730 Well, if you just had a fourth column here called show, 1510 01:12:42,730 --> 01:12:45,560 well, you would have to decide what show is that person in. 1511 01:12:45,560 --> 01:12:47,560 And no one could ever act again in another show, 1512 01:12:47,560 --> 01:12:49,310 because there's no room to store the data. 1513 01:12:49,310 --> 01:12:53,560 But if someone, of course, a popular actor can star in multiple shows, 1514 01:12:53,560 --> 01:12:57,790 well, we could have one ID for that person, one name, one birth year, 1515 01:12:57,790 --> 01:12:58,360 obviously. 1516 01:12:58,360 --> 01:13:02,230 Like there's only one Steve Carell as an actor in the world of people. 1517 01:13:02,230 --> 01:13:06,610 But Steve Carell in this example could have his person ID, 1518 01:13:06,610 --> 01:13:09,220 whatever his Harvard ID equivalent, Yale ID equivalent is, 1519 01:13:09,220 --> 01:13:12,130 appear in multiple rows in this table so that it 1520 01:13:12,130 --> 01:13:14,470 can be associated with multiple shows. 1521 01:13:14,470 --> 01:13:18,830 And this allows you to create what's called a one to many relationship, 1522 01:13:18,830 --> 01:13:20,500 or technically it's bidirectional. 1523 01:13:20,500 --> 01:13:22,450 It's a many to many relationship. 1524 01:13:22,450 --> 01:13:23,120 Why? 1525 01:13:23,120 --> 01:13:26,470 Well, one show can certainly have multiple people in it 1526 01:13:26,470 --> 01:13:29,090 and multiple people writing for it, just in the real world. 1527 01:13:29,090 --> 01:13:33,910 But conversely, one person could certainly act in multiple shows 1528 01:13:33,910 --> 01:13:35,330 or write multiple shows. 1529 01:13:35,330 --> 01:13:38,380 So this is what you get with relational databases. 1530 01:13:38,380 --> 01:13:41,440 You put your sort of canonical data for people in one place, 1531 01:13:41,440 --> 01:13:46,150 for shows in another place, and then you use these additional tables 1532 01:13:46,150 --> 01:13:48,993 to relate one thing to another. 1533 01:13:48,993 --> 01:13:50,410 So we won't dwell on the pictures. 1534 01:13:50,410 --> 01:13:53,350 That's just if you sort of can wrap your mind around the data 1535 01:13:53,350 --> 01:13:55,870 set better that way, that's one way of thinking about it. 1536 01:13:55,870 --> 01:14:00,177 But recall that the code we just saw for the schema, again, escalated quickly. 1537 01:14:00,177 --> 01:14:02,260 There's a lot of keywords I haven't mentioned yet. 1538 01:14:02,260 --> 01:14:04,360 But some of these are perhaps familiar. 1539 01:14:04,360 --> 01:14:06,130 They're capitalized differently here. 1540 01:14:06,130 --> 01:14:08,140 But integer is on the list here. 1541 01:14:08,140 --> 01:14:10,790 Null is on the list, albeit technically not null. 1542 01:14:10,790 --> 01:14:12,760 So let's tease apart some of these keywords 1543 01:14:12,760 --> 01:14:15,218 and consider what they're actually doing for your database, 1544 01:14:15,218 --> 01:14:18,640 because now we're exploring features that do not exist 1545 01:14:18,640 --> 01:14:20,980 in the world of spreadsheets alone. 1546 01:14:20,980 --> 01:14:24,190 So it turns out in a SQL database, specifically 1547 01:14:24,190 --> 01:14:27,265 SQLite which is the version of SQL we use in CS50 1548 01:14:27,265 --> 01:14:30,910 and which is commonly used for things like mobile applications nowadays. 1549 01:14:30,910 --> 01:14:32,560 It's like a lightweight version of SQL. 1550 01:14:32,560 --> 01:14:35,590 It's when you aren't trying to run Twitter and have billions 1551 01:14:35,590 --> 01:14:37,150 and billions of rows necessarily. 1552 01:14:37,150 --> 01:14:40,730 You've got hundreds, thousands, tens of thousands, maybe even a few million, 1553 01:14:40,730 --> 01:14:44,590 but not crazy numbers, crazy amounts of data. 1554 01:14:44,590 --> 01:14:49,250 In the world of SQLite specifically, there's these five data types. 1555 01:14:49,250 --> 01:14:52,390 So just like in C, we had int and char and the like. 1556 01:14:52,390 --> 01:14:54,190 In SQL, we have these. 1557 01:14:54,190 --> 01:14:57,447 Blob, which is kind of funny, but it just means binary large objects. 1558 01:14:57,447 --> 01:14:58,780 So it's like a binary data type. 1559 01:14:58,780 --> 01:15:02,830 Zeros and ones that aren't necessarily fitting into the other categories. 1560 01:15:02,830 --> 01:15:05,290 Integer, which of course, is an integer as we know it. 1561 01:15:05,290 --> 01:15:09,940 Numeric, which is kind of a catchall for numbers that are formatted specially. 1562 01:15:09,940 --> 01:15:14,710 So like a date would be like year, year, year, year, dash month, month, 1563 01:15:14,710 --> 01:15:16,390 dash day, day. 1564 01:15:16,390 --> 01:15:18,015 And this is actually a wonderful thing. 1565 01:15:18,015 --> 01:15:19,723 Depending on the country you're from, you 1566 01:15:19,723 --> 01:15:22,720 might think your date system in your country is great or it's horrible. 1567 01:15:22,720 --> 01:15:27,040 The US system is horrible because we have month, day, and then 1568 01:15:27,040 --> 01:15:29,350 year, which is impossible to sort. 1569 01:15:29,350 --> 01:15:31,930 It is the wrong way objectively to store data. 1570 01:15:31,930 --> 01:15:33,803 And yet here we are using this at scale. 1571 01:15:33,803 --> 01:15:35,470 Other countries have gotten this better. 1572 01:15:35,470 --> 01:15:38,383 Numeric in SQL itself standardizes that stuff. 1573 01:15:38,383 --> 01:15:40,300 So it doesn't matter what country you're from. 1574 01:15:40,300 --> 01:15:43,200 You're storing your data in this particular way for instance. 1575 01:15:43,200 --> 01:15:46,560 Times are standardized and other types of numeric data as well. 1576 01:15:46,560 --> 01:15:48,630 Real is synonymous with flow. 1577 01:15:48,630 --> 01:15:52,410 So something with a decimal point and some number of digits thereafter. 1578 01:15:52,410 --> 01:15:55,530 And then text is just for strings and the like. 1579 01:15:55,530 --> 01:16:00,240 With other even fancier databases like MySQL, PostgreSQL, Oracle, 1580 01:16:00,240 --> 01:16:02,370 and other products you might have heard of, 1581 01:16:02,370 --> 01:16:04,170 there's even more data types where you have 1582 01:16:04,170 --> 01:16:05,712 to make even finer grained decisions. 1583 01:16:05,712 --> 01:16:07,950 But for SQLite, it's indeed pretty lightweight 1584 01:16:07,950 --> 01:16:12,120 and you or we just have to decide the data types for each column in a table. 1585 01:16:12,120 --> 01:16:15,000 But there's these additional constraints in the world of SQL. 1586 01:16:15,000 --> 01:16:21,400 You can additionally say that cells in this column may or may not be null. 1587 01:16:21,400 --> 01:16:24,660 So if you want to protect yourself from yourself so you don't screw up 1588 01:16:24,660 --> 01:16:26,880 and insert a null, that is a blank value, 1589 01:16:26,880 --> 01:16:31,090 you can explicitly design a table to have a column that cannot be null. 1590 01:16:31,090 --> 01:16:33,090 And so in fact, someone came up during the break 1591 01:16:33,090 --> 01:16:36,450 to ask me about my having manually inserted SQL, 1592 01:16:36,450 --> 01:16:38,923 quote unquote "SQL," into our favorites database. 1593 01:16:38,923 --> 01:16:40,590 You might recall that I kind of cheated. 1594 01:16:40,590 --> 01:16:44,340 I just inserted "SQL" quote unquote and "fiftyville," 1595 01:16:44,340 --> 01:16:46,050 the name of a new problem, quote unquote. 1596 01:16:46,050 --> 01:16:49,170 But what did I not insert into the database? 1597 01:16:49,170 --> 01:16:50,100 A timestamp. 1598 01:16:50,100 --> 01:16:50,880 And I could have. 1599 01:16:50,880 --> 01:16:54,240 I could have put the current day and time a few minutes ago, but I didn't. 1600 01:16:54,240 --> 01:16:58,590 And that's fine if it's acceptable to you and the product you're building. 1601 01:16:58,590 --> 01:17:00,300 But I could have prevented that. 1602 01:17:00,300 --> 01:17:05,100 If we had defined the table to have a timestamp column that isn't just text 1603 01:17:05,100 --> 01:17:08,490 but it's text that's not null, SQL would have complained 1604 01:17:08,490 --> 01:17:10,650 and would not have let me complete that insertion. 1605 01:17:10,650 --> 01:17:13,567 So there's these kinds of built in defenses that you don't necessarily 1606 01:17:13,567 --> 01:17:15,180 get with a spreadsheet alone. 1607 01:17:15,180 --> 01:17:17,170 And unique means exactly that. 1608 01:17:17,170 --> 01:17:20,190 If you want to make sure that every row in that column 1609 01:17:20,190 --> 01:17:24,480 is unique, maybe for email addresses or in the US Social Security 1610 01:17:24,480 --> 01:17:26,190 numbers or anything that you want to make 1611 01:17:26,190 --> 01:17:30,330 sure you don't have two versions of, you can specify that the column is unique. 1612 01:17:30,330 --> 01:17:32,170 And there's other such constraints as well. 1613 01:17:32,170 --> 01:17:34,140 But again, this is just a list of features 1614 01:17:34,140 --> 01:17:37,170 that you get from a proper relational database. 1615 01:17:37,170 --> 01:17:39,990 But perhaps the most intellectually interesting one 1616 01:17:39,990 --> 01:17:42,420 and the most powerful one is what's called 1617 01:17:42,420 --> 01:17:45,000 here a primary key and a foreign key. 1618 01:17:45,000 --> 01:17:47,640 And let me go back now to this output. 1619 01:17:47,640 --> 01:17:51,970 If we look at shows, you'll see that a show, again, 1620 01:17:51,970 --> 01:17:55,320 has an ID, a title, a year, and a number of episodes. 1621 01:17:55,320 --> 01:17:57,120 And now the data types might make sense. 1622 01:17:57,120 --> 01:17:59,820 The ID it turns out, just like a Harvard ID, a Yale ID, 1623 01:17:59,820 --> 01:18:01,150 is going to be an integer. 1624 01:18:01,150 --> 01:18:02,220 So a simple number. 1625 01:18:02,220 --> 01:18:04,680 The title, of course, is going to be text but not null. 1626 01:18:04,680 --> 01:18:07,470 It would be weird if a TV show had no name. 1627 01:18:07,470 --> 01:18:08,615 That can't be. 1628 01:18:08,615 --> 01:18:10,990 The whole world would break or your TV Guide and whatnot. 1629 01:18:10,990 --> 01:18:13,590 So that makes sense there to say not null. 1630 01:18:13,590 --> 01:18:14,670 Year is numeric. 1631 01:18:14,670 --> 01:18:15,990 So it's a standardized form. 1632 01:18:15,990 --> 01:18:19,140 Episodes is an integer, like how many episodes have been produced. 1633 01:18:19,140 --> 01:18:20,790 And then lastly, notice this. 1634 01:18:20,790 --> 01:18:24,090 The primary key of the show's table is apparently 1635 01:18:24,090 --> 01:18:27,880 the column called ID mentioned a few lines earlier. 1636 01:18:27,880 --> 01:18:32,400 This just means that the database will use the ID 1637 01:18:32,400 --> 01:18:34,740 column as the unique identifier. 1638 01:18:34,740 --> 01:18:37,770 So it's similar to the unique keyword, but primary key just 1639 01:18:37,770 --> 01:18:40,260 means the database is going to treat it as special too 1640 01:18:40,260 --> 01:18:43,380 and make sure that it is uniquely identifying your data. 1641 01:18:43,380 --> 01:18:45,300 But what's interesting is this. 1642 01:18:45,300 --> 01:18:48,480 Notice if I scroll back up to people, people 1643 01:18:48,480 --> 01:18:51,480 were sort of similarly structured but with different attributes. 1644 01:18:51,480 --> 01:18:56,310 Like up here we had a person has an ID, a name, a birth 1645 01:18:56,310 --> 01:18:58,800 year, and a primary key of ID. 1646 01:18:58,800 --> 01:19:01,080 So a ID is, again, integer. 1647 01:19:01,080 --> 01:19:03,600 Name is text but not null, because it'd be 1648 01:19:03,600 --> 01:19:06,420 weird to have a human with absolutely no name textually. 1649 01:19:06,420 --> 01:19:07,860 Birth is going to be numeric. 1650 01:19:07,860 --> 01:19:11,590 But the primary key of people is ID as well. 1651 01:19:11,590 --> 01:19:15,150 So those are the unique columns that the database will just treat special. 1652 01:19:15,150 --> 01:19:15,810 Why? 1653 01:19:15,810 --> 01:19:17,310 Well, we just looked at shows. 1654 01:19:17,310 --> 01:19:18,900 We just looked at people. 1655 01:19:18,900 --> 01:19:22,380 Let's focus now on this one down here, stars. 1656 01:19:22,380 --> 01:19:25,260 How do you determine who stars in a TV show? 1657 01:19:25,260 --> 01:19:26,370 Well, we had two columns. 1658 01:19:26,370 --> 01:19:28,320 The show ID and the person ID. 1659 01:19:28,320 --> 01:19:31,080 This is the incarnation of a many to many relationship. 1660 01:19:31,080 --> 01:19:32,550 One person could be in many shows. 1661 01:19:32,550 --> 01:19:35,710 One show could certainly have many people in it or writing for it. 1662 01:19:35,710 --> 01:19:37,250 But notice this. 1663 01:19:37,250 --> 01:19:41,220 Within this table of two columns, show ID and person ID, 1664 01:19:41,220 --> 01:19:44,070 there's what's going to be called a foreign key called 1665 01:19:44,070 --> 01:19:49,350 show ID that references the show's table's ID column 1666 01:19:49,350 --> 01:19:51,367 and then another foreign key called person ID, 1667 01:19:51,367 --> 01:19:53,700 though I could call these things in parentheses anything 1668 01:19:53,700 --> 01:19:57,495 I want, that references the people table's ID column. 1669 01:19:57,495 --> 01:20:00,120 Now, you're not going to often have to type commands like this. 1670 01:20:00,120 --> 01:20:02,700 Again, you set the database up once in the beginning 1671 01:20:02,700 --> 01:20:04,590 typically, maybe with some help from a TF, 1672 01:20:04,590 --> 01:20:06,210 maybe with help of Google or the like. 1673 01:20:06,210 --> 01:20:09,203 But once your database is designed, it's back to the CRUD. 1674 01:20:09,203 --> 01:20:12,120 Create, read, update, delete, the selects, the inserts, the deletions, 1675 01:20:12,120 --> 01:20:12,690 and the like. 1676 01:20:12,690 --> 01:20:14,310 But what's this implying? 1677 01:20:14,310 --> 01:20:17,520 These keywords like primary key and foreign key 1678 01:20:17,520 --> 01:20:23,100 are what are doing in code what this picture was painting a moment ago. 1679 01:20:23,100 --> 01:20:27,570 These lines here are drawn literally to line up with the corresponding things. 1680 01:20:27,570 --> 01:20:30,660 People's ID lines up with person ID. 1681 01:20:30,660 --> 01:20:34,410 Show's ID lines up with show ID. 1682 01:20:34,410 --> 01:20:37,950 And so you're just seeing graphical version, code version, graphical, 1683 01:20:37,950 --> 01:20:40,860 code that creates these relationships. 1684 01:20:40,860 --> 01:20:45,190 Now, given that, let's actually see what these things look like. 1685 01:20:45,190 --> 01:20:47,160 So let me go back to VS Code here. 1686 01:20:47,160 --> 01:20:48,400 Let me clear my screen. 1687 01:20:48,400 --> 01:20:51,000 I'm still within SQLite with shows.db. 1688 01:20:51,000 --> 01:20:53,480 Let me go ahead and do what I do with any new database. 1689 01:20:53,480 --> 01:20:55,230 If I ever download something or I'm trying 1690 01:20:55,230 --> 01:20:57,600 to wrap my mind around a problem, usually it 1691 01:20:57,600 --> 01:21:01,230 doesn't come with a pretty picture or a three hour lecture 1692 01:21:01,230 --> 01:21:02,670 to explain what the data set is. 1693 01:21:02,670 --> 01:21:06,190 Rather you just have the data set in your own knowledge of SQL. 1694 01:21:06,190 --> 01:21:07,440 So let me play around. 1695 01:21:07,440 --> 01:21:09,608 So .schema shows me all of the tables. 1696 01:21:09,608 --> 01:21:10,900 That might be a starting point. 1697 01:21:10,900 --> 01:21:11,900 OK, this is interesting. 1698 01:21:11,900 --> 01:21:13,560 I know what people are. 1699 01:21:13,560 --> 01:21:16,530 Let's go ahead and show me all the people. 1700 01:21:16,530 --> 01:21:18,595 So select star from people. 1701 01:21:18,595 --> 01:21:20,970 I'm just trying to wrap my mind around what this data set 1702 01:21:20,970 --> 01:21:24,330 looks like in a more user friendly way. 1703 01:21:24,330 --> 01:21:26,607 That's already a lot of people. 1704 01:21:26,607 --> 01:21:28,440 As you see the years flying by, there's been 1705 01:21:28,440 --> 01:21:31,170 a lot of people in the TV business. 1706 01:21:31,170 --> 01:21:33,540 So this was maybe not the best query to run. 1707 01:21:33,540 --> 01:21:39,300 But this is indicative of just how large this data set is from IMDb. 1708 01:21:39,300 --> 01:21:42,960 When in doubt and whenever you lose control over your computer, 1709 01:21:42,960 --> 01:21:44,940 Control C is your friend to interrupt. 1710 01:21:44,940 --> 01:21:46,860 What would have been better, because I don't 1711 01:21:46,860 --> 01:21:49,980 think I need to know all of the million people in the world, 1712 01:21:49,980 --> 01:21:52,470 I could do limit me to 10 people. 1713 01:21:52,470 --> 01:21:55,860 And that's enough now to get a sense of Fred Astaire 1714 01:21:55,860 --> 01:21:58,920 has an ID of one, the first person ever. 1715 01:21:58,920 --> 01:22:00,660 Birth year of 1899. 1716 01:22:00,660 --> 01:22:04,740 Lauren Bacall and all of these other people from yesteryear. 1717 01:22:04,740 --> 01:22:07,410 You see that they are the first 10 people in the database. 1718 01:22:07,410 --> 01:22:09,160 So there's an example of some of the data. 1719 01:22:09,160 --> 01:22:12,540 Now if I want to wrap my mind around what a show is, I know it technically. 1720 01:22:12,540 --> 01:22:13,800 I know it from the picture. 1721 01:22:13,800 --> 01:22:15,400 But let's just look at some raw data. 1722 01:22:15,400 --> 01:22:18,210 So instead of saying select star from people, let me go ahead 1723 01:22:18,210 --> 01:22:21,480 and select star from shows limit 10. 1724 01:22:21,480 --> 01:22:26,250 And OK, I've only heard of or seen a couple of these, 1725 01:22:26,250 --> 01:22:27,970 but these are older shows at that. 1726 01:22:27,970 --> 01:22:31,530 But I see that every show has an ID, a title, a year in which it debuted, 1727 01:22:31,530 --> 01:22:32,910 and a number of episodes. 1728 01:22:32,910 --> 01:22:35,340 But perhaps most opaque is going to be this. 1729 01:22:35,340 --> 01:22:41,520 Select star from stars where this is the table that associates people 1730 01:22:41,520 --> 01:22:42,330 with shows. 1731 01:22:42,330 --> 01:22:46,950 Am I going to see any names or show titles here? 1732 01:22:46,950 --> 01:22:49,127 Not according to the definition we saw earlier. 1733 01:22:49,127 --> 01:22:50,460 Oh, I should have done my limit. 1734 01:22:50,460 --> 01:22:51,660 Let me interrupt that. 1735 01:22:51,660 --> 01:22:52,620 Let me do that again. 1736 01:22:52,620 --> 01:22:54,180 Limit 10. 1737 01:22:54,180 --> 01:22:55,050 No. 1738 01:22:55,050 --> 01:22:58,380 And this is where now you're definitely in the programmer world, 1739 01:22:58,380 --> 01:23:01,780 because this would be the most annoying spreadsheet to use on your Mac or PC 1740 01:23:01,780 --> 01:23:02,280 ever. 1741 01:23:02,280 --> 01:23:04,528 If you just had a sheet with all of these numbers 1742 01:23:04,528 --> 01:23:06,570 that associates one thing with the other, my God, 1743 01:23:06,570 --> 01:23:08,910 how do you figure out who this is or what this is? 1744 01:23:08,910 --> 01:23:12,780 You have to manually Control F or Command F looking for the data. 1745 01:23:12,780 --> 01:23:14,310 But a database doesn't care. 1746 01:23:14,310 --> 01:23:18,160 Once you know SQL, you can stitch these things back together. 1747 01:23:18,160 --> 01:23:21,360 So what you're seeing here are foreign keys. 1748 01:23:21,360 --> 01:23:22,150 Foreign keys. 1749 01:23:22,150 --> 01:23:22,650 Why? 1750 01:23:22,650 --> 01:23:27,600 Because show ID corresponds to the same numbers from that other table 1751 01:23:27,600 --> 01:23:30,990 called shows that has a proper primary key called ID. 1752 01:23:30,990 --> 01:23:33,360 Person ID is a foreign key in this context, 1753 01:23:33,360 --> 01:23:38,620 because it refers to numbers that belong to really the people table and its ID 1754 01:23:38,620 --> 01:23:39,120 column. 1755 01:23:39,120 --> 01:23:41,037 So this is just a way of somehow linking them. 1756 01:23:41,037 --> 01:23:44,100 And so if you think of I always think of this in my mind's eye as this. 1757 01:23:44,100 --> 01:23:47,760 If this is the people table, this is the shows table, 1758 01:23:47,760 --> 01:23:51,690 and there's this middle table in between, the stars table. 1759 01:23:51,690 --> 01:23:56,490 There's some way of stitching those two together by lining up the IDs of one 1760 01:23:56,490 --> 01:23:58,870 with the other and getting back some more data. 1761 01:23:58,870 --> 01:24:01,690 So let's actually play with some of this data. 1762 01:24:01,690 --> 01:24:05,148 How about we start where we emphasized earlier, genres. 1763 01:24:05,148 --> 01:24:08,440 So let me go ahead and take a quick look at all of the genres in this database. 1764 01:24:08,440 --> 01:24:10,500 So select star from genres. 1765 01:24:10,500 --> 01:24:13,030 Star is usually going to be a little overwhelming, 1766 01:24:13,030 --> 01:24:15,240 but it just gives me a sense of what the data is. 1767 01:24:15,240 --> 01:24:17,850 But let's actually look at-- 1768 01:24:17,850 --> 01:24:19,440 let's go look at all of them there. 1769 01:24:19,440 --> 01:24:20,850 OK, that's a lot. 1770 01:24:20,850 --> 01:24:24,210 These are all official genres from IMDb. 1771 01:24:24,210 --> 01:24:27,030 OK, it wasn't terribly long. 1772 01:24:27,030 --> 01:24:28,350 Let me filter that down. 1773 01:24:28,350 --> 01:24:34,800 So from genres where genre equals Comedy, capital C just based 1774 01:24:34,800 --> 01:24:36,000 on the data I'm seeing. 1775 01:24:36,000 --> 01:24:38,220 OK, so what am I seeing now? 1776 01:24:38,220 --> 01:24:40,838 And in fact, let me limit this arbitrarily to 10, 1777 01:24:40,838 --> 01:24:42,630 though I could limit it to anything I want. 1778 01:24:42,630 --> 01:24:45,060 Here are 10 comedies. 1779 01:24:45,060 --> 01:24:46,080 What are they? 1780 01:24:46,080 --> 01:24:47,490 Well, who the heck knows? 1781 01:24:47,490 --> 01:24:49,590 All I know are the 10 show IDs. 1782 01:24:49,590 --> 01:24:51,180 Now, I could do something like this. 1783 01:24:51,180 --> 01:24:54,037 As we've seen before with SQL, I could do, all right, well 1784 01:24:54,037 --> 01:24:55,620 let's figure out what this show ID is. 1785 01:24:55,620 --> 01:25:03,750 Select star from shows where the ID of the show I'm looking for equals what? 1786 01:25:03,750 --> 01:25:06,910 62614 semicolon. 1787 01:25:06,910 --> 01:25:11,200 So I could manually look it up by cross referencing the other table. 1788 01:25:11,200 --> 01:25:14,860 So that was the show in question there, the first comedy in the data set. 1789 01:25:14,860 --> 01:25:16,200 Let me look up the second one. 1790 01:25:16,200 --> 01:25:20,610 So instead of that, let's do 63881 Enter. 1791 01:25:20,610 --> 01:25:21,150 OK. 1792 01:25:21,150 --> 01:25:22,410 So that's that show. 1793 01:25:22,410 --> 01:25:23,640 And let's do one more. 1794 01:25:23,640 --> 01:25:27,210 And suffice it to say, this is just getting tedious and vulnerable 1795 01:25:27,210 --> 01:25:29,520 to mistakes quickly. 1796 01:25:29,520 --> 01:25:31,710 This surely can't be the way to do this. 1797 01:25:31,710 --> 01:25:35,280 And indeed, SQL is going to let us do this a little more powerfully instead. 1798 01:25:35,280 --> 01:25:36,790 Let's do this. 1799 01:25:36,790 --> 01:25:39,210 Instead of getting this table temporarily 1800 01:25:39,210 --> 01:25:42,640 with all these show IDs and all these genres, let's refine the query. 1801 01:25:42,640 --> 01:25:49,920 So let's just select the show ID from the genres table where the genre equals 1802 01:25:49,920 --> 01:25:51,810 quote unquote "comedy." 1803 01:25:51,810 --> 01:25:56,670 Now I have a big list of show IDs, all of which are comedy. 1804 01:25:56,670 --> 01:25:57,205 How many? 1805 01:25:57,205 --> 01:25:58,830 Well, I can combine ideas from earlier. 1806 01:25:58,830 --> 01:26:03,582 I can just count all of those show IDs or star if I want to just do that too. 1807 01:26:03,582 --> 01:26:05,040 But I can count all those show IDs. 1808 01:26:05,040 --> 01:26:09,880 48,706 comedies on IMDb's database for TV shows. 1809 01:26:09,880 --> 01:26:11,140 So feels like a lot. 1810 01:26:11,140 --> 01:26:14,010 But how can I now use that information and get back 1811 01:26:14,010 --> 01:26:18,100 the titles of comedies in the database without doing it manually? 1812 01:26:18,100 --> 01:26:19,720 Well, let's do this. 1813 01:26:19,720 --> 01:26:21,270 I have a moment ago this query. 1814 01:26:21,270 --> 01:26:25,710 Select the show ID from genres where the current genre 1815 01:26:25,710 --> 01:26:29,295 is quote unquote "comedy." 1816 01:26:29,295 --> 01:26:31,920 What if I kind of nest these queries, kind of like grade school 1817 01:26:31,920 --> 01:26:33,030 math in parentheses? 1818 01:26:33,030 --> 01:26:35,940 What if I combine this whole thing in parentheses? 1819 01:26:35,940 --> 01:26:38,040 And now let me select what I really want. 1820 01:26:38,040 --> 01:26:43,950 Let me go ahead and select how about the title of all 1821 01:26:43,950 --> 01:26:52,420 shows where the idea of the show is in this list of show IDs. 1822 01:26:52,420 --> 01:26:59,440 So if you agree that the shows table has an ID column, which is otherwise 1823 01:26:59,440 --> 01:27:02,170 known as its primary key, the unique ID that identifies it, just 1824 01:27:02,170 --> 01:27:06,490 like our Harvard IDs, our Yale IDs, and you agree that per a moment ago 1825 01:27:06,490 --> 01:27:09,580 this shorter query will give me back just the show 1826 01:27:09,580 --> 01:27:13,840 IDs of all of the comedies in the database, 1827 01:27:13,840 --> 01:27:17,710 you can actually combine or nest these queries together. 1828 01:27:17,710 --> 01:27:19,843 It's going to respect SQLite order of operations 1829 01:27:19,843 --> 01:27:21,760 with parentheses, just like grade school math. 1830 01:27:21,760 --> 01:27:24,070 So the thing in parentheses will be executed first. 1831 01:27:24,070 --> 01:27:27,340 That gives it back a list of IDs, like 48,000 IDs. 1832 01:27:27,340 --> 01:27:30,160 And then this query, the outer query, is going 1833 01:27:30,160 --> 01:27:33,790 to get the title from all of the shows where the ID of the show 1834 01:27:33,790 --> 01:27:36,280 is in that big list of 48,000. 1835 01:27:36,280 --> 01:27:39,233 So if I now execute these together, I think 1836 01:27:39,233 --> 01:27:42,400 the list is still going to be a little long, but let me execute it together. 1837 01:27:42,400 --> 01:27:45,880 Now I see this long list of outputs. 1838 01:27:45,880 --> 01:27:46,900 A little overwhelming. 1839 01:27:46,900 --> 01:27:50,650 Let's go ahead and maybe limit it to just 10 1840 01:27:50,650 --> 01:27:52,460 as before for discussion's sake. 1841 01:27:52,460 --> 01:27:56,110 And now I see 10 comedies ordered arbitrarily 1842 01:27:56,110 --> 01:27:58,330 from however they're in the database that happen 1843 01:27:58,330 --> 01:28:01,420 to indeed have comedy as their genre. 1844 01:28:01,420 --> 01:28:04,310 If I want to do this a little more cleanly, I could do this. 1845 01:28:04,310 --> 01:28:05,260 Let's see. 1846 01:28:05,260 --> 01:28:09,640 Why don't I order by title ascending order, which is alphabetically, 1847 01:28:09,640 --> 01:28:12,190 or the default is also an ascending. 1848 01:28:12,190 --> 01:28:13,240 Limit 10. 1849 01:28:13,240 --> 01:28:18,250 Now I see the top 10, I mean, weirdly named things with hash symbols 1850 01:28:18,250 --> 01:28:22,360 presumably to get their titles up to the beginning or maybe these are hashtags. 1851 01:28:22,360 --> 01:28:27,550 Here now we have alphabetically the first 10 shows that are comedies. 1852 01:28:27,550 --> 01:28:31,705 Any questions on these kinds of queries? 1853 01:28:31,705 --> 01:28:33,580 It's kind of a lot, but at the same time it's 1854 01:28:33,580 --> 01:28:37,540 just like composing the smaller ideas from before into slightly more 1855 01:28:37,540 --> 01:28:39,820 useful queries. 1856 01:28:39,820 --> 01:28:40,360 Yeah. 1857 01:28:40,360 --> 01:28:43,348 AUDIENCE: [INAUDIBLE] 1858 01:28:43,348 --> 01:28:46,343 1859 01:28:46,343 --> 01:28:48,760 DAVID MALAN: Do foreign keys have to set the relationship? 1860 01:28:48,760 --> 01:28:51,850 When you create the table, the programmer or the database 1861 01:28:51,850 --> 01:28:54,280 administrator would create that relationship 1862 01:28:54,280 --> 01:28:57,700 by using those keywords primary key and foreign key that 1863 01:28:57,700 --> 01:29:01,330 teaches the database what is related to what per the picture. 1864 01:29:01,330 --> 01:29:02,530 So you do that once. 1865 01:29:02,530 --> 01:29:07,390 And now I being the sort of programmer who's familiar with the database, 1866 01:29:07,390 --> 01:29:13,750 I am just using these foreign keys in a manner consistent with their design. 1867 01:29:13,750 --> 01:29:15,680 And this is where it's useful at some point, 1868 01:29:15,680 --> 01:29:18,940 even if no one hands you a picture, to make sure you understand the database, 1869 01:29:18,940 --> 01:29:22,180 because that's going to inform literally what you type in SQL 1870 01:29:22,180 --> 01:29:23,740 to get the data you care about. 1871 01:29:23,740 --> 01:29:26,060 Well, let's do something a little more precise. 1872 01:29:26,060 --> 01:29:27,860 How about-- very reasonable question. 1873 01:29:27,860 --> 01:29:30,970 And honestly, this is exactly what imdb.com and the app or for. 1874 01:29:30,970 --> 01:29:34,600 What if you want to find all of the shows that Steve Carell is in? 1875 01:29:34,600 --> 01:29:36,160 Kind of a reasonable query. 1876 01:29:36,160 --> 01:29:40,150 Literally something someone might type into Google or more specifically IMDb. 1877 01:29:40,150 --> 01:29:42,130 It's not really obvious at first glance how 1878 01:29:42,130 --> 01:29:44,795 to do that, though, because from my database, 1879 01:29:44,795 --> 01:29:47,170 if these are my six tables, well, I can pretty easily get 1880 01:29:47,170 --> 01:29:48,460 Steve Carell from here. 1881 01:29:48,460 --> 01:29:51,105 But I can really only get his ID number, whatever that is, 1882 01:29:51,105 --> 01:29:53,230 his name, which I know already, and his birth year. 1883 01:29:53,230 --> 01:29:56,200 OK, interesting but has nothing to do with the shows that he's in. 1884 01:29:56,200 --> 01:30:00,850 I can look at shows over here, but there's no mention of Steve Carell 1885 01:30:00,850 --> 01:30:02,710 because there's no person ID here. 1886 01:30:02,710 --> 01:30:04,660 Where is that relationship implemented? 1887 01:30:04,660 --> 01:30:06,590 Well, it's implemented down here. 1888 01:30:06,590 --> 01:30:07,910 So how do we do this? 1889 01:30:07,910 --> 01:30:10,210 Well, here's the perfect example of a lesson 1890 01:30:10,210 --> 01:30:13,450 we've been trying to emphasize for weeks of taking these baby steps. 1891 01:30:13,450 --> 01:30:17,750 Break larger problems down into smaller ones and let's do something like this. 1892 01:30:17,750 --> 01:30:20,740 Let's just get everything I know about Steve Carell from the database. 1893 01:30:20,740 --> 01:30:24,190 Let's select star from people where the name of the person 1894 01:30:24,190 --> 01:30:26,890 is quote unquote "Steve Carell." 1895 01:30:26,890 --> 01:30:28,630 I just want to see what data we've got. 1896 01:30:28,630 --> 01:30:30,640 And here's what we have. 1897 01:30:30,640 --> 01:30:33,130 There's only one Steve Carell born in 1962 1898 01:30:33,130 --> 01:30:37,150 and his unique ID is 136797 according to IMDb. 1899 01:30:37,150 --> 01:30:40,930 This isn't some global actor identifier, per se. 1900 01:30:40,930 --> 01:30:45,010 All right, well how do I get now all of the shows that Steve Carell is in? 1901 01:30:45,010 --> 01:30:46,130 Well, I could do this. 1902 01:30:46,130 --> 01:30:51,190 Select star from stars, not to confuse the two. 1903 01:30:51,190 --> 01:30:53,500 One's the symbol, one's the table name. 1904 01:30:53,500 --> 01:30:58,690 Where person ID equals 136797. 1905 01:30:58,690 --> 01:31:03,070 So I think this will now give me everything from the stars table 1906 01:31:03,070 --> 01:31:04,943 that relates to Steve Carell. 1907 01:31:04,943 --> 01:31:07,360 And you'll see person ID is the same because I'm literally 1908 01:31:07,360 --> 01:31:08,693 searching for just Steve Carell. 1909 01:31:08,693 --> 01:31:12,200 But there are like 20 or so shows that he's been in. 1910 01:31:12,200 --> 01:31:14,450 All right, well here's where things would get tedious. 1911 01:31:14,450 --> 01:31:15,340 What are those shows? 1912 01:31:15,340 --> 01:31:21,460 Well, I could do select title from shows where the ID of the show equals. 1913 01:31:21,460 --> 01:31:25,670 And here's whenever you copy paste, you're probably doing something wrong. 1914 01:31:25,670 --> 01:31:27,890 OK, he was in The Dana Carvey Show. 1915 01:31:27,890 --> 01:31:28,715 Familiar with that. 1916 01:31:28,715 --> 01:31:29,590 Let's do another one. 1917 01:31:29,590 --> 01:31:31,460 We'll copy paste this. 1918 01:31:31,460 --> 01:31:34,090 Where ID equals this. 1919 01:31:34,090 --> 01:31:34,890 Over The Top. 1920 01:31:34,890 --> 01:31:35,390 Another. 1921 01:31:35,390 --> 01:31:37,000 And if we keep digging, we'll probably find The Office. 1922 01:31:37,000 --> 01:31:40,240 But my God, that's going to take forever to do 20 queries manually. 1923 01:31:40,240 --> 01:31:41,252 It's not very dynamic. 1924 01:31:41,252 --> 01:31:43,960 But what if we just nest these queries a little more dynamically? 1925 01:31:43,960 --> 01:31:45,830 So let me start from the beginning again. 1926 01:31:45,830 --> 01:31:48,890 What if we go ahead and select everything 1927 01:31:48,890 --> 01:31:56,410 we know about people whose name equals Steve Carell. 1928 01:31:56,410 --> 01:31:58,220 That gave us earlier this data. 1929 01:31:58,220 --> 01:31:59,470 I don't need all of that data. 1930 01:31:59,470 --> 01:32:00,100 I know his name. 1931 01:32:00,100 --> 01:32:01,517 I don't care about his birth year. 1932 01:32:01,517 --> 01:32:05,140 So let's change this to just be give me the ID of Steve Carell. 1933 01:32:05,140 --> 01:32:08,390 And that gives me back now this smaller temporary data set. 1934 01:32:08,390 --> 01:32:08,890 All right. 1935 01:32:08,890 --> 01:32:12,195 Can I now use this inside of another query? 1936 01:32:12,195 --> 01:32:14,320 Well, let me wrap the whole thing with parentheses. 1937 01:32:14,320 --> 01:32:18,850 And now let me say select star from the stars table 1938 01:32:18,850 --> 01:32:22,477 where the person ID equals this. 1939 01:32:22,477 --> 01:32:24,310 So I'm deliberately not using in because I'm 1940 01:32:24,310 --> 01:32:26,780 assuming there's indeed only one Steve Carell in the world. 1941 01:32:26,780 --> 01:32:28,480 So I'm not getting back a list of Steve Carells. 1942 01:32:28,480 --> 01:32:30,640 I'm getting back the one and only in this case. 1943 01:32:30,640 --> 01:32:31,810 So equal is fine. 1944 01:32:31,810 --> 01:32:33,310 In is when you have multiple. 1945 01:32:33,310 --> 01:32:34,840 Equal is when you have one. 1946 01:32:34,840 --> 01:32:36,375 Let me go ahead and hit Enter now. 1947 01:32:36,375 --> 01:32:37,750 OK, that's more data than I need. 1948 01:32:37,750 --> 01:32:40,430 I don't need like 20 copies of Steve Carell's person ID. 1949 01:32:40,430 --> 01:32:41,380 So let me hit up. 1950 01:32:41,380 --> 01:32:46,420 Let me go back and let me just get show ID from Steve Carell. 1951 01:32:46,420 --> 01:32:52,070 And now I have a list of just the 20 or so show IDs that he has been in. 1952 01:32:52,070 --> 01:32:52,570 All right. 1953 01:32:52,570 --> 01:32:53,597 How can I now use this? 1954 01:32:53,597 --> 01:32:54,430 Well, let me hit up. 1955 01:32:54,430 --> 01:32:57,040 Let me put the whole thing in parentheses. 1956 01:32:57,040 --> 01:32:58,960 And now let me select what I really want. 1957 01:32:58,960 --> 01:33:03,340 Select title from shows where. 1958 01:33:03,340 --> 01:33:05,020 And here's the final flourish. 1959 01:33:05,020 --> 01:33:11,350 The shows table has an ID, has a title, has a year, and has an episode. 1960 01:33:11,350 --> 01:33:21,610 And what I really want, though, is to check which shows have ID that is what? 1961 01:33:21,610 --> 01:33:24,510 Anyone want to finish the thought? 1962 01:33:24,510 --> 01:33:25,860 I just want to-- yeah. 1963 01:33:25,860 --> 01:33:28,435 AUDIENCE: [INAUDIBLE] 1964 01:33:28,435 --> 01:33:29,310 DAVID MALAN: Exactly. 1965 01:33:29,310 --> 01:33:30,570 ID in this. 1966 01:33:30,570 --> 01:33:31,690 And this is getting ugly. 1967 01:33:31,690 --> 01:33:33,810 And when you actually write your queries in a text file, 1968 01:33:33,810 --> 01:33:35,730 you can format them nicely and indent them. 1969 01:33:35,730 --> 01:33:36,772 My font is just getting-- 1970 01:33:36,772 --> 01:33:38,938 I don't want to make it too small to fit everything. 1971 01:33:38,938 --> 01:33:40,230 But now we have three queries. 1972 01:33:40,230 --> 01:33:43,380 One is in doubly nested parentheses, then there's the middle one, 1973 01:33:43,380 --> 01:33:44,550 then there's the outer one. 1974 01:33:44,550 --> 01:33:47,310 So this last query is going to get me the title from shows 1975 01:33:47,310 --> 01:33:50,460 where the ID of the show is in this big list of 20 1976 01:33:50,460 --> 01:33:52,410 or so show IDs that Steve Carell is in. 1977 01:33:52,410 --> 01:33:54,812 And I knew that because I looked up his name here. 1978 01:33:54,812 --> 01:33:57,270 And notice what I did not do this time is I didn't manually 1979 01:33:57,270 --> 01:33:58,620 hardcode his ID number. 1980 01:33:58,620 --> 01:33:59,430 There's no need. 1981 01:33:59,430 --> 01:34:02,508 That would be kind of a bad way to implement a website if you're 1982 01:34:02,508 --> 01:34:04,050 using a database underneath the hood. 1983 01:34:04,050 --> 01:34:07,470 You want the IMDb for real to search for whatever 1984 01:34:07,470 --> 01:34:11,010 the human typed in and no one's going to know Steve Carell's person ID 1985 01:34:11,010 --> 01:34:12,070 or anything else. 1986 01:34:12,070 --> 01:34:14,580 So here we've done this all dynamically. 1987 01:34:14,580 --> 01:34:18,690 And now if I hit Enter, I think I get all of his shows. 1988 01:34:18,690 --> 01:34:21,840 Let's go ahead and order this by title just to make it tidy. 1989 01:34:21,840 --> 01:34:24,810 And you probably will see at least one or more shows that. 1990 01:34:24,810 --> 01:34:29,520 And probably the most popular is, dot dot dot, The Office. 1991 01:34:29,520 --> 01:34:32,280 So this is literally the kind of query that's 1992 01:34:32,280 --> 01:34:37,200 being executed underneath the hood when you go to websites or apps like IMDb. 1993 01:34:37,200 --> 01:34:42,030 Your textual query is probably being plugged into a longer SQL query 1994 01:34:42,030 --> 01:34:44,850 like this where some programmer at IMDb probably 1995 01:34:44,850 --> 01:34:48,420 wrote this whole query in advance weeks, months, years ago 1996 01:34:48,420 --> 01:34:52,470 and they're just somehow plugging in the value that you the human 1997 01:34:52,470 --> 01:34:55,550 typed into the search box or the like. 1998 01:34:55,550 --> 01:35:03,010 Questions now on finding this data or any other? 1999 01:35:03,010 --> 01:35:03,890 No? 2000 01:35:03,890 --> 01:35:04,390 OK. 2001 01:35:04,390 --> 01:35:06,980 So where else could we go with this? 2002 01:35:06,980 --> 01:35:12,700 Well, let's consider how else we might combine data. 2003 01:35:12,700 --> 01:35:16,060 Suppose that the next question actually perhaps appropriately 2004 01:35:16,060 --> 01:35:19,780 would be focusing in on not just people and shows and these stars, 2005 01:35:19,780 --> 01:35:23,680 but how do we gather more information about the shows themselves, 2006 01:35:23,680 --> 01:35:25,430 like the genres, the ratings, or the like. 2007 01:35:25,430 --> 01:35:28,030 So indeed, let's focus on just these two tables here. 2008 01:35:28,030 --> 01:35:31,640 Recall that every show has an ID, a title, a year, and episodes. 2009 01:35:31,640 --> 01:35:34,660 But it also might have one or more relationships with rows 2010 01:35:34,660 --> 01:35:36,280 and this other table called genres. 2011 01:35:36,280 --> 01:35:39,250 And this is so that a show can be a comedy, can be a drama, 2012 01:35:39,250 --> 01:35:40,960 can be any number of other things. 2013 01:35:40,960 --> 01:35:42,400 One row per. 2014 01:35:42,400 --> 01:35:45,040 So you would see the same show ID again and again 2015 01:35:45,040 --> 01:35:48,760 and again with a different genre written in English 2016 01:35:48,760 --> 01:35:51,070 like comedy, drama, or the like. 2017 01:35:51,070 --> 01:35:53,890 Well, how do I kind of reconstitute that data? 2018 01:35:53,890 --> 01:35:56,710 Well, turns out there's a few different ways to do this. 2019 01:35:56,710 --> 01:36:00,610 And let me propose that we introduce this keyword here, join. 2020 01:36:00,610 --> 01:36:05,525 And this is really the most powerful of the keywords in SQL itself. 2021 01:36:05,525 --> 01:36:06,650 It doesn't have to be used. 2022 01:36:06,650 --> 01:36:09,310 We've seen with nested queries that you can still 2023 01:36:09,310 --> 01:36:13,490 select data across multiple tables, but here is another way. 2024 01:36:13,490 --> 01:36:14,660 So let me do this. 2025 01:36:14,660 --> 01:36:17,710 Let me go back to my SQLite database. 2026 01:36:17,710 --> 01:36:23,680 And let me select sort of in one breath exactly the data I want. 2027 01:36:23,680 --> 01:36:26,980 Select star from shows. 2028 01:36:26,980 --> 01:36:30,500 And let's just limit this initially to 10 to see what it looks like. 2029 01:36:30,500 --> 01:36:31,000 All right. 2030 01:36:31,000 --> 01:36:32,250 That's, again, the shows data. 2031 01:36:32,250 --> 01:36:34,690 Select star from genres. 2032 01:36:34,690 --> 01:36:37,760 Let's limit that to 10 too, just to wrap our minds around it. 2033 01:36:37,760 --> 01:36:39,730 And now this is not that useful. 2034 01:36:39,730 --> 01:36:43,510 However, the data in the leftmost column here 2035 01:36:43,510 --> 01:36:45,490 is the primary key in the shows table. 2036 01:36:45,490 --> 01:36:46,870 These are just unique IDs. 2037 01:36:46,870 --> 01:36:51,530 The data here in the genres table, recall, show ID is the foreign key. 2038 01:36:51,530 --> 01:36:54,310 So it's the same numbers but just copied into another table 2039 01:36:54,310 --> 01:36:56,650 so that we can have this relationship across them. 2040 01:36:56,650 --> 01:37:00,070 How do I kind of line up these numbers with these numbers 2041 01:37:00,070 --> 01:37:06,430 to get back a wider table that has title and year and episodes and genre and, 2042 01:37:06,430 --> 01:37:08,950 heck, ratings and all of that too if we want? 2043 01:37:08,950 --> 01:37:11,230 Well, you can join these tables by just telling 2044 01:37:11,230 --> 01:37:12,890 the database what to join on what. 2045 01:37:12,890 --> 01:37:13,940 So let me do this. 2046 01:37:13,940 --> 01:37:17,110 Select star from shows. 2047 01:37:17,110 --> 01:37:21,760 Join that table though on the genres table. 2048 01:37:21,760 --> 01:37:24,160 Well, how do you want to join those two tables? 2049 01:37:24,160 --> 01:37:26,860 And again, the two tables from the picture looked like this. 2050 01:37:26,860 --> 01:37:30,400 How do you tell SQL programmatically to put one of them 2051 01:37:30,400 --> 01:37:32,420 right next to the other, line up all of the ID 2052 01:37:32,420 --> 01:37:34,820 so that you just get one larger data set? 2053 01:37:34,820 --> 01:37:39,160 Well, we can use indeed this syntax called join. 2054 01:37:39,160 --> 01:37:41,140 So back to VS Code here. 2055 01:37:41,140 --> 01:37:44,380 And let me join these two tables. 2056 01:37:44,380 --> 01:37:45,970 Sorry, typo here. 2057 01:37:45,970 --> 01:37:51,940 Join genres on the shows table's ID column, a.k.a. 2058 01:37:51,940 --> 01:37:58,240 its primary key, equaling the genres table's show ID column, a.k.a. 2059 01:37:58,240 --> 01:37:59,210 the foreign key. 2060 01:37:59,210 --> 01:38:01,090 So in other words, it looks a little cryptic, 2061 01:38:01,090 --> 01:38:05,080 but I'm just telling SQL how to line up these two tables and what column 2062 01:38:05,080 --> 01:38:08,230 to match with the other so that the numbers line up 2063 01:38:08,230 --> 01:38:10,610 and I get essentially a wider table. 2064 01:38:10,610 --> 01:38:14,930 Let me go ahead and hit semicolon and Enter. 2065 01:38:14,930 --> 01:38:17,330 And this is now going to give me a lot of data. 2066 01:38:17,330 --> 01:38:18,600 We might have to interrupt it. 2067 01:38:18,600 --> 01:38:22,650 But notice even at a glance, we're getting the ID, the title, the year, 2068 01:38:22,650 --> 01:38:25,130 the number of episodes, the ID again redundantly, 2069 01:38:25,130 --> 01:38:27,140 but that's to be expected if I'm joining them, 2070 01:38:27,140 --> 01:38:29,360 and the genre all the way on the right. 2071 01:38:29,360 --> 01:38:31,740 Let me hit Control C to interrupt. 2072 01:38:31,740 --> 01:38:33,890 Let me just limit this to The Office. 2073 01:38:33,890 --> 01:38:36,500 So where title equals quote unquote "The Office" 2074 01:38:36,500 --> 01:38:39,680 so we can focus on just one sample data. 2075 01:38:39,680 --> 01:38:43,490 And here, fun fact, there's been more than one Office. 2076 01:38:43,490 --> 01:38:46,100 The one that you all probably like is this one 2077 01:38:46,100 --> 01:38:49,250 that started in 2005 with 188 episodes. 2078 01:38:49,250 --> 01:38:53,060 Its ID in the shows table is 386676. 2079 01:38:53,060 --> 01:38:54,710 That's confirmed over here too. 2080 01:38:54,710 --> 01:38:56,850 So again, we've just joined the two tables. 2081 01:38:56,850 --> 01:38:57,350 How? 2082 01:38:57,350 --> 01:38:58,730 By lining up those fields. 2083 01:38:58,730 --> 01:39:01,490 But now that we can see that almost all of The Offices 2084 01:39:01,490 --> 01:39:04,700 produced over the decades are comedies except for this one. 2085 01:39:04,700 --> 01:39:07,460 There was a version of The Office produced in 2001 2086 01:39:07,460 --> 01:39:10,160 that was considered more of a drama. 2087 01:39:10,160 --> 01:39:11,940 Unsure if it's related to the other. 2088 01:39:11,940 --> 01:39:13,640 How can we link in other data? 2089 01:39:13,640 --> 01:39:16,920 Well, let's go ahead and link in ratings too or instead. 2090 01:39:16,920 --> 01:39:21,290 So instead of joining this with genres, let me go ahead and rewind here 2091 01:39:21,290 --> 01:39:28,550 and join shows on ratings on shows.id equals ratings.show_ID. 2092 01:39:28,550 --> 01:39:32,630 And let's limit it to The Office too for discussion's sake where title equals 2093 01:39:32,630 --> 01:39:35,820 quote unquote "The Office" semicolon. 2094 01:39:35,820 --> 01:39:39,650 And now you can see that among the various Offices, 2095 01:39:39,650 --> 01:39:43,100 it looks like the one that most of us probably know and love 2096 01:39:43,100 --> 01:39:49,070 is the highest rated also with a 9.0 with like 585,000 people having 2097 01:39:49,070 --> 01:39:52,280 cast votes for whereas this other shows seem to have been less popular. 2098 01:39:52,280 --> 01:39:56,370 And perhaps that's why indeed you see fewer episodes for them as well. 2099 01:39:56,370 --> 01:39:58,680 So even though we've put the data in multiple places, 2100 01:39:58,680 --> 01:40:02,670 you can still kind of reconstitute it by lining things up in this way 2101 01:40:02,670 --> 01:40:04,970 and rejoining the tables. 2102 01:40:04,970 --> 01:40:09,140 Questions now on this? 2103 01:40:09,140 --> 01:40:16,540 This is the heart of what SQL does and what relational databases do for you. 2104 01:40:16,540 --> 01:40:18,100 Questions? 2105 01:40:18,100 --> 01:40:18,610 All right. 2106 01:40:18,610 --> 01:40:20,470 A few final features. 2107 01:40:20,470 --> 01:40:22,660 There's not all that much that-- 2108 01:40:22,660 --> 01:40:24,320 SQL takes practice like anything else. 2109 01:40:24,320 --> 01:40:26,237 But in terms of syntax and capabilities, let's 2110 01:40:26,237 --> 01:40:30,610 just introduce you to a couple of final features here and problems that arise 2111 01:40:30,610 --> 01:40:32,210 and how we might solve them. 2112 01:40:32,210 --> 01:40:34,640 Let's do this as well. 2113 01:40:34,640 --> 01:40:36,980 So let me go back into VS Code here. 2114 01:40:36,980 --> 01:40:40,900 And let's just find out Steve Carell's information again. 2115 01:40:40,900 --> 01:40:43,180 Last time we did it with this nested query 2116 01:40:43,180 --> 01:40:47,503 by getting his ID and then the show IDs and then the titles for those show IDs. 2117 01:40:47,503 --> 01:40:49,420 With join, you can do it a little differently. 2118 01:40:49,420 --> 01:40:50,950 And any of these ways are fine. 2119 01:40:50,950 --> 01:40:53,740 One might become easier to mentally than another. 2120 01:40:53,740 --> 01:40:57,040 Let's go ahead and select the titles from what. 2121 01:40:57,040 --> 01:41:02,095 Let's select the title from the people table. 2122 01:41:02,095 --> 01:41:03,220 And I'm going to hit Enter. 2123 01:41:03,220 --> 01:41:05,260 And when you're using SQLite3 interactively, 2124 01:41:05,260 --> 01:41:09,400 if you ever find yourself with a prompt that says dot dot dot angle bracket, 2125 01:41:09,400 --> 01:41:12,067 it means you're continuing your thought onto the next line. 2126 01:41:12,067 --> 01:41:13,900 If you didn't intend that, you can sometimes 2127 01:41:13,900 --> 01:41:16,510 hit semicolon to just end the thought and hit 2128 01:41:16,510 --> 01:41:18,460 Enter even if it triggers an error. 2129 01:41:18,460 --> 01:41:21,430 But this is one way of formatting my queries now a little more nicely. 2130 01:41:21,430 --> 01:41:24,597 I'm just going to add some white space so that it's a little easier to read. 2131 01:41:24,597 --> 01:41:25,720 What do I want to select? 2132 01:41:25,720 --> 01:41:28,000 Well, I want to select the title of shows 2133 01:41:28,000 --> 01:41:35,320 from the people table joined with the stars table on the people table's ID 2134 01:41:35,320 --> 01:41:40,520 column equaling the stars table's person ID column. 2135 01:41:40,520 --> 01:41:44,230 So in other words, if you think back to what people are and what stars are, 2136 01:41:44,230 --> 01:41:48,250 one has an ID, one has a person ID, I'm just now connecting those two tables. 2137 01:41:48,250 --> 01:41:49,450 I'm joining those two. 2138 01:41:49,450 --> 01:41:53,350 But I want to do this as well with another table. 2139 01:41:53,350 --> 01:41:54,938 Let me additionally join in. 2140 01:41:54,938 --> 01:41:56,980 So now I only have two hands, but now I'm putting 2141 01:41:56,980 --> 01:41:59,320 a third table joined in together here. 2142 01:41:59,320 --> 01:42:05,960 Join shows on stars.show_id equals shows.id. 2143 01:42:05,960 --> 01:42:08,500 So this is now linking three tables together. 2144 01:42:08,500 --> 01:42:12,100 But I only care about this for one person, so where the name of the person 2145 01:42:12,100 --> 01:42:14,650 equals quote unquote "Steve Carell." 2146 01:42:14,650 --> 01:42:18,850 So more cryptic, to be sure, but what we're doing with this query 2147 01:42:18,850 --> 01:42:21,370 is just taking all three tables that we care about 2148 01:42:21,370 --> 01:42:23,440 and we're joining them all together at once 2149 01:42:23,440 --> 01:42:27,250 using this new join syntax literally telling the database what 2150 01:42:27,250 --> 01:42:29,020 columns to line up with what. 2151 01:42:29,020 --> 01:42:32,350 And then we filter at the very end just like before to get back, 2152 01:42:32,350 --> 01:42:35,650 if I hit Enter, the answer we want, which in this case 2153 01:42:35,650 --> 01:42:40,030 is a little slower at the moment, but that same list of 20 or so 2154 01:42:40,030 --> 01:42:41,350 shows that he's been in. 2155 01:42:41,350 --> 01:42:42,920 There's one other way to do this. 2156 01:42:42,920 --> 01:42:45,250 And again, these are all in the slides online. 2157 01:42:45,250 --> 01:42:47,890 So you can repeat them without having to jot down everything and we'll put them 2158 01:42:47,890 --> 01:42:48,820 in the notes too. 2159 01:42:48,820 --> 01:42:50,450 But there's another way to do this. 2160 01:42:50,450 --> 01:42:52,780 I could also use an implicit join. 2161 01:42:52,780 --> 01:42:55,030 So that was an explicit join because I literally typed 2162 01:42:55,030 --> 01:42:57,220 the word join multiple times at that. 2163 01:42:57,220 --> 01:43:00,880 But let me go ahead and select the title from these three tables. 2164 01:43:00,880 --> 01:43:02,628 People, stars, and shows. 2165 01:43:02,628 --> 01:43:04,420 And this might just be nicer because if you 2166 01:43:04,420 --> 01:43:06,337 know what tables you want to select data from, 2167 01:43:06,337 --> 01:43:08,980 just enumerate them separated by commas, which you might prefer 2168 01:43:08,980 --> 01:43:18,970 in your mind, where the people ID equals the stars person ID and the stars 2169 01:43:18,970 --> 01:43:25,450 show ID equals the shows ID and the name of the person equals Steve Carell. 2170 01:43:25,450 --> 01:43:26,788 So this is an implicit join. 2171 01:43:26,788 --> 01:43:29,830 And honestly, I constantly reference my notes for some of this stuff too. 2172 01:43:29,830 --> 01:43:31,720 It's not the kind of thing that's going to come like this to you 2173 01:43:31,720 --> 01:43:32,740 after just one day. 2174 01:43:32,740 --> 01:43:35,560 But it's just a different way of expressing the same thing. 2175 01:43:35,560 --> 01:43:37,720 I want to select data from three different tables. 2176 01:43:37,720 --> 01:43:43,120 And hey SQL, here is how I want you to line those tables up so that I can 2177 01:43:43,120 --> 01:43:45,640 get like related data for Steve Carell. 2178 01:43:45,640 --> 01:43:49,450 And this now will achieve the same results ultimately. 2179 01:43:49,450 --> 01:43:50,260 Let me hit Enter. 2180 01:43:50,260 --> 01:43:54,040 2181 01:43:54,040 --> 01:43:54,920 And there we go. 2182 01:43:54,920 --> 01:43:55,750 So a little slower. 2183 01:43:55,750 --> 01:43:57,850 And performance might vary based on computer, 2184 01:43:57,850 --> 01:44:01,270 based on implementation of SQL, but I think I still have the same answers. 2185 01:44:01,270 --> 01:44:04,790 Now suppose, as I often do, and I had to look it up again last time, 2186 01:44:04,790 --> 01:44:08,350 suppose you forget how to spell Steve Carell's name. 2187 01:44:08,350 --> 01:44:10,240 Is it two R's, two L's, or the like? 2188 01:44:10,240 --> 01:44:12,903 Well, I could also do something like this. 2189 01:44:12,903 --> 01:44:14,320 Well, let's just keep this simple. 2190 01:44:14,320 --> 01:44:17,103 Select star from people where name equals. 2191 01:44:17,103 --> 01:44:20,020 I've been deliberately getting it right so as to not embarrass myself. 2192 01:44:20,020 --> 01:44:22,000 That's the Steve Carell I keep querying. 2193 01:44:22,000 --> 01:44:25,960 If you forget, well you could try searching for just Steves, 2194 01:44:25,960 --> 01:44:28,450 but interestingly, there's a bunch of Steves. 2195 01:44:28,450 --> 01:44:30,070 We don't know when they were born. 2196 01:44:30,070 --> 01:44:35,360 But that's probably not the Steve Carell we want if we don't have his last name. 2197 01:44:35,360 --> 01:44:38,770 So I could alternatively do, well it's Steve and then it starts with a C, 2198 01:44:38,770 --> 01:44:39,460 I think. 2199 01:44:39,460 --> 01:44:42,700 Well, it turns out there's another wild card you can use in SQL. 2200 01:44:42,700 --> 01:44:46,060 We used the asterisk to select all of the columns. 2201 01:44:46,060 --> 01:44:50,800 You can in quotes use a percent sign to say C something. 2202 01:44:50,800 --> 01:44:54,370 So there's 0 or more characters after the letter C. 2203 01:44:54,370 --> 01:44:58,630 And now this doesn't work because now I would be literally looking 2204 01:44:58,630 --> 01:45:01,690 for Steve space C something. 2205 01:45:01,690 --> 01:45:04,910 But recall earlier I mentioned that one other keyword, 2206 01:45:04,910 --> 01:45:06,868 which is for fuzzier matching, so to speak, 2207 01:45:06,868 --> 01:45:09,160 where it's not exactly what you're looking for but it's 2208 01:45:09,160 --> 01:45:10,630 like what you're looking for. 2209 01:45:10,630 --> 01:45:14,350 If you instead say where his name is like Steve space C something, 2210 01:45:14,350 --> 01:45:17,060 now we'll get back a whole bunch of Steves. 2211 01:45:17,060 --> 01:45:19,810 But I think now I could probably find the one I'm actually looking 2212 01:45:19,810 --> 01:45:20,920 for if I don't remember his name. 2213 01:45:20,920 --> 01:45:22,510 You can use multiple percent signs. 2214 01:45:22,510 --> 01:45:25,260 If you forget what his first name is, you could reverse the order. 2215 01:45:25,260 --> 01:45:29,940 But that too is a very powerful SQL feature at that. 2216 01:45:29,940 --> 01:45:32,670 Questions on these queries here? 2217 01:45:32,670 --> 01:45:33,708 Yeah. 2218 01:45:33,708 --> 01:45:35,448 AUDIENCE: [INAUDIBLE] 2219 01:45:35,448 --> 01:45:36,240 DAVID MALAN: Sorry? 2220 01:45:36,240 --> 01:45:38,730 AUDIENCE: [INAUDIBLE] 2221 01:45:38,730 --> 01:45:40,003 DAVID MALAN: What about it? 2222 01:45:40,003 --> 01:45:40,990 AUDIENCE: [INAUDIBLE] 2223 01:45:40,990 --> 01:45:42,400 DAVID MALAN: Oh yeah, sure. 2224 01:45:42,400 --> 01:45:44,500 So the query I used here. 2225 01:45:44,500 --> 01:45:49,810 There's a lot of Steves whose last name starts with C. Oops, too far. 2226 01:45:49,810 --> 01:45:52,760 The last query I executed was this one here. 2227 01:45:52,760 --> 01:45:57,110 So where the name is like quote unquote "Steve C%." 2228 01:45:57,110 --> 01:45:59,950 So that's just another tool for your toolkit here. 2229 01:45:59,950 --> 01:46:02,830 But you'll perhaps have notice that those two-- 2230 01:46:02,830 --> 01:46:05,380 prior to that query, the joins I did were sort of slow. 2231 01:46:05,380 --> 01:46:07,480 And honestly, this database isn't even that big. 2232 01:46:07,480 --> 01:46:10,120 Like yes, it has tens of thousands of rows in it. 2233 01:46:10,120 --> 01:46:13,630 But in the real world and most of the apps you and I use a lot every day 2234 01:46:13,630 --> 01:46:17,080 or websites, there's millions, even billions of rows of data. 2235 01:46:17,080 --> 01:46:22,210 And if I had to wait on my computer here or my code space a second or two 2236 01:46:22,210 --> 01:46:25,990 to get the data, that's not going to work for millions of users or customers 2237 01:46:25,990 --> 01:46:26,680 certainly. 2238 01:46:26,680 --> 01:46:29,020 So how can we actually improve things? 2239 01:46:29,020 --> 01:46:34,330 Well, it turns out another upside of a proper relational database 2240 01:46:34,330 --> 01:46:38,830 is that it's not just a spreadsheet where the onus is on you to find 2241 01:46:38,830 --> 01:46:40,120 the data you're looking for. 2242 01:46:40,120 --> 01:46:43,780 You can also tell the database to index the data for you. 2243 01:46:43,780 --> 01:46:48,670 An index is an efficient cheat sheet for finding data fast. 2244 01:46:48,670 --> 01:46:52,908 Like books in the real world often have indices at the end of the book 2245 01:46:52,908 --> 01:46:54,700 where you can look things up alphabetically 2246 01:46:54,700 --> 01:46:57,790 and then you can cross reference it for the pages that topic appears on. 2247 01:46:57,790 --> 01:46:59,110 Same idea in a database. 2248 01:46:59,110 --> 01:47:03,190 If you tell the database in advance that you want to search on a certain column 2249 01:47:03,190 --> 01:47:06,700 frequently, you can tell it to build a fancy index that will just 2250 01:47:06,700 --> 01:47:08,770 allow you to search that column faster. 2251 01:47:08,770 --> 01:47:11,560 By default, these columns are going to be searched 2252 01:47:11,560 --> 01:47:13,197 most likely by a linear search. 2253 01:47:13,197 --> 01:47:15,280 Not even binary search, because the data might not 2254 01:47:15,280 --> 01:47:17,390 be sorted because it came in any order. 2255 01:47:17,390 --> 01:47:20,170 But if you create an index, you're probably 2256 01:47:20,170 --> 01:47:22,780 going to get something closer to logarithmic than linear, 2257 01:47:22,780 --> 01:47:24,892 and that's going to be a big plus overall. 2258 01:47:24,892 --> 01:47:26,350 So let me do something simple here. 2259 01:47:26,350 --> 01:47:30,670 First let me turn on a SQLite specific feature that just is going to time all 2260 01:47:30,670 --> 01:47:33,190 of my queries by writing .timer on. 2261 01:47:33,190 --> 01:47:37,515 I just want to keep track of how long each of these commands takes. 2262 01:47:37,515 --> 01:47:40,390 This one is not a slow command, so this is just going to be relative. 2263 01:47:40,390 --> 01:47:43,300 But let's just select everything from the shows table 2264 01:47:43,300 --> 01:47:45,970 where the title thereof is The Office. 2265 01:47:45,970 --> 01:47:48,940 Let's see how long this relatively simple query takes. 2266 01:47:48,940 --> 01:47:50,980 All right, not very long at all. 2267 01:47:50,980 --> 01:47:54,850 In real terms less than a second, 0.035 seconds. 2268 01:47:54,850 --> 01:47:57,070 So not slow by any means. 2269 01:47:57,070 --> 01:48:00,160 But if you've got hundreds, thousands, millions of users, 2270 01:48:00,160 --> 01:48:03,040 every one of those milliseconds could very well add up. 2271 01:48:03,040 --> 01:48:04,640 So can we do better? 2272 01:48:04,640 --> 01:48:06,440 Well, we can if I do this. 2273 01:48:06,440 --> 01:48:11,740 If I use syntax like this once in the beginning of the design of my database, 2274 01:48:11,740 --> 01:48:17,980 I create not a table but an index with some name on a specific table on one 2275 01:48:17,980 --> 01:48:18,820 or more columns. 2276 01:48:18,820 --> 01:48:21,370 I can give a clue, a hint to the database in advance 2277 01:48:21,370 --> 01:48:26,260 saying please optimize with some secret sauce searching or selecting 2278 01:48:26,260 --> 01:48:29,740 on this column in this table so that my searches are faster. 2279 01:48:29,740 --> 01:48:30,860 So let me do this. 2280 01:48:30,860 --> 01:48:32,590 Let me go back to VS Code here. 2281 01:48:32,590 --> 01:48:36,460 Let me create an index called how about title index. 2282 01:48:36,460 --> 01:48:40,240 I could call it anything I want, but I want to search faster on titles. 2283 01:48:40,240 --> 01:48:45,940 So I'm going to call this a title index where rather title index on the table 2284 01:48:45,940 --> 01:48:46,900 called shows. 2285 01:48:46,900 --> 01:48:49,150 And then in parentheses is the syntax. 2286 01:48:49,150 --> 01:48:50,620 The column called title. 2287 01:48:50,620 --> 01:48:53,350 So again, I've just borrowed this canonical syntax 2288 01:48:53,350 --> 01:48:57,590 and I've just translated it into something that's TV show specific. 2289 01:48:57,590 --> 01:48:58,090 All right. 2290 01:48:58,090 --> 01:49:00,040 What is this going to do for me? 2291 01:49:00,040 --> 01:49:03,700 Once I hit Enter, this is going to create in the computer's memory, 2292 01:49:03,700 --> 01:49:05,860 the database's memory something called a B-tree. 2293 01:49:05,860 --> 01:49:07,330 It's not a binary tree. 2294 01:49:07,330 --> 01:49:11,627 A B-tree is actually a potentially more efficient data structure 2295 01:49:11,627 --> 01:49:13,960 that we didn't talk about a few weeks back in week five, 2296 01:49:13,960 --> 01:49:15,710 but it looks a little something like this, 2297 01:49:15,710 --> 01:49:17,680 which looks similar to a binary tree. 2298 01:49:17,680 --> 01:49:21,847 But does anyone notice what makes this not a binary tree? 2299 01:49:21,847 --> 01:49:26,405 AUDIENCE: [INAUDIBLE] 2300 01:49:26,405 --> 01:49:27,280 DAVID MALAN: Exactly. 2301 01:49:27,280 --> 01:49:31,570 Binary tree, bi implying two, has no more than two children per node, 2302 01:49:31,570 --> 01:49:33,580 but here's a perfect example, one, two, three. 2303 01:49:33,580 --> 01:49:35,990 And there could be four children, five children or more. 2304 01:49:35,990 --> 01:49:38,530 But the effect of that, if you have a very wide tree, 2305 01:49:38,530 --> 01:49:40,780 the upside is that it's very short. 2306 01:49:40,780 --> 01:49:44,560 It pulls the data higher up closer to the node, to the root node. 2307 01:49:44,560 --> 01:49:48,080 And recall that the root node is where we began our searches in the past, 2308 01:49:48,080 --> 01:49:51,620 whether it was a BST, a Binary Search Tree, even a tri or other data 2309 01:49:51,620 --> 01:49:52,120 structures. 2310 01:49:52,120 --> 01:49:53,380 We always began at the top. 2311 01:49:53,380 --> 01:49:55,780 So the higher up you can pull the data, even 2312 01:49:55,780 --> 01:49:57,617 if it makes the data structure very wide, 2313 01:49:57,617 --> 01:50:00,700 you're going to be able to do boom, boom, boom, look up queries or look up 2314 01:50:00,700 --> 01:50:03,340 data probably much faster certainly than if it's just 2315 01:50:03,340 --> 01:50:06,590 a very long list like a column by default. 2316 01:50:06,590 --> 01:50:09,970 So with that said, let me go back to VS Code. 2317 01:50:09,970 --> 01:50:11,380 I didn't create the index yet. 2318 01:50:11,380 --> 01:50:13,277 Let me go ahead and hit Enter and create it. 2319 01:50:13,277 --> 01:50:14,860 All right, it took a minute, a moment. 2320 01:50:14,860 --> 01:50:17,920 It took like half a second, which obviously is not that slow. 2321 01:50:17,920 --> 01:50:20,380 But with more data, that could have been even slower. 2322 01:50:20,380 --> 01:50:22,600 But it's a one time operation as of now. 2323 01:50:22,600 --> 01:50:27,910 And now let me hit up and let me select the same data from shows 2324 01:50:27,910 --> 01:50:29,350 where title equals The Office. 2325 01:50:29,350 --> 01:50:34,360 Last time just a moment ago it took 0.035 seconds. 2326 01:50:34,360 --> 01:50:38,920 Not slow but also that's going to add up if I have lots of users of IMDb. 2327 01:50:38,920 --> 01:50:42,040 Let's go ahead now and execute the same query again. 2328 01:50:42,040 --> 01:50:44,700 How long did that take? 2329 01:50:44,700 --> 01:50:47,010 0.001 seconds now. 2330 01:50:47,010 --> 01:50:48,720 I mean, practically nothing. 2331 01:50:48,720 --> 01:50:50,970 And so that's the sort of opportunity now. 2332 01:50:50,970 --> 01:50:53,470 When you've got lots of data and you want to really speed up 2333 01:50:53,470 --> 01:50:56,130 these searches, these indexes, these indices that just create 2334 01:50:56,130 --> 01:50:59,130 for you these magical data structures in the databases memory, 2335 01:50:59,130 --> 01:51:02,430 it allows you to search on columns that you are pretty sure you 2336 01:51:02,430 --> 01:51:04,150 want to search on more effectively. 2337 01:51:04,150 --> 01:51:07,380 Now, by contrast, if you've ever used Google or Bing or some search 2338 01:51:07,380 --> 01:51:10,590 engine that has advanced search, some of those text boxes 2339 01:51:10,590 --> 01:51:13,470 that you can search more precisely in might very well be slower. 2340 01:51:13,470 --> 01:51:14,190 Why? 2341 01:51:14,190 --> 01:51:16,950 Well, probably you don't want to go crazy and just index 2342 01:51:16,950 --> 01:51:19,380 every column on every table. 2343 01:51:19,380 --> 01:51:21,020 Why? 2344 01:51:21,020 --> 01:51:25,070 What might be the intuition? 2345 01:51:25,070 --> 01:51:28,490 If logically indexes speed things up, why not index everything? 2346 01:51:28,490 --> 01:51:30,710 There's always going to be a trade off here. 2347 01:51:30,710 --> 01:51:33,040 What might that be? 2348 01:51:33,040 --> 01:51:34,000 Yeah. 2349 01:51:34,000 --> 01:51:34,875 AUDIENCE: [INAUDIBLE] 2350 01:51:34,875 --> 01:51:37,167 DAVID MALAN: Yeah, it's going to take a lot of storage. 2351 01:51:37,167 --> 01:51:38,960 This is just a slide on the screen. 2352 01:51:38,960 --> 01:51:41,020 But this has to go somewhere. 2353 01:51:41,020 --> 01:51:43,360 This needs space in the computer's memory 2354 01:51:43,360 --> 01:51:44,740 or on the hard drive or the like. 2355 01:51:44,740 --> 01:51:48,777 And that's fine if you have unlimited space, but odds are you don't. 2356 01:51:48,777 --> 01:51:51,110 And that's going to get expensive for different reasons. 2357 01:51:51,110 --> 01:51:55,810 So maybe you only want to index certain columns and certain tables 2358 01:51:55,810 --> 01:51:56,860 and not all of them. 2359 01:51:56,860 --> 01:51:57,777 Because you know what? 2360 01:51:57,777 --> 01:51:59,590 What even if a user really wants to search 2361 01:51:59,590 --> 01:52:05,020 maybe via advanced search on some other column or table altogether, fine. 2362 01:52:05,020 --> 01:52:07,240 If once in a while a query is slow, we're 2363 01:52:07,240 --> 01:52:11,110 probably getting the bigger bang for our buck by optimizing the common cases, 2364 01:52:11,110 --> 01:52:16,030 the more popular queries that people actually care about too. 2365 01:52:16,030 --> 01:52:16,640 All right. 2366 01:52:16,640 --> 01:52:22,450 So let's come full circle and bring this now back to how we actually 2367 01:52:22,450 --> 01:52:25,030 began, which was with some Python code. 2368 01:52:25,030 --> 01:52:27,550 So it turns out these are not either or decisions. 2369 01:52:27,550 --> 01:52:29,740 It turns out in the real world, developers 2370 01:52:29,740 --> 01:52:33,320 are constantly using one, two, three languages at once. 2371 01:52:33,320 --> 01:52:36,940 And in fact, next week I rattled off HTML, CSS, and JavaScript, one of which 2372 01:52:36,940 --> 01:52:39,310 is a proper programming language, but those languages 2373 01:52:39,310 --> 01:52:40,510 are often used together. 2374 01:52:40,510 --> 01:52:46,300 Totally normal and common to use Python and SQL or Java and SQL 2375 01:52:46,300 --> 01:52:49,300 or SWIFT and SQL or any number of different combinations 2376 01:52:49,300 --> 01:52:50,710 with a database language. 2377 01:52:50,710 --> 01:52:53,920 You might use your preferred programming language, Java, Python, 2378 01:52:53,920 --> 01:53:00,760 C++ to create the user interface and the logic that implements the program 2379 01:53:00,760 --> 01:53:01,330 itself. 2380 01:53:01,330 --> 01:53:04,092 But for your data, SQL's a really good candidate. 2381 01:53:04,092 --> 01:53:07,300 And indeed, we've seen already that SQL can just speed up certain operations. 2382 01:53:07,300 --> 01:53:08,140 You can change. 2383 01:53:08,140 --> 01:53:11,200 You can collapse 15 lines of code into just one 2384 01:53:11,200 --> 01:53:13,430 and you can use these things together. 2385 01:53:13,430 --> 01:53:14,740 So let me come back to-- 2386 01:53:14,740 --> 01:53:16,720 I'm going to quit out of SQLite. 2387 01:53:16,720 --> 01:53:18,580 I'm going to minimize my terminal window. 2388 01:53:18,580 --> 01:53:22,030 And here's where we left off before with favorites.py. 2389 01:53:22,030 --> 01:53:27,430 With favorites.py, everything was being stored in favorites.csv. 2390 01:53:27,430 --> 01:53:32,260 And recall that we eventually imported that CSV file into favorites.db 2391 01:53:32,260 --> 01:53:35,920 automatically with .import just so we could start playing around with SQL. 2392 01:53:35,920 --> 01:53:37,990 But we can now tie these two together. 2393 01:53:37,990 --> 01:53:40,690 And a way to do that is as follows. 2394 01:53:40,690 --> 01:53:42,700 CS50 has a library for Python. 2395 01:53:42,700 --> 01:53:47,350 You might recall having available get string, get int, get float. 2396 01:53:47,350 --> 01:53:49,360 You don't strictly need to use them in Python 2397 01:53:49,360 --> 01:53:51,970 because it's much easier to just use the input function 2398 01:53:51,970 --> 01:53:55,570 and then try, accept, and convert things to int or float or the like. 2399 01:53:55,570 --> 01:54:00,520 But it's a lot more work to use SQL in Python without a third party library. 2400 01:54:00,520 --> 01:54:03,370 A lot of the commercial options or popular open source options 2401 01:54:03,370 --> 01:54:05,200 are actually just complicated to use. 2402 01:54:05,200 --> 01:54:09,790 So CS50 does have a very useful function inside of its library for Python 2403 01:54:09,790 --> 01:54:12,640 that you should use and must use for the problem set that 2404 01:54:12,640 --> 01:54:18,100 just makes it easy to execute Python, execute SQL inside of your Python code. 2405 01:54:18,100 --> 01:54:22,360 But it's built on top of a very popular open source alternative. 2406 01:54:22,360 --> 01:54:24,350 So you can use that too in the real world. 2407 01:54:24,350 --> 01:54:26,800 So the documentation for that is at this URL here, 2408 01:54:26,800 --> 01:54:30,940 but I'll show you what we need to know here by focusing back on favorites.py. 2409 01:54:30,940 --> 01:54:35,540 So what I'm going to do here is follows is this. 2410 01:54:35,540 --> 01:54:42,550 Let me delete everything from favorites.py except for let's say this. 2411 01:54:42,550 --> 01:54:45,970 From CS50 import SQL in all caps. 2412 01:54:45,970 --> 01:54:49,060 So that's importing a SQL feature from CS50's library 2413 01:54:49,060 --> 01:54:53,200 that's going to allow me to open a DB file in code. 2414 01:54:53,200 --> 01:54:54,380 How do I do that? 2415 01:54:54,380 --> 01:54:56,650 Well, let me create a variable called DB for database, 2416 01:54:56,650 --> 01:54:58,275 though I could call it anything I want. 2417 01:54:58,275 --> 01:55:02,470 Let me call this SQL function and pass in using special syntax that's 2418 01:55:02,470 --> 01:55:03,640 not CS50 specific. 2419 01:55:03,640 --> 01:55:09,640 It's an industry thing. sqlite:///. 2420 01:55:09,640 --> 01:55:14,640 Unlike every other URL you type, this one literally has three in this context 2421 01:55:14,640 --> 01:55:15,140 here. 2422 01:55:15,140 --> 01:55:18,340 And then the name of the database, which in this case is favorites.db. 2423 01:55:18,340 --> 01:55:22,450 So this is just a way of telling this SQL library that we wrote 2424 01:55:22,450 --> 01:55:27,640 but that works exactly like third party alternatives open favorites.db using 2425 01:55:27,640 --> 01:55:30,380 the SQLite technology, if you will. 2426 01:55:30,380 --> 01:55:30,880 All right. 2427 01:55:30,880 --> 01:55:32,338 Let's just ask the user a question. 2428 01:55:32,338 --> 01:55:36,057 Give me your favorite problem. 2429 01:55:36,057 --> 01:55:38,140 So we're going to use input instead of get string, 2430 01:55:38,140 --> 01:55:41,450 but we could use get string, but they're pretty much the same for our purposes. 2431 01:55:41,450 --> 01:55:43,150 Let's ask the user for their favorite. 2432 01:55:43,150 --> 01:55:49,960 And now in Python code, let us select from favorites.db 2433 01:55:49,960 --> 01:55:55,450 all of the rows where students specify that problem as their favorite. 2434 01:55:55,450 --> 01:55:57,610 So in SQL alone, it would be this. 2435 01:55:57,610 --> 01:56:04,210 Select star from favorites where problem equals 2436 01:56:04,210 --> 01:56:08,170 and I'll do, well, whatever my favorite's going to be. 2437 01:56:08,170 --> 01:56:10,288 Like problem equals Mario, for instance. 2438 01:56:10,288 --> 01:56:13,330 So if I were just using SQL, I would literally write something like that. 2439 01:56:13,330 --> 01:56:15,130 But I'm in a .py file now. 2440 01:56:15,130 --> 01:56:16,870 I have to use Python syntax. 2441 01:56:16,870 --> 01:56:18,550 But Python supports strings. 2442 01:56:18,550 --> 01:56:20,090 SQL is just text. 2443 01:56:20,090 --> 01:56:20,990 It's just a string. 2444 01:56:20,990 --> 01:56:24,220 So I could certainly just put my SQL code in a string 2445 01:56:24,220 --> 01:56:26,800 perhaps and then pass it to a Python function. 2446 01:56:26,800 --> 01:56:28,780 And here's the bridge between the two. 2447 01:56:28,780 --> 01:56:31,960 If you just treat SQL as any old text, we can put it in a string 2448 01:56:31,960 --> 01:56:32,840 and execute it. 2449 01:56:32,840 --> 01:56:34,837 So let me actually do this. 2450 01:56:34,837 --> 01:56:36,670 Let me go ahead and create a variable called 2451 01:56:36,670 --> 01:56:40,130 rows, which is eventually going to contain all the rows from the database. 2452 01:56:40,130 --> 01:56:45,790 Let me go ahead and select db.execute. 2453 01:56:45,790 --> 01:56:49,030 This is the one function you need to know about inside of CS50's library, 2454 01:56:49,030 --> 01:56:51,310 and it literally executes a SQL statement. 2455 01:56:51,310 --> 01:56:56,140 And then in quotes, you pass it literally what you want to execute. 2456 01:56:56,140 --> 01:56:58,990 And let me go ahead and close the parenthesis at the end there. 2457 01:56:58,990 --> 01:57:00,740 And now let me just try this. 2458 01:57:00,740 --> 01:57:05,260 So for row in rows, let's iterate over all of the rows, let me go ahead 2459 01:57:05,260 --> 01:57:12,370 and print out how about row, quote unquote. 2460 01:57:12,370 --> 01:57:15,700 And what do I want here? 2461 01:57:15,700 --> 01:57:20,020 Let's print out the timestamp of that person for kicks. 2462 01:57:20,020 --> 01:57:22,120 All right, let me open my terminal window. 2463 01:57:22,120 --> 01:57:23,890 Python of favorites.py. 2464 01:57:23,890 --> 01:57:27,490 Crossing my fingers here for sure. 2465 01:57:27,490 --> 01:57:27,990 Enter. 2466 01:57:27,990 --> 01:57:31,200 2467 01:57:31,200 --> 01:57:31,770 There we go. 2468 01:57:31,770 --> 01:57:32,280 Favorites. 2469 01:57:32,280 --> 01:57:34,500 I'll type in Mario. 2470 01:57:34,500 --> 01:57:35,070 OK. 2471 01:57:35,070 --> 01:57:36,240 So I got back-- 2472 01:57:36,240 --> 01:57:37,980 it's not very interesting, but I got back 2473 01:57:37,980 --> 01:57:40,830 all of the timestamps of students who typed in Mario that we 2474 01:57:40,830 --> 01:57:42,330 imported into this database. 2475 01:57:42,330 --> 01:57:45,360 Well, what I really care about is how popular Mario is. 2476 01:57:45,360 --> 01:57:46,830 So let me change this a little bit. 2477 01:57:46,830 --> 01:57:50,490 Let me change this to count the number of rows. 2478 01:57:50,490 --> 01:57:52,080 And let me keep it simple. 2479 01:57:52,080 --> 01:57:56,260 Let me give an alias like I proposed earlier like as n, where n is a number. 2480 01:57:56,260 --> 01:58:00,090 So that now down here, I can actually just do this. 2481 01:58:00,090 --> 01:58:02,230 Print out the value of n. 2482 01:58:02,230 --> 01:58:02,730 All right. 2483 01:58:02,730 --> 01:58:04,450 Let me go back to my terminal window. 2484 01:58:04,450 --> 01:58:05,910 Run Python to favorites.py. 2485 01:58:05,910 --> 01:58:07,140 Let me type in Mario. 2486 01:58:07,140 --> 01:58:08,070 Enter. 2487 01:58:08,070 --> 01:58:09,660 OK, 39. 2488 01:58:09,660 --> 01:58:11,790 Now, technically I'm cheating. 2489 01:58:11,790 --> 01:58:14,430 Honestly if I'm executing select count, we've 2490 01:58:14,430 --> 01:58:17,752 seen before it only ever returns one row, not multiple. 2491 01:58:17,752 --> 01:58:20,460 So there's really nothing to iterate over, but it's working fine. 2492 01:58:20,460 --> 01:58:22,560 It's just iterating once, but I'm getting lucky. 2493 01:58:22,560 --> 01:58:26,100 So technically what I should probably just do is this. 2494 01:58:26,100 --> 01:58:29,220 I should probably give myself a variable called row, 2495 01:58:29,220 --> 01:58:34,770 set it equal to the very first row and only row that came back, 2496 01:58:34,770 --> 01:58:39,510 and now print out that rows and column. 2497 01:58:39,510 --> 01:58:40,890 Let me rerun the program. 2498 01:58:40,890 --> 01:58:42,130 I'll type in Mario again. 2499 01:58:42,130 --> 01:58:42,660 Enter. 2500 01:58:42,660 --> 01:58:44,980 And I still see 39. 2501 01:58:44,980 --> 01:58:48,328 So of course, I don't strictly need to do this. 2502 01:58:48,328 --> 01:58:49,620 I don't really need a variable. 2503 01:58:49,620 --> 01:58:52,590 I can do rows bracket 0 instead. 2504 01:58:52,590 --> 01:58:54,840 But let me focus on what this library is now doing. 2505 01:58:54,840 --> 01:59:01,620 So per the documentation, what the CS50 execute function always does for you is 2506 01:59:01,620 --> 01:59:05,620 it returns a list of dictionaries. 2507 01:59:05,620 --> 01:59:10,558 So if your query returns nothing, like no matches, you get back an empty list. 2508 01:59:10,558 --> 01:59:12,600 Like open bracket, closed bracket, nothing in it. 2509 01:59:12,600 --> 01:59:14,520 Any loop is not going to execute anything useful, 2510 01:59:14,520 --> 01:59:15,770 because there's nothing in it. 2511 01:59:15,770 --> 01:59:18,660 If, though, you get back one row, you're going 2512 01:59:18,660 --> 01:59:22,740 to get back a list of size one inside of which is a single dictionary. 2513 01:59:22,740 --> 01:59:24,930 That dictionary is going to have keys that 2514 01:59:24,930 --> 01:59:29,200 correspond to whatever you selected, be it the columns or the count. 2515 01:59:29,200 --> 01:59:32,880 So when I selected star before, I would have gotten all of the columns. 2516 01:59:32,880 --> 01:59:34,950 That's how I was able to access timestamp. 2517 01:59:34,950 --> 01:59:37,230 Here I'm just selecting count and I don't 2518 01:59:37,230 --> 01:59:38,910 want to have to type this down here. 2519 01:59:38,910 --> 01:59:40,230 That would just look kind of atrocious. 2520 01:59:40,230 --> 01:59:42,355 It would work, but it would look weird to just keep 2521 01:59:42,355 --> 01:59:44,640 retyping count paren star close paren. 2522 01:59:44,640 --> 01:59:49,530 So I just created an alias called n just to make my life easier or cleaner down 2523 01:59:49,530 --> 01:59:50,380 here. 2524 01:59:50,380 --> 01:59:55,680 So to be clear, the CS50 execute function returns a list of dictionaries 2525 01:59:55,680 --> 01:59:57,360 when you're using select. 2526 01:59:57,360 --> 02:00:01,500 And that is how I can now get back the first and only row 2527 02:00:01,500 --> 02:00:04,860 and then print out that row's end value. 2528 02:00:04,860 --> 02:00:07,380 It is identical to-- 2529 02:00:07,380 --> 02:00:08,650 let me do this. 2530 02:00:08,650 --> 02:00:11,610 Let me highlight this whole line of text. 2531 02:00:11,610 --> 02:00:15,120 Let me in my terminal window run SQLite3 of favorites.db 2532 02:00:15,120 --> 02:00:16,470 like we did before break. 2533 02:00:16,470 --> 02:00:19,170 Let me just copy paste this query. 2534 02:00:19,170 --> 02:00:20,040 Enter. 2535 02:00:20,040 --> 02:00:24,550 That's the table I got back earlier when we played with SQL manually. 2536 02:00:24,550 --> 02:00:28,170 And so when I get back this table, here's the key, here's the value, 2537 02:00:28,170 --> 02:00:31,860 and I only have one row, which is why I'm just blindly indexing 2538 02:00:31,860 --> 02:00:34,080 into rows bracket 0, because I know there's always 2539 02:00:34,080 --> 02:00:35,247 going to be an answer there. 2540 02:00:35,247 --> 02:00:36,690 It's going to be 0 or 1 or more. 2541 02:00:36,690 --> 02:00:40,360 But I know now it's going to be called n because of this here. 2542 02:00:40,360 --> 02:00:41,580 So what have I just done? 2543 02:00:41,580 --> 02:00:43,113 Well, this is SQL down here. 2544 02:00:43,113 --> 02:00:45,780 And this is just me being like a data scientist asking questions 2545 02:00:45,780 --> 02:00:48,640 about my data just using black and white SQL queries. 2546 02:00:48,640 --> 02:00:52,410 This is me now being a Python programmer who wants to talk to a SQL database 2547 02:00:52,410 --> 02:00:53,700 using Python. 2548 02:00:53,700 --> 02:00:56,820 And the bridge we're using happens to be the CS50 library. 2549 02:00:56,820 --> 02:00:59,940 But again, there's third party free libraries you can also use as well. 2550 02:00:59,940 --> 02:01:01,710 Ours is just very simple. 2551 02:01:01,710 --> 02:01:03,660 And indeed, the documentation will explain 2552 02:01:03,660 --> 02:01:07,113 how execute behaves a little differently for inserts, updates, and deletes. 2553 02:01:07,113 --> 02:01:09,780 You don't get back a list because you're not selecting anything, 2554 02:01:09,780 --> 02:01:13,050 but you do get back some return values. 2555 02:01:13,050 --> 02:01:15,990 Questions on this? 2556 02:01:15,990 --> 02:01:18,030 That's the last of our Python code. 2557 02:01:18,030 --> 02:01:23,320 That ties everything together in spirit. 2558 02:01:23,320 --> 02:01:24,768 Yeah? 2559 02:01:24,768 --> 02:01:29,460 AUDIENCE: [INAUDIBLE] 2560 02:01:29,460 --> 02:01:31,810 DAVID MALAN: This one here? 2561 02:01:31,810 --> 02:01:32,320 Yes. 2562 02:01:32,320 --> 02:01:37,558 So db.execute by definition returns a list of rows. 2563 02:01:37,558 --> 02:01:40,600 And each of those rows happens to be a dictionary because its convenient. 2564 02:01:40,600 --> 02:01:42,100 Key value pairs. 2565 02:01:42,100 --> 02:01:45,040 If I'm selecting the count of rows, I just 2566 02:01:45,040 --> 02:01:48,250 know from having learned SQL an hour ago that this is always 2567 02:01:48,250 --> 02:01:52,540 going to give me a single row whose column in this case is called n. 2568 02:01:52,540 --> 02:01:56,470 So if I know it's a single row, I can just blindly, just like in C, 2569 02:01:56,470 --> 02:02:00,160 go into that list or an array in C and go to the first location 2570 02:02:00,160 --> 02:02:02,590 and then treat that as the single row. 2571 02:02:02,590 --> 02:02:04,330 What you don't want to do is this. 2572 02:02:04,330 --> 02:02:08,500 Even if you the human know the query returns one row, 2573 02:02:08,500 --> 02:02:10,780 you can't just magically change the variable name 2574 02:02:10,780 --> 02:02:13,270 to be singular and expect to have only one value. 2575 02:02:13,270 --> 02:02:14,960 You will always have a list. 2576 02:02:14,960 --> 02:02:18,190 So even if there is only one value in it, it's up to you to do something 2577 02:02:18,190 --> 02:02:19,780 like this to get at it. 2578 02:02:19,780 --> 02:02:24,070 Or if you prefer more succinctness, you can do rows bracket I bracket n. 2579 02:02:24,070 --> 02:02:27,220 That'll achieve the same thing without a variable. 2580 02:02:27,220 --> 02:02:29,095 Yeah? 2581 02:02:29,095 --> 02:02:33,970 AUDIENCE: [INAUDIBLE] 2582 02:02:33,970 --> 02:02:34,720 DAVID MALAN: Good. 2583 02:02:34,720 --> 02:02:37,150 So I have been misleading this whole time 2584 02:02:37,150 --> 02:02:40,690 and cheating because this is only ever going to return Mario. 2585 02:02:40,690 --> 02:02:44,530 I'm ignoring the favorite that the human typed in here on line five. 2586 02:02:44,530 --> 02:02:45,770 So let me fix that. 2587 02:02:45,770 --> 02:02:49,000 And that's going to lead us to some of the problems that arise ultimately 2588 02:02:49,000 --> 02:02:49,780 with SQL. 2589 02:02:49,780 --> 02:02:53,320 The right way to solve that problem-- let me get rid of my terminal window 2590 02:02:53,320 --> 02:02:53,890 here. 2591 02:02:53,890 --> 02:02:56,170 The right way to solve this problem is not 2592 02:02:56,170 --> 02:02:59,170 to use an fstring like we did in Python generally, 2593 02:02:59,170 --> 02:03:02,680 because SQL queries, as we'll see in a moment, can be dangerous. 2594 02:03:02,680 --> 02:03:06,940 When you want to plug in users' data into a query 2595 02:03:06,940 --> 02:03:10,540 that you've written most of in advance, you should, you 2596 02:03:10,540 --> 02:03:15,640 must, you had better use a placeholder, namely a question mark in this case. 2597 02:03:15,640 --> 02:03:18,340 This is somewhat specific to CS50's library, 2598 02:03:18,340 --> 02:03:21,910 but we just borrowed the convention that every other library uses too. 2599 02:03:21,910 --> 02:03:25,660 In the world of SQL, single question marks are used as placeholders. 2600 02:03:25,660 --> 02:03:28,190 And the way you do this is as follows. 2601 02:03:28,190 --> 02:03:32,020 If you want to plug-in a value for that question mark, 2602 02:03:32,020 --> 02:03:37,180 just like in printf in C, you specify as a second or a third or fourth argument 2603 02:03:37,180 --> 02:03:39,320 all of the values you want plugged into this. 2604 02:03:39,320 --> 02:03:42,280 So in C weeks ago, we were using %s. 2605 02:03:42,280 --> 02:03:43,810 Same exact idea. 2606 02:03:43,810 --> 02:03:46,810 In SQL it's a question mark that you use instead. 2607 02:03:46,810 --> 02:03:49,480 This now, if I open back my terminal window 2608 02:03:49,480 --> 02:03:55,120 and I run Python of favorites.py, type in Mario, I should still get 39. 2609 02:03:55,120 --> 02:03:58,750 But now I can also type in Scratch perhaps and get 44 2610 02:03:58,750 --> 02:04:00,520 for that very first piece at 0. 2611 02:04:00,520 --> 02:04:02,900 And that one is even more popular here. 2612 02:04:02,900 --> 02:04:04,150 So this now is correct. 2613 02:04:04,150 --> 02:04:09,560 It would work to use an fstring here and then plug in a value like favorite 2614 02:04:09,560 --> 02:04:10,060 here. 2615 02:04:10,060 --> 02:04:11,977 But you'll see in just a moment don't do that. 2616 02:04:11,977 --> 02:04:15,640 You will expose yourself to potential hack or attacks 2617 02:04:15,640 --> 02:04:17,910 by trusting the user's input. 2618 02:04:17,910 --> 02:04:19,660 And so in fact, let's transition from that 2619 02:04:19,660 --> 02:04:21,610 to exactly some of these kinds of challenges, 2620 02:04:21,610 --> 02:04:24,020 namely two before we wrap up. 2621 02:04:24,020 --> 02:04:27,820 So in the world of SQL, especially when it's used at scale with the Twitters 2622 02:04:27,820 --> 02:04:30,610 and the Googles of the world, a lot of data 2623 02:04:30,610 --> 02:04:32,610 is probably coming into the database all at once 2624 02:04:32,610 --> 02:04:34,610 because multiple people are opening their phones 2625 02:04:34,610 --> 02:04:36,070 at the same time around the world. 2626 02:04:36,070 --> 02:04:37,690 They're clicking on the same links roughly 2627 02:04:37,690 --> 02:04:39,107 at the same time around the world. 2628 02:04:39,107 --> 02:04:42,235 When you have thousands of people all using your site at once, 2629 02:04:42,235 --> 02:04:44,110 order of operations is going to be important. 2630 02:04:44,110 --> 02:04:47,620 But unfortunately in SQL and in other contexts of computing, 2631 02:04:47,620 --> 02:04:50,150 there's this risk of what's known as a race condition. 2632 02:04:50,150 --> 02:04:53,710 So for instance, has anyone ever seen or liked this? 2633 02:04:53,710 --> 02:04:55,780 This is the world record egg. 2634 02:04:55,780 --> 02:04:58,467 Or it's this thing that was very popular a while back. 2635 02:04:58,467 --> 02:04:59,800 It's still kind of going strong. 2636 02:04:59,800 --> 02:05:03,340 But if you go to the Instagram profile for World Record Egg, 2637 02:05:03,340 --> 02:05:06,610 the goal was to make the most liked Instagram post ever. 2638 02:05:06,610 --> 02:05:08,000 And they did pretty well. 2639 02:05:08,000 --> 02:05:09,170 It's just this. 2640 02:05:09,170 --> 02:05:10,790 It's just a picture of an egg. 2641 02:05:10,790 --> 02:05:12,570 Now, at the height of the popularity, like 2642 02:05:12,570 --> 02:05:15,070 there might have been hundreds, thousands, tens of thousands 2643 02:05:15,070 --> 02:05:18,520 of people clicking pretty much at the same time on this egg. 2644 02:05:18,520 --> 02:05:20,830 So it actually creates a potential problem 2645 02:05:20,830 --> 02:05:23,470 with the integrity of Instagram's data. 2646 02:05:23,470 --> 02:05:24,010 Why? 2647 02:05:24,010 --> 02:05:26,302 Well, if you have all these requests coming in at once, 2648 02:05:26,302 --> 02:05:28,250 how do you possibly keep track of all of them 2649 02:05:28,250 --> 02:05:31,850 and update your counter in a way that can keep up with all of that traffic? 2650 02:05:31,850 --> 02:05:32,350 Why? 2651 02:05:32,350 --> 02:05:35,680 Well, let's just hypothesize what Meta, formerly Facebook, 2652 02:05:35,680 --> 02:05:38,860 was doing underneath the hood with Instagram if this were their code. 2653 02:05:38,860 --> 02:05:41,890 So suppose for the sake of discussion that Instagram servers 2654 02:05:41,890 --> 02:05:44,617 are using a mix of Python and SQL. 2655 02:05:44,617 --> 02:05:47,200 Probably not using the CS50 library, but they could absolutely 2656 02:05:47,200 --> 02:05:50,680 be using those two languages or two others together. 2657 02:05:50,680 --> 02:05:55,270 Suppose they do this in order to update the number of likes for that post. 2658 02:05:55,270 --> 02:06:00,160 They first execute a SQL query like select the current number of likes 2659 02:06:00,160 --> 02:06:03,580 from a table called posts where the idea of the post 2660 02:06:03,580 --> 02:06:08,530 equals whatever the unique identifier is for that specific egg in the table. 2661 02:06:08,530 --> 02:06:12,460 And then they store the result in this rows variable, just like I did. 2662 02:06:12,460 --> 02:06:14,830 And then they do this. 2663 02:06:14,830 --> 02:06:16,420 They create a variable called likes. 2664 02:06:16,420 --> 02:06:18,700 They set it equal to rows bracket 0. 2665 02:06:18,700 --> 02:06:21,280 So the very first row in the result set. 2666 02:06:21,280 --> 02:06:22,450 And they get the likes key. 2667 02:06:22,450 --> 02:06:25,030 So this is literally what I just did with the count. 2668 02:06:25,030 --> 02:06:27,880 Let me hypothesize that Instagram does something similar 2669 02:06:27,880 --> 02:06:29,330 with the total number of likes. 2670 02:06:29,330 --> 02:06:30,370 Why are they doing this? 2671 02:06:30,370 --> 02:06:34,600 Because they then want to execute a third line of code that executes 2672 02:06:34,600 --> 02:06:36,430 update the posts table. 2673 02:06:36,430 --> 02:06:41,920 Set the new number of likes equal to something where the idea of the post 2674 02:06:41,920 --> 02:06:43,330 equals this other thing. 2675 02:06:43,330 --> 02:06:46,870 Now, notice just like in printf there's the comma separated list of values. 2676 02:06:46,870 --> 02:06:50,560 They want to update the current number of likes from the current value 2677 02:06:50,560 --> 02:06:51,970 to the current value plus 1. 2678 02:06:51,970 --> 02:06:53,350 So it's likes plus 1. 2679 02:06:53,350 --> 02:06:55,640 And then we plug in the ID for this. 2680 02:06:55,640 --> 02:06:57,760 So suppose this is what Instagram is doing. 2681 02:06:57,760 --> 02:07:02,860 Unfortunately, whenever you execute multiple lines of code independently 2682 02:07:02,860 --> 02:07:06,260 and you're so popular like Instagram that you have thousands, 2683 02:07:06,260 --> 02:07:08,530 hundreds of thousands of servers potentially, 2684 02:07:08,530 --> 02:07:11,470 it is quite possible that if you and I and everyone else in the room 2685 02:07:11,470 --> 02:07:13,630 clicks that egg at the same time, it's not 2686 02:07:13,630 --> 02:07:17,093 going to be the case statistically that three lines of code are executed for me 2687 02:07:17,093 --> 02:07:19,510 and then three lines for you and then three lines for you. 2688 02:07:19,510 --> 02:07:21,490 They're probably going to get interspersed. 2689 02:07:21,490 --> 02:07:24,430 This gets executed for me and then this gets executed for you 2690 02:07:24,430 --> 02:07:27,520 and then they get back to doing work for me and so forth just to kind 2691 02:07:27,520 --> 02:07:30,820 of multitask, just like a human might, but at a super speed here. 2692 02:07:30,820 --> 02:07:33,590 The problem, though, is if these lines of code 2693 02:07:33,590 --> 02:07:35,838 get interrupted, what could go wrong? 2694 02:07:35,838 --> 02:07:38,630 Well, suppose that Carter and I both click the egg at the same time 2695 02:07:38,630 --> 02:07:41,780 and suppose the current number of likes back in the day is 100. 2696 02:07:41,780 --> 02:07:45,150 That stores in this variable the value 100. 2697 02:07:45,150 --> 02:07:49,010 But if we click so close in time, we might get back the same answer 2698 02:07:49,010 --> 02:07:50,300 to this select query. 2699 02:07:50,300 --> 02:07:54,380 As of that moment in time when David and Carter clicked, it had 100 likes. 2700 02:07:54,380 --> 02:07:58,220 But then this last line of code is executed for me and then maybe Carter. 2701 02:07:58,220 --> 02:08:03,380 Because that answer, the state of the database, was stored in this variable, 2702 02:08:03,380 --> 02:08:08,870 then both Carter and I will result in this line of code 2703 02:08:08,870 --> 02:08:10,610 being executed with the same value. 2704 02:08:10,610 --> 02:08:17,480 Update the post table setting the likes equal to 101 for that post's ID. 2705 02:08:17,480 --> 02:08:18,080 Why? 2706 02:08:18,080 --> 02:08:21,950 Because again, if each of these lines of code running on different servers 2707 02:08:21,950 --> 02:08:25,213 are checking the value of the current number of likes 2708 02:08:25,213 --> 02:08:28,130 but then getting interrupted because Carter clicked the darn thing too 2709 02:08:28,130 --> 02:08:30,350 and then resuming their work on my behalf, 2710 02:08:30,350 --> 02:08:33,890 we might have a race condition where the code is sort of racing to finish 2711 02:08:33,890 --> 02:08:36,320 but getting interrupted by other users' clicks. 2712 02:08:36,320 --> 02:08:38,720 And the problem with that is that if you are 2713 02:08:38,720 --> 02:08:42,710 inspecting the value of some variable, or in this case a database cell, 2714 02:08:42,710 --> 02:08:45,650 and making a decision based on it, like how to update it, 2715 02:08:45,650 --> 02:08:46,995 you might now lose data. 2716 02:08:46,995 --> 02:08:50,120 And Instagram is probably not good for advertising if they're losing likes. 2717 02:08:50,120 --> 02:08:54,620 And so that's probably a problem not to retain the value 102 2718 02:08:54,620 --> 02:08:57,230 and instead insert the number 101 twice. 2719 02:08:57,230 --> 02:09:00,500 It's actually similar in spirit to a story that 2720 02:09:00,500 --> 02:09:05,960 was told in a databases course I took myself years ago whereby-- 2721 02:09:05,960 --> 02:09:08,900 it's somewhat analogous to kind of a contrived scenario 2722 02:09:08,900 --> 02:09:10,070 involving a refrigerator. 2723 02:09:10,070 --> 02:09:12,380 And this is the closest thing to a refrigerator we could get on stage. 2724 02:09:12,380 --> 02:09:14,420 But imagine you've got one of these little dorm fridges 2725 02:09:14,420 --> 02:09:16,040 in your dorm too and your roommate. 2726 02:09:16,040 --> 02:09:19,102 And maybe both of you, as the story was told to me, really like milk. 2727 02:09:19,102 --> 02:09:21,560 And one of you is at class, but the other of you comes home 2728 02:09:21,560 --> 02:09:25,500 and you open your dorm fridge and you're like, oh darn it, we're out of milk. 2729 02:09:25,500 --> 02:09:27,020 And so you close the fridge. 2730 02:09:27,020 --> 02:09:29,720 You walk across the street to CVS or some other store 2731 02:09:29,720 --> 02:09:31,490 and you get in line to buy some milk. 2732 02:09:31,490 --> 02:09:33,020 Meanwhile, your roommate gets out of class. 2733 02:09:33,020 --> 02:09:34,160 They come back to your dorm room. 2734 02:09:34,160 --> 02:09:35,702 They're really thirsty for some milk. 2735 02:09:35,702 --> 02:09:37,010 They open up the fridge. 2736 02:09:37,010 --> 02:09:38,510 They say, oh, we're out of milk. 2737 02:09:38,510 --> 02:09:41,630 And then they take a different route perhaps to CVS 2738 02:09:41,630 --> 02:09:44,300 or some other store nearby, get in line to buy some milk. 2739 02:09:44,300 --> 02:09:48,260 Fast forward some amount of time in this very contrived story and what happens? 2740 02:09:48,260 --> 02:09:50,570 Oh damn it, we now ended up with two gallons of milk 2741 02:09:50,570 --> 02:09:54,480 and there's no way we can fit gallons of milk in there, let alone two of them. 2742 02:09:54,480 --> 02:09:56,000 So that's a problem. 2743 02:09:56,000 --> 02:10:00,000 But what's the relationship to this here? 2744 02:10:00,000 --> 02:10:02,693 Well both of us, yeah, did what? 2745 02:10:02,693 --> 02:10:07,523 AUDIENCE: [INAUDIBLE] 2746 02:10:07,523 --> 02:10:16,685 2747 02:10:16,685 --> 02:10:17,560 DAVID MALAN: Exactly. 2748 02:10:17,560 --> 02:10:19,925 AUDIENCE: [INAUDIBLE] 2749 02:10:19,925 --> 02:10:22,965 2750 02:10:22,965 --> 02:10:23,840 DAVID MALAN: Exactly. 2751 02:10:23,840 --> 02:10:26,480 So to summarize, both of us had a very similar thought process, 2752 02:10:26,480 --> 02:10:28,820 made a similar decision based on the same information, 2753 02:10:28,820 --> 02:10:31,760 not realizing that the information, the fridge, 2754 02:10:31,760 --> 02:10:34,600 was in the process of being updated. 2755 02:10:34,600 --> 02:10:37,100 And of course, in the Instagram world, it happens like this. 2756 02:10:37,100 --> 02:10:39,300 In the fridge world, it might take a few minutes. 2757 02:10:39,300 --> 02:10:41,840 But the problem is ultimately the result of our 2758 02:10:41,840 --> 02:10:46,070 having made a decision about the state of the world and the state of the world 2759 02:10:46,070 --> 02:10:48,600 was in the middle of being updated. 2760 02:10:48,600 --> 02:10:50,333 The queries got comingled with others. 2761 02:10:50,333 --> 02:10:53,000 Or, in this case, someone was already on their way to the store. 2762 02:10:53,000 --> 02:10:54,960 So what's the solution in the real world? 2763 02:10:54,960 --> 02:10:59,000 Well, you could very simply take a post it note and put like gone for milk 2764 02:10:59,000 --> 02:11:02,060 so as to communicate to your roommate that they should not 2765 02:11:02,060 --> 02:11:04,620 inspect the value of that variable and make a decision on it. 2766 02:11:04,620 --> 02:11:05,120 Why? 2767 02:11:05,120 --> 02:11:08,958 Because it's not yet consistent with the outcome that's about to happen. 2768 02:11:08,958 --> 02:11:11,000 You could be more dramatic and you could actually 2769 02:11:11,000 --> 02:11:13,580 lock the fridge somehow, put a padlock around it 2770 02:11:13,580 --> 02:11:15,470 or the like so they can't even get in there. 2771 02:11:15,470 --> 02:11:18,510 And that would achieve the same effect too. 2772 02:11:18,510 --> 02:11:22,400 And that is actually pretty much the solution to this problem in code too. 2773 02:11:22,400 --> 02:11:23,370 It's not safe. 2774 02:11:23,370 --> 02:11:27,110 It's not sufficient to only execute three lines of code like this. 2775 02:11:27,110 --> 02:11:32,000 Rather, what you probably want to do is use additional SQL keywords 2776 02:11:32,000 --> 02:11:35,270 that we won't spend much time on in the class itself, but these. 2777 02:11:35,270 --> 02:11:37,160 There are solutions to this problem. 2778 02:11:37,160 --> 02:11:39,320 You can begin what's called a transaction 2779 02:11:39,320 --> 02:11:43,460 and you can more explicitly commit to making a decision, 2780 02:11:43,460 --> 02:11:46,100 like updating the database to 101 or 102. 2781 02:11:46,100 --> 02:11:50,120 Or if you realize, wait a minute, Carter's query is interrupting mine. 2782 02:11:50,120 --> 02:11:53,540 Let me roll back to the previous state and just rewind. 2783 02:11:53,540 --> 02:11:54,290 Let me undo. 2784 02:11:54,290 --> 02:11:55,840 Control Z, if you will. 2785 02:11:55,840 --> 02:11:58,340 There's also another keyword that's not so much used anymore 2786 02:11:58,340 --> 02:11:59,600 in SQL which is locking. 2787 02:11:59,600 --> 02:12:01,170 You could literally back in the day. 2788 02:12:01,170 --> 02:12:05,690 Lock the entire database table, preventing anyone from updating it 2789 02:12:05,690 --> 02:12:09,320 or making changes or even reading it while someone else was accessing it. 2790 02:12:09,320 --> 02:12:12,320 That was a very heavy handed solution because it slowed everything down. 2791 02:12:12,320 --> 02:12:15,710 But in short, transactions are now a feature of SQL 2792 02:12:15,710 --> 02:12:17,840 that you won't necessarily need to use yourselves 2793 02:12:17,840 --> 02:12:22,490 that do solve this problem by doing the equivalent of saying while David's 2794 02:12:22,490 --> 02:12:26,360 like counter is in the process of being updated, keep Carter at bay, 2795 02:12:26,360 --> 02:12:29,510 ideally briefly, and then let his data go through too. 2796 02:12:29,510 --> 02:12:32,240 It's equivalent too to putting a note or a lock on the fridge. 2797 02:12:32,240 --> 02:12:33,920 And indeed, I mean lock literally. 2798 02:12:33,920 --> 02:12:36,380 They were once upon a time called and still 2799 02:12:36,380 --> 02:12:41,480 are in some contexts called locks on databases too. 2800 02:12:41,480 --> 02:12:44,730 And the code for which you might do this is almost the same. 2801 02:12:44,730 --> 02:12:51,860 You simply wrap the three queries with a transaction statement and a commit. 2802 02:12:51,860 --> 02:12:57,800 And the term of art here is that this makes your statements atomic. 2803 02:12:57,800 --> 02:13:01,230 So atomic means they're either all executed or not at all. 2804 02:13:01,230 --> 02:13:04,340 That is they're all very tightly coupled together without interruption. 2805 02:13:04,340 --> 02:13:08,480 Transactions solves that problem and avoid having two gallons of milk. 2806 02:13:08,480 --> 02:13:11,360 And the last problem that arises that is tragically 2807 02:13:11,360 --> 02:13:14,510 so darn common in the real world today is 2808 02:13:14,510 --> 02:13:16,173 what's called a SQL injection attack. 2809 02:13:16,173 --> 02:13:18,590 And it's what I alluded to earlier with the question mark. 2810 02:13:18,590 --> 02:13:20,298 So suppose you're in the habit of logging 2811 02:13:20,298 --> 02:13:23,240 into Yale websites with your net ID or password 2812 02:13:23,240 --> 02:13:25,950 or at Harvard, your Harvard key and password as well. 2813 02:13:25,950 --> 02:13:28,370 Suppose for the sake of discussion that the people that 2814 02:13:28,370 --> 02:13:32,360 implemented Harvard key log in allow you to type in your email address, 2815 02:13:32,360 --> 02:13:33,680 of course, and your password. 2816 02:13:33,680 --> 02:13:37,160 But suppose that they are using SQL underneath the hood 2817 02:13:37,160 --> 02:13:39,770 to check your username and password to make sure 2818 02:13:39,770 --> 02:13:43,400 that you are David Malan or Carter Zenke or whoever you claim to be. 2819 02:13:43,400 --> 02:13:48,770 I haven't shown you the syntax yet, but it turns out that in SQL, -- 2820 02:13:48,770 --> 02:13:51,410 is a special way of indicating a comment. 2821 02:13:51,410 --> 02:13:53,100 It means ignore everything to the right. 2822 02:13:53,100 --> 02:13:57,080 So it's just like // in C or the hash symbol in Python. 2823 02:13:57,080 --> 02:13:59,430 -- just means ignore everything to the right. 2824 02:13:59,430 --> 02:14:01,440 And we've, of course, seen single quotes. 2825 02:14:01,440 --> 02:14:04,730 So one way to wage a SQL injection attack 2826 02:14:04,730 --> 02:14:08,870 is to try to inject malicious SQL code into someone else's database 2827 02:14:08,870 --> 02:14:10,220 without them realizing it. 2828 02:14:10,220 --> 02:14:11,220 How do you do this? 2829 02:14:11,220 --> 02:14:15,492 Well, suppose I log in as malan@harvard.edu single quote dash, 2830 02:14:15,492 --> 02:14:16,316 dash. 2831 02:14:16,316 --> 02:14:20,510 I'm not double quoting anything clearly and there's nothing to the right 2832 02:14:20,510 --> 02:14:21,770 of the -- 2833 02:14:21,770 --> 02:14:23,150 anyway. 2834 02:14:23,150 --> 02:14:25,500 But this imbalance is going to be useful. 2835 02:14:25,500 --> 02:14:26,000 Why? 2836 02:14:26,000 --> 02:14:29,630 Because if I'm a hacker and I'm presuming someone at Harvard probably 2837 02:14:29,630 --> 02:14:33,680 is using single quotes to wrap the user's email address 2838 02:14:33,680 --> 02:14:38,360 and wrap the user's password, what if I try to complete their thought for them 2839 02:14:38,360 --> 02:14:40,370 and close one of those quotes for them? 2840 02:14:40,370 --> 02:14:41,850 What might happen? 2841 02:14:41,850 --> 02:14:43,220 Well, we could do this. 2842 02:14:43,220 --> 02:14:46,250 Here for instance, let me hypothesize is the code that Harvard wrote, 2843 02:14:46,250 --> 02:14:48,050 hopefully not, underneath the hood. 2844 02:14:48,050 --> 02:14:51,500 So they're using CS50's library in Python and they're using SQL inside. 2845 02:14:51,500 --> 02:14:54,350 Suppose that they have a query like this. 2846 02:14:54,350 --> 02:15:00,380 Select star from users where username equals question mark and password 2847 02:15:00,380 --> 02:15:01,315 equals question mark. 2848 02:15:01,315 --> 02:15:04,190 And then suppose they just plug in whatever username and password was 2849 02:15:04,190 --> 02:15:05,070 typed in. 2850 02:15:05,070 --> 02:15:08,090 And then if they get back some number of rows dot dot dot, 2851 02:15:08,090 --> 02:15:09,410 they assume I am David. 2852 02:15:09,410 --> 02:15:12,200 They assume Carter is Carter if both the username and password are 2853 02:15:12,200 --> 02:15:12,890 in the database. 2854 02:15:12,890 --> 02:15:14,480 Just end of story there. 2855 02:15:14,480 --> 02:15:15,680 This is good. 2856 02:15:15,680 --> 02:15:18,750 This has the question mark placeholder, as we discussed earlier. 2857 02:15:18,750 --> 02:15:21,140 But what if you don't quite remember that? 2858 02:15:21,140 --> 02:15:24,670 You don't quite take that to heart and you use your more familiar last week 2859 02:15:24,670 --> 02:15:28,690 fstrings whereby we use these curly braces to plug in values. 2860 02:15:28,690 --> 02:15:30,280 What if you do this instead? 2861 02:15:30,280 --> 02:15:31,900 So it's almost the same idea. 2862 02:15:31,900 --> 02:15:34,950 It's still db execute but now it's select star from users 2863 02:15:34,950 --> 02:15:36,120 where username equals. 2864 02:15:36,120 --> 02:15:39,570 And now notice I'm doing the single quotes, which is required by SQL, 2865 02:15:39,570 --> 02:15:41,850 but I'm using fstrings with the curly braces. 2866 02:15:41,850 --> 02:15:46,530 And the password equals single quote password and then close single quote. 2867 02:15:46,530 --> 02:15:50,910 The problem is if you're just blindly pasting effectively 2868 02:15:50,910 --> 02:15:55,530 the user's input into that web form into the username field and the password 2869 02:15:55,530 --> 02:15:59,890 field, there's nothing stopping a malicious user, student, faculty, 2870 02:15:59,890 --> 02:16:03,090 staff from including a single quote in their name. 2871 02:16:03,090 --> 02:16:07,470 Or maybe even benevolently if their name happens 2872 02:16:07,470 --> 02:16:10,440 to have a single quote, as some last names in particular do. 2873 02:16:10,440 --> 02:16:12,230 So this is very fragile. 2874 02:16:12,230 --> 02:16:12,730 Why? 2875 02:16:12,730 --> 02:16:15,090 Well, suppose that if we plug in my malicious, 2876 02:16:15,090 --> 02:16:18,030 malan@harvard.edu single quote -- 2877 02:16:18,030 --> 02:16:20,430 notice what happens to username here. 2878 02:16:20,430 --> 02:16:23,940 The username variable inside of the curly quotes 2879 02:16:23,940 --> 02:16:26,340 will get replaced with this. 2880 02:16:26,340 --> 02:16:29,880 And notice single quote, which the Harvard programmer wrote, 2881 02:16:29,880 --> 02:16:35,969 malan@harvard.edu single quote which I wrote -- 2882 02:16:35,969 --> 02:16:40,170 which I wrote single quote which Harvard wrote and whatever else 2883 02:16:40,170 --> 02:16:41,160 they want after that. 2884 02:16:41,160 --> 02:16:43,524 What's the implication, though, of the dash, dash? 2885 02:16:43,524 --> 02:16:46,722 2886 02:16:46,722 --> 02:16:48,680 Everything to the right is going to be ignored. 2887 02:16:48,680 --> 02:16:51,730 So the password is never even checked in this scenario. 2888 02:16:51,730 --> 02:16:55,855 I'm tricking the server into ignoring everything after the -- 2889 02:16:55,855 --> 02:17:00,650 but I have constructed very cleverly, very maliciously a syntactically valid 2890 02:17:00,650 --> 02:17:01,150 query. 2891 02:17:01,150 --> 02:17:01,650 Why? 2892 02:17:01,650 --> 02:17:03,580 Because I provided the single quote that's 2893 02:17:03,580 --> 02:17:06,670 going to finish the thought of that first single quote. 2894 02:17:06,670 --> 02:17:09,219 And now I would only know how to do this if I saw the code 2895 02:17:09,219 --> 02:17:12,209 or if I just randomly try putting apostrophes into web forms 2896 02:17:12,209 --> 02:17:13,209 and see if things break. 2897 02:17:13,209 --> 02:17:15,215 That's often how adversaries attack systems. 2898 02:17:15,215 --> 02:17:17,590 They type in potentially dangerous characters, hit Enter. 2899 02:17:17,590 --> 02:17:20,510 If something breaks, they're not necessarily into the system, 2900 02:17:20,510 --> 02:17:22,757 but they know that there might be a vulnerability. 2901 02:17:22,757 --> 02:17:25,340 And then they start trying more methodically things like this. 2902 02:17:25,340 --> 02:17:27,850 So this then is going to be bad, because it effectively 2903 02:17:27,850 --> 02:17:29,500 grays out the rest of the query. 2904 02:17:29,500 --> 02:17:32,889 And this query is surely going to return some rows 2905 02:17:32,889 --> 02:17:34,570 without even knowing my password. 2906 02:17:34,570 --> 02:17:37,450 And so this logic here dot dot dot means, well, 2907 02:17:37,450 --> 02:17:39,850 if a data came back from this query, Harvard 2908 02:17:39,850 --> 02:17:43,480 is presumably going to assume that Malan logged in. 2909 02:17:43,480 --> 02:17:47,840 Show him his account or whatever is being protected here. 2910 02:17:47,840 --> 02:17:50,559 So in short, using fstrings bad. 2911 02:17:50,559 --> 02:17:54,280 Using any equivalent like %s in C, bad. 2912 02:17:54,280 --> 02:17:56,559 When it comes to SQL, using question marks 2913 02:17:56,559 --> 02:18:00,280 or whatever a third party library like CS50 prescribes 2914 02:18:00,280 --> 02:18:02,110 is the way to solve this. 2915 02:18:02,110 --> 02:18:02,620 Why? 2916 02:18:02,620 --> 02:18:07,540 Because libraries like ours are designed to at least be smart and be paranoid. 2917 02:18:07,540 --> 02:18:09,740 And what we will do is this. 2918 02:18:09,740 --> 02:18:12,580 When you use the question marks and the values are plugged in, 2919 02:18:12,580 --> 02:18:16,660 we will escape any potentially dangerous characters 2920 02:18:16,660 --> 02:18:18,520 inside of those placeholders. 2921 02:18:18,520 --> 02:18:21,580 And so effectively, the single quote will no longer 2922 02:18:21,580 --> 02:18:23,469 be considered a grammatical single quote. 2923 02:18:23,469 --> 02:18:26,870 It will just be literally a character in the username or password. 2924 02:18:26,870 --> 02:18:30,040 So the library takes care of this for you 2925 02:18:30,040 --> 02:18:33,500 because you're plugging in the username and password as separate arguments. 2926 02:18:33,500 --> 02:18:37,360 And then we or the third party you're using actually sanitize. 2927 02:18:37,360 --> 02:18:41,510 That is clean up the data and prevent those bad characters. 2928 02:18:41,510 --> 02:18:44,650 Now, this is kind of an internet meme that went around for a while. 2929 02:18:44,650 --> 02:18:47,320 If you've ever driven a car or been in a car 2930 02:18:47,320 --> 02:18:49,930 where there's the automatic readers for tolls. 2931 02:18:49,930 --> 02:18:52,969 This person thought it might be funny to try doing something like this. 2932 02:18:52,969 --> 02:18:54,309 What are they presumably doing? 2933 02:18:54,309 --> 02:18:57,520 The presumption here is, whether or not it worked is unclear, 2934 02:18:57,520 --> 02:19:01,570 is that here's the end of actual license plate number, 2935 02:19:01,570 --> 02:19:03,903 but here's an interesting single quote and a semicolon. 2936 02:19:03,903 --> 02:19:06,070 That's especially bad because it means you can maybe 2937 02:19:06,070 --> 02:19:08,059 execute a second query on the database. 2938 02:19:08,059 --> 02:19:11,170 This is someone having fun trying to drop the entire database 2939 02:19:11,170 --> 02:19:16,090 table for whatever municipality is scanning through cameras their license 2940 02:19:16,090 --> 02:19:16,750 plate code. 2941 02:19:16,750 --> 02:19:19,150 And I would be remiss if we didn't end on this note. 2942 02:19:19,150 --> 02:19:22,240 At least in computer science circles, there 2943 02:19:22,240 --> 02:19:26,920 is someone named, no relation to the TF name we put in the database earlier, 2944 02:19:26,920 --> 02:19:31,660 little Bobby Tables, which ends with this XKCD comic. 2945 02:19:31,660 --> 02:19:36,340 And if you chuckle, if you laugh, you're now legit SQL programmers. 2946 02:19:36,340 --> 02:19:39,049 2947 02:19:39,049 --> 02:19:42,030 Nice, nice. 2948 02:19:42,030 --> 02:19:44,740 Every CS student out there knows about little Bobby Table. 2949 02:19:44,740 --> 02:19:47,657 So if you name drop little Bobby Tables now, you're in. 2950 02:19:47,657 --> 02:19:49,240 All right, that's it though for today. 2951 02:19:49,240 --> 02:19:51,780 We will see you next time. 2952 02:19:51,780 --> 02:19:55,130 [MUSIC PLAYING] 2953 02:19:55,130 --> 02:20:28,000245154

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.