All language subtitles for cdn.cs50.net_2022_fall_lectures_7_lang_en_lecture7
Afrikaans
Akan
Albanian
Amharic
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bemba
Bengali
Bihari
Bosnian
Breton
Bulgarian
Cambodian
Catalan
Cebuano
Cherokee
Chichewa
Chinese (Simplified)
Chinese (Traditional)
Corsican
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Ewe
Faroese
Filipino
Finnish
French
Frisian
Ga
Galician
Georgian
German
Greek
Guarani
Gujarati
Haitian Creole
Hausa
Hawaiian
Hebrew
Hindi
Hmong
Hungarian
Icelandic
Igbo
Indonesian
Interlingua
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Kinyarwanda
Kirundi
Kongo
Korean
Krio (Sierra Leone)
Kurdish
Kurdish (SoranĂ®)
Kyrgyz
Laothian
Latin
Latvian
Lingala
Lithuanian
Lozi
Luganda
Luo
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Maltese
Maori
Marathi
Mauritian Creole
Moldavian
Mongolian
Myanmar (Burmese)
Montenegrin
Nepali
Nigerian Pidgin
Northern Sotho
Norwegian
Norwegian (Nynorsk)
Occitan
Oriya
Oromo
Pashto
Persian
Polish
Portuguese (Brazil)
Portuguese (Portugal)
Punjabi
Quechua
Romanian
Romansh
Runyakitara
Russian
Samoan
Scots Gaelic
Serbian
Serbo-Croatian
Sesotho
Setswana
Seychellois Creole
Shona
Sindhi
Sinhalese
Slovak
Slovenian
Somali
Spanish
Spanish (Latin American)
Sundanese
Swahili
Swedish
Tajik
Tamil
Tatar
Telugu
Thai
Tigrinya
Tonga
Tshiluba
Tumbuka
Turkish
Turkmen
Twi
Uighur
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Wolof
Xhosa
Yiddish
Yoruba
Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:02,988
[MUSIC PLAYING]
2
00:00:02,988 --> 00:01:12,582
3
00:01:12,582 --> 00:01:13,540
DAVID MALAN: All right.
4
00:01:13,540 --> 00:01:17,150
This is CS50 and this
is already week seven.
5
00:01:17,150 --> 00:01:20,240
And this is the week where
we'll continue where we left off
6
00:01:20,240 --> 00:01:22,430
with Python, introducing
you to a bit more syntax
7
00:01:22,430 --> 00:01:25,580
and capabilities of the language so
you can solve interesting problems.
8
00:01:25,580 --> 00:01:27,580
But a lot of those problems
increasingly are now
9
00:01:27,580 --> 00:01:29,180
going to involve data in some form.
10
00:01:29,180 --> 00:01:31,250
After all, if you think
of most any website
11
00:01:31,250 --> 00:01:36,320
or mobile app or process nowadays
that involves solving problems,
12
00:01:36,320 --> 00:01:39,530
it almost always involves some amount
of data and often data at scale.
13
00:01:39,530 --> 00:01:40,680
Lots and lots of data.
14
00:01:40,680 --> 00:01:42,430
And so what we're going
to see first today
15
00:01:42,430 --> 00:01:45,710
is that, yes, you can use Python to
solve all the problems past that we've
16
00:01:45,710 --> 00:01:48,327
seen and also some data specific
ones, but sometimes it's
17
00:01:48,327 --> 00:01:49,410
just going to be annoying.
18
00:01:49,410 --> 00:01:50,420
It's going to be a little painful.
19
00:01:50,420 --> 00:01:54,060
It's just going to be more work than you
might like to just get to some answer.
20
00:01:54,060 --> 00:01:56,870
And so today we'll too
introduce you to a new language
21
00:01:56,870 --> 00:01:59,240
called SQL, Structured Query Language.
22
00:01:59,240 --> 00:02:02,780
And this is a language that rest
assured is actually much smaller,
23
00:02:02,780 --> 00:02:05,240
relatively speaking, than C and Python.
24
00:02:05,240 --> 00:02:07,860
It sort of does less, but
it doesn't really well.
25
00:02:07,860 --> 00:02:11,240
And it's a language for querying
databases, storing data in it,
26
00:02:11,240 --> 00:02:13,950
updating it, inserting it,
deleting it, and so much more.
27
00:02:13,950 --> 00:02:16,160
And it's the kind of
technology that's used nowadays
28
00:02:16,160 --> 00:02:19,700
in, indeed, web apps and mobile
apps, data science, analytics,
29
00:02:19,700 --> 00:02:20,940
and so much more.
30
00:02:20,940 --> 00:02:23,600
It's really good at storing
lots and lots of data.
31
00:02:23,600 --> 00:02:25,430
Now, this is yet another language.
32
00:02:25,430 --> 00:02:27,800
And believe it or not, next
week we'll introduce you
33
00:02:27,800 --> 00:02:31,880
to three more languages, HTML and
CSS, which are not technically
34
00:02:31,880 --> 00:02:32,900
programming languages.
35
00:02:32,900 --> 00:02:35,630
They're all about aesthetics
and markup of information.
36
00:02:35,630 --> 00:02:38,660
But also JavaScript, which is,
in fact, a programming language.
37
00:02:38,660 --> 00:02:41,630
But the goals here in
CS50 really are going
38
00:02:41,630 --> 00:02:44,000
to be to empower you to
program more generally.
39
00:02:44,000 --> 00:02:47,000
And indeed, when you're out there in
the real world some years from now,
40
00:02:47,000 --> 00:02:50,330
invariably there's going to be some
new other popular language out there.
41
00:02:50,330 --> 00:02:53,880
And hopefully in this week and next
week and beyond, among the goals
42
00:02:53,880 --> 00:02:56,630
is not just to teach you these
languages specifically but again,
43
00:02:56,630 --> 00:02:59,330
how to teach yourself the future
languages that we've not even
44
00:02:59,330 --> 00:03:00,480
heard about just yet.
45
00:03:00,480 --> 00:03:04,220
So with that said, let's
begin with a survey of sorts.
46
00:03:04,220 --> 00:03:10,190
If you go to this URL on your
phone or laptop, cs50.ly/favorites,
47
00:03:10,190 --> 00:03:14,390
a very simple Google Form awaits you
that's just going to ask you a couple
48
00:03:14,390 --> 00:03:16,530
of multiple choice questions.
49
00:03:16,530 --> 00:03:24,800
So go to cs50.ly/favorites, and that
should lead you to a Google Form that
50
00:03:24,800 --> 00:03:27,590
looks a little something
like this asking you first,
51
00:03:27,590 --> 00:03:31,700
as of now in week seven, what is your
favorite language among those options
52
00:03:31,700 --> 00:03:32,250
here.
53
00:03:32,250 --> 00:03:34,220
And then further down,
one more question.
54
00:03:34,220 --> 00:03:39,170
If you think back on problem sets
0 through 6, what was, if any,
55
00:03:39,170 --> 00:03:42,080
your favorite problem set problem?
56
00:03:42,080 --> 00:03:45,110
Be it in Scratch or C or Python.
57
00:03:45,110 --> 00:03:48,650
So answer those two questions.
58
00:03:48,650 --> 00:03:52,820
And in a moment, I'll flip over to
my screen here where you'll see,
59
00:03:52,820 --> 00:03:56,570
and anyone who's used Google Forms
knows, the spreadsheet that's
60
00:03:56,570 --> 00:03:58,130
collecting now this data.
61
00:03:58,130 --> 00:04:01,060
Microsoft Office 365 can do the
same if you use one of those forms.
62
00:04:01,060 --> 00:04:02,810
And what you see here
now is a spreadsheet
63
00:04:02,810 --> 00:04:06,770
in Google Sheets enumerating
all of the audience's questions.
64
00:04:06,770 --> 00:04:09,980
Language is in column B,
problem is in column C,
65
00:04:09,980 --> 00:04:13,190
and each row represents one
student who has responded.
66
00:04:13,190 --> 00:04:17,360
A few of you were super eager for
class today at 8:33 AM Eastern time.
67
00:04:17,360 --> 00:04:19,040
10:32, 11:10.
68
00:04:19,040 --> 00:04:21,779
OK, so now we're getting into
the actual class time here.
69
00:04:21,779 --> 00:04:24,830
And if I scroll down, we'll
probably see a few dozen,
70
00:04:24,830 --> 00:04:26,900
a couple of hundred answers by now.
71
00:04:26,900 --> 00:04:29,225
And yeah, so we're getting
a whole lot of answers here.
72
00:04:29,225 --> 00:04:31,100
And I'm seeing some
patterns emerge, but it's
73
00:04:31,100 --> 00:04:34,190
not necessarily obvious to the
human eyes what those patterns are.
74
00:04:34,190 --> 00:04:36,320
Now of course, you can
use Google Spreadsheets.
75
00:04:36,320 --> 00:04:40,143
You can highlight the data and you
can create charts magically out of it.
76
00:04:40,143 --> 00:04:42,560
But you can only do what Google
lets you do with the data.
77
00:04:42,560 --> 00:04:45,350
And same thing for Microsoft
Excel or Apple Numbers.
78
00:04:45,350 --> 00:04:48,230
But wouldn't it be nice to just be
able to manipulate the raw data,
79
00:04:48,230 --> 00:04:51,740
relatively simple though it is, to
just answer questions about the data?
80
00:04:51,740 --> 00:04:54,260
Maybe long term create your
own charts, customize it
81
00:04:54,260 --> 00:04:56,240
just the way you want
rather than beholden
82
00:04:56,240 --> 00:04:59,250
to software that's off
the shelf like this.
83
00:04:59,250 --> 00:05:02,130
Well, how could we go about doing this?
84
00:05:02,130 --> 00:05:05,330
Well, let me propose that
we treat this data set now
85
00:05:05,330 --> 00:05:08,790
as what we're going to call
for now a flat file database.
86
00:05:08,790 --> 00:05:10,850
We'll see today that
there's fancier databases,
87
00:05:10,850 --> 00:05:15,200
but the simplest database in the
world is really just like a .csv file.
88
00:05:15,200 --> 00:05:18,500
And we saw that a couple of weeks
ago in C. We wrote a bit of C code
89
00:05:18,500 --> 00:05:23,660
that used fprintf to write data to a
file using commas as the separator.
90
00:05:23,660 --> 00:05:25,910
We didn't really do much
more with CSVs at the time
91
00:05:25,910 --> 00:05:28,610
though, because it's really
annoying, painful, time consuming,
92
00:05:28,610 --> 00:05:32,840
not fun to use C for something like that
because of malloc and memory and all
93
00:05:32,840 --> 00:05:33,413
that stuff.
94
00:05:33,413 --> 00:05:35,330
But with Python, it's
going to be much easier.
95
00:05:35,330 --> 00:05:39,530
And so any time you have access to some
data set where you can just download it
96
00:05:39,530 --> 00:05:42,140
to your own Mac or PC or
your cloud environment,
97
00:05:42,140 --> 00:05:46,460
it's sort of a candidate for now writing
code to do something with the data.
98
00:05:46,460 --> 00:05:48,120
Maybe analyze it right away.
99
00:05:48,120 --> 00:05:50,150
If it's been human imported
manually, maybe you
100
00:05:50,150 --> 00:05:52,550
have to clean it up by doing
a lot of find and replace
101
00:05:52,550 --> 00:05:54,852
but not with your keyboard
but rather with code.
102
00:05:54,852 --> 00:05:56,310
And so let me go ahead and do this.
103
00:05:56,310 --> 00:05:59,780
Let me go back to my
Google Sheet here that has
104
00:05:59,780 --> 00:06:02,010
all of the data that's come in now.
105
00:06:02,010 --> 00:06:08,430
And let me go ahead and download
this via the File menu here.
106
00:06:08,430 --> 00:06:09,590
And let's see.
107
00:06:09,590 --> 00:06:10,310
Download.
108
00:06:10,310 --> 00:06:12,170
And you can see a
whole bunch of options.
109
00:06:12,170 --> 00:06:14,210
Most formats might be
familiar, but today we'll
110
00:06:14,210 --> 00:06:17,602
focus just on this one, Comma
Separated Values or CSV.
111
00:06:17,602 --> 00:06:20,810
That's going to go ahead and download
it on my Mac here into my own Downloads
112
00:06:20,810 --> 00:06:21,140
folder.
113
00:06:21,140 --> 00:06:22,890
And now I'm going to
go ahead and do this.
114
00:06:22,890 --> 00:06:25,140
Let me go ahead and pull up
VS Code in the cloud here.
115
00:06:25,140 --> 00:06:28,140
And if you've never done this before,
there's a couple of ways to do it.
116
00:06:28,140 --> 00:06:31,320
But the simplest way to upload a
file to your code space, so to speak,
117
00:06:31,320 --> 00:06:32,990
is just a sort of drag and drop.
118
00:06:32,990 --> 00:06:36,020
That's going to magically
upload it to the server there.
119
00:06:36,020 --> 00:06:39,688
And we'll see that, one, it has a very
long file name, which I'm actually
120
00:06:39,688 --> 00:06:41,480
going to clean this up
because that's going
121
00:06:41,480 --> 00:06:43,880
to be very tedious to type in my code.
122
00:06:43,880 --> 00:06:46,927
So I could either right
click, of course, up here,
123
00:06:46,927 --> 00:06:48,510
but I'm going to use my Linux command.
124
00:06:48,510 --> 00:06:52,550
So let's move this file called CSV
50 2022 something or other and let's
125
00:06:52,550 --> 00:06:54,740
just name it more simply favorites.csv.
126
00:06:54,740 --> 00:06:58,370
So all lowercase, no
spaces, sort of good basics.
127
00:06:58,370 --> 00:07:02,960
And let me go ahead now and open up
this file with code favorites.csv.
128
00:07:02,960 --> 00:07:06,500
I'll close my File Explorer and
we'll see exactly the same data
129
00:07:06,500 --> 00:07:09,890
as before but not quite as pretty
as Google Sheets makes it be.
130
00:07:09,890 --> 00:07:13,640
Rather we see here that I still have
three columns, timestamp, language,
131
00:07:13,640 --> 00:07:17,013
problem, and then all of the values
down below, including the timestamps
132
00:07:17,013 --> 00:07:18,180
and the answers they're for.
133
00:07:18,180 --> 00:07:20,360
But it doesn't have proper columns.
134
00:07:20,360 --> 00:07:22,250
It just has commas separating them.
135
00:07:22,250 --> 00:07:24,860
Now, we could very
easily write Python code
136
00:07:24,860 --> 00:07:28,820
just like we wrote C code to
manipulate files like this
137
00:07:28,820 --> 00:07:30,140
either to write or read.
138
00:07:30,140 --> 00:07:32,450
But instead let's do
something that's a little more
139
00:07:32,450 --> 00:07:34,640
pleasant, which is indeed
in the form of Python.
140
00:07:34,640 --> 00:07:38,300
So Python actually comes
with native support for CSVs.
141
00:07:38,300 --> 00:07:41,200
It has indeed a package
called CSV that just
142
00:07:41,200 --> 00:07:43,700
lets you read and write and do
a whole bunch of useful stuff
143
00:07:43,700 --> 00:07:45,007
when it comes to CSV files.
144
00:07:45,007 --> 00:07:47,090
So let's go ahead and do
something with this file.
145
00:07:47,090 --> 00:07:49,250
Let me go back here to VS Code.
146
00:07:49,250 --> 00:07:51,860
I'm going to close
favorites.csv for now.
147
00:07:51,860 --> 00:07:55,430
But just remember in your mind that
timestamp was the first column,
148
00:07:55,430 --> 00:07:58,640
language was the second column,
and problem was the third.
149
00:07:58,640 --> 00:08:01,790
And notice because we're using commas,
they don't again line up perfectly,
150
00:08:01,790 --> 00:08:02,832
but that's not a problem.
151
00:08:02,832 --> 00:08:05,270
There are two commas in
every line presumably.
152
00:08:05,270 --> 00:08:09,892
And I'm going to go ahead and now create
a file called how about favorites.py so
153
00:08:09,892 --> 00:08:12,350
that I can start writing some
code to manipulate this data.
154
00:08:12,350 --> 00:08:13,600
And let's do something simple.
155
00:08:13,600 --> 00:08:16,910
Let's just write a simple program
in Python that opens this file,
156
00:08:16,910 --> 00:08:19,997
reads it, and print something
out just as a safety check
157
00:08:19,997 --> 00:08:22,830
that I know what I'm doing, even
though it's not going to be useful.
158
00:08:22,830 --> 00:08:26,780
So in Python, if you want
CSV support, you import CSV.
159
00:08:26,780 --> 00:08:30,350
And that gives you access to all
the magical capabilities thereof.
160
00:08:30,350 --> 00:08:32,450
Let me now go ahead
and use this technique
161
00:08:32,450 --> 00:08:35,960
to open a file in Python,
which is similar in C.
162
00:08:35,960 --> 00:08:37,789
But with Python, we're going to do this.
163
00:08:37,789 --> 00:08:39,020
The keyword with.
164
00:08:39,020 --> 00:08:42,049
I'm going to open a file
called favorites.csv,
165
00:08:42,049 --> 00:08:43,880
which was the shorter name I gave it.
166
00:08:43,880 --> 00:08:46,280
This is optional, but
just for explicitness, I'm
167
00:08:46,280 --> 00:08:50,090
going to open it in read mode
explicitly, just like f open
168
00:08:50,090 --> 00:08:51,720
took a second argument as well.
169
00:08:51,720 --> 00:08:54,500
And I'm going to name this file
once open quite simply file,
170
00:08:54,500 --> 00:08:56,240
though I could call it anything I want.
171
00:08:56,240 --> 00:08:59,060
And now it's just an open file.
172
00:08:59,060 --> 00:09:02,870
So far as Python knows at this
moment, it's just text, or better yet,
173
00:09:02,870 --> 00:09:04,250
it's just zeros and ones.
174
00:09:04,250 --> 00:09:08,360
If you want this Python package
called CSV to actually do
175
00:09:08,360 --> 00:09:12,500
something useful with it, you have to
load this file now into the library.
176
00:09:12,500 --> 00:09:16,165
And the simplest way to do this is to
give myself a variable called reader
177
00:09:16,165 --> 00:09:17,540
because I want to read this file.
178
00:09:17,540 --> 00:09:19,430
Though this too I could
call anything else.
179
00:09:19,430 --> 00:09:22,610
I'm going to then set that equal to
the return value of a function called
180
00:09:22,610 --> 00:09:24,710
csv.reader.
181
00:09:24,710 --> 00:09:28,220
And I pass to that per the
documentation the open file.
182
00:09:28,220 --> 00:09:29,780
So step one, I open the file.
183
00:09:29,780 --> 00:09:32,030
And this just gives me
access to the bytes therein.
184
00:09:32,030 --> 00:09:36,530
Step two now with csv.reader
tells the Python package
185
00:09:36,530 --> 00:09:40,940
called CSV to do something useful
with it and start analyzing the commas
186
00:09:40,940 --> 00:09:43,770
and allow me to parse it further.
187
00:09:43,770 --> 00:09:45,180
So let's go ahead and do this.
188
00:09:45,180 --> 00:09:49,730
Let me go ahead now and within
this loop let's say this.
189
00:09:49,730 --> 00:09:52,340
Sorry, within this open
file, let's do this.
190
00:09:52,340 --> 00:09:58,040
For every row, if you will,
or line in the file, a.k.a.
191
00:09:58,040 --> 00:10:04,910
reader, let's go ahead and print
out just how about row bracket 1.
192
00:10:04,910 --> 00:10:06,198
Now what's going on here?
193
00:10:06,198 --> 00:10:08,990
Well, it turns out if you read the
documentation for the CSV reader
194
00:10:08,990 --> 00:10:13,610
function, what it hands you back is
essentially this special object, so
195
00:10:13,610 --> 00:10:16,850
to speak, that allows you to
treat it as though it's just
196
00:10:16,850 --> 00:10:20,660
a really big list of lines
from the file, a.k.a. reader.
197
00:10:20,660 --> 00:10:25,310
So by saying for row in reader,
this is a way more succinct way
198
00:10:25,310 --> 00:10:28,763
of saying give me the first
line in the file plus plus,
199
00:10:28,763 --> 00:10:30,680
give me the second line
in the file plus plus,
200
00:10:30,680 --> 00:10:33,597
and so forth that we would have done
what much more mechanically in C.
201
00:10:33,597 --> 00:10:37,350
This is just much more Pythonic
and English friendly, if you will.
202
00:10:37,350 --> 00:10:42,225
So in every iteration of this loop,
row is going to contain all of the data
203
00:10:42,225 --> 00:10:43,100
from the current row.
204
00:10:43,100 --> 00:10:45,590
But better yet, what
the reader function does
205
00:10:45,590 --> 00:10:50,240
for me is it hands me each row
not just as a big string or STR
206
00:10:50,240 --> 00:10:51,590
of text in Python.
207
00:10:51,590 --> 00:10:56,590
It gives me what apparently
based on the syntax on line six.
208
00:10:56,590 --> 00:10:57,460
Any instinct?
209
00:10:57,460 --> 00:10:58,030
Yeah.
210
00:10:58,030 --> 00:10:58,740
AUDIENCE: A list.
211
00:10:58,740 --> 00:11:00,698
DAVID MALAN: It's giving
me back indeed a list.
212
00:11:00,698 --> 00:11:02,880
And I presume the visual
clue for you was the fact
213
00:11:02,880 --> 00:11:04,463
that we're using square brackets here.
214
00:11:04,463 --> 00:11:07,780
And indeed, row bracket 1
is going to be not the first
215
00:11:07,780 --> 00:11:10,060
but the second element in that list.
216
00:11:10,060 --> 00:11:11,280
And so just take a guess.
217
00:11:11,280 --> 00:11:14,400
When I run this code in a moment,
what's going to get printed?
218
00:11:14,400 --> 00:11:16,950
The timestamp, the
language, or the problem?
219
00:11:16,950 --> 00:11:20,892
220
00:11:20,892 --> 00:11:21,392
Yeah?
221
00:11:21,392 --> 00:11:22,350
AUDIENCE: The language.
222
00:11:22,350 --> 00:11:25,260
DAVID MALAN: The language because
it's the second column that is
223
00:11:25,260 --> 00:11:27,848
in the file delimited by those commas.
224
00:11:27,848 --> 00:11:29,140
So let me go ahead and do this.
225
00:11:29,140 --> 00:11:30,630
Let me clear my terminal down here.
226
00:11:30,630 --> 00:11:33,570
Let me run Python of
favorites.py and Enter.
227
00:11:33,570 --> 00:11:34,710
And there's everything.
228
00:11:34,710 --> 00:11:35,490
It was super fast.
229
00:11:35,490 --> 00:11:37,090
But there's a really long list here.
230
00:11:37,090 --> 00:11:41,130
And in fact, if I increase the size
of my terminal and start scrolling up,
231
00:11:41,130 --> 00:11:43,090
you'll just see all of the raw data.
232
00:11:43,090 --> 00:11:44,490
Now, this isn't that useful yet.
233
00:11:44,490 --> 00:11:46,032
I could have just glanced at the CSV.
234
00:11:46,032 --> 00:11:49,320
But clearly now I have the
ability to open the file,
235
00:11:49,320 --> 00:11:52,830
parse it, so to speak, that is break
it up into its constituent parts,
236
00:11:52,830 --> 00:11:56,380
and do something with
specific parts therein.
237
00:11:56,380 --> 00:11:56,880
All right.
238
00:11:56,880 --> 00:11:59,640
So if I want to do this a
little more pleasantly though,
239
00:11:59,640 --> 00:12:02,460
let me at least make this
semantically a little cleaner.
240
00:12:02,460 --> 00:12:05,650
And you know what, just for clarity,
let me just give myself a variable.
241
00:12:05,650 --> 00:12:09,000
It's not strictly necessary,
but I know that this is
242
00:12:09,000 --> 00:12:12,958
the favorite, for instance, language.
243
00:12:12,958 --> 00:12:14,250
So let's just call it favorite.
244
00:12:14,250 --> 00:12:16,410
Set it equal to row bracket 1.
245
00:12:16,410 --> 00:12:18,600
And now just to be more
explicit in my code,
246
00:12:18,600 --> 00:12:21,030
even though again, we don't
need the variable per se,
247
00:12:21,030 --> 00:12:22,840
this code is, of course,
going to do the same thing.
248
00:12:22,840 --> 00:12:25,132
It's just using an additional
variable called favorite.
249
00:12:25,132 --> 00:12:28,450
If I go down here, scroll
up, run the program again,
250
00:12:28,450 --> 00:12:30,280
I get back to the exact same data.
251
00:12:30,280 --> 00:12:32,280
But this is a stepping
stone to something
252
00:12:32,280 --> 00:12:34,800
that's even more powerful
about Python support
253
00:12:34,800 --> 00:12:38,610
for CSV files is that you don't
have to just treat the return
254
00:12:38,610 --> 00:12:43,500
value as a list with 0 and 1 and 2.
255
00:12:43,500 --> 00:12:48,540
So just thinking intuitively here,
why is this maybe not the best design
256
00:12:48,540 --> 00:12:51,960
to hand you, the programmer,
back the data in a list
257
00:12:51,960 --> 00:12:55,640
that's numerically indexed with 0, 1, 2?
258
00:12:55,640 --> 00:12:59,760
It clearly works, but critique this.
259
00:12:59,760 --> 00:13:00,890
What could go wrong?
260
00:13:00,890 --> 00:13:03,282
What's a little poorly designed?
261
00:13:03,282 --> 00:13:04,169
Yeah?
262
00:13:04,169 --> 00:13:06,586
AUDIENCE: You have to always
remember what the things are,
263
00:13:06,586 --> 00:13:10,945
what the order is [INAUDIBLE]
264
00:13:10,945 --> 00:13:11,820
DAVID MALAN: Exactly.
265
00:13:11,820 --> 00:13:14,880
So yeah, so it's up to you to repeat--
266
00:13:14,880 --> 00:13:18,360
it's up to you to remember what
column the data is actually in.
267
00:13:18,360 --> 00:13:22,020
And God forbid you're collaborating
with someone else on the spreadsheet.
268
00:13:22,020 --> 00:13:24,803
If you've used Google Spreadsheets,
you can move columns around
269
00:13:24,803 --> 00:13:27,220
maybe just because you want
to visually reorganize things.
270
00:13:27,220 --> 00:13:30,150
And if you do this and then someone
else downloads that same data,
271
00:13:30,150 --> 00:13:31,840
all of their code is going to break.
272
00:13:31,840 --> 00:13:33,360
So that's just really bad design.
273
00:13:33,360 --> 00:13:36,390
It's fragile just because you're
sort of on the honor system
274
00:13:36,390 --> 00:13:38,588
that one means the data that you want.
275
00:13:38,588 --> 00:13:41,130
So wouldn't it be nice if it
could be a little more explicit?
276
00:13:41,130 --> 00:13:47,130
Well, recall that the very first
line in this file is actually this.
277
00:13:47,130 --> 00:13:50,730
And I paused the output this time
so that we can see more optionally.
278
00:13:50,730 --> 00:13:51,960
I just reran favorites.py.
279
00:13:51,960 --> 00:13:55,050
And notice one of these
things is not like the other.
280
00:13:55,050 --> 00:13:59,070
Every output was either scratch or C
or Python except for this first one.
281
00:13:59,070 --> 00:14:00,855
Why am I seeing the word language here?
282
00:14:00,855 --> 00:14:03,610
283
00:14:03,610 --> 00:14:05,860
Where did language come from?
284
00:14:05,860 --> 00:14:08,230
You didn't have the
ability to manually input.
285
00:14:08,230 --> 00:14:09,372
No.
286
00:14:09,372 --> 00:14:10,330
Where did it come from?
287
00:14:10,330 --> 00:14:11,030
Yeah.
288
00:14:11,030 --> 00:14:11,850
AUDIENCE: That would be the header.
289
00:14:11,850 --> 00:14:12,630
DAVID MALAN: Yeah, the header.
290
00:14:12,630 --> 00:14:15,300
The very first row in the
file, which by human convention
291
00:14:15,300 --> 00:14:18,210
generally just defines what the
columns represent so that there's
292
00:14:18,210 --> 00:14:20,470
some human useful information there.
293
00:14:20,470 --> 00:14:24,400
Now, that's not really intended to
be part of my output at the moment,
294
00:14:24,400 --> 00:14:25,710
so there is a way to skip this.
295
00:14:25,710 --> 00:14:29,380
If you want to skip the first row, you
can actually do something like this.
296
00:14:29,380 --> 00:14:32,560
You can say next row, and that
will just ignore that row.
297
00:14:32,560 --> 00:14:35,520
So then I'm starting really
with the every row thereafter.
298
00:14:35,520 --> 00:14:37,995
But there's a better way
to handle this than that.
299
00:14:37,995 --> 00:14:40,620
That will get rid of the row in
the output, but let me go ahead
300
00:14:40,620 --> 00:14:42,888
and use a different
feature of the CSV package
301
00:14:42,888 --> 00:14:45,430
that's just going to make this
a little cleaner all together.
302
00:14:45,430 --> 00:14:47,140
So let me clear my terminal window here.
303
00:14:47,140 --> 00:14:49,830
Let me undo this next
thing that I just added.
304
00:14:49,830 --> 00:14:54,690
And instead of using a reader, let me
go ahead and use a dictionary reader,
305
00:14:54,690 --> 00:15:00,720
abbreviated dict reader, that's going
to now return me the equivalent of all
306
00:15:00,720 --> 00:15:02,260
of the rows one at a time.
307
00:15:02,260 --> 00:15:04,720
So I can still call it
reader just as before.
308
00:15:04,720 --> 00:15:08,340
But as the name implies, what
this reader is going to return
309
00:15:08,340 --> 00:15:11,560
is not a list after list
after list but a dictionary,
310
00:15:11,560 --> 00:15:13,140
a dictionary, a dictionary.
311
00:15:13,140 --> 00:15:16,210
And remember, a dictionary is just
a collection of key value pairs.
312
00:15:16,210 --> 00:15:17,640
So what does that mean?
313
00:15:17,640 --> 00:15:18,390
What are the keys?
314
00:15:18,390 --> 00:15:19,480
What are the values?
315
00:15:19,480 --> 00:15:24,210
Well, now that I'm using a dictionary
reader, I can actually do this.
316
00:15:24,210 --> 00:15:27,270
Instead of on the honor system
remembering that I want column one,
317
00:15:27,270 --> 00:15:30,630
I can treat row now not as
a list but as a dictionary.
318
00:15:30,630 --> 00:15:34,410
And that means I can go in here
and say quote unquote "language."
319
00:15:34,410 --> 00:15:36,150
And we saw that back in week six.
320
00:15:36,150 --> 00:15:38,820
Python allows you to
index into dictionaries
321
00:15:38,820 --> 00:15:42,480
using square bracket notation in
strings or STRs on the inside,
322
00:15:42,480 --> 00:15:44,790
just like lists allow for numbers.
323
00:15:44,790 --> 00:15:48,180
But this now I think is going
to be a little more robust.
324
00:15:48,180 --> 00:15:52,830
If I run this again, Python of
favorites.py, of that worked out fine.
325
00:15:52,830 --> 00:15:56,580
And let me pause the output too
by using this program called more.
326
00:15:56,580 --> 00:15:58,600
Now I don't even see the header.
327
00:15:58,600 --> 00:16:03,660
So now whoever works with Python
wrote the code for this package
328
00:16:03,660 --> 00:16:06,600
to just analyze that first
line of code, use the header
329
00:16:06,600 --> 00:16:10,230
as you just called it as the
keys, and then every time you
330
00:16:10,230 --> 00:16:13,380
iterate through this loop, it updates
the values, the values, the values,
331
00:16:13,380 --> 00:16:15,810
but the keys stay the same.
332
00:16:15,810 --> 00:16:20,320
Any questions then on this technique?
333
00:16:20,320 --> 00:16:24,206
Suffice it to say this
would be painful in C. Yes?
334
00:16:24,206 --> 00:16:28,670
AUDIENCE: [INAUDIBLE]
335
00:16:28,670 --> 00:16:33,395
336
00:16:33,395 --> 00:16:34,270
DAVID MALAN: Exactly.
337
00:16:34,270 --> 00:16:39,130
So the keys are always going to be
quote unquote "timestamp, language,
338
00:16:39,130 --> 00:16:40,130
and problem."
339
00:16:40,130 --> 00:16:43,330
But on each iteration of
this loop here, the row
340
00:16:43,330 --> 00:16:46,600
is going to contain a different row
of values, different row of values,
341
00:16:46,600 --> 00:16:47,600
different row of values.
342
00:16:47,600 --> 00:16:50,830
So you're going to get back one
dictionary for every student who
343
00:16:50,830 --> 00:16:54,140
submitted the Google Form, if you
will, while iterating through it there.
344
00:16:54,140 --> 00:16:54,640
All right.
345
00:16:54,640 --> 00:17:01,270
So once we have this ability here, why
don't we go ahead and transition to
346
00:17:01,270 --> 00:17:03,850
how about not just using
that dictionary reader, which
347
00:17:03,850 --> 00:17:05,890
it makes the code a little more robust.
348
00:17:05,890 --> 00:17:08,599
Because now if you move the
columns around, no big deal.
349
00:17:08,599 --> 00:17:11,170
It doesn't matter if the
numeric indices change.
350
00:17:11,170 --> 00:17:13,660
You can still use
those keywords instead.
351
00:17:13,660 --> 00:17:15,460
But let's actually analyze the data now.
352
00:17:15,460 --> 00:17:18,619
I'm just spitting it out, which is
not solving any problems for anyone.
353
00:17:18,619 --> 00:17:22,750
So let's go ahead and count the
popularity of Scratch, C, and Python
354
00:17:22,750 --> 00:17:25,790
and see what everyone's
been thinking here.
355
00:17:25,790 --> 00:17:26,290
All right.
356
00:17:26,290 --> 00:17:27,470
So how might I do this?
357
00:17:27,470 --> 00:17:29,890
Well, let me go ahead
and do this up here.
358
00:17:29,890 --> 00:17:33,880
Before I start iterating, let me give
myself, let's say, three variables.
359
00:17:33,880 --> 00:17:37,120
And to keep things simple, I'll
say one variable called Scratch.
360
00:17:37,120 --> 00:17:40,270
Set it equal to 0 for 0 students so far.
361
00:17:40,270 --> 00:17:44,200
C is going to equal 0 and
Python is going to equal 0.
362
00:17:44,200 --> 00:17:46,157
There's a slightly
prettier way of doing this,
363
00:17:46,157 --> 00:17:49,240
just because this is like three lines
of code to do something very simple.
364
00:17:49,240 --> 00:17:53,080
You could alternatively
in Python but not C
365
00:17:53,080 --> 00:17:58,720
do Scratch comma C comma
Python equals 0 comma 0.
366
00:17:58,720 --> 00:18:02,168
So kind of slightly more elegant
just to fit it all into one line.
367
00:18:02,168 --> 00:18:04,210
But now let's just do
something more interesting.
368
00:18:04,210 --> 00:18:06,160
On line seven, I'm still
going to figure out
369
00:18:06,160 --> 00:18:08,320
what the current favorite language is.
370
00:18:08,320 --> 00:18:11,740
And now I'm just going to
do some conditional checks.
371
00:18:11,740 --> 00:18:16,610
How about if that favorite equals
equals quote unquote "Scratch,"
372
00:18:16,610 --> 00:18:19,990
Let's go ahead and
increment Scratch by 1.
373
00:18:19,990 --> 00:18:23,560
We can't do plus plus in Python,
but we can do plus equals 1.
374
00:18:23,560 --> 00:18:30,700
How about elif if favorite equals
equals C, then let's do C plus equals 1.
375
00:18:30,700 --> 00:18:32,063
We could do else.
376
00:18:32,063 --> 00:18:33,730
This is actually a good design question.
377
00:18:33,730 --> 00:18:34,840
Should I do else?
378
00:18:34,840 --> 00:18:36,880
Should I do elif?
379
00:18:36,880 --> 00:18:39,250
Any instincts here?
380
00:18:39,250 --> 00:18:40,824
Yeah.
381
00:18:40,824 --> 00:18:43,259
AUDIENCE: [INAUDIBLE]
382
00:18:43,259 --> 00:18:46,322
383
00:18:46,322 --> 00:18:48,030
DAVID MALAN: Yeah,
really good instincts.
384
00:18:48,030 --> 00:18:50,370
Just in case someone goes
and adds another language
385
00:18:50,370 --> 00:18:51,960
to the form next week
because we're obviously
386
00:18:51,960 --> 00:18:53,370
going to introduce
another language today,
387
00:18:53,370 --> 00:18:55,620
you don't want your code
to now artificially inflate
388
00:18:55,620 --> 00:18:58,380
the scores for Python
just because you're
389
00:18:58,380 --> 00:19:00,340
conflating multiple languages together.
390
00:19:00,340 --> 00:19:03,810
So the more defensive sort of better
way to write this code, I agree,
391
00:19:03,810 --> 00:19:07,680
would be elif favorite
equals equals Python.
392
00:19:07,680 --> 00:19:10,590
Then let's go ahead and
increment Python plus equals 1.
393
00:19:10,590 --> 00:19:12,840
And if there's a new
language next week, we're
394
00:19:12,840 --> 00:19:14,260
obviously going to have
to update the code,
395
00:19:14,260 --> 00:19:15,630
but at least we're not miscounting.
396
00:19:15,630 --> 00:19:17,130
We're just missing the new language.
397
00:19:17,130 --> 00:19:19,210
So I think that's slightly more robust.
398
00:19:19,210 --> 00:19:22,290
All right, now at the very bottom of
this program and outside of the loop
399
00:19:22,290 --> 00:19:25,680
when I'm all done counting,
let me go ahead and print out
400
00:19:25,680 --> 00:19:27,060
using some f strings.
401
00:19:27,060 --> 00:19:31,450
How about the total number of
people whose favorite is Scratch?
402
00:19:31,450 --> 00:19:34,890
So this is just week
six f string syntax.
403
00:19:34,890 --> 00:19:38,940
Let me go ahead and print out another
f string for C. And I'm, of course,
404
00:19:38,940 --> 00:19:41,880
putting the variables in
curly braces, all lowercase,
405
00:19:41,880 --> 00:19:45,060
but the English words I'm
doing capitalization for.
406
00:19:45,060 --> 00:19:49,470
Let's do a final one with f Python
colon and then in curly braces
407
00:19:49,470 --> 00:19:51,000
Python close quote.
408
00:19:51,000 --> 00:19:52,105
And I think I'm done.
409
00:19:52,105 --> 00:19:53,980
So let me just hide my
terminal for a second.
410
00:19:53,980 --> 00:19:55,290
Here's the total program.
411
00:19:55,290 --> 00:19:56,730
Same stuff as before.
412
00:19:56,730 --> 00:19:58,980
Open favorites.csv.
413
00:19:58,980 --> 00:20:02,880
Open it further with the dictionary
reader to do that processing for us.
414
00:20:02,880 --> 00:20:06,330
Initialize three variables to 0 just
so we have something to count with.
415
00:20:06,330 --> 00:20:09,300
And then iterate over
the file row by row.
416
00:20:09,300 --> 00:20:12,120
And this is just some sort of
week one style conditional logic,
417
00:20:12,120 --> 00:20:14,410
albeit in Python, counting things.
418
00:20:14,410 --> 00:20:14,910
All right.
419
00:20:14,910 --> 00:20:17,080
So how can we now execute this?
420
00:20:17,080 --> 00:20:18,450
Let me go back to my terminal.
421
00:20:18,450 --> 00:20:20,610
Python of favorites.py.
422
00:20:20,610 --> 00:20:22,050
And here we go.
423
00:20:22,050 --> 00:20:26,400
As of today, everyone who's reporting
in live via the Google Form,
424
00:20:26,400 --> 00:20:30,190
their favorite languages are--
425
00:20:30,190 --> 00:20:31,330
interesting.
426
00:20:31,330 --> 00:20:34,510
That's pretty interesting too after
just one week of Python no less.
427
00:20:34,510 --> 00:20:36,310
But Scratch is a
healthy contender there.
428
00:20:36,310 --> 00:20:38,210
A lot of C. So a pretty good mix here.
429
00:20:38,210 --> 00:20:43,127
So is this going to be the best way
to write this program long term?
430
00:20:43,127 --> 00:20:45,460
Well, as you noted, if there's
a new language next week,
431
00:20:45,460 --> 00:20:47,830
this week we're going to have
to constantly update this.
432
00:20:47,830 --> 00:20:51,610
And here's where you should let
your mind wander to the future.
433
00:20:51,610 --> 00:20:55,270
If we have a fourth language, fifth
language, sixth, seventh, eighth,
434
00:20:55,270 --> 00:20:59,410
which aspects here might kind
of have some code smell to it?
435
00:20:59,410 --> 00:21:04,310
This probably isn't the best
design to set us up for the future.
436
00:21:04,310 --> 00:21:07,480
What might be better than this?
437
00:21:07,480 --> 00:21:08,417
Yeah.
438
00:21:08,417 --> 00:21:09,880
AUDIENCE: We need to add
a language to line five.
439
00:21:09,880 --> 00:21:10,330
DAVID MALAN: Yeah.
440
00:21:10,330 --> 00:21:12,288
We have to keep adding
a language to line five.
441
00:21:12,288 --> 00:21:13,340
And OK, not a big deal.
442
00:21:13,340 --> 00:21:16,690
We could add SQL today and
maybe JavaScript next week.
443
00:21:16,690 --> 00:21:19,510
But any time a line of
code, a line of logic,
444
00:21:19,510 --> 00:21:21,450
it's just going to grow out of control.
445
00:21:21,450 --> 00:21:23,950
We've had this chat a couple
of times with different syntax.
446
00:21:23,950 --> 00:21:25,700
There's probably a better way than that.
447
00:21:25,700 --> 00:21:26,440
So let's do that.
448
00:21:26,440 --> 00:21:30,550
Instead of using these individual
variables, we could maybe use a list,
449
00:21:30,550 --> 00:21:35,560
but a list would be a little confusing
because what does bracket 0 mean?
450
00:21:35,560 --> 00:21:37,090
What is bracket 1, bracket 2?
451
00:21:37,090 --> 00:21:40,930
But a dictionary, recall, is this
Swiss army knife of data structures
452
00:21:40,930 --> 00:21:44,630
whereby you can associate anything
with anything else, keys with values.
453
00:21:44,630 --> 00:21:48,370
So I dare say a cleaner way to
solve this problem that sets us up
454
00:21:48,370 --> 00:21:52,570
for less work or confusion later would
be to create a new variable called
455
00:21:52,570 --> 00:21:55,000
counts, if that's what we're
doing, counting things up,
456
00:21:55,000 --> 00:21:58,340
and just set it equal
to an empty dictionary.
457
00:21:58,340 --> 00:22:01,675
And you can literally say dict with
the open parenthesis, close parenthesis
458
00:22:01,675 --> 00:22:05,020
and nothing or the more
Pythonic just use open
459
00:22:05,020 --> 00:22:07,000
and close curly braces
with nothing inside.
460
00:22:07,000 --> 00:22:11,320
That gives me an empty dictionary just
like square brackets gives me a list.
461
00:22:11,320 --> 00:22:13,990
Now, my logic down here
has to change a little bit.
462
00:22:13,990 --> 00:22:19,150
But what's nice is I don't need
one conditional for every language.
463
00:22:19,150 --> 00:22:22,240
Because again, if we have a fourth,
a fifth, a sixth, that chunk of code
464
00:22:22,240 --> 00:22:24,580
is also going to grow a
bit out of control too.
465
00:22:24,580 --> 00:22:26,570
So I can get rid of this here.
466
00:22:26,570 --> 00:22:29,180
And what I think I'm
going to do is say this.
467
00:22:29,180 --> 00:22:32,380
Whatever the current favorite is
from the current row in the file,
468
00:22:32,380 --> 00:22:38,420
why don't we go into our
counts variable at that key?
469
00:22:38,420 --> 00:22:40,303
And again, favorite is a variable.
470
00:22:40,303 --> 00:22:41,720
It's not quote unquote "favorite."
471
00:22:41,720 --> 00:22:44,560
It's going to be Scratch or C or Python.
472
00:22:44,560 --> 00:22:47,350
And then why don't we go ahead
and just increment whatever
473
00:22:47,350 --> 00:22:51,740
the value of that count is at that key?
474
00:22:51,740 --> 00:22:53,500
Now, this is technically buggy.
475
00:22:53,500 --> 00:22:57,140
We're really close, but there is a bug.
476
00:22:57,140 --> 00:23:00,350
Does anyone want to
conjecture what the bug is?
477
00:23:00,350 --> 00:23:00,890
Yeah.
478
00:23:00,890 --> 00:23:04,215
AUDIENCE: [INAUDIBLE]
479
00:23:04,215 --> 00:23:08,490
480
00:23:08,490 --> 00:23:11,640
DAVID MALAN: A good question that
answers my question nonetheless.
481
00:23:11,640 --> 00:23:12,685
So no.
482
00:23:12,685 --> 00:23:14,310
The magic you describe will not happen.
483
00:23:14,310 --> 00:23:17,490
And to repeat the hypothesis,
will this automatically
484
00:23:17,490 --> 00:23:22,590
create a key for every
language that we try
485
00:23:22,590 --> 00:23:24,090
plugging into those square brackets?
486
00:23:24,090 --> 00:23:25,140
Short answer no.
487
00:23:25,140 --> 00:23:27,750
Odds are this is going to
create a key error, one of those
488
00:23:27,750 --> 00:23:29,640
traceback error messages
that you've probably
489
00:23:29,640 --> 00:23:31,890
seen by now either in
class or in problem sets
490
00:23:31,890 --> 00:23:37,410
whereby if Scratch hasn't appeared in
the dictionary before or C or Python,
491
00:23:37,410 --> 00:23:40,330
then the dictionary has no
clue what you're talking about.
492
00:23:40,330 --> 00:23:43,200
So I think we actually still need
some conditional logic but not
493
00:23:43,200 --> 00:23:45,900
that's going to grow longer
and longer with each language.
494
00:23:45,900 --> 00:23:47,860
What I think we probably
want to do is this.
495
00:23:47,860 --> 00:23:52,530
If the current favorite is
in the count dictionary,
496
00:23:52,530 --> 00:23:56,730
and this is the Pythonic way of just
saying is this key in this dictionary,
497
00:23:56,730 --> 00:24:02,160
then go ahead and safely do
counts favorite plus equals 1.
498
00:24:02,160 --> 00:24:05,250
Else, to your conjecture
now, else what I want to do.
499
00:24:05,250 --> 00:24:10,330
Counts favorites equals, yeah, 1.
500
00:24:10,330 --> 00:24:14,380
So initialize a brand new
key to a brand new value of 1
501
00:24:14,380 --> 00:24:16,630
because I'm obviously
just seeing this language.
502
00:24:16,630 --> 00:24:19,190
Otherwise increment again and again.
503
00:24:19,190 --> 00:24:22,390
And now down here I just need
to tweak my syntax a little bit.
504
00:24:22,390 --> 00:24:26,540
I don't need to print out all of
these things one at a time manually.
505
00:24:26,540 --> 00:24:30,470
I can actually get away I think with
another loop at the very bottom here.
506
00:24:30,470 --> 00:24:31,480
So how about I do this?
507
00:24:31,480 --> 00:24:34,790
For each favorite in
those counts, and this is,
508
00:24:34,790 --> 00:24:39,340
again, the Pythonic way to iterate
over all of the keys in a dictionary,
509
00:24:39,340 --> 00:24:45,220
go ahead and print out using an f
string whatever the current favorite is,
510
00:24:45,220 --> 00:24:52,180
Scratch or C or Python, and then a colon
and then figure out what its count is.
511
00:24:52,180 --> 00:24:54,610
And you can do that by going
into the counts dictionary,
512
00:24:54,610 --> 00:24:58,910
looking at the favorite
key, and get back its value.
513
00:24:58,910 --> 00:25:00,190
So I close my curly braces.
514
00:25:00,190 --> 00:25:01,060
I close my quotes.
515
00:25:01,060 --> 00:25:04,690
And even though this looks ugly at the
moment, now this is much more dynamic.
516
00:25:04,690 --> 00:25:07,690
Because if we go and add
SQL to the CSV file tomorrow
517
00:25:07,690 --> 00:25:10,720
or we add JavaScript next
week, this will just work.
518
00:25:10,720 --> 00:25:12,760
It will keep working now automatically.
519
00:25:12,760 --> 00:25:15,140
All I change is the Google
Form, not my actual code.
520
00:25:15,140 --> 00:25:15,640
All right.
521
00:25:15,640 --> 00:25:17,620
Let's try Python of favorites.py.
522
00:25:17,620 --> 00:25:19,150
Cross my fingers as always.
523
00:25:19,150 --> 00:25:23,020
And there now is the data as of now.
524
00:25:23,020 --> 00:25:25,270
Questions on this code here?
525
00:25:25,270 --> 00:25:26,195
Yeah.
526
00:25:26,195 --> 00:25:28,093
AUDIENCE: [INAUDIBLE]
527
00:25:28,093 --> 00:25:29,510
DAVID MALAN: Really good question.
528
00:25:29,510 --> 00:25:33,040
What if you wanted to print
it in a particular order?
529
00:25:33,040 --> 00:25:35,530
Well, I could give you
a couple of solutions.
530
00:25:35,530 --> 00:25:37,390
If you want to print it out in--
531
00:25:37,390 --> 00:25:40,850
it's already coincidentally
in alphabetical order.
532
00:25:40,850 --> 00:25:43,630
So you got that for free although
that's just by chance here.
533
00:25:43,630 --> 00:25:45,010
But there is a way to do this.
534
00:25:45,010 --> 00:25:47,320
And let me propose that we
go down here to my loop.
535
00:25:47,320 --> 00:25:50,320
And I explicitly use a function you
might not have seen in Python yet,
536
00:25:50,320 --> 00:25:52,240
but it's literally
called sorted, which is
537
00:25:52,240 --> 00:25:54,970
going to take either a list
or, in this case, a dictionary
538
00:25:54,970 --> 00:25:58,270
and by default sort it
by key alphabetically.
539
00:25:58,270 --> 00:26:00,907
Now, if my intuition
is correct, this is not
540
00:26:00,907 --> 00:26:03,490
going to change the output,
because it's already alphabetical.
541
00:26:03,490 --> 00:26:05,980
But if you read the documentation
for the sorted function,
542
00:26:05,980 --> 00:26:08,800
it takes multiple parameters
potentially, some of which
543
00:26:08,800 --> 00:26:10,000
are named parameters.
544
00:26:10,000 --> 00:26:11,870
And so you can actually do this.
545
00:26:11,870 --> 00:26:14,200
If you want to the counts
but you want to reverse
546
00:26:14,200 --> 00:26:18,018
the order for whatever reason here so
that it's reverse alphabetical order.
547
00:26:18,018 --> 00:26:19,810
Now let me go ahead
and rerun this and I'll
548
00:26:19,810 --> 00:26:21,470
keep the previous output on the screen.
549
00:26:21,470 --> 00:26:27,000
Enter and now it's backwards
alphabetically, if you will.
550
00:26:27,000 --> 00:26:29,130
Other questions on this here?
551
00:26:29,130 --> 00:26:31,770
552
00:26:31,770 --> 00:26:33,830
No?
553
00:26:33,830 --> 00:26:39,930
How about then we transition
to changing sorting by value.
554
00:26:39,930 --> 00:26:42,770
And this is going to escalate
a little quickly briefly
555
00:26:42,770 --> 00:26:44,540
but then we'll tone it down again.
556
00:26:44,540 --> 00:26:47,600
Notice that right now this
is indeed sorting by key.
557
00:26:47,600 --> 00:26:49,400
What if, especially if
I have lots of data,
558
00:26:49,400 --> 00:26:52,670
it'd be nice to make a top 10 list
or, in this case, a top three list
559
00:26:52,670 --> 00:26:58,890
and actually see in order of the counts,
the values what these popular ones are.
560
00:26:58,890 --> 00:27:00,620
So it's not C, Python, Scratch.
561
00:27:00,620 --> 00:27:03,770
It should ideally be
Python, then C, then
562
00:27:03,770 --> 00:27:06,710
Scratch because of the values
and the magnitude thereof.
563
00:27:06,710 --> 00:27:08,010
So how can I do this?
564
00:27:08,010 --> 00:27:12,170
Well, it turns out there is another
key, another parameter that you
565
00:27:12,170 --> 00:27:16,550
can pass to the sorted
function that is typically
566
00:27:16,550 --> 00:27:17,925
implemented as a function itself.
567
00:27:17,925 --> 00:27:19,633
And so I'm going to
go ahead and do this.
568
00:27:19,633 --> 00:27:21,950
I'm going to temporarily
define a function called
569
00:27:21,950 --> 00:27:24,230
get value just to make my life easier.
570
00:27:24,230 --> 00:27:29,390
And this get value function is going
to take I'll say a language parameter.
571
00:27:29,390 --> 00:27:35,400
And then all I'm going to do is return
whatever the count is of that language.
572
00:27:35,400 --> 00:27:37,850
So out of context, this is
just a super simple function
573
00:27:37,850 --> 00:27:41,252
that you hand it a language
like Scratch or C or Python,
574
00:27:41,252 --> 00:27:43,460
it's just going to tell you
what the count is thereof
575
00:27:43,460 --> 00:27:46,400
in that dictionary called counts.
576
00:27:46,400 --> 00:27:51,440
But what I can do now down here in my
newly introduced call to sort it is I
577
00:27:51,440 --> 00:27:54,380
can tell it what to use as its key.
578
00:27:54,380 --> 00:27:57,170
Instead of using literally
the key, Scratch, C,
579
00:27:57,170 --> 00:28:01,040
Python, I can sort of override that
behavior and say, you know what?
580
00:28:01,040 --> 00:28:05,570
To figure out what to sort by,
go ahead and call this function
581
00:28:05,570 --> 00:28:07,890
called get value.
582
00:28:07,890 --> 00:28:11,270
Notice that I have not put
parentheses after get value
583
00:28:11,270 --> 00:28:14,420
because I don't want to call
get value right then and there.
584
00:28:14,420 --> 00:28:20,737
I want to pass the get value function
as itself in argument to the sorted
585
00:28:20,737 --> 00:28:23,570
function so that the sorted function
written years ago by the people
586
00:28:23,570 --> 00:28:27,320
at Python can call my version
of get value again and again
587
00:28:27,320 --> 00:28:30,300
and again when they try
to sort this actual data.
588
00:28:30,300 --> 00:28:36,350
So now if I add that and I leave reverse
equals true, let's see what happens.
589
00:28:36,350 --> 00:28:38,240
Python of favorites.py.
590
00:28:38,240 --> 00:28:38,810
Enter.
591
00:28:38,810 --> 00:28:42,410
And now I get my top 10, or
in this case, top three list.
592
00:28:42,410 --> 00:28:45,290
And if I had more sophisticated
data with more columns
593
00:28:45,290 --> 00:28:47,150
all together that I
actually care about, I
594
00:28:47,150 --> 00:28:50,758
could even sort this
more powerfully as well.
595
00:28:50,758 --> 00:28:52,550
But let me clean this
up a little bit, just
596
00:28:52,550 --> 00:28:54,592
so you've seen it, even
though we won't use these
597
00:28:54,592 --> 00:28:59,010
that often in CS50 until the end of
the class will they come up again.
598
00:28:59,010 --> 00:29:00,740
Technically this is a little bit--
599
00:29:00,740 --> 00:29:03,620
this isn't necessarily the best
design to spend all this time
600
00:29:03,620 --> 00:29:06,680
implementing a function and
then only use it in one place.
601
00:29:06,680 --> 00:29:09,430
In general, we've argued
that you don't necessarily
602
00:29:09,430 --> 00:29:11,930
need a variable if you're only
going to use it in one place.
603
00:29:11,930 --> 00:29:13,430
You don't really need a
function if you're only
604
00:29:13,430 --> 00:29:14,638
going to use it in one place.
605
00:29:14,638 --> 00:29:17,160
And here we kind of have
a good candidate for that.
606
00:29:17,160 --> 00:29:19,400
And so it turns out
in Python if you don't
607
00:29:19,400 --> 00:29:23,240
want to bother creating a
function just to use it once,
608
00:29:23,240 --> 00:29:26,810
you can create what's called
an anonymous function, a.k.a.
609
00:29:26,810 --> 00:29:29,900
a lambda function, like
the lambda symbol familiar.
610
00:29:29,900 --> 00:29:33,380
And a lambda function, the syntax
is a little strange looking,
611
00:29:33,380 --> 00:29:34,490
but you say this.
612
00:29:34,490 --> 00:29:36,080
You literally say lambda.
613
00:29:36,080 --> 00:29:38,660
You literally then say
the name of the argument
614
00:29:38,660 --> 00:29:42,470
that you want this anonymous
function with no name to take.
615
00:29:42,470 --> 00:29:44,430
Then you have a colon.
616
00:29:44,430 --> 00:29:47,930
And then quite simply, you
write what you want the return
617
00:29:47,930 --> 00:29:49,850
value of this function to be.
618
00:29:49,850 --> 00:29:51,800
You don't even say return literally.
619
00:29:51,800 --> 00:29:53,720
These lambda functions
are meant to be used
620
00:29:53,720 --> 00:30:00,680
super tersely so that you can in one
line express something like this.
621
00:30:00,680 --> 00:30:03,720
And I admit this looks more cryptic,
I think, than the previous version.
622
00:30:03,720 --> 00:30:06,512
But as you get more comfortable
with Python or other languages that
623
00:30:06,512 --> 00:30:10,370
support this feature, it allows you to
not bother with lines of code like that
624
00:30:10,370 --> 00:30:12,690
and just tighten up
your code a little bit.
625
00:30:12,690 --> 00:30:16,670
So this line here, lambda
language colon counts language,
626
00:30:16,670 --> 00:30:18,927
is the one line version of this.
627
00:30:18,927 --> 00:30:21,260
And you don't even need to
bother picking a name for it.
628
00:30:21,260 --> 00:30:23,840
Lambda tells Python I
didn't waste any time
629
00:30:23,840 --> 00:30:26,440
thinking of a name for this function.
630
00:30:26,440 --> 00:30:35,310
So questions then on this technique of
using Python to analyze data like this?
631
00:30:35,310 --> 00:30:37,900
Any questions?
632
00:30:37,900 --> 00:30:39,860
We're almost done with Python.
633
00:30:39,860 --> 00:30:42,070
Questions?
634
00:30:42,070 --> 00:30:42,640
No?
635
00:30:42,640 --> 00:30:43,390
OK.
636
00:30:43,390 --> 00:30:45,790
So why don't we make things
a little more interesting?
637
00:30:45,790 --> 00:30:48,620
Because we had a much juicier
data set with the problems
638
00:30:48,620 --> 00:30:50,620
that we've assigned over
the past several weeks.
639
00:30:50,620 --> 00:30:54,970
Why don't we go ahead
and, quite simply, I think
640
00:30:54,970 --> 00:30:57,290
we wrote pretty darn good code here.
641
00:30:57,290 --> 00:31:02,590
So I think we can pretty much
just change a bit of it to say,
642
00:31:02,590 --> 00:31:05,720
let's see, if I don't want
language, I want problem.
643
00:31:05,720 --> 00:31:10,510
And if I want to sort by not language
but problem, I think that's it.
644
00:31:10,510 --> 00:31:12,580
I think if I didn't
overlook something here
645
00:31:12,580 --> 00:31:16,210
just by changing what column I'm
reading the data from and then
646
00:31:16,210 --> 00:31:18,550
just to be consistent
renaming my variables
647
00:31:18,550 --> 00:31:21,310
just so I know what I'm looking
at, what will this program now
648
00:31:21,310 --> 00:31:25,470
do after those minor changes?
649
00:31:25,470 --> 00:31:26,970
What will I see when I run this?
650
00:31:26,970 --> 00:31:29,693
651
00:31:29,693 --> 00:31:31,860
What would be the first
thing I see when I run this?
652
00:31:31,860 --> 00:31:34,950
653
00:31:34,950 --> 00:31:35,970
Tough crowd today.
654
00:31:35,970 --> 00:31:36,720
Yes.
655
00:31:36,720 --> 00:31:37,637
AUDIENCE: The problem.
656
00:31:37,637 --> 00:31:39,110
DAVID MALAN: Yeah.
657
00:31:39,110 --> 00:31:40,330
The top problem.
658
00:31:40,330 --> 00:31:42,627
So the most popular problem,
which I'm a little worried
659
00:31:42,627 --> 00:31:45,210
it might be hello or just Scratch,
but let's go ahead and see.
660
00:31:45,210 --> 00:31:47,700
So let me go ahead and
open my terminal window.
661
00:31:47,700 --> 00:31:50,670
I'll even maximize my terminal
window so we can see a lot.
662
00:31:50,670 --> 00:31:54,167
Let me go ahead and run
Python of favorites.py.
663
00:31:54,167 --> 00:31:57,000
I'm going to go ahead now and cross
my fingers that I didn't mess up
664
00:31:57,000 --> 00:31:57,660
and hit Enter.
665
00:31:57,660 --> 00:31:59,550
And OK, great.
666
00:31:59,550 --> 00:32:00,480
We peaked early.
667
00:32:00,480 --> 00:32:03,840
So Scratch was the most
popular program according
668
00:32:03,840 --> 00:32:06,460
to the data at the time I downloaded it.
669
00:32:06,460 --> 00:32:08,520
I'm sure other votes have come in since.
670
00:32:08,520 --> 00:32:13,080
Filter in week four was tied
then with Tideman as well.
671
00:32:13,080 --> 00:32:16,480
Mario is a close third
there and so forth.
672
00:32:16,480 --> 00:32:19,540
So this is helpful for us on
staff that not so much love
673
00:32:19,540 --> 00:32:21,040
down here at the bottom of the list.
674
00:32:21,040 --> 00:32:23,400
So it was a bunch of
code to write, but now
675
00:32:23,400 --> 00:32:26,250
that we've written it in this
very versatile dynamic way,
676
00:32:26,250 --> 00:32:29,280
it's pretty good for just like
crunching data and doing some analytics.
677
00:32:29,280 --> 00:32:33,480
But it's still a decent number of
lines to have had to write manually.
678
00:32:33,480 --> 00:32:41,820
And this is where sometimes it isn't
necessarily the right tool for the job,
679
00:32:41,820 --> 00:32:44,520
but rather a candidate for using
some other language altogether,
680
00:32:44,520 --> 00:32:47,428
especially when it's not just
a one time program that you run
681
00:32:47,428 --> 00:32:48,720
and you want to see the answer.
682
00:32:48,720 --> 00:32:53,130
What if you want to take input from the
user and answer questions dynamically
683
00:32:53,130 --> 00:32:55,050
like a mobile app would,
like a website would,
684
00:32:55,050 --> 00:32:58,740
like Microsoft Excel or Apple Numbers
or Google Sheets would for you.
685
00:32:58,740 --> 00:33:02,430
Well, let's make one final change for
now to this version of the program
686
00:33:02,430 --> 00:33:05,020
and actually take in some user input.
687
00:33:05,020 --> 00:33:08,170
So besides just loading all
of the data into memory,
688
00:33:08,170 --> 00:33:13,290
let's go ahead and down below
here not just print out the top 10
689
00:33:13,290 --> 00:33:16,660
list, if you will, but prompt
the user for their favorites.
690
00:33:16,660 --> 00:33:18,660
I'm going to use Python's
input function and I'm
691
00:33:18,660 --> 00:33:21,840
going to prompt them with "favorite,"
quote unquote, like tell me what
692
00:33:21,840 --> 00:33:23,370
your favorite problem--
693
00:33:23,370 --> 00:33:27,660
what problem rather
you are interested in.
694
00:33:27,660 --> 00:33:33,267
And now let me go ahead and say if that
favorite is in the count's variable,
695
00:33:33,267 --> 00:33:35,850
so you didn't type in something
random that we didn't actually
696
00:33:35,850 --> 00:33:37,890
assign as a problem,
then let me go ahead
697
00:33:37,890 --> 00:33:44,130
and print with a format string whatever
that favorite is of yours and show you
698
00:33:44,130 --> 00:33:47,040
the actual popularity
thereof by indexing
699
00:33:47,040 --> 00:33:51,370
into counts using that favorite
as the key and printing this.
700
00:33:51,370 --> 00:33:52,620
So now it's a dynamic program.
701
00:33:52,620 --> 00:33:55,170
It doesn't dump all of the
data and all of the summations.
702
00:33:55,170 --> 00:33:59,100
Rather it's going to allow me to
see what my choice of favorite is.
703
00:33:59,100 --> 00:34:03,940
And I'm going to go ahead and say,
let's see, I'm a fan of Mario here.
704
00:34:03,940 --> 00:34:04,950
So Enter.
705
00:34:04,950 --> 00:34:08,880
And indeed, we see the same value we
saw a moment ago but just for Mario.
706
00:34:08,880 --> 00:34:11,880
But the point now is that,
one, all of this is possible.
707
00:34:11,880 --> 00:34:15,630
Two, it's way easier and more pleasant
than this would have been in C.
708
00:34:15,630 --> 00:34:17,280
This is still only 15 lines of code.
709
00:34:17,280 --> 00:34:19,238
And in C, again, there's
the memory management.
710
00:34:19,238 --> 00:34:21,947
There's the iterating over the
strings trying to find the commas.
711
00:34:21,947 --> 00:34:23,190
There's just a lot more work.
712
00:34:23,190 --> 00:34:26,489
But honestly even when you just
want to answer a question like this
713
00:34:26,489 --> 00:34:29,067
in Excel and Apple Numbers,
Google Sheets, generally, you
714
00:34:29,067 --> 00:34:30,150
can just highlight things.
715
00:34:30,150 --> 00:34:34,500
You can click a button and boom, you get
your answer for summation or max or min
716
00:34:34,500 --> 00:34:35,909
or any of those sort of basics.
717
00:34:35,909 --> 00:34:39,270
Wouldn't it be nice if we weren't
taking a step backwards as programmers
718
00:34:39,270 --> 00:34:42,880
and being more powerful and yet we
now have to do more of the work?
719
00:34:42,880 --> 00:34:46,230
So maybe sometimes Python's
not or any language
720
00:34:46,230 --> 00:34:48,370
is not the best tool for the job.
721
00:34:48,370 --> 00:34:51,960
And that's going to now allow us to
introduce more generally something
722
00:34:51,960 --> 00:34:53,670
called a relational database.
723
00:34:53,670 --> 00:34:57,137
Graduating from mere flat
file databases like text files
724
00:34:57,137 --> 00:34:58,845
or binary files in
which all of your data
725
00:34:58,845 --> 00:35:00,570
is stored to something more proper.
726
00:35:00,570 --> 00:35:02,566
But first, questions.
727
00:35:02,566 --> 00:35:08,518
AUDIENCE: [INAUDIBLE]
728
00:35:08,518 --> 00:35:12,943
729
00:35:12,943 --> 00:35:14,360
DAVID MALAN: Really good question.
730
00:35:14,360 --> 00:35:17,670
To reiterate, if I were--
is this case sensitive?
731
00:35:17,670 --> 00:35:21,470
So if I were to type in Mario
in all lowercase and hit Enter,
732
00:35:21,470 --> 00:35:23,450
I actually get no such response.
733
00:35:23,450 --> 00:35:26,720
Now that might be acceptable, because
the problem technically is a capital M.
734
00:35:26,720 --> 00:35:30,840
But that's a little ridiculous to
be that pedantic about the input.
735
00:35:30,840 --> 00:35:34,250
So how could we solve this?
736
00:35:34,250 --> 00:35:37,010
Any tips for how we can make
this a little more robust?
737
00:35:37,010 --> 00:35:38,378
Yeah.
738
00:35:38,378 --> 00:35:42,290
AUDIENCE: [INAUDIBLE]
739
00:35:42,290 --> 00:35:45,848
740
00:35:45,848 --> 00:35:46,765
DAVID MALAN: OK, yeah.
741
00:35:46,765 --> 00:35:49,400
742
00:35:49,400 --> 00:35:49,900
Yeah.
743
00:35:49,900 --> 00:35:51,700
So we could use a few different
functions, one of which
744
00:35:51,700 --> 00:35:54,970
is called title, which will change it
to title case where it capitalizes,
745
00:35:54,970 --> 00:35:58,390
like in most English sentences,
the first letter of that sentence.
746
00:35:58,390 --> 00:35:59,740
We could use capitalize.
747
00:35:59,740 --> 00:36:00,940
We could use upper.
748
00:36:00,940 --> 00:36:01,990
We could use lower.
749
00:36:01,990 --> 00:36:06,435
But indeed, we could just decide how we
want to standardize the capitalization.
750
00:36:06,435 --> 00:36:08,810
Either uppercase, lowercase,
or some combination thereof.
751
00:36:08,810 --> 00:36:11,770
And just make sure that you
change the counts themselves.
752
00:36:11,770 --> 00:36:13,930
Make sure that you do
the same to favorite
753
00:36:13,930 --> 00:36:16,420
and make sure that maybe you
keep a backup of the data
754
00:36:16,420 --> 00:36:19,780
if you want to show the original
version that came from the CSV
755
00:36:19,780 --> 00:36:23,205
without presuming to just
capitalize everything for the user.
756
00:36:23,205 --> 00:36:25,330
But indeed, that would be
the most common scenario.
757
00:36:25,330 --> 00:36:28,540
You just make things case
insensitive when doing those matches.
758
00:36:28,540 --> 00:36:36,910
Other questions now on Python before
we leave it behind for the coming week?
759
00:36:36,910 --> 00:36:39,880
All right, well then let's introduce
these relational databases.
760
00:36:39,880 --> 00:36:43,060
So relational database is what every--
761
00:36:43,060 --> 00:36:45,572
it's a super popular way
of storing lots of data.
762
00:36:45,572 --> 00:36:48,280
This is what the Twitters of the
world, the Googles of the world,
763
00:36:48,280 --> 00:36:52,390
the Metas of the world use to
store some of their data at scale.
764
00:36:52,390 --> 00:36:55,492
There are alternatives
to relational databases.
765
00:36:55,492 --> 00:36:57,700
Indeed, today we'll talk
about a language called SQL.
766
00:36:57,700 --> 00:37:00,408
There's also a movement, if you
will, or an alternative generally
767
00:37:00,408 --> 00:37:02,330
called NoSQL, which
is just the opposite.
768
00:37:02,330 --> 00:37:03,280
You don't use SQL.
769
00:37:03,280 --> 00:37:06,020
There are things called object
oriented databases and the like.
770
00:37:06,020 --> 00:37:11,110
But if you've ever heard of MySQL
or PostgreSQL or Microsoft SQL
771
00:37:11,110 --> 00:37:15,580
Server or Oracle or MariaDB
or a bunch of other products,
772
00:37:15,580 --> 00:37:18,850
both free and commercial, this
is what they're talking about.
773
00:37:18,850 --> 00:37:21,560
Databases that are designed
to store lots of data.
774
00:37:21,560 --> 00:37:23,710
And what's nice about
relational databases
775
00:37:23,710 --> 00:37:26,530
is that they're really similar to
the spreadsheets with which you
776
00:37:26,530 --> 00:37:29,800
were presumably familiar
long before today's class.
777
00:37:29,800 --> 00:37:32,530
So a relational database
is going to store,
778
00:37:32,530 --> 00:37:35,452
as you'll see, all of the
data in rows and columns.
779
00:37:35,452 --> 00:37:37,910
Now, the terminology will
thereafter be a little different.
780
00:37:37,910 --> 00:37:40,570
Instead of having sheets,
you're going to have tables.
781
00:37:40,570 --> 00:37:43,240
But those tables are still
going to have rows and columns.
782
00:37:43,240 --> 00:37:46,330
And you're going to have even
more control over the performance
783
00:37:46,330 --> 00:37:51,970
of your data when you start to access
it using this Structured Query Language
784
00:37:51,970 --> 00:37:53,050
or SQL.
785
00:37:53,050 --> 00:37:56,440
This is a language you can
use for web apps, mobile apps.
786
00:37:56,440 --> 00:37:58,990
A lot of analysts would
sit down at their Mac or PC
787
00:37:58,990 --> 00:38:01,810
and actually ask questions of
data to get back the answer.
788
00:38:01,810 --> 00:38:05,050
And wonderfully, even though there
will be some new syntax today,
789
00:38:05,050 --> 00:38:07,570
SQL really just does four basic things.
790
00:38:07,570 --> 00:38:10,030
CRUD is the sort of crude acronym here.
791
00:38:10,030 --> 00:38:14,470
CRUD is a way of remembering that
a relational database supports
792
00:38:14,470 --> 00:38:20,390
ultimately creating data, reading
data, updating data, and deleting data.
793
00:38:20,390 --> 00:38:22,240
So even if you're
feeling like, wow, this
794
00:38:22,240 --> 00:38:25,510
is a lot of new syntax, which it
isn't relative to our past languages,
795
00:38:25,510 --> 00:38:29,650
the only things you're doing really are
creating data, reading data, updating,
796
00:38:29,650 --> 00:38:30,910
and deleting the same.
797
00:38:30,910 --> 00:38:36,940
Now a little confusingly in SQL, the
corresponding functions or commands
798
00:38:36,940 --> 00:38:41,150
that exist that map to
CRUD are actually this.
799
00:38:41,150 --> 00:38:44,290
So it's still create, but there's
another one called insert.
800
00:38:44,290 --> 00:38:48,250
It's not read, which is more of the
computer scientist way of saying it,
801
00:38:48,250 --> 00:38:50,380
but select, which is a
little more explicit.
802
00:38:50,380 --> 00:38:52,030
Like select data you care about.
803
00:38:52,030 --> 00:38:53,320
Update is still update.
804
00:38:53,320 --> 00:38:54,310
Delete is still delete.
805
00:38:54,310 --> 00:38:57,910
But there's another command called drop,
which lets you drop, that is delete,
806
00:38:57,910 --> 00:38:59,930
entire tables as well.
807
00:38:59,930 --> 00:39:03,280
So you can create tables
using syntax that's
808
00:39:03,280 --> 00:39:05,480
generally going to look like this.
809
00:39:05,480 --> 00:39:06,910
You'll say create table.
810
00:39:06,910 --> 00:39:10,210
You'll give the name of the
table, which you can call most
811
00:39:10,210 --> 00:39:14,260
anything you want, but generally
all lowercase, no spaces is best.
812
00:39:14,260 --> 00:39:18,550
Then in parentheses, you can specify
a comma separated list of the columns
813
00:39:18,550 --> 00:39:20,660
that you might want in this table.
814
00:39:20,660 --> 00:39:24,070
So this is the code
equivalent in the SQL language
815
00:39:24,070 --> 00:39:27,220
of manually opening Google
Sheets or Excel or Numbers
816
00:39:27,220 --> 00:39:30,490
and clicking in the top left
cell and like typing timestamp
817
00:39:30,490 --> 00:39:34,180
and then in the next typing language
and then the third typing problem.
818
00:39:34,180 --> 00:39:37,750
This is the way to define what
your headers are, if you will,
819
00:39:37,750 --> 00:39:38,500
in a spreadsheet.
820
00:39:38,500 --> 00:39:39,790
But now it's called a table.
821
00:39:39,790 --> 00:39:42,100
Now, we won't use this
command manually first.
822
00:39:42,100 --> 00:39:44,050
Let's do something a little simpler.
823
00:39:44,050 --> 00:39:47,590
We're going to start off by just
importing this data ourselves.
824
00:39:47,590 --> 00:39:49,340
And I'm going to go ahead and do this.
825
00:39:49,340 --> 00:39:51,160
Let me go back to VS Code here.
826
00:39:51,160 --> 00:39:54,040
I'm going to leave behind
favorites.py for now,
827
00:39:54,040 --> 00:39:57,100
because now we're going to transition
to this other language called SQL.
828
00:39:57,100 --> 00:40:02,680
And to do this, I am going to
create a new database file.
829
00:40:02,680 --> 00:40:06,790
And I'm going to do so using
a command called sqlite3,
830
00:40:06,790 --> 00:40:09,160
which is just the third
version thereof, and I'm
831
00:40:09,160 --> 00:40:12,110
going to give the database
a name of favorites.db.
832
00:40:12,110 --> 00:40:14,860
There's different conventions, but
this is one of the most common.
833
00:40:14,860 --> 00:40:18,610
When I hit Enter, this is going to
create for me a new empty database just
834
00:40:18,610 --> 00:40:23,080
like opening an untitled spreadsheet in
Excel, Google Sheets, or Apple Numbers.
835
00:40:23,080 --> 00:40:25,720
I'm being prompted do I
want to create favorites.db.
836
00:40:25,720 --> 00:40:26,890
I'll hit Y for Yes.
837
00:40:26,890 --> 00:40:28,562
OK, we're up and running.
838
00:40:28,562 --> 00:40:30,520
Now, you're going to
notice a different prompt.
839
00:40:30,520 --> 00:40:34,360
I'm not in my Linux prompt per se,
which is always the dollar sign.
840
00:40:34,360 --> 00:40:37,090
I'm now inside of the
program called SQLite.
841
00:40:37,090 --> 00:40:41,500
And we're going to use SQLite, SQLite3,
as just an interactive way for now
842
00:40:41,500 --> 00:40:43,180
of playing with SQL code.
843
00:40:43,180 --> 00:40:48,550
At the end of today, we'll show you
how you can use SQL in Python code
844
00:40:48,550 --> 00:40:51,050
so that you still write Python
code to do whatever you want,
845
00:40:51,050 --> 00:40:53,170
but you can talk to
databases using Python.
846
00:40:53,170 --> 00:40:55,630
And this is exactly how
web apps, mobile apps work.
847
00:40:55,630 --> 00:40:59,140
For instance, on iOS, on an
iPhone, an iPad, or the like,
848
00:40:59,140 --> 00:41:03,010
if you want to store data, it's
very often stored in a SQL database,
849
00:41:03,010 --> 00:41:04,580
as we're about to do.
850
00:41:04,580 --> 00:41:07,398
But you might use a language
called SWIFT or Objective C.
851
00:41:07,398 --> 00:41:09,190
And the same exists in
the world of Android
852
00:41:09,190 --> 00:41:12,830
using Java or Kotlin or something
else to query the database.
853
00:41:12,830 --> 00:41:16,270
So we're going to see SQL in isolation
for now like an analyst might just
854
00:41:16,270 --> 00:41:19,880
use with their Mac or PC, but we're
going to tie it together by day's end.
855
00:41:19,880 --> 00:41:28,180
So at this terminal SQLite, let me go
ahead and execute this command first.
856
00:41:28,180 --> 00:41:30,200
I'm going to first put
SQLite into CSV mode,
857
00:41:30,200 --> 00:41:32,200
because I'm going to cut
some corners initially.
858
00:41:32,200 --> 00:41:34,600
And I'm just going to
automatically import
859
00:41:34,600 --> 00:41:38,560
all of the data that was submitted
via the Google Form, which I exported
860
00:41:38,560 --> 00:41:41,410
as a CSV and uploaded to my code space.
861
00:41:41,410 --> 00:41:45,790
And I'm just going to automatically
say turn this CSV file into a SQL
862
00:41:45,790 --> 00:41:48,040
database for me just so I
don't have to figure out
863
00:41:48,040 --> 00:41:49,760
what those create table commands are.
864
00:41:49,760 --> 00:41:53,200
So to do this, I'm
going to say mode csv so
865
00:41:53,200 --> 00:41:57,850
that SQLite knows that this is the
command, knows that this is a CSV file.
866
00:41:57,850 --> 00:41:59,290
It's literally .mode.
867
00:41:59,290 --> 00:42:01,480
So the dot comes before
the keyword there.
868
00:42:01,480 --> 00:42:05,770
And now I'm going to say .import
and then the name of the file I want
869
00:42:05,770 --> 00:42:08,020
to import, which is favorites.csv.
870
00:42:08,020 --> 00:42:12,635
And now the name of the table that
I want to create with that data.
871
00:42:12,635 --> 00:42:15,010
And just for consistency, I'm
going to call it favorites.
872
00:42:15,010 --> 00:42:17,270
I could change these things
to be anything I want,
873
00:42:17,270 --> 00:42:18,530
but I'm going to do that.
874
00:42:18,530 --> 00:42:20,860
And voila, nothing
seems to have happened.
875
00:42:20,860 --> 00:42:24,503
But just like in C and in Python and
Linux when nothing seems to happen,
876
00:42:24,503 --> 00:42:25,670
that's usually a good thing.
877
00:42:25,670 --> 00:42:27,590
It means I didn't mess up.
878
00:42:27,590 --> 00:42:31,120
So if I want to see what just
happened, there's this other command.
879
00:42:31,120 --> 00:42:33,430
And these commands that
start with dots, these
880
00:42:33,430 --> 00:42:37,690
are SQLite specific, which is
indeed a lightweight version of SQL.
881
00:42:37,690 --> 00:42:39,010
They're not SQL, per se.
882
00:42:39,010 --> 00:42:41,260
So if you're using Oracle
or something else like that,
883
00:42:41,260 --> 00:42:43,135
you're not going to use
these exact commands.
884
00:42:43,135 --> 00:42:45,160
You'll see the ones we
use in just a moment.
885
00:42:45,160 --> 00:42:46,660
And here's the first.
886
00:42:46,660 --> 00:42:51,230
When I type .schema, the schema of a
database is the design of the database.
887
00:42:51,230 --> 00:42:52,120
What are the tables?
888
00:42:52,120 --> 00:42:53,900
What are the columns and all of that?
889
00:42:53,900 --> 00:42:58,390
So when I type .schema, this actually
in this case shows me the create table
890
00:42:58,390 --> 00:43:03,700
command that was automatically drawn
for me by just doing this import line.
891
00:43:03,700 --> 00:43:07,450
Once I get more comfortable with SQL,
I could literally type this out myself
892
00:43:07,450 --> 00:43:09,560
or use some program to
generate that as well.
893
00:43:09,560 --> 00:43:11,290
But what it's creating for me is this.
894
00:43:11,290 --> 00:43:15,580
Create table if it doesn't exist,
even though it's more terse than that.
895
00:43:15,580 --> 00:43:18,130
I want to create a
table called favorites.
896
00:43:18,130 --> 00:43:20,590
And then the columns
for that table are going
897
00:43:20,590 --> 00:43:24,880
to be timestamp, which is going
to be text, comma, language,
898
00:43:24,880 --> 00:43:27,248
which is also going to
be text, comma, problem,
899
00:43:27,248 --> 00:43:28,540
which is also going to be text.
900
00:43:28,540 --> 00:43:34,330
That was just inferred very trivially by
the .import command to just figure out
901
00:43:34,330 --> 00:43:38,680
that, yes, just give me a three column
database table based on the Google
902
00:43:38,680 --> 00:43:40,390
Form.
903
00:43:40,390 --> 00:43:41,740
Questions on this?
904
00:43:41,740 --> 00:43:44,140
These are commands you run
once to get up and running.
905
00:43:44,140 --> 00:43:47,260
You don't run these commands
frequently, but we have them
906
00:43:47,260 --> 00:43:49,280
on the slide just for reference.
907
00:43:49,280 --> 00:43:49,780
All right.
908
00:43:49,780 --> 00:43:52,010
So now let's do something
a little more interesting.
909
00:43:52,010 --> 00:43:55,330
I'm going to clear my SQLite terminal
here, but I'm still in SQLite.
910
00:43:55,330 --> 00:43:58,570
I'm going to now use some of
my first SQL commands, which
911
00:43:58,570 --> 00:44:02,440
recall were among them select.
912
00:44:02,440 --> 00:44:05,522
So CRUD, C-R-U-D. The R was select.
913
00:44:05,522 --> 00:44:07,480
This is maybe the most
common, the most useful,
914
00:44:07,480 --> 00:44:10,330
the most powerful thing
to use with a SQL database
915
00:44:10,330 --> 00:44:13,000
selecting data to answer
questions akin to the ones we
916
00:44:13,000 --> 00:44:14,980
were trying to answer with Python.
917
00:44:14,980 --> 00:44:16,270
This is the general syntax.
918
00:44:16,270 --> 00:44:19,900
Any time you want to select data from a
SQL database, you literally say select.
919
00:44:19,900 --> 00:44:24,730
You then specify the column or columns
that you want to select data from.
920
00:44:24,730 --> 00:44:28,450
You literally write the word from and
then you specify the name of the table.
921
00:44:28,450 --> 00:44:32,030
You want to get that data
from semicolon, in this case.
922
00:44:32,030 --> 00:44:35,200
Everything that's in capitals
here is a SQL keyword.
923
00:44:35,200 --> 00:44:37,750
Strictly speaking, you don't
have to capitalize things,
924
00:44:37,750 --> 00:44:40,150
but we would encourage you
to do so stylistically.
925
00:44:40,150 --> 00:44:43,030
And especially as you're learning
and even as you're writing it,
926
00:44:43,030 --> 00:44:47,530
it just helps to distinguish SQL
from words you chose, like the names
927
00:44:47,530 --> 00:44:49,460
of the columns and the data therein.
928
00:44:49,460 --> 00:44:52,700
So do adopt early on this convention.
929
00:44:52,700 --> 00:44:56,240
So let me go back now
to my code space here.
930
00:44:56,240 --> 00:44:59,012
I'm running my terminal window
with SQLite3 inside of it.
931
00:44:59,012 --> 00:45:00,970
Suppose that I just want
to get all of the data
932
00:45:00,970 --> 00:45:04,720
from the favorites table, which
was automatically imported.
933
00:45:04,720 --> 00:45:05,480
Let's do this.
934
00:45:05,480 --> 00:45:06,220
Select.
935
00:45:06,220 --> 00:45:07,630
I want everything.
936
00:45:07,630 --> 00:45:13,090
Well, I can do timestamp
comma language comma problem.
937
00:45:13,090 --> 00:45:13,840
But you know what?
938
00:45:13,840 --> 00:45:15,850
Here's a convenience already.
939
00:45:15,850 --> 00:45:19,120
If you want everything, there's
what's called a wild card character
940
00:45:19,120 --> 00:45:21,460
in SQL, which is just a
star, an asterisk, which
941
00:45:21,460 --> 00:45:25,120
means give me every column without
my knowing even what they're called.
942
00:45:25,120 --> 00:45:28,400
Let me go ahead now and say
from favorites semicolon.
943
00:45:28,400 --> 00:45:33,010
And this is the SQL way of opening
the database, iterating over
944
00:45:33,010 --> 00:45:36,400
every row therein, printing
out every row therein, done.
945
00:45:36,400 --> 00:45:39,640
So those three steps, which was
like nine lines of Python code
946
00:45:39,640 --> 00:45:42,700
give or take earlier,
is now one line of SQL.
947
00:45:42,700 --> 00:45:43,990
I hit Enter.
948
00:45:43,990 --> 00:45:46,370
There is all of the data.
949
00:45:46,370 --> 00:45:48,130
So I see now all of the data.
950
00:45:48,130 --> 00:45:50,230
Just output it as a CSV here.
951
00:45:50,230 --> 00:45:51,940
But it's not the CSV file.
952
00:45:51,940 --> 00:45:53,210
It's now actually the table.
953
00:45:53,210 --> 00:45:54,580
And in fact, just for
good measure, let me
954
00:45:54,580 --> 00:45:58,000
do this, because you'll see the behavior
a little differently the next time we
955
00:45:58,000 --> 00:45:58,690
open the file.
956
00:45:58,690 --> 00:46:00,730
I've just exited out of SQLite3.
957
00:46:00,730 --> 00:46:03,490
I'm going to rerun it, but I'm
not going to reimport the data
958
00:46:03,490 --> 00:46:05,920
or do anything like that,
because my file now exists.
959
00:46:05,920 --> 00:46:07,790
In fact, let me take one step back.
960
00:46:07,790 --> 00:46:11,830
If I type ls at my Linux prompt,
there's my favorites.py from before.
961
00:46:11,830 --> 00:46:13,870
There's my favorites.csv from before.
962
00:46:13,870 --> 00:46:18,580
And here's a third file that I did
create a moment ago when I first
963
00:46:18,580 --> 00:46:19,690
ran SQLite3.
964
00:46:19,690 --> 00:46:21,130
So the data is persistent.
965
00:46:21,130 --> 00:46:23,050
It's not using RAM or memory.
966
00:46:23,050 --> 00:46:24,950
Anything I do now is saved there.
967
00:46:24,950 --> 00:46:27,653
So let's go ahead and rerun
SQLite3 with the same file.
968
00:46:27,653 --> 00:46:28,570
But I'm not going to--
969
00:46:28,570 --> 00:46:31,130
I don't have to reimport everything,
because the file already exists.
970
00:46:31,130 --> 00:46:32,800
Let me now do that same thing again.
971
00:46:32,800 --> 00:46:36,460
Select star from favorites
to get all of the data.
972
00:46:36,460 --> 00:46:39,990
And what you'll see now is the same
data, but it's a little prettier now.
973
00:46:39,990 --> 00:46:44,400
Because I reran it, I effectively
disabled CSV mode this time.
974
00:46:44,400 --> 00:46:49,590
And what I'm now seeing is the
entire contents of this database
975
00:46:49,590 --> 00:46:51,420
table called favorites.
976
00:46:51,420 --> 00:46:53,370
Now, there's nothing new
here, but you're just
977
00:46:53,370 --> 00:46:55,980
seeing now like an ASCII
or Unicode version of all
978
00:46:55,980 --> 00:46:57,988
of the same data from that database.
979
00:46:57,988 --> 00:47:00,030
Well, suppose I want to
get a subset of the data.
980
00:47:00,030 --> 00:47:01,238
Well, let me clear my screen.
981
00:47:01,238 --> 00:47:04,530
And just like in Linux, I can Control L
just to clean things up aesthetically.
982
00:47:04,530 --> 00:47:06,580
Suppose I want to get
just the languages.
983
00:47:06,580 --> 00:47:09,540
So I could do select
language from favorites.
984
00:47:09,540 --> 00:47:12,930
And this will now select not
all three columns, a.k.a.
985
00:47:12,930 --> 00:47:17,280
star, this will only select the language
column and all of the data therein.
986
00:47:17,280 --> 00:47:19,020
If I hit Enter, voila.
987
00:47:19,020 --> 00:47:21,150
Now I just see those there.
988
00:47:21,150 --> 00:47:22,990
No timestamps, no problems.
989
00:47:22,990 --> 00:47:25,470
It's just a slice of
the table, if you will.
990
00:47:25,470 --> 00:47:30,010
All right, not that interesting still
because it's just a big column of data.
991
00:47:30,010 --> 00:47:31,900
But now things get more interesting.
992
00:47:31,900 --> 00:47:35,850
It turns out in SQL that there are
functions that come with this language,
993
00:47:35,850 --> 00:47:37,920
just like C, just like Python.
994
00:47:37,920 --> 00:47:41,790
In SQL, some of the more useful
ones, some of the simpler ones,
995
00:47:41,790 --> 00:47:42,630
are these here.
996
00:47:42,630 --> 00:47:44,880
Average, count, distinct,
lower, max, min,
997
00:47:44,880 --> 00:47:48,150
upper, which pretty
much do what they say.
998
00:47:48,150 --> 00:47:50,533
And count is a particularly useful one.
999
00:47:50,533 --> 00:47:51,450
Let's start with that.
1000
00:47:51,450 --> 00:47:55,740
It's a reasonable question to be asked
how many people submitted the Google
1001
00:47:55,740 --> 00:47:58,523
Form by the time I actually
downloaded the CSV.
1002
00:47:58,523 --> 00:48:00,190
Well, why don't we go ahead and do this?
1003
00:48:00,190 --> 00:48:02,850
Let me go back to VS Code
here in my terminal window.
1004
00:48:02,850 --> 00:48:06,900
Let me select not star
but the count of star.
1005
00:48:06,900 --> 00:48:10,020
So give me the count of
the rows that are being
1006
00:48:10,020 --> 00:48:12,660
returned from the database called--
1007
00:48:12,660 --> 00:48:14,190
the database table called favorites.
1008
00:48:14,190 --> 00:48:16,648
Now when I hit Enter, I'm not
going to get all of the data.
1009
00:48:16,648 --> 00:48:18,630
I'm just going to get simply a number.
1010
00:48:18,630 --> 00:48:21,480
430 rows came back.
1011
00:48:21,480 --> 00:48:22,740
So that's pretty good.
1012
00:48:22,740 --> 00:48:24,580
I now know how much data is in there.
1013
00:48:24,580 --> 00:48:26,350
Well, what languages were in there?
1014
00:48:26,350 --> 00:48:30,520
Well, I could do select language
from favorites just as before,
1015
00:48:30,520 --> 00:48:33,270
but that's not that useful,
especially if I'm inheriting the data.
1016
00:48:33,270 --> 00:48:35,770
Like I'm the analyst who's been
handed a data set by my boss
1017
00:48:35,770 --> 00:48:37,440
and they want me to crunch some numbers.
1018
00:48:37,440 --> 00:48:39,810
OK, I could load this into Excel.
1019
00:48:39,810 --> 00:48:40,770
I could sort it.
1020
00:48:40,770 --> 00:48:44,400
But you can use SQL now to answer
pretty basic questions too.
1021
00:48:44,400 --> 00:48:48,540
If you want to select the distinct
languages in the data set because you
1022
00:48:48,540 --> 00:48:51,000
weren't privy to the Google
Form, let me go ahead
1023
00:48:51,000 --> 00:48:56,820
and select only the distinct
languages from the favorites table.
1024
00:48:56,820 --> 00:49:00,330
And now I hit Enter and I get
back a much more succinct answer.
1025
00:49:00,330 --> 00:49:02,220
Just the three languages in question.
1026
00:49:02,220 --> 00:49:04,963
Not really that useful since
I created the Google Form,
1027
00:49:04,963 --> 00:49:07,380
but certainly if you're
inheriting data from someone else,
1028
00:49:07,380 --> 00:49:09,390
you've just downloaded
a data set, at least now
1029
00:49:09,390 --> 00:49:12,190
I'm arguably wrapping my
mind around what's going on.
1030
00:49:12,190 --> 00:49:15,178
Now, this is not necessary
for such a small data set,
1031
00:49:15,178 --> 00:49:16,470
but I can combine these things.
1032
00:49:16,470 --> 00:49:21,180
Select the count of the distinct
languages in this data set
1033
00:49:21,180 --> 00:49:22,140
called favorites.
1034
00:49:22,140 --> 00:49:25,220
And now I should get back what answer?
1035
00:49:25,220 --> 00:49:27,378
So hopefully indeed an
answer called three.
1036
00:49:27,378 --> 00:49:29,420
And what you're getting
back notice aesthetically
1037
00:49:29,420 --> 00:49:32,180
too is like a mini temporary table.
1038
00:49:32,180 --> 00:49:36,890
When I asked for just the
distinct languages, what SQL hands
1039
00:49:36,890 --> 00:49:39,410
me back is this
temporary table in memory
1040
00:49:39,410 --> 00:49:43,340
that has one column called
language and then three rows.
1041
00:49:43,340 --> 00:49:45,210
Now, this is not saved anywhere.
1042
00:49:45,210 --> 00:49:47,450
It's just executed
ephemerally like this.
1043
00:49:47,450 --> 00:49:49,580
But that's why it's
depicted in this way.
1044
00:49:49,580 --> 00:49:52,610
What you're getting is
subsets of your data, smaller
1045
00:49:52,610 --> 00:49:54,410
tables containing some of your data.
1046
00:49:54,410 --> 00:49:55,550
And same thing down here.
1047
00:49:55,550 --> 00:49:58,400
This is a crazy long column name.
1048
00:49:58,400 --> 00:50:00,480
You can rename it if you really want.
1049
00:50:00,480 --> 00:50:02,900
But that's all we're seeing there.
1050
00:50:02,900 --> 00:50:07,100
And in fact, if that's a little ugly,
we can actually alias these things.
1051
00:50:07,100 --> 00:50:12,090
N is a common name for a variable, a
number in any programming language.
1052
00:50:12,090 --> 00:50:15,200
So I can actually alias this
to be a column called n.
1053
00:50:15,200 --> 00:50:16,160
Hit Enter.
1054
00:50:16,160 --> 00:50:19,490
And now I'm getting a tiny, tiny
table whose column is called
1055
00:50:19,490 --> 00:50:23,280
n that just has the one value there.
1056
00:50:23,280 --> 00:50:28,500
All right, questions on these
application of these functions here?
1057
00:50:28,500 --> 00:50:31,950
1058
00:50:31,950 --> 00:50:33,152
Questions, yeah.
1059
00:50:33,152 --> 00:50:35,000
AUDIENCE: [INAUDIBLE]
1060
00:50:35,000 --> 00:50:36,400
DAVID MALAN: Say a little louder.
1061
00:50:36,400 --> 00:50:37,280
AUDIENCE: A-S.
1062
00:50:37,280 --> 00:50:38,475
DAVID MALAN: Oh, A-S. As.
1063
00:50:38,475 --> 00:50:39,350
Literally in English.
1064
00:50:39,350 --> 00:50:42,650
So rename this column as this.
1065
00:50:42,650 --> 00:50:45,317
Technically it creates
an alias for the column.
1066
00:50:45,317 --> 00:50:45,900
So that's all.
1067
00:50:45,900 --> 00:50:46,818
Yeah?
1068
00:50:46,818 --> 00:50:49,955
AUDIENCE: [INAUDIBLE]
1069
00:50:49,955 --> 00:50:50,830
DAVID MALAN: Exactly.
1070
00:50:50,830 --> 00:50:53,530
Distinct will operate on whatever
you hand it in parentheses
1071
00:50:53,530 --> 00:50:57,460
and get rid of all of the duplicates,
giving you back just the uniques.
1072
00:50:57,460 --> 00:50:58,360
Correct.
1073
00:50:58,360 --> 00:50:59,780
Other questions here?
1074
00:50:59,780 --> 00:51:00,524
Yeah.
1075
00:51:00,524 --> 00:51:02,944
AUDIENCE: [INAUDIBLE]
1076
00:51:02,944 --> 00:51:06,935
1077
00:51:06,935 --> 00:51:08,060
DAVID MALAN: Good question.
1078
00:51:08,060 --> 00:51:10,355
When you define an alias
like n, which I just did,
1079
00:51:10,355 --> 00:51:12,230
does it become like a
variable you can reuse?
1080
00:51:12,230 --> 00:51:16,580
Short answer, no in this case, but you
can reuse it within your same query.
1081
00:51:16,580 --> 00:51:20,630
Even though these queries are
getting a little longer, admittedly,
1082
00:51:20,630 --> 00:51:23,270
statements that they
are, you can actually
1083
00:51:23,270 --> 00:51:26,090
reuse n in an even longer query.
1084
00:51:26,090 --> 00:51:27,440
So later in your query.
1085
00:51:27,440 --> 00:51:30,380
And we'll see a few that are
going to start to grow in length.
1086
00:51:30,380 --> 00:51:33,050
So it's a nice way of
nicknaming things just to be
1087
00:51:33,050 --> 00:51:35,190
a little more terse in your query.
1088
00:51:35,190 --> 00:51:38,480
So we can transition to some of
these more sophisticated queries
1089
00:51:38,480 --> 00:51:41,360
because it turns out there
are some other techniques we
1090
00:51:41,360 --> 00:51:42,630
can introduce as well.
1091
00:51:42,630 --> 00:51:44,660
Here are some other keywords in SQL.
1092
00:51:44,660 --> 00:51:47,005
And again, even though this
is another list of things,
1093
00:51:47,005 --> 00:51:49,130
there's only four things
fundamentally we're doing.
1094
00:51:49,130 --> 00:51:51,600
Creating, reading,
updating, and deleting data.
1095
00:51:51,600 --> 00:51:55,580
These are just allowing us to
fine tune how we do it exactly.
1096
00:51:55,580 --> 00:51:59,210
So where is going to allow us to filter
data, as we'll do in just a moment.
1097
00:51:59,210 --> 00:52:03,440
Like select data where
this conditional is true.
1098
00:52:03,440 --> 00:52:06,260
Like is going to be an
alternative to an equal sign.
1099
00:52:06,260 --> 00:52:11,210
So instead of looking for exactly
Scratch or exactly Python or exactly C,
1100
00:52:11,210 --> 00:52:13,790
you can look for
something like dot dot dot
1101
00:52:13,790 --> 00:52:16,770
and it can be a little bit of
a fuzzier match, if you will,
1102
00:52:16,770 --> 00:52:18,020
with other characters as well.
1103
00:52:18,020 --> 00:52:19,790
Order by is going to deal with sorting.
1104
00:52:19,790 --> 00:52:23,360
Limit is going to just let me limit
the total number of rows that come back
1105
00:52:23,360 --> 00:52:26,330
to 1 or 10 or finite
if I don't want to see
1106
00:52:26,330 --> 00:52:28,375
all 400 plus rows all at
once, because I'm just
1107
00:52:28,375 --> 00:52:29,750
trying to wrap my mind around it.
1108
00:52:29,750 --> 00:52:32,207
And group by is best shown by example.
1109
00:52:32,207 --> 00:52:34,290
So let's play with just a
couple of these as well.
1110
00:52:34,290 --> 00:52:36,080
Let me go back to VS Code here.
1111
00:52:36,080 --> 00:52:37,170
I'll clear my screen.
1112
00:52:37,170 --> 00:52:39,530
I'm still in the same SQLite instance.
1113
00:52:39,530 --> 00:52:44,040
And let's count how many of you liked C
without writing Python code as before.
1114
00:52:44,040 --> 00:52:47,960
So let me go ahead and
select the count of the rows
1115
00:52:47,960 --> 00:52:54,530
from favorites where the
language in each row equals C.
1116
00:52:54,530 --> 00:52:58,100
And the convention in SQLite is
to use single quotes any time
1117
00:52:58,100 --> 00:53:02,810
you're surrounding a string that's meant
to represent a literal piece of text
1118
00:53:02,810 --> 00:53:06,720
as opposed to C, which was double
quotes, or Python, which was either.
1119
00:53:06,720 --> 00:53:11,090
So this is selecting the count of
rows from favorites table where
1120
00:53:11,090 --> 00:53:13,730
the language in question is C. Enter.
1121
00:53:13,730 --> 00:53:15,410
And this gives me 98.
1122
00:53:15,410 --> 00:53:19,190
Notice, though, if I omit that
predicate like we did before,
1123
00:53:19,190 --> 00:53:23,850
you'll get back the total number
of rows that were in the table.
1124
00:53:23,850 --> 00:53:26,030
So where is what's called
a predicate that just
1125
00:53:26,030 --> 00:53:30,020
allows me to filter things just like an
if condition or the like in a language
1126
00:53:30,020 --> 00:53:31,130
that we've seen before.
1127
00:53:31,130 --> 00:53:34,670
You can be a little more specific
like how many people really liked
1128
00:53:34,670 --> 00:53:38,870
C and the Mario problem specifically?
1129
00:53:38,870 --> 00:53:40,440
Well, let's do this.
1130
00:53:40,440 --> 00:53:45,830
Let's go ahead and do select the
number of rows from the favorites table
1131
00:53:45,830 --> 00:53:49,640
where the language is C and.
1132
00:53:49,640 --> 00:53:52,880
So it's still literally
the word ands and or, just
1133
00:53:52,880 --> 00:53:57,830
like in Python but not like in
C. And problem equals Mario.
1134
00:53:57,830 --> 00:54:01,520
So let's see if there's any fans
of both C and the Mario problem.
1135
00:54:01,520 --> 00:54:05,638
And three of us really like those
two things together in this case.
1136
00:54:05,638 --> 00:54:06,930
All right, what else can we do?
1137
00:54:06,930 --> 00:54:10,280
Well, more compelling might be
to see, kind of like in Python,
1138
00:54:10,280 --> 00:54:13,580
for each language, what
was the popularity thereof?
1139
00:54:13,580 --> 00:54:15,560
And at the moment, we
don't really have a way
1140
00:54:15,560 --> 00:54:17,960
of doing that except in
Python where we had the loop
1141
00:54:17,960 --> 00:54:20,710
and we had those variables and the
dictionary that did all of that
1142
00:54:20,710 --> 00:54:21,410
counting for us.
1143
00:54:21,410 --> 00:54:24,710
Totally doable but tedious, especially
if your job is to analyze data.
1144
00:54:24,710 --> 00:54:28,190
My God, even writing 15 lines of
code to answer simple questions
1145
00:54:28,190 --> 00:54:29,390
is kind of ridiculous.
1146
00:54:29,390 --> 00:54:31,560
SQL can do better for us.
1147
00:54:31,560 --> 00:54:33,630
So let me go ahead and do this.
1148
00:54:33,630 --> 00:54:38,540
Let me go ahead and select
every language and the count
1149
00:54:38,540 --> 00:54:45,650
thereof from the favorites table
but this time group by language.
1150
00:54:45,650 --> 00:54:48,440
So this was another one
of the keywords that we
1151
00:54:48,440 --> 00:54:52,460
can use in this abbreviated
list of extra features of SQL.
1152
00:54:52,460 --> 00:54:55,760
And this one takes a moment
to wrap your mind around,
1153
00:54:55,760 --> 00:54:59,510
but this is going to give me a
two column temporary table where
1154
00:54:59,510 --> 00:55:02,030
the first column is a
language and the second column
1155
00:55:02,030 --> 00:55:05,330
is the count thereof from this data set.
1156
00:55:05,330 --> 00:55:10,010
And group by language just means
that only show me Scratch once,
1157
00:55:10,010 --> 00:55:12,350
only show me C once,
only show me Python once.
1158
00:55:12,350 --> 00:55:16,130
That is group all of the
identical values together,
1159
00:55:16,130 --> 00:55:19,260
but keep track of how
many of them there are.
1160
00:55:19,260 --> 00:55:27,200
And so now if I go over to SQLite and
I hit Enter, now I have in SQL version
1161
00:55:27,200 --> 00:55:29,720
the exact same output
that I had from Python
1162
00:55:29,720 --> 00:55:32,120
that took me, what,
15 plus lines before.
1163
00:55:32,120 --> 00:55:35,810
Now we're down to just one because
SQL, Structured Query Language,
1164
00:55:35,810 --> 00:55:39,830
is all about constructing queries like
this to answer questions and get back
1165
00:55:39,830 --> 00:55:41,152
answers quickly.
1166
00:55:41,152 --> 00:55:42,860
If we want to clean
this up a little bit,
1167
00:55:42,860 --> 00:55:44,960
you asked earlier about sorting order.
1168
00:55:44,960 --> 00:55:46,200
Well, we can do that too.
1169
00:55:46,200 --> 00:55:48,930
There's another key
phrase we can use here.
1170
00:55:48,930 --> 00:55:55,080
We can order by the count of those
rows and then run that query here.
1171
00:55:55,080 --> 00:55:57,480
So now unfortunately they're
from smallest to biggest,
1172
00:55:57,480 --> 00:55:58,675
but we can reverse that.
1173
00:55:58,675 --> 00:56:00,800
It turns out, and my query's
starting to wrap here.
1174
00:56:00,800 --> 00:56:02,360
I'll zoom out for a moment.
1175
00:56:02,360 --> 00:56:07,050
If you want to order by count,
the default is in ascending order,
1176
00:56:07,050 --> 00:56:10,980
abbreviated A-S-C. If you want
to reverse the sort in SQL,
1177
00:56:10,980 --> 00:56:13,920
instead of using reverse equals
true like we did in Python,
1178
00:56:13,920 --> 00:56:16,380
you say D-E-S-C for descending order.
1179
00:56:16,380 --> 00:56:19,557
And now we get almost the same
output but flipped in reverse.
1180
00:56:19,557 --> 00:56:22,140
So it's just a lot faster to
answer questions once, of course,
1181
00:56:22,140 --> 00:56:24,660
you get some muscle memory
and some comfort with it.
1182
00:56:24,660 --> 00:56:26,280
Well, what else can I do?
1183
00:56:26,280 --> 00:56:28,890
What if I just care about
the most popular language?
1184
00:56:28,890 --> 00:56:31,140
I don't care about the second
place or the third place
1185
00:56:31,140 --> 00:56:32,460
languages or anything else.
1186
00:56:32,460 --> 00:56:34,830
Well, let me add one more clause here.
1187
00:56:34,830 --> 00:56:36,750
Limit the answer to one.
1188
00:56:36,750 --> 00:56:39,330
And no matter how many
rows should come back now,
1189
00:56:39,330 --> 00:56:41,880
I just get the number one
language as of the data
1190
00:56:41,880 --> 00:56:47,870
set we collected with 270 votes for it.
1191
00:56:47,870 --> 00:56:52,060
Questions on this?
1192
00:56:52,060 --> 00:56:55,260
Any questions here?
1193
00:56:55,260 --> 00:56:55,800
No?
1194
00:56:55,800 --> 00:56:58,620
Well, what if we're
starting to introduce SQL
1195
00:56:58,620 --> 00:57:02,250
and it was kind of too late to
make it into the Google Form?
1196
00:57:02,250 --> 00:57:04,260
So it turns out there's
syntax for this too.
1197
00:57:04,260 --> 00:57:05,670
You can create data, of course.
1198
00:57:05,670 --> 00:57:09,240
Not just the tables, but the data
therein and here's the typical syntax
1199
00:57:09,240 --> 00:57:11,640
for inserting data into a SQL database.
1200
00:57:11,640 --> 00:57:14,850
You literally say insert
into the name of the table.
1201
00:57:14,850 --> 00:57:19,020
And then in parentheses, you specify
one or more columns for which you
1202
00:57:19,020 --> 00:57:20,940
have values that you want to insert.
1203
00:57:20,940 --> 00:57:22,950
This is to say you don't
have to give values
1204
00:57:22,950 --> 00:57:25,650
for every column in the given row.
1205
00:57:25,650 --> 00:57:27,990
If you only have answers
to some of those questions,
1206
00:57:27,990 --> 00:57:29,940
you can enumerate them here like this.
1207
00:57:29,940 --> 00:57:32,010
But the values you insert
are going to be these.
1208
00:57:32,010 --> 00:57:35,500
So you literally say after
the close parenthesis values.
1209
00:57:35,500 --> 00:57:38,430
And then in a second set of
parentheses with a same length
1210
00:57:38,430 --> 00:57:42,370
comma separated list, you specify
what values do you want to insert.
1211
00:57:42,370 --> 00:57:43,710
So it's a little verbose.
1212
00:57:43,710 --> 00:57:47,520
And frankly longer term, you're going
to use Python code to automatically do
1213
00:57:47,520 --> 00:57:50,010
these kinds of insertions, but
let's go ahead and try this.
1214
00:57:50,010 --> 00:57:57,990
Right now if I do select distinct
language from favorites, again,
1215
00:57:57,990 --> 00:57:58,833
we see this.
1216
00:57:58,833 --> 00:58:00,000
Just these three candidates.
1217
00:58:00,000 --> 00:58:01,870
But we've now taught you a bit of SQL.
1218
00:58:01,870 --> 00:58:09,120
So let's do insert into favorites
the column called language.
1219
00:58:09,120 --> 00:58:10,620
And you know what?
1220
00:58:10,620 --> 00:58:12,150
I'm going to give a problem here.
1221
00:58:12,150 --> 00:58:14,430
The values for which,
and let me zoom back out,
1222
00:58:14,430 --> 00:58:17,910
are going to be quote unquote "SQL"
and quote, unquote "fiftyville."
1223
00:58:17,910 --> 00:58:20,130
You'll soon see what that's all about.
1224
00:58:20,130 --> 00:58:21,240
Semicolon.
1225
00:58:21,240 --> 00:58:24,580
Nothing seems to happen, but
that's usually a good thing.
1226
00:58:24,580 --> 00:58:26,580
And now if I scroll
back up in my queries,
1227
00:58:26,580 --> 00:58:30,990
in SQLite3 you can scroll back and
forth in time to avoid retyping things,
1228
00:58:30,990 --> 00:58:35,400
now I should see indeed four
candidate languages here.
1229
00:58:35,400 --> 00:58:39,150
Now, suppose that you were never really
a fan of C and maybe you programmed
1230
00:58:39,150 --> 00:58:42,990
a little bit in high school or in
the real world and you liked C++.
1231
00:58:42,990 --> 00:58:50,010
Well, there's a whole lot of answers
for C. So select star from favorites
1232
00:58:50,010 --> 00:58:53,430
where language equals quote unquote "C."
1233
00:58:53,430 --> 00:58:58,020
So here's everyone who submitted the
answer for C. Let's presume that, no,
1234
00:58:58,020 --> 00:58:59,850
they didn't really want
C, they wanted C++,
1235
00:58:59,850 --> 00:59:01,767
which is not a language
we teach in the class.
1236
00:59:01,767 --> 00:59:03,910
But I could also now do this.
1237
00:59:03,910 --> 00:59:08,670
You can use the update command to set
a column or columns to different values
1238
00:59:08,670 --> 00:59:10,210
where some condition is met.
1239
00:59:10,210 --> 00:59:15,360
So if I do update table name set
column name equal to some value
1240
00:59:15,360 --> 00:59:19,330
filtering it perhaps by
where some condition is true.
1241
00:59:19,330 --> 00:59:22,890
So suppose I've changed
my mind, or you know what,
1242
00:59:22,890 --> 00:59:31,320
let's go ahead and do update favorites
set language equal to maybe C++ where
1243
00:59:31,320 --> 00:59:33,870
language equals C.
1244
00:59:33,870 --> 00:59:35,910
Now, this is destructive,
so you generally
1245
00:59:35,910 --> 00:59:38,368
don't want to do this unless
you have a backup of your data
1246
00:59:38,368 --> 00:59:40,170
too, overriding what
people's answers are.
1247
00:59:40,170 --> 00:59:42,753
This seems to have been successful,
because no error messages.
1248
00:59:42,753 --> 00:59:45,270
And if I rerun the previous
select that gives me
1249
00:59:45,270 --> 00:59:49,410
all of the favorites where language
equals C, now indeed I get none.
1250
00:59:49,410 --> 00:59:53,127
But if I search for
C++, now I get a lot.
1251
00:59:53,127 --> 00:59:55,710
And if I get rid of that where
clause altogether and just look
1252
00:59:55,710 --> 01:00:00,330
at the contents of my database, now you
see that indeed C++ is comingled with
1253
01:00:00,330 --> 01:00:01,140
all the other data.
1254
01:00:01,140 --> 01:00:04,270
This is not what you all intended,
of course, so I can undo this.
1255
01:00:04,270 --> 01:00:06,150
Let me go ahead and
undo what I just did.
1256
01:00:06,150 --> 01:00:12,570
Let me set my favorite language
to C where language equals C++.
1257
01:00:12,570 --> 01:00:14,460
But the predicate is important.
1258
01:00:14,460 --> 01:00:16,230
This I'm not going to do.
1259
01:00:16,230 --> 01:00:20,190
What if I accidentally omitted
this predicate, the where clause?
1260
01:00:20,190 --> 01:00:22,545
How would that screw
things up might you think?
1261
01:00:22,545 --> 01:00:26,160
1262
01:00:26,160 --> 01:00:26,970
Yeah, in the back.
1263
01:00:26,970 --> 01:00:29,110
AUDIENCE: [INAUDIBLE]
1264
01:00:29,110 --> 01:00:32,590
DAVID MALAN: It would set every
row's language to indeed C.
1265
01:00:32,590 --> 01:00:34,060
And this is dangerous.
1266
01:00:34,060 --> 01:00:37,720
And if you start googling around
for SQL mistakes or the like,
1267
01:00:37,720 --> 01:00:41,590
people in the real world have
accidentally run commands like this.
1268
01:00:41,590 --> 01:00:43,510
And without naming
names, a former member
1269
01:00:43,510 --> 01:00:47,200
of our teaching staff at one point
accidentally ran a command like this
1270
01:00:47,200 --> 01:00:51,610
and changed every student's name in
our database to Bobby I think it was.
1271
01:00:51,610 --> 01:00:55,120
The same name for every row because
they simply forgot a predicate.
1272
01:00:55,120 --> 01:00:58,240
So here too there's
dangers in code, and you
1273
01:00:58,240 --> 01:01:00,730
should adopt the habit
quite quickly of always,
1274
01:01:00,730 --> 01:01:03,730
one, backing up your data
like with CP, for instance,
1275
01:01:03,730 --> 01:01:09,010
in Linux or any other technique or just
making sure before you hit Enter that,
1276
01:01:09,010 --> 01:01:11,322
yes, this is indeed the
query I want to execute.
1277
01:01:11,322 --> 01:01:13,030
And generally speaking
in the real world,
1278
01:01:13,030 --> 01:01:14,890
there should be process
controls in place.
1279
01:01:14,890 --> 01:01:17,770
Like the intern should not
have access to the production
1280
01:01:17,770 --> 01:01:19,570
database, the live
database, and the like.
1281
01:01:19,570 --> 01:01:21,860
But you have a lot of power
now with these queries.
1282
01:01:21,860 --> 01:01:25,520
So just be all the more careful, because
very easily can you do bad things.
1283
01:01:25,520 --> 01:01:26,660
So let me undo this.
1284
01:01:26,660 --> 01:01:30,790
Where language equals
quote unquote "C++."
1285
01:01:30,790 --> 01:01:31,840
And I'll zoom back out.
1286
01:01:31,840 --> 01:01:32,470
Enter.
1287
01:01:32,470 --> 01:01:34,480
And now I think we're back in business.
1288
01:01:34,480 --> 01:01:36,100
C is among the answers.
1289
01:01:36,100 --> 01:01:37,520
Yeah?
1290
01:01:37,520 --> 01:01:40,340
AUDIENCE: [INAUDIBLE]
1291
01:01:40,340 --> 01:01:42,293
1292
01:01:42,293 --> 01:01:44,460
DAVID MALAN: It's essentially
doing what at the end?
1293
01:01:44,460 --> 01:01:45,270
AUDIENCE: [INAUDIBLE]
1294
01:01:45,270 --> 01:01:46,680
DAVID MALAN: It's
essentially find and replace.
1295
01:01:46,680 --> 01:01:47,180
Yes.
1296
01:01:47,180 --> 01:01:51,060
In layperson's terms, this is find
and replace implemented with SQL.
1297
01:01:51,060 --> 01:01:55,020
And in fact, the authors of
Microsoft Word or Google Docs
1298
01:01:55,020 --> 01:01:58,290
might very well be using
language like this SQL
1299
01:01:58,290 --> 01:02:01,710
when you go to the nice graphical
user friendly find and replace box.
1300
01:02:01,710 --> 01:02:05,010
This may very well be what they're
doing underneath the hood or, of course,
1301
01:02:05,010 --> 01:02:07,770
they could be using some
other language altogether.
1302
01:02:07,770 --> 01:02:10,530
There's one last syntax
that's worth knowing,
1303
01:02:10,530 --> 01:02:14,280
delete, which for better or for worse
is even more destructive whereby it
1304
01:02:14,280 --> 01:02:16,590
allows you to delete rows from tables.
1305
01:02:16,590 --> 01:02:20,130
It's distinct from drop, which
lets you delete tables themselves.
1306
01:02:20,130 --> 01:02:21,630
This focuses on rows.
1307
01:02:21,630 --> 01:02:26,250
So suppose that you
really, really didn't like,
1308
01:02:26,250 --> 01:02:30,000
let's say, Tideman was a
little challenging if you
1309
01:02:30,000 --> 01:02:31,840
tackled that more comfortable problem.
1310
01:02:31,840 --> 01:02:34,800
So if you really don't want to
even think about Tideman anymore,
1311
01:02:34,800 --> 01:02:40,080
so why don't we do delete from
favorites where problem equals, and I
1312
01:02:40,080 --> 01:02:42,240
won't execute it for real, Tideman.
1313
01:02:42,240 --> 01:02:45,360
This would have the effect of
deleting every row, including
1314
01:02:45,360 --> 01:02:50,010
the language therein and the timestamp,
where the student answered Tideman.
1315
01:02:50,010 --> 01:02:53,910
Worse than this would be this.
1316
01:02:53,910 --> 01:02:55,125
Why might this be bad?
1317
01:02:55,125 --> 01:02:58,560
1318
01:02:58,560 --> 01:03:00,930
OK, chuckling because
there's no predicate.
1319
01:03:00,930 --> 01:03:04,960
There's no filter, which means literally
this would delete all of the data.
1320
01:03:04,960 --> 01:03:07,740
So again, with great power here
comes great responsibility.
1321
01:03:07,740 --> 01:03:12,390
Now, this has just been a data set of
430 rows by us dynamically created.
1322
01:03:12,390 --> 01:03:15,240
There's, of course, some really
juicy data sets in the real world.
1323
01:03:15,240 --> 01:03:17,990
And one website you might have
heard or an app you might have used
1324
01:03:17,990 --> 01:03:20,460
is IMDb, the Internet Movie
Database, which wonderfully
1325
01:03:20,460 --> 01:03:24,330
makes some of their data available
for download as CSV files
1326
01:03:24,330 --> 01:03:27,420
or technically TSV files,
Tab Separated Values.
1327
01:03:27,420 --> 01:03:31,830
But what we did in advance of class was
download some of that data for both TV
1328
01:03:31,830 --> 01:03:34,170
shows in the real world and
movies in the real world.
1329
01:03:34,170 --> 01:03:35,962
And what's wonderful
about this data set is
1330
01:03:35,962 --> 01:03:39,630
it's not just dozens or hundreds
or even thousands of lines.
1331
01:03:39,630 --> 01:03:44,190
There are millions of rows of
juicy data, TV shows and movies
1332
01:03:44,190 --> 01:03:47,190
with which most folks are probably
familiar, at least with a subset.
1333
01:03:47,190 --> 01:03:50,580
And we'll see in just a
little bit that this data
1334
01:03:50,580 --> 01:03:54,600
comes in the form of now six
different tables that we've given you.
1335
01:03:54,600 --> 01:03:56,610
And the tables in question
for today are going
1336
01:03:56,610 --> 01:04:00,735
to be the people in the TV business,
the stars therein, the shows that people
1337
01:04:00,735 --> 01:04:01,860
are producing and the like.
1338
01:04:01,860 --> 01:04:04,860
This is a picture
we'll revisit to enable
1339
01:04:04,860 --> 01:04:07,110
you to wrap your minds around
what the actual data is.
1340
01:04:07,110 --> 01:04:09,948
This feels like a good
opportunity though for a snack.
1341
01:04:09,948 --> 01:04:12,990
In fact, in just a moment, we have a
whole lot of Rice Krispie treats out
1342
01:04:12,990 --> 01:04:13,690
in the lobby.
1343
01:04:13,690 --> 01:04:18,630
But if folks could perhaps acknowledge
this mini wedding cake here.
1344
01:04:18,630 --> 01:04:21,640
CS50 zone Carter Zenke is
getting married this week.
1345
01:04:21,640 --> 01:04:28,880
So congratulations to Carter as well.
1346
01:04:28,880 --> 01:04:30,550
Congrats.
1347
01:04:30,550 --> 01:04:31,940
All right.
1348
01:04:31,940 --> 01:04:34,440
[APPLAUSE]
1349
01:04:34,440 --> 01:04:38,115
1350
01:04:38,115 --> 01:04:40,952
There's only one piece of cake in
that box but a lot of Rice Krispie
1351
01:04:40,952 --> 01:04:41,910
treats in the transept.
1352
01:04:41,910 --> 01:04:45,420
Let's take 10 minutes and we'll be back
with Internet Movie Database in 10.
1353
01:04:45,420 --> 01:04:46,470
All right.
1354
01:04:46,470 --> 01:04:47,320
We are back.
1355
01:04:47,320 --> 01:04:51,000
So if you've never been, you can
actually go to imdb.com right now
1356
01:04:51,000 --> 01:04:53,020
and play around or
download the mobile app.
1357
01:04:53,020 --> 01:04:56,580
And it's just big database of
a lot of TV shows and movies
1358
01:04:56,580 --> 01:04:58,090
and actors and the like.
1359
01:04:58,090 --> 01:05:00,828
But what indeed is nice is you
can download some of that data.
1360
01:05:00,828 --> 01:05:02,370
And that's what I've done in advance.
1361
01:05:02,370 --> 01:05:05,160
And what we've done is
we wrote some Python code
1362
01:05:05,160 --> 01:05:09,540
to convert some of the flat file
databases that they let you download
1363
01:05:09,540 --> 01:05:13,890
and we converted it into a
SQL database with six tables.
1364
01:05:13,890 --> 01:05:17,940
So not just one but six that
ultimately are these here.
1365
01:05:17,940 --> 01:05:21,090
And let me just help you wrap your
minds around what this picture is, which
1366
01:05:21,090 --> 01:05:25,410
is an entity relationship diagram,
which is just to say each of these boxes
1367
01:05:25,410 --> 01:05:27,090
on the screen represents a table.
1368
01:05:27,090 --> 01:05:31,230
And each of the arrows or edges
represents some kind of relationship
1369
01:05:31,230 --> 01:05:32,100
across the tables.
1370
01:05:32,100 --> 01:05:34,200
Because up until now,
the only data we had
1371
01:05:34,200 --> 01:05:36,270
were those three columns
in the favorites table.
1372
01:05:36,270 --> 01:05:39,130
But what gets really
useful about SQL databases,
1373
01:05:39,130 --> 01:05:41,700
just like a Google
Spreadsheet or an Excel file,
1374
01:05:41,700 --> 01:05:45,300
is you can have multiple sheets
or in a database multiple tables.
1375
01:05:45,300 --> 01:05:50,010
And so what we're about to see is that
in the IMDb database for TV shows,
1376
01:05:50,010 --> 01:05:53,820
there's going to be a dedicated table
for all the people in the TV business.
1377
01:05:53,820 --> 01:05:57,210
There's going to be a dedicated
table for all of the TV shows
1378
01:05:57,210 --> 01:05:59,130
that are in their
database as of right now.
1379
01:05:59,130 --> 01:06:03,180
There's going to be a dedicated
table for writers in that industry,
1380
01:06:03,180 --> 01:06:08,580
for the ratings of shows, for
the genres to which shows belong,
1381
01:06:08,580 --> 01:06:09,700
comedy and the like.
1382
01:06:09,700 --> 01:06:12,310
And then lastly, there's
going to be this table,
1383
01:06:12,310 --> 01:06:17,700
which somehow associates people
with the TV shows that they star in
1384
01:06:17,700 --> 01:06:19,030
and vice versa.
1385
01:06:19,030 --> 01:06:22,290
And so let's consider first
what this looks like in code.
1386
01:06:22,290 --> 01:06:25,110
And we'll see that it's going to
overwhelm intentionally at first,
1387
01:06:25,110 --> 01:06:26,200
but I'm going to do this.
1388
01:06:26,200 --> 01:06:27,690
I'm going to go back
to my terminal window.
1389
01:06:27,690 --> 01:06:30,240
And during the break, I downloaded
from the course's website
1390
01:06:30,240 --> 01:06:33,960
a file called shows.db, which
we made in advance for you.
1391
01:06:33,960 --> 01:06:37,440
And if I type ls, I'll see all of
my favorites files from before.
1392
01:06:37,440 --> 01:06:40,020
The CSV, the DB, and the Python file.
1393
01:06:40,020 --> 01:06:42,390
But now they're shows.db.
1394
01:06:42,390 --> 01:06:45,540
So I'm going to go ahead in my
full screen terminal window here.
1395
01:06:45,540 --> 01:06:47,850
I'm not using actual tabs or code files.
1396
01:06:47,850 --> 01:06:52,380
Now I'm going to run sqlite3
on the file called shows.db.
1397
01:06:52,380 --> 01:06:54,900
And I'm just going to see
this version information here.
1398
01:06:54,900 --> 01:06:57,120
Let me clear my screen
and run the one command
1399
01:06:57,120 --> 01:07:01,170
I ran earlier to show us the
schema of the favorites database.
1400
01:07:01,170 --> 01:07:03,947
Now we'll see the schema
for the shows database.
1401
01:07:03,947 --> 01:07:06,030
And there's a lot going
on here, but let me scroll
1402
01:07:06,030 --> 01:07:08,070
back up to the very top, the beginning.
1403
01:07:08,070 --> 01:07:10,570
And we see this here.
1404
01:07:10,570 --> 01:07:15,690
So when I run .schema,
we see a dump, really,
1405
01:07:15,690 --> 01:07:20,310
of all of the SQL create table commands
that were run in order to create this
1406
01:07:20,310 --> 01:07:21,330
database for you.
1407
01:07:21,330 --> 01:07:24,420
And one of those tables is
called genres and another people,
1408
01:07:24,420 --> 01:07:27,370
ratings, shows, stars, and so forth.
1409
01:07:27,370 --> 01:07:29,340
And the columns therein,
even though it's
1410
01:07:29,340 --> 01:07:32,670
formatted a little more prettily than
the automatically generated create
1411
01:07:32,670 --> 01:07:37,680
table statement for favorites whereby
we have one column per line of output
1412
01:07:37,680 --> 01:07:41,520
here in the, for instance,
people table, there's
1413
01:07:41,520 --> 01:07:45,900
going to be an ID column, like a unique
identifier like a Harvard ID, a Yale
1414
01:07:45,900 --> 01:07:50,730
ID or the like, a name column, a
birth year, and then some other stuff.
1415
01:07:50,730 --> 01:07:53,760
If I scroll down to shows,
every show in the world
1416
01:07:53,760 --> 01:07:58,410
is going to have a unique ID as well,
a title of course, the year in which it
1417
01:07:58,410 --> 01:08:00,390
debuted, and the total
number of episodes
1418
01:08:00,390 --> 01:08:02,370
as of the time we downloaded the data.
1419
01:08:02,370 --> 01:08:04,080
And then what else is there?
1420
01:08:04,080 --> 01:08:08,170
Some of these are a little
less obvious like ratings here.
1421
01:08:08,170 --> 01:08:10,830
So ratings don't have
an ID column, but they
1422
01:08:10,830 --> 01:08:15,038
have a show ID column and a rating like
on a five point scale or a 10 point
1423
01:08:15,038 --> 01:08:16,830
scale or the like and
then the total number
1424
01:08:16,830 --> 01:08:19,319
of votes that were collected
to contribute to that rating.
1425
01:08:19,319 --> 01:08:23,680
IMDb allows people to upvote and
downvote shows and movies and the like.
1426
01:08:23,680 --> 01:08:26,380
And then similarly is genre structured.
1427
01:08:26,380 --> 01:08:29,290
There's a show ID and
then there's a genre,
1428
01:08:29,290 --> 01:08:33,370
which is going to be an English word
like comedy or drama or something else.
1429
01:08:33,370 --> 01:08:34,200
And then what else?
1430
01:08:34,200 --> 01:08:38,609
Let's go a little further at the
bottom here for stars and writers.
1431
01:08:38,609 --> 01:08:41,729
If we go to the very bottom
here, stars and writers
1432
01:08:41,729 --> 01:08:43,470
are similarly structured too.
1433
01:08:43,470 --> 01:08:46,890
They have a show ID and a person ID.
1434
01:08:46,890 --> 01:08:48,540
So show and person.
1435
01:08:48,540 --> 01:08:52,470
And then this writers table
has a show ID and a person ID.
1436
01:08:52,470 --> 01:08:56,340
And there's a whole lot of other words
that we'll come to in just a moment.
1437
01:08:56,340 --> 01:08:58,330
But what is this code hinting at?
1438
01:08:58,330 --> 01:09:01,000
Well, if I go back to the
picture from earlier here,
1439
01:09:01,000 --> 01:09:05,160
you'll see that this picture
captures the relationships
1440
01:09:05,160 --> 01:09:07,020
among these various tables.
1441
01:09:07,020 --> 01:09:11,670
So for instance, if we focus on
shows for just a moment, a show,
1442
01:09:11,670 --> 01:09:15,359
again, has a unique ID, a title,
a year in which it debuted,
1443
01:09:15,359 --> 01:09:16,859
and a total number of episodes.
1444
01:09:16,859 --> 01:09:20,550
If you want to figure
out what genre or genres
1445
01:09:20,550 --> 01:09:24,300
a show belongs to, because
some shows are just comedies,
1446
01:09:24,300 --> 01:09:29,010
some shows are just dramas, but some
shows are arguably comedies and dramas
1447
01:09:29,010 --> 01:09:30,700
depending on the episode or the like.
1448
01:09:30,700 --> 01:09:34,740
So you can imagine wanting to associate
two or three or even more genres
1449
01:09:34,740 --> 01:09:35,640
with a show.
1450
01:09:35,640 --> 01:09:39,830
This line here in this second
table allows us to do that.
1451
01:09:39,830 --> 01:09:46,310
Every row in the genres table
we'll see has two items, a show ID
1452
01:09:46,310 --> 01:09:49,250
which relates to the ID of a show.
1453
01:09:49,250 --> 01:09:54,350
And that's why these lines literally
line up with that specific column name.
1454
01:09:54,350 --> 01:09:57,050
And a genre, which is going to
be like, quote unquote, "comedy,"
1455
01:09:57,050 --> 01:09:59,250
quote unquote "drama,"
or something else.
1456
01:09:59,250 --> 01:10:02,180
Now with that said, design question.
1457
01:10:02,180 --> 01:10:07,250
Why have we deliberately not just
gotten rid of this genres table
1458
01:10:07,250 --> 01:10:12,950
and made our lives simpler by just
adding a genre column to this show's
1459
01:10:12,950 --> 01:10:13,640
table?
1460
01:10:13,640 --> 01:10:17,150
And again, a table is just like
a sheet with rows and columns.
1461
01:10:17,150 --> 01:10:21,140
At the moment, shows only have four
columns, ID, title, year, episodes.
1462
01:10:21,140 --> 01:10:28,980
Why not just add a fifth column called
genre and put the show's genre there?
1463
01:10:28,980 --> 01:10:32,200
Any intuition here?
1464
01:10:32,200 --> 01:10:34,960
Why not just keep things simple?
1465
01:10:34,960 --> 01:10:36,286
Yeah, in back.
1466
01:10:36,286 --> 01:10:42,035
AUDIENCE: [INAUDIBLE]
1467
01:10:42,035 --> 01:10:42,910
DAVID MALAN: Exactly.
1468
01:10:42,910 --> 01:10:46,670
If you add a fifth column
here and call it genre,
1469
01:10:46,670 --> 01:10:49,900
then you have to pick
a genre specifically.
1470
01:10:49,900 --> 01:10:55,840
You have to put in that cell presumably
comedy or drama or musical or something
1471
01:10:55,840 --> 01:10:56,480
else.
1472
01:10:56,480 --> 01:10:58,270
Now, you could write
multiple words in the cell,
1473
01:10:58,270 --> 01:11:00,645
but generally speaking, that
would be sloppy, bad design.
1474
01:11:00,645 --> 01:11:02,492
Like every cell just
like in a spreadsheet
1475
01:11:02,492 --> 01:11:03,700
should really have one value.
1476
01:11:03,700 --> 01:11:07,270
It might have multiple words, but it
shouldn't be a weirdly comma separated
1477
01:11:07,270 --> 01:11:08,530
list of multiple things.
1478
01:11:08,530 --> 01:11:10,655
It should just be in a
different cell in that case.
1479
01:11:10,655 --> 01:11:14,320
So if you instead were to design this
with just a single column called genre,
1480
01:11:14,320 --> 01:11:16,510
you're imposing what
a computer scientist
1481
01:11:16,510 --> 01:11:19,480
would call a one to one relationship.
1482
01:11:19,480 --> 01:11:21,580
Every show has one genre.
1483
01:11:21,580 --> 01:11:23,800
And that's not necessarily a good thing.
1484
01:11:23,800 --> 01:11:26,170
Or strictly speaking, it
would be a many to one,
1485
01:11:26,170 --> 01:11:28,510
because the same genre could
belong to multiple shows,
1486
01:11:28,510 --> 01:11:31,760
but each show could only
have one genre in that case.
1487
01:11:31,760 --> 01:11:33,760
What a relational
database allows you to do,
1488
01:11:33,760 --> 01:11:35,890
and relational is indeed
the operative word,
1489
01:11:35,890 --> 01:11:39,100
it allows you to factor out
some of your information
1490
01:11:39,100 --> 01:11:44,590
and then have maybe one show here in
one row but then in this genres table,
1491
01:11:44,590 --> 01:11:47,320
you could have one row
for that one show genre,
1492
01:11:47,320 --> 01:11:52,450
or you could have two rows in the
genres table for comedy and for drama.
1493
01:11:52,450 --> 01:11:55,490
Or if it has a third genre, you
could just add another row here.
1494
01:11:55,490 --> 01:11:59,380
So you still have one
row for the show itself
1495
01:11:59,380 --> 01:12:02,890
with all the juiciest details
but a variable number of rows
1496
01:12:02,890 --> 01:12:05,950
by having this relationship
with another table.
1497
01:12:05,950 --> 01:12:10,000
Meanwhile, ratings work the
same way, at least in this case.
1498
01:12:10,000 --> 01:12:12,882
A show has ID, title,
year, and episodes.
1499
01:12:12,882 --> 01:12:14,590
But if you want to
figure out its rating,
1500
01:12:14,590 --> 01:12:17,710
you have to follow the arrow
here, so to speak, and look up
1501
01:12:17,710 --> 01:12:20,020
the corresponding show ID in this table.
1502
01:12:20,020 --> 01:12:23,750
Find the rating of that show
and the total number of ratings.
1503
01:12:23,750 --> 01:12:26,470
So that's been factored out
too, for better or for worse.
1504
01:12:26,470 --> 01:12:28,600
Now let's consider people.
1505
01:12:28,600 --> 01:12:32,050
People have just three
columns, ID, name, and birth.
1506
01:12:32,050 --> 01:12:35,350
But there's no mention of the TV
show in which people have starred
1507
01:12:35,350 --> 01:12:38,140
or the TV shows that
a person has written.
1508
01:12:38,140 --> 01:12:39,200
Well, why is that?
1509
01:12:39,200 --> 01:12:42,730
Well, if you just had a fourth
column here called show,
1510
01:12:42,730 --> 01:12:45,560
well, you would have to decide
what show is that person in.
1511
01:12:45,560 --> 01:12:47,560
And no one could ever act
again in another show,
1512
01:12:47,560 --> 01:12:49,310
because there's no
room to store the data.
1513
01:12:49,310 --> 01:12:53,560
But if someone, of course, a popular
actor can star in multiple shows,
1514
01:12:53,560 --> 01:12:57,790
well, we could have one ID for that
person, one name, one birth year,
1515
01:12:57,790 --> 01:12:58,360
obviously.
1516
01:12:58,360 --> 01:13:02,230
Like there's only one Steve Carell
as an actor in the world of people.
1517
01:13:02,230 --> 01:13:06,610
But Steve Carell in this example
could have his person ID,
1518
01:13:06,610 --> 01:13:09,220
whatever his Harvard ID
equivalent, Yale ID equivalent is,
1519
01:13:09,220 --> 01:13:12,130
appear in multiple rows
in this table so that it
1520
01:13:12,130 --> 01:13:14,470
can be associated with multiple shows.
1521
01:13:14,470 --> 01:13:18,830
And this allows you to create what's
called a one to many relationship,
1522
01:13:18,830 --> 01:13:20,500
or technically it's bidirectional.
1523
01:13:20,500 --> 01:13:22,450
It's a many to many relationship.
1524
01:13:22,450 --> 01:13:23,120
Why?
1525
01:13:23,120 --> 01:13:26,470
Well, one show can certainly
have multiple people in it
1526
01:13:26,470 --> 01:13:29,090
and multiple people writing
for it, just in the real world.
1527
01:13:29,090 --> 01:13:33,910
But conversely, one person could
certainly act in multiple shows
1528
01:13:33,910 --> 01:13:35,330
or write multiple shows.
1529
01:13:35,330 --> 01:13:38,380
So this is what you get
with relational databases.
1530
01:13:38,380 --> 01:13:41,440
You put your sort of canonical
data for people in one place,
1531
01:13:41,440 --> 01:13:46,150
for shows in another place, and
then you use these additional tables
1532
01:13:46,150 --> 01:13:48,993
to relate one thing to another.
1533
01:13:48,993 --> 01:13:50,410
So we won't dwell on the pictures.
1534
01:13:50,410 --> 01:13:53,350
That's just if you sort of can
wrap your mind around the data
1535
01:13:53,350 --> 01:13:55,870
set better that way, that's
one way of thinking about it.
1536
01:13:55,870 --> 01:14:00,177
But recall that the code we just saw for
the schema, again, escalated quickly.
1537
01:14:00,177 --> 01:14:02,260
There's a lot of keywords
I haven't mentioned yet.
1538
01:14:02,260 --> 01:14:04,360
But some of these are perhaps familiar.
1539
01:14:04,360 --> 01:14:06,130
They're capitalized differently here.
1540
01:14:06,130 --> 01:14:08,140
But integer is on the list here.
1541
01:14:08,140 --> 01:14:10,790
Null is on the list, albeit
technically not null.
1542
01:14:10,790 --> 01:14:12,760
So let's tease apart
some of these keywords
1543
01:14:12,760 --> 01:14:15,218
and consider what they're
actually doing for your database,
1544
01:14:15,218 --> 01:14:18,640
because now we're exploring
features that do not exist
1545
01:14:18,640 --> 01:14:20,980
in the world of spreadsheets alone.
1546
01:14:20,980 --> 01:14:24,190
So it turns out in a SQL
database, specifically
1547
01:14:24,190 --> 01:14:27,265
SQLite which is the version
of SQL we use in CS50
1548
01:14:27,265 --> 01:14:30,910
and which is commonly used for things
like mobile applications nowadays.
1549
01:14:30,910 --> 01:14:32,560
It's like a lightweight version of SQL.
1550
01:14:32,560 --> 01:14:35,590
It's when you aren't trying to
run Twitter and have billions
1551
01:14:35,590 --> 01:14:37,150
and billions of rows necessarily.
1552
01:14:37,150 --> 01:14:40,730
You've got hundreds, thousands, tens
of thousands, maybe even a few million,
1553
01:14:40,730 --> 01:14:44,590
but not crazy numbers,
crazy amounts of data.
1554
01:14:44,590 --> 01:14:49,250
In the world of SQLite specifically,
there's these five data types.
1555
01:14:49,250 --> 01:14:52,390
So just like in C, we had
int and char and the like.
1556
01:14:52,390 --> 01:14:54,190
In SQL, we have these.
1557
01:14:54,190 --> 01:14:57,447
Blob, which is kind of funny, but
it just means binary large objects.
1558
01:14:57,447 --> 01:14:58,780
So it's like a binary data type.
1559
01:14:58,780 --> 01:15:02,830
Zeros and ones that aren't necessarily
fitting into the other categories.
1560
01:15:02,830 --> 01:15:05,290
Integer, which of course,
is an integer as we know it.
1561
01:15:05,290 --> 01:15:09,940
Numeric, which is kind of a catchall for
numbers that are formatted specially.
1562
01:15:09,940 --> 01:15:14,710
So like a date would be like year,
year, year, year, dash month, month,
1563
01:15:14,710 --> 01:15:16,390
dash day, day.
1564
01:15:16,390 --> 01:15:18,015
And this is actually a wonderful thing.
1565
01:15:18,015 --> 01:15:19,723
Depending on the
country you're from, you
1566
01:15:19,723 --> 01:15:22,720
might think your date system in your
country is great or it's horrible.
1567
01:15:22,720 --> 01:15:27,040
The US system is horrible because
we have month, day, and then
1568
01:15:27,040 --> 01:15:29,350
year, which is impossible to sort.
1569
01:15:29,350 --> 01:15:31,930
It is the wrong way
objectively to store data.
1570
01:15:31,930 --> 01:15:33,803
And yet here we are using this at scale.
1571
01:15:33,803 --> 01:15:35,470
Other countries have gotten this better.
1572
01:15:35,470 --> 01:15:38,383
Numeric in SQL itself
standardizes that stuff.
1573
01:15:38,383 --> 01:15:40,300
So it doesn't matter
what country you're from.
1574
01:15:40,300 --> 01:15:43,200
You're storing your data in this
particular way for instance.
1575
01:15:43,200 --> 01:15:46,560
Times are standardized and other
types of numeric data as well.
1576
01:15:46,560 --> 01:15:48,630
Real is synonymous with flow.
1577
01:15:48,630 --> 01:15:52,410
So something with a decimal point
and some number of digits thereafter.
1578
01:15:52,410 --> 01:15:55,530
And then text is just
for strings and the like.
1579
01:15:55,530 --> 01:16:00,240
With other even fancier databases
like MySQL, PostgreSQL, Oracle,
1580
01:16:00,240 --> 01:16:02,370
and other products you
might have heard of,
1581
01:16:02,370 --> 01:16:04,170
there's even more data
types where you have
1582
01:16:04,170 --> 01:16:05,712
to make even finer grained decisions.
1583
01:16:05,712 --> 01:16:07,950
But for SQLite, it's
indeed pretty lightweight
1584
01:16:07,950 --> 01:16:12,120
and you or we just have to decide the
data types for each column in a table.
1585
01:16:12,120 --> 01:16:15,000
But there's these additional
constraints in the world of SQL.
1586
01:16:15,000 --> 01:16:21,400
You can additionally say that cells
in this column may or may not be null.
1587
01:16:21,400 --> 01:16:24,660
So if you want to protect yourself
from yourself so you don't screw up
1588
01:16:24,660 --> 01:16:26,880
and insert a null,
that is a blank value,
1589
01:16:26,880 --> 01:16:31,090
you can explicitly design a table to
have a column that cannot be null.
1590
01:16:31,090 --> 01:16:33,090
And so in fact, someone
came up during the break
1591
01:16:33,090 --> 01:16:36,450
to ask me about my having
manually inserted SQL,
1592
01:16:36,450 --> 01:16:38,923
quote unquote "SQL," into
our favorites database.
1593
01:16:38,923 --> 01:16:40,590
You might recall that I kind of cheated.
1594
01:16:40,590 --> 01:16:44,340
I just inserted "SQL" quote
unquote and "fiftyville,"
1595
01:16:44,340 --> 01:16:46,050
the name of a new
problem, quote unquote.
1596
01:16:46,050 --> 01:16:49,170
But what did I not
insert into the database?
1597
01:16:49,170 --> 01:16:50,100
A timestamp.
1598
01:16:50,100 --> 01:16:50,880
And I could have.
1599
01:16:50,880 --> 01:16:54,240
I could have put the current day and
time a few minutes ago, but I didn't.
1600
01:16:54,240 --> 01:16:58,590
And that's fine if it's acceptable to
you and the product you're building.
1601
01:16:58,590 --> 01:17:00,300
But I could have prevented that.
1602
01:17:00,300 --> 01:17:05,100
If we had defined the table to have a
timestamp column that isn't just text
1603
01:17:05,100 --> 01:17:08,490
but it's text that's not null,
SQL would have complained
1604
01:17:08,490 --> 01:17:10,650
and would not have let me
complete that insertion.
1605
01:17:10,650 --> 01:17:13,567
So there's these kinds of built in
defenses that you don't necessarily
1606
01:17:13,567 --> 01:17:15,180
get with a spreadsheet alone.
1607
01:17:15,180 --> 01:17:17,170
And unique means exactly that.
1608
01:17:17,170 --> 01:17:20,190
If you want to make sure
that every row in that column
1609
01:17:20,190 --> 01:17:24,480
is unique, maybe for email addresses
or in the US Social Security
1610
01:17:24,480 --> 01:17:26,190
numbers or anything
that you want to make
1611
01:17:26,190 --> 01:17:30,330
sure you don't have two versions of, you
can specify that the column is unique.
1612
01:17:30,330 --> 01:17:32,170
And there's other such
constraints as well.
1613
01:17:32,170 --> 01:17:34,140
But again, this is
just a list of features
1614
01:17:34,140 --> 01:17:37,170
that you get from a proper
relational database.
1615
01:17:37,170 --> 01:17:39,990
But perhaps the most
intellectually interesting one
1616
01:17:39,990 --> 01:17:42,420
and the most powerful
one is what's called
1617
01:17:42,420 --> 01:17:45,000
here a primary key and a foreign key.
1618
01:17:45,000 --> 01:17:47,640
And let me go back now to this output.
1619
01:17:47,640 --> 01:17:51,970
If we look at shows, you'll
see that a show, again,
1620
01:17:51,970 --> 01:17:55,320
has an ID, a title, a year,
and a number of episodes.
1621
01:17:55,320 --> 01:17:57,120
And now the data types might make sense.
1622
01:17:57,120 --> 01:17:59,820
The ID it turns out, just
like a Harvard ID, a Yale ID,
1623
01:17:59,820 --> 01:18:01,150
is going to be an integer.
1624
01:18:01,150 --> 01:18:02,220
So a simple number.
1625
01:18:02,220 --> 01:18:04,680
The title, of course, is
going to be text but not null.
1626
01:18:04,680 --> 01:18:07,470
It would be weird if
a TV show had no name.
1627
01:18:07,470 --> 01:18:08,615
That can't be.
1628
01:18:08,615 --> 01:18:10,990
The whole world would break
or your TV Guide and whatnot.
1629
01:18:10,990 --> 01:18:13,590
So that makes sense
there to say not null.
1630
01:18:13,590 --> 01:18:14,670
Year is numeric.
1631
01:18:14,670 --> 01:18:15,990
So it's a standardized form.
1632
01:18:15,990 --> 01:18:19,140
Episodes is an integer, like how
many episodes have been produced.
1633
01:18:19,140 --> 01:18:20,790
And then lastly, notice this.
1634
01:18:20,790 --> 01:18:24,090
The primary key of the
show's table is apparently
1635
01:18:24,090 --> 01:18:27,880
the column called ID
mentioned a few lines earlier.
1636
01:18:27,880 --> 01:18:32,400
This just means that the
database will use the ID
1637
01:18:32,400 --> 01:18:34,740
column as the unique identifier.
1638
01:18:34,740 --> 01:18:37,770
So it's similar to the unique
keyword, but primary key just
1639
01:18:37,770 --> 01:18:40,260
means the database is going
to treat it as special too
1640
01:18:40,260 --> 01:18:43,380
and make sure that it is
uniquely identifying your data.
1641
01:18:43,380 --> 01:18:45,300
But what's interesting is this.
1642
01:18:45,300 --> 01:18:48,480
Notice if I scroll back
up to people, people
1643
01:18:48,480 --> 01:18:51,480
were sort of similarly structured
but with different attributes.
1644
01:18:51,480 --> 01:18:56,310
Like up here we had a person
has an ID, a name, a birth
1645
01:18:56,310 --> 01:18:58,800
year, and a primary key of ID.
1646
01:18:58,800 --> 01:19:01,080
So a ID is, again, integer.
1647
01:19:01,080 --> 01:19:03,600
Name is text but not
null, because it'd be
1648
01:19:03,600 --> 01:19:06,420
weird to have a human with
absolutely no name textually.
1649
01:19:06,420 --> 01:19:07,860
Birth is going to be numeric.
1650
01:19:07,860 --> 01:19:11,590
But the primary key of
people is ID as well.
1651
01:19:11,590 --> 01:19:15,150
So those are the unique columns that
the database will just treat special.
1652
01:19:15,150 --> 01:19:15,810
Why?
1653
01:19:15,810 --> 01:19:17,310
Well, we just looked at shows.
1654
01:19:17,310 --> 01:19:18,900
We just looked at people.
1655
01:19:18,900 --> 01:19:22,380
Let's focus now on this
one down here, stars.
1656
01:19:22,380 --> 01:19:25,260
How do you determine
who stars in a TV show?
1657
01:19:25,260 --> 01:19:26,370
Well, we had two columns.
1658
01:19:26,370 --> 01:19:28,320
The show ID and the person ID.
1659
01:19:28,320 --> 01:19:31,080
This is the incarnation of
a many to many relationship.
1660
01:19:31,080 --> 01:19:32,550
One person could be in many shows.
1661
01:19:32,550 --> 01:19:35,710
One show could certainly have many
people in it or writing for it.
1662
01:19:35,710 --> 01:19:37,250
But notice this.
1663
01:19:37,250 --> 01:19:41,220
Within this table of two
columns, show ID and person ID,
1664
01:19:41,220 --> 01:19:44,070
there's what's going to be
called a foreign key called
1665
01:19:44,070 --> 01:19:49,350
show ID that references the
show's table's ID column
1666
01:19:49,350 --> 01:19:51,367
and then another foreign
key called person ID,
1667
01:19:51,367 --> 01:19:53,700
though I could call these
things in parentheses anything
1668
01:19:53,700 --> 01:19:57,495
I want, that references the
people table's ID column.
1669
01:19:57,495 --> 01:20:00,120
Now, you're not going to often
have to type commands like this.
1670
01:20:00,120 --> 01:20:02,700
Again, you set the database
up once in the beginning
1671
01:20:02,700 --> 01:20:04,590
typically, maybe with
some help from a TF,
1672
01:20:04,590 --> 01:20:06,210
maybe with help of Google or the like.
1673
01:20:06,210 --> 01:20:09,203
But once your database is
designed, it's back to the CRUD.
1674
01:20:09,203 --> 01:20:12,120
Create, read, update, delete, the
selects, the inserts, the deletions,
1675
01:20:12,120 --> 01:20:12,690
and the like.
1676
01:20:12,690 --> 01:20:14,310
But what's this implying?
1677
01:20:14,310 --> 01:20:17,520
These keywords like
primary key and foreign key
1678
01:20:17,520 --> 01:20:23,100
are what are doing in code what this
picture was painting a moment ago.
1679
01:20:23,100 --> 01:20:27,570
These lines here are drawn literally to
line up with the corresponding things.
1680
01:20:27,570 --> 01:20:30,660
People's ID lines up with person ID.
1681
01:20:30,660 --> 01:20:34,410
Show's ID lines up with show ID.
1682
01:20:34,410 --> 01:20:37,950
And so you're just seeing graphical
version, code version, graphical,
1683
01:20:37,950 --> 01:20:40,860
code that creates these relationships.
1684
01:20:40,860 --> 01:20:45,190
Now, given that, let's actually
see what these things look like.
1685
01:20:45,190 --> 01:20:47,160
So let me go back to VS Code here.
1686
01:20:47,160 --> 01:20:48,400
Let me clear my screen.
1687
01:20:48,400 --> 01:20:51,000
I'm still within SQLite with shows.db.
1688
01:20:51,000 --> 01:20:53,480
Let me go ahead and do what
I do with any new database.
1689
01:20:53,480 --> 01:20:55,230
If I ever download
something or I'm trying
1690
01:20:55,230 --> 01:20:57,600
to wrap my mind around
a problem, usually it
1691
01:20:57,600 --> 01:21:01,230
doesn't come with a pretty
picture or a three hour lecture
1692
01:21:01,230 --> 01:21:02,670
to explain what the data set is.
1693
01:21:02,670 --> 01:21:06,190
Rather you just have the data
set in your own knowledge of SQL.
1694
01:21:06,190 --> 01:21:07,440
So let me play around.
1695
01:21:07,440 --> 01:21:09,608
So .schema shows me all of the tables.
1696
01:21:09,608 --> 01:21:10,900
That might be a starting point.
1697
01:21:10,900 --> 01:21:11,900
OK, this is interesting.
1698
01:21:11,900 --> 01:21:13,560
I know what people are.
1699
01:21:13,560 --> 01:21:16,530
Let's go ahead and
show me all the people.
1700
01:21:16,530 --> 01:21:18,595
So select star from people.
1701
01:21:18,595 --> 01:21:20,970
I'm just trying to wrap my
mind around what this data set
1702
01:21:20,970 --> 01:21:24,330
looks like in a more user friendly way.
1703
01:21:24,330 --> 01:21:26,607
That's already a lot of people.
1704
01:21:26,607 --> 01:21:28,440
As you see the years
flying by, there's been
1705
01:21:28,440 --> 01:21:31,170
a lot of people in the TV business.
1706
01:21:31,170 --> 01:21:33,540
So this was maybe not
the best query to run.
1707
01:21:33,540 --> 01:21:39,300
But this is indicative of just how
large this data set is from IMDb.
1708
01:21:39,300 --> 01:21:42,960
When in doubt and whenever you
lose control over your computer,
1709
01:21:42,960 --> 01:21:44,940
Control C is your friend to interrupt.
1710
01:21:44,940 --> 01:21:46,860
What would have been
better, because I don't
1711
01:21:46,860 --> 01:21:49,980
think I need to know all of the
million people in the world,
1712
01:21:49,980 --> 01:21:52,470
I could do limit me to 10 people.
1713
01:21:52,470 --> 01:21:55,860
And that's enough now to
get a sense of Fred Astaire
1714
01:21:55,860 --> 01:21:58,920
has an ID of one, the first person ever.
1715
01:21:58,920 --> 01:22:00,660
Birth year of 1899.
1716
01:22:00,660 --> 01:22:04,740
Lauren Bacall and all of these
other people from yesteryear.
1717
01:22:04,740 --> 01:22:07,410
You see that they are the first
10 people in the database.
1718
01:22:07,410 --> 01:22:09,160
So there's an example
of some of the data.
1719
01:22:09,160 --> 01:22:12,540
Now if I want to wrap my mind around
what a show is, I know it technically.
1720
01:22:12,540 --> 01:22:13,800
I know it from the picture.
1721
01:22:13,800 --> 01:22:15,400
But let's just look at some raw data.
1722
01:22:15,400 --> 01:22:18,210
So instead of saying select star
from people, let me go ahead
1723
01:22:18,210 --> 01:22:21,480
and select star from shows limit 10.
1724
01:22:21,480 --> 01:22:26,250
And OK, I've only heard of
or seen a couple of these,
1725
01:22:26,250 --> 01:22:27,970
but these are older shows at that.
1726
01:22:27,970 --> 01:22:31,530
But I see that every show has an ID,
a title, a year in which it debuted,
1727
01:22:31,530 --> 01:22:32,910
and a number of episodes.
1728
01:22:32,910 --> 01:22:35,340
But perhaps most opaque
is going to be this.
1729
01:22:35,340 --> 01:22:41,520
Select star from stars where this
is the table that associates people
1730
01:22:41,520 --> 01:22:42,330
with shows.
1731
01:22:42,330 --> 01:22:46,950
Am I going to see any
names or show titles here?
1732
01:22:46,950 --> 01:22:49,127
Not according to the
definition we saw earlier.
1733
01:22:49,127 --> 01:22:50,460
Oh, I should have done my limit.
1734
01:22:50,460 --> 01:22:51,660
Let me interrupt that.
1735
01:22:51,660 --> 01:22:52,620
Let me do that again.
1736
01:22:52,620 --> 01:22:54,180
Limit 10.
1737
01:22:54,180 --> 01:22:55,050
No.
1738
01:22:55,050 --> 01:22:58,380
And this is where now you're
definitely in the programmer world,
1739
01:22:58,380 --> 01:23:01,780
because this would be the most annoying
spreadsheet to use on your Mac or PC
1740
01:23:01,780 --> 01:23:02,280
ever.
1741
01:23:02,280 --> 01:23:04,528
If you just had a sheet
with all of these numbers
1742
01:23:04,528 --> 01:23:06,570
that associates one thing
with the other, my God,
1743
01:23:06,570 --> 01:23:08,910
how do you figure out who
this is or what this is?
1744
01:23:08,910 --> 01:23:12,780
You have to manually Control F or
Command F looking for the data.
1745
01:23:12,780 --> 01:23:14,310
But a database doesn't care.
1746
01:23:14,310 --> 01:23:18,160
Once you know SQL, you can stitch
these things back together.
1747
01:23:18,160 --> 01:23:21,360
So what you're seeing
here are foreign keys.
1748
01:23:21,360 --> 01:23:22,150
Foreign keys.
1749
01:23:22,150 --> 01:23:22,650
Why?
1750
01:23:22,650 --> 01:23:27,600
Because show ID corresponds to the
same numbers from that other table
1751
01:23:27,600 --> 01:23:30,990
called shows that has a
proper primary key called ID.
1752
01:23:30,990 --> 01:23:33,360
Person ID is a foreign
key in this context,
1753
01:23:33,360 --> 01:23:38,620
because it refers to numbers that belong
to really the people table and its ID
1754
01:23:38,620 --> 01:23:39,120
column.
1755
01:23:39,120 --> 01:23:41,037
So this is just a way
of somehow linking them.
1756
01:23:41,037 --> 01:23:44,100
And so if you think of I always think
of this in my mind's eye as this.
1757
01:23:44,100 --> 01:23:47,760
If this is the people table,
this is the shows table,
1758
01:23:47,760 --> 01:23:51,690
and there's this middle table
in between, the stars table.
1759
01:23:51,690 --> 01:23:56,490
There's some way of stitching those two
together by lining up the IDs of one
1760
01:23:56,490 --> 01:23:58,870
with the other and getting
back some more data.
1761
01:23:58,870 --> 01:24:01,690
So let's actually play
with some of this data.
1762
01:24:01,690 --> 01:24:05,148
How about we start where we
emphasized earlier, genres.
1763
01:24:05,148 --> 01:24:08,440
So let me go ahead and take a quick look
at all of the genres in this database.
1764
01:24:08,440 --> 01:24:10,500
So select star from genres.
1765
01:24:10,500 --> 01:24:13,030
Star is usually going to
be a little overwhelming,
1766
01:24:13,030 --> 01:24:15,240
but it just gives me a
sense of what the data is.
1767
01:24:15,240 --> 01:24:17,850
But let's actually look at--
1768
01:24:17,850 --> 01:24:19,440
let's go look at all of them there.
1769
01:24:19,440 --> 01:24:20,850
OK, that's a lot.
1770
01:24:20,850 --> 01:24:24,210
These are all official genres from IMDb.
1771
01:24:24,210 --> 01:24:27,030
OK, it wasn't terribly long.
1772
01:24:27,030 --> 01:24:28,350
Let me filter that down.
1773
01:24:28,350 --> 01:24:34,800
So from genres where genre equals
Comedy, capital C just based
1774
01:24:34,800 --> 01:24:36,000
on the data I'm seeing.
1775
01:24:36,000 --> 01:24:38,220
OK, so what am I seeing now?
1776
01:24:38,220 --> 01:24:40,838
And in fact, let me limit
this arbitrarily to 10,
1777
01:24:40,838 --> 01:24:42,630
though I could limit
it to anything I want.
1778
01:24:42,630 --> 01:24:45,060
Here are 10 comedies.
1779
01:24:45,060 --> 01:24:46,080
What are they?
1780
01:24:46,080 --> 01:24:47,490
Well, who the heck knows?
1781
01:24:47,490 --> 01:24:49,590
All I know are the 10 show IDs.
1782
01:24:49,590 --> 01:24:51,180
Now, I could do something like this.
1783
01:24:51,180 --> 01:24:54,037
As we've seen before with SQL,
I could do, all right, well
1784
01:24:54,037 --> 01:24:55,620
let's figure out what this show ID is.
1785
01:24:55,620 --> 01:25:03,750
Select star from shows where the ID of
the show I'm looking for equals what?
1786
01:25:03,750 --> 01:25:06,910
62614 semicolon.
1787
01:25:06,910 --> 01:25:11,200
So I could manually look it up by
cross referencing the other table.
1788
01:25:11,200 --> 01:25:14,860
So that was the show in question there,
the first comedy in the data set.
1789
01:25:14,860 --> 01:25:16,200
Let me look up the second one.
1790
01:25:16,200 --> 01:25:20,610
So instead of that,
let's do 63881 Enter.
1791
01:25:20,610 --> 01:25:21,150
OK.
1792
01:25:21,150 --> 01:25:22,410
So that's that show.
1793
01:25:22,410 --> 01:25:23,640
And let's do one more.
1794
01:25:23,640 --> 01:25:27,210
And suffice it to say, this is
just getting tedious and vulnerable
1795
01:25:27,210 --> 01:25:29,520
to mistakes quickly.
1796
01:25:29,520 --> 01:25:31,710
This surely can't be the way to do this.
1797
01:25:31,710 --> 01:25:35,280
And indeed, SQL is going to let us do
this a little more powerfully instead.
1798
01:25:35,280 --> 01:25:36,790
Let's do this.
1799
01:25:36,790 --> 01:25:39,210
Instead of getting
this table temporarily
1800
01:25:39,210 --> 01:25:42,640
with all these show IDs and all
these genres, let's refine the query.
1801
01:25:42,640 --> 01:25:49,920
So let's just select the show ID from
the genres table where the genre equals
1802
01:25:49,920 --> 01:25:51,810
quote unquote "comedy."
1803
01:25:51,810 --> 01:25:56,670
Now I have a big list of show
IDs, all of which are comedy.
1804
01:25:56,670 --> 01:25:57,205
How many?
1805
01:25:57,205 --> 01:25:58,830
Well, I can combine ideas from earlier.
1806
01:25:58,830 --> 01:26:03,582
I can just count all of those show IDs
or star if I want to just do that too.
1807
01:26:03,582 --> 01:26:05,040
But I can count all those show IDs.
1808
01:26:05,040 --> 01:26:09,880
48,706 comedies on IMDb's
database for TV shows.
1809
01:26:09,880 --> 01:26:11,140
So feels like a lot.
1810
01:26:11,140 --> 01:26:14,010
But how can I now use that
information and get back
1811
01:26:14,010 --> 01:26:18,100
the titles of comedies in the
database without doing it manually?
1812
01:26:18,100 --> 01:26:19,720
Well, let's do this.
1813
01:26:19,720 --> 01:26:21,270
I have a moment ago this query.
1814
01:26:21,270 --> 01:26:25,710
Select the show ID from
genres where the current genre
1815
01:26:25,710 --> 01:26:29,295
is quote unquote "comedy."
1816
01:26:29,295 --> 01:26:31,920
What if I kind of nest these
queries, kind of like grade school
1817
01:26:31,920 --> 01:26:33,030
math in parentheses?
1818
01:26:33,030 --> 01:26:35,940
What if I combine this
whole thing in parentheses?
1819
01:26:35,940 --> 01:26:38,040
And now let me select
what I really want.
1820
01:26:38,040 --> 01:26:43,950
Let me go ahead and select
how about the title of all
1821
01:26:43,950 --> 01:26:52,420
shows where the idea of the show
is in this list of show IDs.
1822
01:26:52,420 --> 01:26:59,440
So if you agree that the shows table
has an ID column, which is otherwise
1823
01:26:59,440 --> 01:27:02,170
known as its primary key, the
unique ID that identifies it, just
1824
01:27:02,170 --> 01:27:06,490
like our Harvard IDs, our Yale IDs,
and you agree that per a moment ago
1825
01:27:06,490 --> 01:27:09,580
this shorter query will
give me back just the show
1826
01:27:09,580 --> 01:27:13,840
IDs of all of the
comedies in the database,
1827
01:27:13,840 --> 01:27:17,710
you can actually combine or
nest these queries together.
1828
01:27:17,710 --> 01:27:19,843
It's going to respect
SQLite order of operations
1829
01:27:19,843 --> 01:27:21,760
with parentheses, just
like grade school math.
1830
01:27:21,760 --> 01:27:24,070
So the thing in parentheses
will be executed first.
1831
01:27:24,070 --> 01:27:27,340
That gives it back a list
of IDs, like 48,000 IDs.
1832
01:27:27,340 --> 01:27:30,160
And then this query, the
outer query, is going
1833
01:27:30,160 --> 01:27:33,790
to get the title from all of the
shows where the ID of the show
1834
01:27:33,790 --> 01:27:36,280
is in that big list of 48,000.
1835
01:27:36,280 --> 01:27:39,233
So if I now execute
these together, I think
1836
01:27:39,233 --> 01:27:42,400
the list is still going to be a little
long, but let me execute it together.
1837
01:27:42,400 --> 01:27:45,880
Now I see this long list of outputs.
1838
01:27:45,880 --> 01:27:46,900
A little overwhelming.
1839
01:27:46,900 --> 01:27:50,650
Let's go ahead and maybe
limit it to just 10
1840
01:27:50,650 --> 01:27:52,460
as before for discussion's sake.
1841
01:27:52,460 --> 01:27:56,110
And now I see 10 comedies
ordered arbitrarily
1842
01:27:56,110 --> 01:27:58,330
from however they're in
the database that happen
1843
01:27:58,330 --> 01:28:01,420
to indeed have comedy as their genre.
1844
01:28:01,420 --> 01:28:04,310
If I want to do this a little
more cleanly, I could do this.
1845
01:28:04,310 --> 01:28:05,260
Let's see.
1846
01:28:05,260 --> 01:28:09,640
Why don't I order by title ascending
order, which is alphabetically,
1847
01:28:09,640 --> 01:28:12,190
or the default is also an ascending.
1848
01:28:12,190 --> 01:28:13,240
Limit 10.
1849
01:28:13,240 --> 01:28:18,250
Now I see the top 10, I mean, weirdly
named things with hash symbols
1850
01:28:18,250 --> 01:28:22,360
presumably to get their titles up to the
beginning or maybe these are hashtags.
1851
01:28:22,360 --> 01:28:27,550
Here now we have alphabetically the
first 10 shows that are comedies.
1852
01:28:27,550 --> 01:28:31,705
Any questions on these kinds of queries?
1853
01:28:31,705 --> 01:28:33,580
It's kind of a lot, but
at the same time it's
1854
01:28:33,580 --> 01:28:37,540
just like composing the smaller
ideas from before into slightly more
1855
01:28:37,540 --> 01:28:39,820
useful queries.
1856
01:28:39,820 --> 01:28:40,360
Yeah.
1857
01:28:40,360 --> 01:28:43,348
AUDIENCE: [INAUDIBLE]
1858
01:28:43,348 --> 01:28:46,343
1859
01:28:46,343 --> 01:28:48,760
DAVID MALAN: Do foreign keys
have to set the relationship?
1860
01:28:48,760 --> 01:28:51,850
When you create the table,
the programmer or the database
1861
01:28:51,850 --> 01:28:54,280
administrator would
create that relationship
1862
01:28:54,280 --> 01:28:57,700
by using those keywords primary
key and foreign key that
1863
01:28:57,700 --> 01:29:01,330
teaches the database what is
related to what per the picture.
1864
01:29:01,330 --> 01:29:02,530
So you do that once.
1865
01:29:02,530 --> 01:29:07,390
And now I being the sort of programmer
who's familiar with the database,
1866
01:29:07,390 --> 01:29:13,750
I am just using these foreign keys in
a manner consistent with their design.
1867
01:29:13,750 --> 01:29:15,680
And this is where it's
useful at some point,
1868
01:29:15,680 --> 01:29:18,940
even if no one hands you a picture, to
make sure you understand the database,
1869
01:29:18,940 --> 01:29:22,180
because that's going to inform
literally what you type in SQL
1870
01:29:22,180 --> 01:29:23,740
to get the data you care about.
1871
01:29:23,740 --> 01:29:26,060
Well, let's do something
a little more precise.
1872
01:29:26,060 --> 01:29:27,860
How about-- very reasonable question.
1873
01:29:27,860 --> 01:29:30,970
And honestly, this is exactly
what imdb.com and the app or for.
1874
01:29:30,970 --> 01:29:34,600
What if you want to find all of
the shows that Steve Carell is in?
1875
01:29:34,600 --> 01:29:36,160
Kind of a reasonable query.
1876
01:29:36,160 --> 01:29:40,150
Literally something someone might type
into Google or more specifically IMDb.
1877
01:29:40,150 --> 01:29:42,130
It's not really obvious
at first glance how
1878
01:29:42,130 --> 01:29:44,795
to do that, though,
because from my database,
1879
01:29:44,795 --> 01:29:47,170
if these are my six tables,
well, I can pretty easily get
1880
01:29:47,170 --> 01:29:48,460
Steve Carell from here.
1881
01:29:48,460 --> 01:29:51,105
But I can really only get his
ID number, whatever that is,
1882
01:29:51,105 --> 01:29:53,230
his name, which I know
already, and his birth year.
1883
01:29:53,230 --> 01:29:56,200
OK, interesting but has nothing
to do with the shows that he's in.
1884
01:29:56,200 --> 01:30:00,850
I can look at shows over here, but
there's no mention of Steve Carell
1885
01:30:00,850 --> 01:30:02,710
because there's no person ID here.
1886
01:30:02,710 --> 01:30:04,660
Where is that relationship implemented?
1887
01:30:04,660 --> 01:30:06,590
Well, it's implemented down here.
1888
01:30:06,590 --> 01:30:07,910
So how do we do this?
1889
01:30:07,910 --> 01:30:10,210
Well, here's the perfect
example of a lesson
1890
01:30:10,210 --> 01:30:13,450
we've been trying to emphasize for
weeks of taking these baby steps.
1891
01:30:13,450 --> 01:30:17,750
Break larger problems down into smaller
ones and let's do something like this.
1892
01:30:17,750 --> 01:30:20,740
Let's just get everything I know
about Steve Carell from the database.
1893
01:30:20,740 --> 01:30:24,190
Let's select star from people
where the name of the person
1894
01:30:24,190 --> 01:30:26,890
is quote unquote "Steve Carell."
1895
01:30:26,890 --> 01:30:28,630
I just want to see what data we've got.
1896
01:30:28,630 --> 01:30:30,640
And here's what we have.
1897
01:30:30,640 --> 01:30:33,130
There's only one Steve
Carell born in 1962
1898
01:30:33,130 --> 01:30:37,150
and his unique ID is
136797 according to IMDb.
1899
01:30:37,150 --> 01:30:40,930
This isn't some global
actor identifier, per se.
1900
01:30:40,930 --> 01:30:45,010
All right, well how do I get now all
of the shows that Steve Carell is in?
1901
01:30:45,010 --> 01:30:46,130
Well, I could do this.
1902
01:30:46,130 --> 01:30:51,190
Select star from stars,
not to confuse the two.
1903
01:30:51,190 --> 01:30:53,500
One's the symbol, one's the table name.
1904
01:30:53,500 --> 01:30:58,690
Where person ID equals 136797.
1905
01:30:58,690 --> 01:31:03,070
So I think this will now give me
everything from the stars table
1906
01:31:03,070 --> 01:31:04,943
that relates to Steve Carell.
1907
01:31:04,943 --> 01:31:07,360
And you'll see person ID is
the same because I'm literally
1908
01:31:07,360 --> 01:31:08,693
searching for just Steve Carell.
1909
01:31:08,693 --> 01:31:12,200
But there are like 20 or
so shows that he's been in.
1910
01:31:12,200 --> 01:31:14,450
All right, well here's where
things would get tedious.
1911
01:31:14,450 --> 01:31:15,340
What are those shows?
1912
01:31:15,340 --> 01:31:21,460
Well, I could do select title from
shows where the ID of the show equals.
1913
01:31:21,460 --> 01:31:25,670
And here's whenever you copy paste,
you're probably doing something wrong.
1914
01:31:25,670 --> 01:31:27,890
OK, he was in The Dana Carvey Show.
1915
01:31:27,890 --> 01:31:28,715
Familiar with that.
1916
01:31:28,715 --> 01:31:29,590
Let's do another one.
1917
01:31:29,590 --> 01:31:31,460
We'll copy paste this.
1918
01:31:31,460 --> 01:31:34,090
Where ID equals this.
1919
01:31:34,090 --> 01:31:34,890
Over The Top.
1920
01:31:34,890 --> 01:31:35,390
Another.
1921
01:31:35,390 --> 01:31:37,000
And if we keep digging, we'll
probably find The Office.
1922
01:31:37,000 --> 01:31:40,240
But my God, that's going to take
forever to do 20 queries manually.
1923
01:31:40,240 --> 01:31:41,252
It's not very dynamic.
1924
01:31:41,252 --> 01:31:43,960
But what if we just nest these
queries a little more dynamically?
1925
01:31:43,960 --> 01:31:45,830
So let me start from
the beginning again.
1926
01:31:45,830 --> 01:31:48,890
What if we go ahead
and select everything
1927
01:31:48,890 --> 01:31:56,410
we know about people whose
name equals Steve Carell.
1928
01:31:56,410 --> 01:31:58,220
That gave us earlier this data.
1929
01:31:58,220 --> 01:31:59,470
I don't need all of that data.
1930
01:31:59,470 --> 01:32:00,100
I know his name.
1931
01:32:00,100 --> 01:32:01,517
I don't care about his birth year.
1932
01:32:01,517 --> 01:32:05,140
So let's change this to just be
give me the ID of Steve Carell.
1933
01:32:05,140 --> 01:32:08,390
And that gives me back now this
smaller temporary data set.
1934
01:32:08,390 --> 01:32:08,890
All right.
1935
01:32:08,890 --> 01:32:12,195
Can I now use this
inside of another query?
1936
01:32:12,195 --> 01:32:14,320
Well, let me wrap the whole
thing with parentheses.
1937
01:32:14,320 --> 01:32:18,850
And now let me say select
star from the stars table
1938
01:32:18,850 --> 01:32:22,477
where the person ID equals this.
1939
01:32:22,477 --> 01:32:24,310
So I'm deliberately not
using in because I'm
1940
01:32:24,310 --> 01:32:26,780
assuming there's indeed only
one Steve Carell in the world.
1941
01:32:26,780 --> 01:32:28,480
So I'm not getting back
a list of Steve Carells.
1942
01:32:28,480 --> 01:32:30,640
I'm getting back the one
and only in this case.
1943
01:32:30,640 --> 01:32:31,810
So equal is fine.
1944
01:32:31,810 --> 01:32:33,310
In is when you have multiple.
1945
01:32:33,310 --> 01:32:34,840
Equal is when you have one.
1946
01:32:34,840 --> 01:32:36,375
Let me go ahead and hit Enter now.
1947
01:32:36,375 --> 01:32:37,750
OK, that's more data than I need.
1948
01:32:37,750 --> 01:32:40,430
I don't need like 20 copies
of Steve Carell's person ID.
1949
01:32:40,430 --> 01:32:41,380
So let me hit up.
1950
01:32:41,380 --> 01:32:46,420
Let me go back and let me just
get show ID from Steve Carell.
1951
01:32:46,420 --> 01:32:52,070
And now I have a list of just the 20
or so show IDs that he has been in.
1952
01:32:52,070 --> 01:32:52,570
All right.
1953
01:32:52,570 --> 01:32:53,597
How can I now use this?
1954
01:32:53,597 --> 01:32:54,430
Well, let me hit up.
1955
01:32:54,430 --> 01:32:57,040
Let me put the whole
thing in parentheses.
1956
01:32:57,040 --> 01:32:58,960
And now let me select
what I really want.
1957
01:32:58,960 --> 01:33:03,340
Select title from shows where.
1958
01:33:03,340 --> 01:33:05,020
And here's the final flourish.
1959
01:33:05,020 --> 01:33:11,350
The shows table has an ID, has a
title, has a year, and has an episode.
1960
01:33:11,350 --> 01:33:21,610
And what I really want, though, is to
check which shows have ID that is what?
1961
01:33:21,610 --> 01:33:24,510
Anyone want to finish the thought?
1962
01:33:24,510 --> 01:33:25,860
I just want to-- yeah.
1963
01:33:25,860 --> 01:33:28,435
AUDIENCE: [INAUDIBLE]
1964
01:33:28,435 --> 01:33:29,310
DAVID MALAN: Exactly.
1965
01:33:29,310 --> 01:33:30,570
ID in this.
1966
01:33:30,570 --> 01:33:31,690
And this is getting ugly.
1967
01:33:31,690 --> 01:33:33,810
And when you actually write
your queries in a text file,
1968
01:33:33,810 --> 01:33:35,730
you can format them
nicely and indent them.
1969
01:33:35,730 --> 01:33:36,772
My font is just getting--
1970
01:33:36,772 --> 01:33:38,938
I don't want to make it too
small to fit everything.
1971
01:33:38,938 --> 01:33:40,230
But now we have three queries.
1972
01:33:40,230 --> 01:33:43,380
One is in doubly nested parentheses,
then there's the middle one,
1973
01:33:43,380 --> 01:33:44,550
then there's the outer one.
1974
01:33:44,550 --> 01:33:47,310
So this last query is going
to get me the title from shows
1975
01:33:47,310 --> 01:33:50,460
where the ID of the show
is in this big list of 20
1976
01:33:50,460 --> 01:33:52,410
or so show IDs that Steve Carell is in.
1977
01:33:52,410 --> 01:33:54,812
And I knew that because I
looked up his name here.
1978
01:33:54,812 --> 01:33:57,270
And notice what I did not do
this time is I didn't manually
1979
01:33:57,270 --> 01:33:58,620
hardcode his ID number.
1980
01:33:58,620 --> 01:33:59,430
There's no need.
1981
01:33:59,430 --> 01:34:02,508
That would be kind of a bad way
to implement a website if you're
1982
01:34:02,508 --> 01:34:04,050
using a database underneath the hood.
1983
01:34:04,050 --> 01:34:07,470
You want the IMDb for real
to search for whatever
1984
01:34:07,470 --> 01:34:11,010
the human typed in and no one's going
to know Steve Carell's person ID
1985
01:34:11,010 --> 01:34:12,070
or anything else.
1986
01:34:12,070 --> 01:34:14,580
So here we've done this all dynamically.
1987
01:34:14,580 --> 01:34:18,690
And now if I hit Enter, I
think I get all of his shows.
1988
01:34:18,690 --> 01:34:21,840
Let's go ahead and order this
by title just to make it tidy.
1989
01:34:21,840 --> 01:34:24,810
And you probably will see at
least one or more shows that.
1990
01:34:24,810 --> 01:34:29,520
And probably the most popular
is, dot dot dot, The Office.
1991
01:34:29,520 --> 01:34:32,280
So this is literally
the kind of query that's
1992
01:34:32,280 --> 01:34:37,200
being executed underneath the hood when
you go to websites or apps like IMDb.
1993
01:34:37,200 --> 01:34:42,030
Your textual query is probably being
plugged into a longer SQL query
1994
01:34:42,030 --> 01:34:44,850
like this where some
programmer at IMDb probably
1995
01:34:44,850 --> 01:34:48,420
wrote this whole query in
advance weeks, months, years ago
1996
01:34:48,420 --> 01:34:52,470
and they're just somehow plugging
in the value that you the human
1997
01:34:52,470 --> 01:34:55,550
typed into the search box or the like.
1998
01:34:55,550 --> 01:35:03,010
Questions now on finding
this data or any other?
1999
01:35:03,010 --> 01:35:03,890
No?
2000
01:35:03,890 --> 01:35:04,390
OK.
2001
01:35:04,390 --> 01:35:06,980
So where else could we go with this?
2002
01:35:06,980 --> 01:35:12,700
Well, let's consider how
else we might combine data.
2003
01:35:12,700 --> 01:35:16,060
Suppose that the next question
actually perhaps appropriately
2004
01:35:16,060 --> 01:35:19,780
would be focusing in on not just
people and shows and these stars,
2005
01:35:19,780 --> 01:35:23,680
but how do we gather more information
about the shows themselves,
2006
01:35:23,680 --> 01:35:25,430
like the genres, the
ratings, or the like.
2007
01:35:25,430 --> 01:35:28,030
So indeed, let's focus on
just these two tables here.
2008
01:35:28,030 --> 01:35:31,640
Recall that every show has an ID,
a title, a year, and episodes.
2009
01:35:31,640 --> 01:35:34,660
But it also might have one or
more relationships with rows
2010
01:35:34,660 --> 01:35:36,280
and this other table called genres.
2011
01:35:36,280 --> 01:35:39,250
And this is so that a show can
be a comedy, can be a drama,
2012
01:35:39,250 --> 01:35:40,960
can be any number of other things.
2013
01:35:40,960 --> 01:35:42,400
One row per.
2014
01:35:42,400 --> 01:35:45,040
So you would see the same
show ID again and again
2015
01:35:45,040 --> 01:35:48,760
and again with a different
genre written in English
2016
01:35:48,760 --> 01:35:51,070
like comedy, drama, or the like.
2017
01:35:51,070 --> 01:35:53,890
Well, how do I kind of
reconstitute that data?
2018
01:35:53,890 --> 01:35:56,710
Well, turns out there's a few
different ways to do this.
2019
01:35:56,710 --> 01:36:00,610
And let me propose that we
introduce this keyword here, join.
2020
01:36:00,610 --> 01:36:05,525
And this is really the most powerful
of the keywords in SQL itself.
2021
01:36:05,525 --> 01:36:06,650
It doesn't have to be used.
2022
01:36:06,650 --> 01:36:09,310
We've seen with nested
queries that you can still
2023
01:36:09,310 --> 01:36:13,490
select data across multiple
tables, but here is another way.
2024
01:36:13,490 --> 01:36:14,660
So let me do this.
2025
01:36:14,660 --> 01:36:17,710
Let me go back to my SQLite database.
2026
01:36:17,710 --> 01:36:23,680
And let me select sort of in one
breath exactly the data I want.
2027
01:36:23,680 --> 01:36:26,980
Select star from shows.
2028
01:36:26,980 --> 01:36:30,500
And let's just limit this initially
to 10 to see what it looks like.
2029
01:36:30,500 --> 01:36:31,000
All right.
2030
01:36:31,000 --> 01:36:32,250
That's, again, the shows data.
2031
01:36:32,250 --> 01:36:34,690
Select star from genres.
2032
01:36:34,690 --> 01:36:37,760
Let's limit that to 10 too, just
to wrap our minds around it.
2033
01:36:37,760 --> 01:36:39,730
And now this is not that useful.
2034
01:36:39,730 --> 01:36:43,510
However, the data in
the leftmost column here
2035
01:36:43,510 --> 01:36:45,490
is the primary key in the shows table.
2036
01:36:45,490 --> 01:36:46,870
These are just unique IDs.
2037
01:36:46,870 --> 01:36:51,530
The data here in the genres table,
recall, show ID is the foreign key.
2038
01:36:51,530 --> 01:36:54,310
So it's the same numbers but
just copied into another table
2039
01:36:54,310 --> 01:36:56,650
so that we can have this
relationship across them.
2040
01:36:56,650 --> 01:37:00,070
How do I kind of line up these
numbers with these numbers
2041
01:37:00,070 --> 01:37:06,430
to get back a wider table that has title
and year and episodes and genre and,
2042
01:37:06,430 --> 01:37:08,950
heck, ratings and all
of that too if we want?
2043
01:37:08,950 --> 01:37:11,230
Well, you can join these
tables by just telling
2044
01:37:11,230 --> 01:37:12,890
the database what to join on what.
2045
01:37:12,890 --> 01:37:13,940
So let me do this.
2046
01:37:13,940 --> 01:37:17,110
Select star from shows.
2047
01:37:17,110 --> 01:37:21,760
Join that table though
on the genres table.
2048
01:37:21,760 --> 01:37:24,160
Well, how do you want to
join those two tables?
2049
01:37:24,160 --> 01:37:26,860
And again, the two tables from
the picture looked like this.
2050
01:37:26,860 --> 01:37:30,400
How do you tell SQL
programmatically to put one of them
2051
01:37:30,400 --> 01:37:32,420
right next to the other,
line up all of the ID
2052
01:37:32,420 --> 01:37:34,820
so that you just get
one larger data set?
2053
01:37:34,820 --> 01:37:39,160
Well, we can use indeed
this syntax called join.
2054
01:37:39,160 --> 01:37:41,140
So back to VS Code here.
2055
01:37:41,140 --> 01:37:44,380
And let me join these two tables.
2056
01:37:44,380 --> 01:37:45,970
Sorry, typo here.
2057
01:37:45,970 --> 01:37:51,940
Join genres on the shows
table's ID column, a.k.a.
2058
01:37:51,940 --> 01:37:58,240
its primary key, equaling the genres
table's show ID column, a.k.a.
2059
01:37:58,240 --> 01:37:59,210
the foreign key.
2060
01:37:59,210 --> 01:38:01,090
So in other words, it
looks a little cryptic,
2061
01:38:01,090 --> 01:38:05,080
but I'm just telling SQL how to line
up these two tables and what column
2062
01:38:05,080 --> 01:38:08,230
to match with the other so
that the numbers line up
2063
01:38:08,230 --> 01:38:10,610
and I get essentially a wider table.
2064
01:38:10,610 --> 01:38:14,930
Let me go ahead and hit
semicolon and Enter.
2065
01:38:14,930 --> 01:38:17,330
And this is now going to
give me a lot of data.
2066
01:38:17,330 --> 01:38:18,600
We might have to interrupt it.
2067
01:38:18,600 --> 01:38:22,650
But notice even at a glance, we're
getting the ID, the title, the year,
2068
01:38:22,650 --> 01:38:25,130
the number of episodes,
the ID again redundantly,
2069
01:38:25,130 --> 01:38:27,140
but that's to be expected
if I'm joining them,
2070
01:38:27,140 --> 01:38:29,360
and the genre all the way on the right.
2071
01:38:29,360 --> 01:38:31,740
Let me hit Control C to interrupt.
2072
01:38:31,740 --> 01:38:33,890
Let me just limit this to The Office.
2073
01:38:33,890 --> 01:38:36,500
So where title equals
quote unquote "The Office"
2074
01:38:36,500 --> 01:38:39,680
so we can focus on just one sample data.
2075
01:38:39,680 --> 01:38:43,490
And here, fun fact, there's
been more than one Office.
2076
01:38:43,490 --> 01:38:46,100
The one that you all
probably like is this one
2077
01:38:46,100 --> 01:38:49,250
that started in 2005 with 188 episodes.
2078
01:38:49,250 --> 01:38:53,060
Its ID in the shows table is 386676.
2079
01:38:53,060 --> 01:38:54,710
That's confirmed over here too.
2080
01:38:54,710 --> 01:38:56,850
So again, we've just
joined the two tables.
2081
01:38:56,850 --> 01:38:57,350
How?
2082
01:38:57,350 --> 01:38:58,730
By lining up those fields.
2083
01:38:58,730 --> 01:39:01,490
But now that we can see that
almost all of The Offices
2084
01:39:01,490 --> 01:39:04,700
produced over the decades are
comedies except for this one.
2085
01:39:04,700 --> 01:39:07,460
There was a version of The
Office produced in 2001
2086
01:39:07,460 --> 01:39:10,160
that was considered more of a drama.
2087
01:39:10,160 --> 01:39:11,940
Unsure if it's related to the other.
2088
01:39:11,940 --> 01:39:13,640
How can we link in other data?
2089
01:39:13,640 --> 01:39:16,920
Well, let's go ahead and link
in ratings too or instead.
2090
01:39:16,920 --> 01:39:21,290
So instead of joining this with
genres, let me go ahead and rewind here
2091
01:39:21,290 --> 01:39:28,550
and join shows on ratings on
shows.id equals ratings.show_ID.
2092
01:39:28,550 --> 01:39:32,630
And let's limit it to The Office too
for discussion's sake where title equals
2093
01:39:32,630 --> 01:39:35,820
quote unquote "The Office" semicolon.
2094
01:39:35,820 --> 01:39:39,650
And now you can see that
among the various Offices,
2095
01:39:39,650 --> 01:39:43,100
it looks like the one that most
of us probably know and love
2096
01:39:43,100 --> 01:39:49,070
is the highest rated also with a
9.0 with like 585,000 people having
2097
01:39:49,070 --> 01:39:52,280
cast votes for whereas this other
shows seem to have been less popular.
2098
01:39:52,280 --> 01:39:56,370
And perhaps that's why indeed you
see fewer episodes for them as well.
2099
01:39:56,370 --> 01:39:58,680
So even though we've put
the data in multiple places,
2100
01:39:58,680 --> 01:40:02,670
you can still kind of reconstitute
it by lining things up in this way
2101
01:40:02,670 --> 01:40:04,970
and rejoining the tables.
2102
01:40:04,970 --> 01:40:09,140
Questions now on this?
2103
01:40:09,140 --> 01:40:16,540
This is the heart of what SQL does and
what relational databases do for you.
2104
01:40:16,540 --> 01:40:18,100
Questions?
2105
01:40:18,100 --> 01:40:18,610
All right.
2106
01:40:18,610 --> 01:40:20,470
A few final features.
2107
01:40:20,470 --> 01:40:22,660
There's not all that much that--
2108
01:40:22,660 --> 01:40:24,320
SQL takes practice like anything else.
2109
01:40:24,320 --> 01:40:26,237
But in terms of syntax
and capabilities, let's
2110
01:40:26,237 --> 01:40:30,610
just introduce you to a couple of final
features here and problems that arise
2111
01:40:30,610 --> 01:40:32,210
and how we might solve them.
2112
01:40:32,210 --> 01:40:34,640
Let's do this as well.
2113
01:40:34,640 --> 01:40:36,980
So let me go back into VS Code here.
2114
01:40:36,980 --> 01:40:40,900
And let's just find out Steve
Carell's information again.
2115
01:40:40,900 --> 01:40:43,180
Last time we did it
with this nested query
2116
01:40:43,180 --> 01:40:47,503
by getting his ID and then the show IDs
and then the titles for those show IDs.
2117
01:40:47,503 --> 01:40:49,420
With join, you can do
it a little differently.
2118
01:40:49,420 --> 01:40:50,950
And any of these ways are fine.
2119
01:40:50,950 --> 01:40:53,740
One might become easier
to mentally than another.
2120
01:40:53,740 --> 01:40:57,040
Let's go ahead and select
the titles from what.
2121
01:40:57,040 --> 01:41:02,095
Let's select the title
from the people table.
2122
01:41:02,095 --> 01:41:03,220
And I'm going to hit Enter.
2123
01:41:03,220 --> 01:41:05,260
And when you're using
SQLite3 interactively,
2124
01:41:05,260 --> 01:41:09,400
if you ever find yourself with a prompt
that says dot dot dot angle bracket,
2125
01:41:09,400 --> 01:41:12,067
it means you're continuing your
thought onto the next line.
2126
01:41:12,067 --> 01:41:13,900
If you didn't intend
that, you can sometimes
2127
01:41:13,900 --> 01:41:16,510
hit semicolon to just
end the thought and hit
2128
01:41:16,510 --> 01:41:18,460
Enter even if it triggers an error.
2129
01:41:18,460 --> 01:41:21,430
But this is one way of formatting
my queries now a little more nicely.
2130
01:41:21,430 --> 01:41:24,597
I'm just going to add some white space
so that it's a little easier to read.
2131
01:41:24,597 --> 01:41:25,720
What do I want to select?
2132
01:41:25,720 --> 01:41:28,000
Well, I want to select
the title of shows
2133
01:41:28,000 --> 01:41:35,320
from the people table joined with the
stars table on the people table's ID
2134
01:41:35,320 --> 01:41:40,520
column equaling the stars
table's person ID column.
2135
01:41:40,520 --> 01:41:44,230
So in other words, if you think back
to what people are and what stars are,
2136
01:41:44,230 --> 01:41:48,250
one has an ID, one has a person ID, I'm
just now connecting those two tables.
2137
01:41:48,250 --> 01:41:49,450
I'm joining those two.
2138
01:41:49,450 --> 01:41:53,350
But I want to do this as
well with another table.
2139
01:41:53,350 --> 01:41:54,938
Let me additionally join in.
2140
01:41:54,938 --> 01:41:56,980
So now I only have two
hands, but now I'm putting
2141
01:41:56,980 --> 01:41:59,320
a third table joined in together here.
2142
01:41:59,320 --> 01:42:05,960
Join shows on stars.show_id
equals shows.id.
2143
01:42:05,960 --> 01:42:08,500
So this is now linking
three tables together.
2144
01:42:08,500 --> 01:42:12,100
But I only care about this for one
person, so where the name of the person
2145
01:42:12,100 --> 01:42:14,650
equals quote unquote "Steve Carell."
2146
01:42:14,650 --> 01:42:18,850
So more cryptic, to be sure, but
what we're doing with this query
2147
01:42:18,850 --> 01:42:21,370
is just taking all three
tables that we care about
2148
01:42:21,370 --> 01:42:23,440
and we're joining them
all together at once
2149
01:42:23,440 --> 01:42:27,250
using this new join syntax
literally telling the database what
2150
01:42:27,250 --> 01:42:29,020
columns to line up with what.
2151
01:42:29,020 --> 01:42:32,350
And then we filter at the very
end just like before to get back,
2152
01:42:32,350 --> 01:42:35,650
if I hit Enter, the answer
we want, which in this case
2153
01:42:35,650 --> 01:42:40,030
is a little slower at the moment,
but that same list of 20 or so
2154
01:42:40,030 --> 01:42:41,350
shows that he's been in.
2155
01:42:41,350 --> 01:42:42,920
There's one other way to do this.
2156
01:42:42,920 --> 01:42:45,250
And again, these are all
in the slides online.
2157
01:42:45,250 --> 01:42:47,890
So you can repeat them without having to
jot down everything and we'll put them
2158
01:42:47,890 --> 01:42:48,820
in the notes too.
2159
01:42:48,820 --> 01:42:50,450
But there's another way to do this.
2160
01:42:50,450 --> 01:42:52,780
I could also use an implicit join.
2161
01:42:52,780 --> 01:42:55,030
So that was an explicit join
because I literally typed
2162
01:42:55,030 --> 01:42:57,220
the word join multiple times at that.
2163
01:42:57,220 --> 01:43:00,880
But let me go ahead and select
the title from these three tables.
2164
01:43:00,880 --> 01:43:02,628
People, stars, and shows.
2165
01:43:02,628 --> 01:43:04,420
And this might just be
nicer because if you
2166
01:43:04,420 --> 01:43:06,337
know what tables you
want to select data from,
2167
01:43:06,337 --> 01:43:08,980
just enumerate them separated by
commas, which you might prefer
2168
01:43:08,980 --> 01:43:18,970
in your mind, where the people ID
equals the stars person ID and the stars
2169
01:43:18,970 --> 01:43:25,450
show ID equals the shows ID and the
name of the person equals Steve Carell.
2170
01:43:25,450 --> 01:43:26,788
So this is an implicit join.
2171
01:43:26,788 --> 01:43:29,830
And honestly, I constantly reference
my notes for some of this stuff too.
2172
01:43:29,830 --> 01:43:31,720
It's not the kind of thing that's
going to come like this to you
2173
01:43:31,720 --> 01:43:32,740
after just one day.
2174
01:43:32,740 --> 01:43:35,560
But it's just a different way
of expressing the same thing.
2175
01:43:35,560 --> 01:43:37,720
I want to select data from
three different tables.
2176
01:43:37,720 --> 01:43:43,120
And hey SQL, here is how I want you
to line those tables up so that I can
2177
01:43:43,120 --> 01:43:45,640
get like related data for Steve Carell.
2178
01:43:45,640 --> 01:43:49,450
And this now will achieve
the same results ultimately.
2179
01:43:49,450 --> 01:43:50,260
Let me hit Enter.
2180
01:43:50,260 --> 01:43:54,040
2181
01:43:54,040 --> 01:43:54,920
And there we go.
2182
01:43:54,920 --> 01:43:55,750
So a little slower.
2183
01:43:55,750 --> 01:43:57,850
And performance might
vary based on computer,
2184
01:43:57,850 --> 01:44:01,270
based on implementation of SQL, but I
think I still have the same answers.
2185
01:44:01,270 --> 01:44:04,790
Now suppose, as I often do, and I
had to look it up again last time,
2186
01:44:04,790 --> 01:44:08,350
suppose you forget how to
spell Steve Carell's name.
2187
01:44:08,350 --> 01:44:10,240
Is it two R's, two L's, or the like?
2188
01:44:10,240 --> 01:44:12,903
Well, I could also do
something like this.
2189
01:44:12,903 --> 01:44:14,320
Well, let's just keep this simple.
2190
01:44:14,320 --> 01:44:17,103
Select star from people
where name equals.
2191
01:44:17,103 --> 01:44:20,020
I've been deliberately getting it
right so as to not embarrass myself.
2192
01:44:20,020 --> 01:44:22,000
That's the Steve Carell I keep querying.
2193
01:44:22,000 --> 01:44:25,960
If you forget, well you could
try searching for just Steves,
2194
01:44:25,960 --> 01:44:28,450
but interestingly,
there's a bunch of Steves.
2195
01:44:28,450 --> 01:44:30,070
We don't know when they were born.
2196
01:44:30,070 --> 01:44:35,360
But that's probably not the Steve Carell
we want if we don't have his last name.
2197
01:44:35,360 --> 01:44:38,770
So I could alternatively do, well it's
Steve and then it starts with a C,
2198
01:44:38,770 --> 01:44:39,460
I think.
2199
01:44:39,460 --> 01:44:42,700
Well, it turns out there's another
wild card you can use in SQL.
2200
01:44:42,700 --> 01:44:46,060
We used the asterisk to
select all of the columns.
2201
01:44:46,060 --> 01:44:50,800
You can in quotes use a percent
sign to say C something.
2202
01:44:50,800 --> 01:44:54,370
So there's 0 or more
characters after the letter C.
2203
01:44:54,370 --> 01:44:58,630
And now this doesn't work because
now I would be literally looking
2204
01:44:58,630 --> 01:45:01,690
for Steve space C something.
2205
01:45:01,690 --> 01:45:04,910
But recall earlier I mentioned
that one other keyword,
2206
01:45:04,910 --> 01:45:06,868
which is for fuzzier
matching, so to speak,
2207
01:45:06,868 --> 01:45:09,160
where it's not exactly what
you're looking for but it's
2208
01:45:09,160 --> 01:45:10,630
like what you're looking for.
2209
01:45:10,630 --> 01:45:14,350
If you instead say where his name
is like Steve space C something,
2210
01:45:14,350 --> 01:45:17,060
now we'll get back a
whole bunch of Steves.
2211
01:45:17,060 --> 01:45:19,810
But I think now I could probably
find the one I'm actually looking
2212
01:45:19,810 --> 01:45:20,920
for if I don't remember his name.
2213
01:45:20,920 --> 01:45:22,510
You can use multiple percent signs.
2214
01:45:22,510 --> 01:45:25,260
If you forget what his first name
is, you could reverse the order.
2215
01:45:25,260 --> 01:45:29,940
But that too is a very
powerful SQL feature at that.
2216
01:45:29,940 --> 01:45:32,670
Questions on these queries here?
2217
01:45:32,670 --> 01:45:33,708
Yeah.
2218
01:45:33,708 --> 01:45:35,448
AUDIENCE: [INAUDIBLE]
2219
01:45:35,448 --> 01:45:36,240
DAVID MALAN: Sorry?
2220
01:45:36,240 --> 01:45:38,730
AUDIENCE: [INAUDIBLE]
2221
01:45:38,730 --> 01:45:40,003
DAVID MALAN: What about it?
2222
01:45:40,003 --> 01:45:40,990
AUDIENCE: [INAUDIBLE]
2223
01:45:40,990 --> 01:45:42,400
DAVID MALAN: Oh yeah, sure.
2224
01:45:42,400 --> 01:45:44,500
So the query I used here.
2225
01:45:44,500 --> 01:45:49,810
There's a lot of Steves whose last
name starts with C. Oops, too far.
2226
01:45:49,810 --> 01:45:52,760
The last query I executed
was this one here.
2227
01:45:52,760 --> 01:45:57,110
So where the name is like
quote unquote "Steve C%."
2228
01:45:57,110 --> 01:45:59,950
So that's just another
tool for your toolkit here.
2229
01:45:59,950 --> 01:46:02,830
But you'll perhaps have
notice that those two--
2230
01:46:02,830 --> 01:46:05,380
prior to that query, the
joins I did were sort of slow.
2231
01:46:05,380 --> 01:46:07,480
And honestly, this database
isn't even that big.
2232
01:46:07,480 --> 01:46:10,120
Like yes, it has tens of
thousands of rows in it.
2233
01:46:10,120 --> 01:46:13,630
But in the real world and most of the
apps you and I use a lot every day
2234
01:46:13,630 --> 01:46:17,080
or websites, there's millions,
even billions of rows of data.
2235
01:46:17,080 --> 01:46:22,210
And if I had to wait on my computer
here or my code space a second or two
2236
01:46:22,210 --> 01:46:25,990
to get the data, that's not going to
work for millions of users or customers
2237
01:46:25,990 --> 01:46:26,680
certainly.
2238
01:46:26,680 --> 01:46:29,020
So how can we actually improve things?
2239
01:46:29,020 --> 01:46:34,330
Well, it turns out another upside
of a proper relational database
2240
01:46:34,330 --> 01:46:38,830
is that it's not just a spreadsheet
where the onus is on you to find
2241
01:46:38,830 --> 01:46:40,120
the data you're looking for.
2242
01:46:40,120 --> 01:46:43,780
You can also tell the database
to index the data for you.
2243
01:46:43,780 --> 01:46:48,670
An index is an efficient cheat
sheet for finding data fast.
2244
01:46:48,670 --> 01:46:52,908
Like books in the real world often
have indices at the end of the book
2245
01:46:52,908 --> 01:46:54,700
where you can look
things up alphabetically
2246
01:46:54,700 --> 01:46:57,790
and then you can cross reference it
for the pages that topic appears on.
2247
01:46:57,790 --> 01:46:59,110
Same idea in a database.
2248
01:46:59,110 --> 01:47:03,190
If you tell the database in advance that
you want to search on a certain column
2249
01:47:03,190 --> 01:47:06,700
frequently, you can tell it to
build a fancy index that will just
2250
01:47:06,700 --> 01:47:08,770
allow you to search that column faster.
2251
01:47:08,770 --> 01:47:11,560
By default, these columns
are going to be searched
2252
01:47:11,560 --> 01:47:13,197
most likely by a linear search.
2253
01:47:13,197 --> 01:47:15,280
Not even binary search,
because the data might not
2254
01:47:15,280 --> 01:47:17,390
be sorted because it came in any order.
2255
01:47:17,390 --> 01:47:20,170
But if you create an
index, you're probably
2256
01:47:20,170 --> 01:47:22,780
going to get something closer
to logarithmic than linear,
2257
01:47:22,780 --> 01:47:24,892
and that's going to
be a big plus overall.
2258
01:47:24,892 --> 01:47:26,350
So let me do something simple here.
2259
01:47:26,350 --> 01:47:30,670
First let me turn on a SQLite specific
feature that just is going to time all
2260
01:47:30,670 --> 01:47:33,190
of my queries by writing .timer on.
2261
01:47:33,190 --> 01:47:37,515
I just want to keep track of how
long each of these commands takes.
2262
01:47:37,515 --> 01:47:40,390
This one is not a slow command, so
this is just going to be relative.
2263
01:47:40,390 --> 01:47:43,300
But let's just select
everything from the shows table
2264
01:47:43,300 --> 01:47:45,970
where the title thereof is The Office.
2265
01:47:45,970 --> 01:47:48,940
Let's see how long this
relatively simple query takes.
2266
01:47:48,940 --> 01:47:50,980
All right, not very long at all.
2267
01:47:50,980 --> 01:47:54,850
In real terms less than
a second, 0.035 seconds.
2268
01:47:54,850 --> 01:47:57,070
So not slow by any means.
2269
01:47:57,070 --> 01:48:00,160
But if you've got hundreds,
thousands, millions of users,
2270
01:48:00,160 --> 01:48:03,040
every one of those milliseconds
could very well add up.
2271
01:48:03,040 --> 01:48:04,640
So can we do better?
2272
01:48:04,640 --> 01:48:06,440
Well, we can if I do this.
2273
01:48:06,440 --> 01:48:11,740
If I use syntax like this once in the
beginning of the design of my database,
2274
01:48:11,740 --> 01:48:17,980
I create not a table but an index with
some name on a specific table on one
2275
01:48:17,980 --> 01:48:18,820
or more columns.
2276
01:48:18,820 --> 01:48:21,370
I can give a clue, a hint
to the database in advance
2277
01:48:21,370 --> 01:48:26,260
saying please optimize with some
secret sauce searching or selecting
2278
01:48:26,260 --> 01:48:29,740
on this column in this table
so that my searches are faster.
2279
01:48:29,740 --> 01:48:30,860
So let me do this.
2280
01:48:30,860 --> 01:48:32,590
Let me go back to VS Code here.
2281
01:48:32,590 --> 01:48:36,460
Let me create an index
called how about title index.
2282
01:48:36,460 --> 01:48:40,240
I could call it anything I want, but
I want to search faster on titles.
2283
01:48:40,240 --> 01:48:45,940
So I'm going to call this a title index
where rather title index on the table
2284
01:48:45,940 --> 01:48:46,900
called shows.
2285
01:48:46,900 --> 01:48:49,150
And then in parentheses is the syntax.
2286
01:48:49,150 --> 01:48:50,620
The column called title.
2287
01:48:50,620 --> 01:48:53,350
So again, I've just borrowed
this canonical syntax
2288
01:48:53,350 --> 01:48:57,590
and I've just translated it into
something that's TV show specific.
2289
01:48:57,590 --> 01:48:58,090
All right.
2290
01:48:58,090 --> 01:49:00,040
What is this going to do for me?
2291
01:49:00,040 --> 01:49:03,700
Once I hit Enter, this is going to
create in the computer's memory,
2292
01:49:03,700 --> 01:49:05,860
the database's memory
something called a B-tree.
2293
01:49:05,860 --> 01:49:07,330
It's not a binary tree.
2294
01:49:07,330 --> 01:49:11,627
A B-tree is actually a potentially
more efficient data structure
2295
01:49:11,627 --> 01:49:13,960
that we didn't talk about a
few weeks back in week five,
2296
01:49:13,960 --> 01:49:15,710
but it looks a little
something like this,
2297
01:49:15,710 --> 01:49:17,680
which looks similar to a binary tree.
2298
01:49:17,680 --> 01:49:21,847
But does anyone notice what
makes this not a binary tree?
2299
01:49:21,847 --> 01:49:26,405
AUDIENCE: [INAUDIBLE]
2300
01:49:26,405 --> 01:49:27,280
DAVID MALAN: Exactly.
2301
01:49:27,280 --> 01:49:31,570
Binary tree, bi implying two, has
no more than two children per node,
2302
01:49:31,570 --> 01:49:33,580
but here's a perfect
example, one, two, three.
2303
01:49:33,580 --> 01:49:35,990
And there could be four
children, five children or more.
2304
01:49:35,990 --> 01:49:38,530
But the effect of that, if
you have a very wide tree,
2305
01:49:38,530 --> 01:49:40,780
the upside is that it's very short.
2306
01:49:40,780 --> 01:49:44,560
It pulls the data higher up closer
to the node, to the root node.
2307
01:49:44,560 --> 01:49:48,080
And recall that the root node is where
we began our searches in the past,
2308
01:49:48,080 --> 01:49:51,620
whether it was a BST, a Binary
Search Tree, even a tri or other data
2309
01:49:51,620 --> 01:49:52,120
structures.
2310
01:49:52,120 --> 01:49:53,380
We always began at the top.
2311
01:49:53,380 --> 01:49:55,780
So the higher up you
can pull the data, even
2312
01:49:55,780 --> 01:49:57,617
if it makes the data
structure very wide,
2313
01:49:57,617 --> 01:50:00,700
you're going to be able to do boom,
boom, boom, look up queries or look up
2314
01:50:00,700 --> 01:50:03,340
data probably much faster
certainly than if it's just
2315
01:50:03,340 --> 01:50:06,590
a very long list like
a column by default.
2316
01:50:06,590 --> 01:50:09,970
So with that said, let
me go back to VS Code.
2317
01:50:09,970 --> 01:50:11,380
I didn't create the index yet.
2318
01:50:11,380 --> 01:50:13,277
Let me go ahead and hit
Enter and create it.
2319
01:50:13,277 --> 01:50:14,860
All right, it took a minute, a moment.
2320
01:50:14,860 --> 01:50:17,920
It took like half a second,
which obviously is not that slow.
2321
01:50:17,920 --> 01:50:20,380
But with more data, that
could have been even slower.
2322
01:50:20,380 --> 01:50:22,600
But it's a one time operation as of now.
2323
01:50:22,600 --> 01:50:27,910
And now let me hit up and let me
select the same data from shows
2324
01:50:27,910 --> 01:50:29,350
where title equals The Office.
2325
01:50:29,350 --> 01:50:34,360
Last time just a moment
ago it took 0.035 seconds.
2326
01:50:34,360 --> 01:50:38,920
Not slow but also that's going to add
up if I have lots of users of IMDb.
2327
01:50:38,920 --> 01:50:42,040
Let's go ahead now and
execute the same query again.
2328
01:50:42,040 --> 01:50:44,700
How long did that take?
2329
01:50:44,700 --> 01:50:47,010
0.001 seconds now.
2330
01:50:47,010 --> 01:50:48,720
I mean, practically nothing.
2331
01:50:48,720 --> 01:50:50,970
And so that's the sort
of opportunity now.
2332
01:50:50,970 --> 01:50:53,470
When you've got lots of data
and you want to really speed up
2333
01:50:53,470 --> 01:50:56,130
these searches, these indexes,
these indices that just create
2334
01:50:56,130 --> 01:50:59,130
for you these magical data
structures in the databases memory,
2335
01:50:59,130 --> 01:51:02,430
it allows you to search on columns
that you are pretty sure you
2336
01:51:02,430 --> 01:51:04,150
want to search on more effectively.
2337
01:51:04,150 --> 01:51:07,380
Now, by contrast, if you've ever
used Google or Bing or some search
2338
01:51:07,380 --> 01:51:10,590
engine that has advanced
search, some of those text boxes
2339
01:51:10,590 --> 01:51:13,470
that you can search more precisely
in might very well be slower.
2340
01:51:13,470 --> 01:51:14,190
Why?
2341
01:51:14,190 --> 01:51:16,950
Well, probably you don't want
to go crazy and just index
2342
01:51:16,950 --> 01:51:19,380
every column on every table.
2343
01:51:19,380 --> 01:51:21,020
Why?
2344
01:51:21,020 --> 01:51:25,070
What might be the intuition?
2345
01:51:25,070 --> 01:51:28,490
If logically indexes speed things
up, why not index everything?
2346
01:51:28,490 --> 01:51:30,710
There's always going
to be a trade off here.
2347
01:51:30,710 --> 01:51:33,040
What might that be?
2348
01:51:33,040 --> 01:51:34,000
Yeah.
2349
01:51:34,000 --> 01:51:34,875
AUDIENCE: [INAUDIBLE]
2350
01:51:34,875 --> 01:51:37,167
DAVID MALAN: Yeah, it's going
to take a lot of storage.
2351
01:51:37,167 --> 01:51:38,960
This is just a slide on the screen.
2352
01:51:38,960 --> 01:51:41,020
But this has to go somewhere.
2353
01:51:41,020 --> 01:51:43,360
This needs space in
the computer's memory
2354
01:51:43,360 --> 01:51:44,740
or on the hard drive or the like.
2355
01:51:44,740 --> 01:51:48,777
And that's fine if you have unlimited
space, but odds are you don't.
2356
01:51:48,777 --> 01:51:51,110
And that's going to get
expensive for different reasons.
2357
01:51:51,110 --> 01:51:55,810
So maybe you only want to index
certain columns and certain tables
2358
01:51:55,810 --> 01:51:56,860
and not all of them.
2359
01:51:56,860 --> 01:51:57,777
Because you know what?
2360
01:51:57,777 --> 01:51:59,590
What even if a user
really wants to search
2361
01:51:59,590 --> 01:52:05,020
maybe via advanced search on some
other column or table altogether, fine.
2362
01:52:05,020 --> 01:52:07,240
If once in a while a
query is slow, we're
2363
01:52:07,240 --> 01:52:11,110
probably getting the bigger bang for
our buck by optimizing the common cases,
2364
01:52:11,110 --> 01:52:16,030
the more popular queries that
people actually care about too.
2365
01:52:16,030 --> 01:52:16,640
All right.
2366
01:52:16,640 --> 01:52:22,450
So let's come full circle and bring
this now back to how we actually
2367
01:52:22,450 --> 01:52:25,030
began, which was with some Python code.
2368
01:52:25,030 --> 01:52:27,550
So it turns out these are
not either or decisions.
2369
01:52:27,550 --> 01:52:29,740
It turns out in the
real world, developers
2370
01:52:29,740 --> 01:52:33,320
are constantly using one,
two, three languages at once.
2371
01:52:33,320 --> 01:52:36,940
And in fact, next week I rattled off
HTML, CSS, and JavaScript, one of which
2372
01:52:36,940 --> 01:52:39,310
is a proper programming
language, but those languages
2373
01:52:39,310 --> 01:52:40,510
are often used together.
2374
01:52:40,510 --> 01:52:46,300
Totally normal and common to use
Python and SQL or Java and SQL
2375
01:52:46,300 --> 01:52:49,300
or SWIFT and SQL or any number
of different combinations
2376
01:52:49,300 --> 01:52:50,710
with a database language.
2377
01:52:50,710 --> 01:52:53,920
You might use your preferred
programming language, Java, Python,
2378
01:52:53,920 --> 01:53:00,760
C++ to create the user interface and
the logic that implements the program
2379
01:53:00,760 --> 01:53:01,330
itself.
2380
01:53:01,330 --> 01:53:04,092
But for your data, SQL's
a really good candidate.
2381
01:53:04,092 --> 01:53:07,300
And indeed, we've seen already that SQL
can just speed up certain operations.
2382
01:53:07,300 --> 01:53:08,140
You can change.
2383
01:53:08,140 --> 01:53:11,200
You can collapse 15 lines
of code into just one
2384
01:53:11,200 --> 01:53:13,430
and you can use these things together.
2385
01:53:13,430 --> 01:53:14,740
So let me come back to--
2386
01:53:14,740 --> 01:53:16,720
I'm going to quit out of SQLite.
2387
01:53:16,720 --> 01:53:18,580
I'm going to minimize
my terminal window.
2388
01:53:18,580 --> 01:53:22,030
And here's where we left off
before with favorites.py.
2389
01:53:22,030 --> 01:53:27,430
With favorites.py, everything was
being stored in favorites.csv.
2390
01:53:27,430 --> 01:53:32,260
And recall that we eventually imported
that CSV file into favorites.db
2391
01:53:32,260 --> 01:53:35,920
automatically with .import just so we
could start playing around with SQL.
2392
01:53:35,920 --> 01:53:37,990
But we can now tie these two together.
2393
01:53:37,990 --> 01:53:40,690
And a way to do that is as follows.
2394
01:53:40,690 --> 01:53:42,700
CS50 has a library for Python.
2395
01:53:42,700 --> 01:53:47,350
You might recall having available
get string, get int, get float.
2396
01:53:47,350 --> 01:53:49,360
You don't strictly need
to use them in Python
2397
01:53:49,360 --> 01:53:51,970
because it's much easier to
just use the input function
2398
01:53:51,970 --> 01:53:55,570
and then try, accept, and convert
things to int or float or the like.
2399
01:53:55,570 --> 01:54:00,520
But it's a lot more work to use SQL in
Python without a third party library.
2400
01:54:00,520 --> 01:54:03,370
A lot of the commercial options
or popular open source options
2401
01:54:03,370 --> 01:54:05,200
are actually just complicated to use.
2402
01:54:05,200 --> 01:54:09,790
So CS50 does have a very useful function
inside of its library for Python
2403
01:54:09,790 --> 01:54:12,640
that you should use and must
use for the problem set that
2404
01:54:12,640 --> 01:54:18,100
just makes it easy to execute Python,
execute SQL inside of your Python code.
2405
01:54:18,100 --> 01:54:22,360
But it's built on top of a very
popular open source alternative.
2406
01:54:22,360 --> 01:54:24,350
So you can use that
too in the real world.
2407
01:54:24,350 --> 01:54:26,800
So the documentation for
that is at this URL here,
2408
01:54:26,800 --> 01:54:30,940
but I'll show you what we need to know
here by focusing back on favorites.py.
2409
01:54:30,940 --> 01:54:35,540
So what I'm going to do
here is follows is this.
2410
01:54:35,540 --> 01:54:42,550
Let me delete everything from
favorites.py except for let's say this.
2411
01:54:42,550 --> 01:54:45,970
From CS50 import SQL in all caps.
2412
01:54:45,970 --> 01:54:49,060
So that's importing a SQL
feature from CS50's library
2413
01:54:49,060 --> 01:54:53,200
that's going to allow me
to open a DB file in code.
2414
01:54:53,200 --> 01:54:54,380
How do I do that?
2415
01:54:54,380 --> 01:54:56,650
Well, let me create a variable
called DB for database,
2416
01:54:56,650 --> 01:54:58,275
though I could call it anything I want.
2417
01:54:58,275 --> 01:55:02,470
Let me call this SQL function and
pass in using special syntax that's
2418
01:55:02,470 --> 01:55:03,640
not CS50 specific.
2419
01:55:03,640 --> 01:55:09,640
It's an industry thing. sqlite:///.
2420
01:55:09,640 --> 01:55:14,640
Unlike every other URL you type, this
one literally has three in this context
2421
01:55:14,640 --> 01:55:15,140
here.
2422
01:55:15,140 --> 01:55:18,340
And then the name of the database,
which in this case is favorites.db.
2423
01:55:18,340 --> 01:55:22,450
So this is just a way of telling
this SQL library that we wrote
2424
01:55:22,450 --> 01:55:27,640
but that works exactly like third party
alternatives open favorites.db using
2425
01:55:27,640 --> 01:55:30,380
the SQLite technology, if you will.
2426
01:55:30,380 --> 01:55:30,880
All right.
2427
01:55:30,880 --> 01:55:32,338
Let's just ask the user a question.
2428
01:55:32,338 --> 01:55:36,057
Give me your favorite problem.
2429
01:55:36,057 --> 01:55:38,140
So we're going to use input
instead of get string,
2430
01:55:38,140 --> 01:55:41,450
but we could use get string, but they're
pretty much the same for our purposes.
2431
01:55:41,450 --> 01:55:43,150
Let's ask the user for their favorite.
2432
01:55:43,150 --> 01:55:49,960
And now in Python code, let
us select from favorites.db
2433
01:55:49,960 --> 01:55:55,450
all of the rows where students specify
that problem as their favorite.
2434
01:55:55,450 --> 01:55:57,610
So in SQL alone, it would be this.
2435
01:55:57,610 --> 01:56:04,210
Select star from favorites
where problem equals
2436
01:56:04,210 --> 01:56:08,170
and I'll do, well, whatever
my favorite's going to be.
2437
01:56:08,170 --> 01:56:10,288
Like problem equals Mario, for instance.
2438
01:56:10,288 --> 01:56:13,330
So if I were just using SQL, I would
literally write something like that.
2439
01:56:13,330 --> 01:56:15,130
But I'm in a .py file now.
2440
01:56:15,130 --> 01:56:16,870
I have to use Python syntax.
2441
01:56:16,870 --> 01:56:18,550
But Python supports strings.
2442
01:56:18,550 --> 01:56:20,090
SQL is just text.
2443
01:56:20,090 --> 01:56:20,990
It's just a string.
2444
01:56:20,990 --> 01:56:24,220
So I could certainly just
put my SQL code in a string
2445
01:56:24,220 --> 01:56:26,800
perhaps and then pass
it to a Python function.
2446
01:56:26,800 --> 01:56:28,780
And here's the bridge between the two.
2447
01:56:28,780 --> 01:56:31,960
If you just treat SQL as any old
text, we can put it in a string
2448
01:56:31,960 --> 01:56:32,840
and execute it.
2449
01:56:32,840 --> 01:56:34,837
So let me actually do this.
2450
01:56:34,837 --> 01:56:36,670
Let me go ahead and
create a variable called
2451
01:56:36,670 --> 01:56:40,130
rows, which is eventually going to
contain all the rows from the database.
2452
01:56:40,130 --> 01:56:45,790
Let me go ahead and select db.execute.
2453
01:56:45,790 --> 01:56:49,030
This is the one function you need to
know about inside of CS50's library,
2454
01:56:49,030 --> 01:56:51,310
and it literally
executes a SQL statement.
2455
01:56:51,310 --> 01:56:56,140
And then in quotes, you pass it
literally what you want to execute.
2456
01:56:56,140 --> 01:56:58,990
And let me go ahead and close
the parenthesis at the end there.
2457
01:56:58,990 --> 01:57:00,740
And now let me just try this.
2458
01:57:00,740 --> 01:57:05,260
So for row in rows, let's iterate
over all of the rows, let me go ahead
2459
01:57:05,260 --> 01:57:12,370
and print out how about
row, quote unquote.
2460
01:57:12,370 --> 01:57:15,700
And what do I want here?
2461
01:57:15,700 --> 01:57:20,020
Let's print out the timestamp
of that person for kicks.
2462
01:57:20,020 --> 01:57:22,120
All right, let me open
my terminal window.
2463
01:57:22,120 --> 01:57:23,890
Python of favorites.py.
2464
01:57:23,890 --> 01:57:27,490
Crossing my fingers here for sure.
2465
01:57:27,490 --> 01:57:27,990
Enter.
2466
01:57:27,990 --> 01:57:31,200
2467
01:57:31,200 --> 01:57:31,770
There we go.
2468
01:57:31,770 --> 01:57:32,280
Favorites.
2469
01:57:32,280 --> 01:57:34,500
I'll type in Mario.
2470
01:57:34,500 --> 01:57:35,070
OK.
2471
01:57:35,070 --> 01:57:36,240
So I got back--
2472
01:57:36,240 --> 01:57:37,980
it's not very interesting,
but I got back
2473
01:57:37,980 --> 01:57:40,830
all of the timestamps of students
who typed in Mario that we
2474
01:57:40,830 --> 01:57:42,330
imported into this database.
2475
01:57:42,330 --> 01:57:45,360
Well, what I really care
about is how popular Mario is.
2476
01:57:45,360 --> 01:57:46,830
So let me change this a little bit.
2477
01:57:46,830 --> 01:57:50,490
Let me change this to
count the number of rows.
2478
01:57:50,490 --> 01:57:52,080
And let me keep it simple.
2479
01:57:52,080 --> 01:57:56,260
Let me give an alias like I proposed
earlier like as n, where n is a number.
2480
01:57:56,260 --> 01:58:00,090
So that now down here, I
can actually just do this.
2481
01:58:00,090 --> 01:58:02,230
Print out the value of n.
2482
01:58:02,230 --> 01:58:02,730
All right.
2483
01:58:02,730 --> 01:58:04,450
Let me go back to my terminal window.
2484
01:58:04,450 --> 01:58:05,910
Run Python to favorites.py.
2485
01:58:05,910 --> 01:58:07,140
Let me type in Mario.
2486
01:58:07,140 --> 01:58:08,070
Enter.
2487
01:58:08,070 --> 01:58:09,660
OK, 39.
2488
01:58:09,660 --> 01:58:11,790
Now, technically I'm cheating.
2489
01:58:11,790 --> 01:58:14,430
Honestly if I'm executing
select count, we've
2490
01:58:14,430 --> 01:58:17,752
seen before it only ever
returns one row, not multiple.
2491
01:58:17,752 --> 01:58:20,460
So there's really nothing to
iterate over, but it's working fine.
2492
01:58:20,460 --> 01:58:22,560
It's just iterating once,
but I'm getting lucky.
2493
01:58:22,560 --> 01:58:26,100
So technically what I should
probably just do is this.
2494
01:58:26,100 --> 01:58:29,220
I should probably give
myself a variable called row,
2495
01:58:29,220 --> 01:58:34,770
set it equal to the very first
row and only row that came back,
2496
01:58:34,770 --> 01:58:39,510
and now print out that rows and column.
2497
01:58:39,510 --> 01:58:40,890
Let me rerun the program.
2498
01:58:40,890 --> 01:58:42,130
I'll type in Mario again.
2499
01:58:42,130 --> 01:58:42,660
Enter.
2500
01:58:42,660 --> 01:58:44,980
And I still see 39.
2501
01:58:44,980 --> 01:58:48,328
So of course, I don't
strictly need to do this.
2502
01:58:48,328 --> 01:58:49,620
I don't really need a variable.
2503
01:58:49,620 --> 01:58:52,590
I can do rows bracket 0 instead.
2504
01:58:52,590 --> 01:58:54,840
But let me focus on what
this library is now doing.
2505
01:58:54,840 --> 01:59:01,620
So per the documentation, what the CS50
execute function always does for you is
2506
01:59:01,620 --> 01:59:05,620
it returns a list of dictionaries.
2507
01:59:05,620 --> 01:59:10,558
So if your query returns nothing, like
no matches, you get back an empty list.
2508
01:59:10,558 --> 01:59:12,600
Like open bracket, closed
bracket, nothing in it.
2509
01:59:12,600 --> 01:59:14,520
Any loop is not going to
execute anything useful,
2510
01:59:14,520 --> 01:59:15,770
because there's nothing in it.
2511
01:59:15,770 --> 01:59:18,660
If, though, you get back
one row, you're going
2512
01:59:18,660 --> 01:59:22,740
to get back a list of size one inside
of which is a single dictionary.
2513
01:59:22,740 --> 01:59:24,930
That dictionary is
going to have keys that
2514
01:59:24,930 --> 01:59:29,200
correspond to whatever you selected,
be it the columns or the count.
2515
01:59:29,200 --> 01:59:32,880
So when I selected star before, I
would have gotten all of the columns.
2516
01:59:32,880 --> 01:59:34,950
That's how I was able
to access timestamp.
2517
01:59:34,950 --> 01:59:37,230
Here I'm just selecting
count and I don't
2518
01:59:37,230 --> 01:59:38,910
want to have to type this down here.
2519
01:59:38,910 --> 01:59:40,230
That would just look kind of atrocious.
2520
01:59:40,230 --> 01:59:42,355
It would work, but it would
look weird to just keep
2521
01:59:42,355 --> 01:59:44,640
retyping count paren star close paren.
2522
01:59:44,640 --> 01:59:49,530
So I just created an alias called n just
to make my life easier or cleaner down
2523
01:59:49,530 --> 01:59:50,380
here.
2524
01:59:50,380 --> 01:59:55,680
So to be clear, the CS50 execute
function returns a list of dictionaries
2525
01:59:55,680 --> 01:59:57,360
when you're using select.
2526
01:59:57,360 --> 02:00:01,500
And that is how I can now get
back the first and only row
2527
02:00:01,500 --> 02:00:04,860
and then print out that row's end value.
2528
02:00:04,860 --> 02:00:07,380
It is identical to--
2529
02:00:07,380 --> 02:00:08,650
let me do this.
2530
02:00:08,650 --> 02:00:11,610
Let me highlight this
whole line of text.
2531
02:00:11,610 --> 02:00:15,120
Let me in my terminal window
run SQLite3 of favorites.db
2532
02:00:15,120 --> 02:00:16,470
like we did before break.
2533
02:00:16,470 --> 02:00:19,170
Let me just copy paste this query.
2534
02:00:19,170 --> 02:00:20,040
Enter.
2535
02:00:20,040 --> 02:00:24,550
That's the table I got back earlier
when we played with SQL manually.
2536
02:00:24,550 --> 02:00:28,170
And so when I get back this table,
here's the key, here's the value,
2537
02:00:28,170 --> 02:00:31,860
and I only have one row, which
is why I'm just blindly indexing
2538
02:00:31,860 --> 02:00:34,080
into rows bracket 0, because
I know there's always
2539
02:00:34,080 --> 02:00:35,247
going to be an answer there.
2540
02:00:35,247 --> 02:00:36,690
It's going to be 0 or 1 or more.
2541
02:00:36,690 --> 02:00:40,360
But I know now it's going to be
called n because of this here.
2542
02:00:40,360 --> 02:00:41,580
So what have I just done?
2543
02:00:41,580 --> 02:00:43,113
Well, this is SQL down here.
2544
02:00:43,113 --> 02:00:45,780
And this is just me being like a
data scientist asking questions
2545
02:00:45,780 --> 02:00:48,640
about my data just using
black and white SQL queries.
2546
02:00:48,640 --> 02:00:52,410
This is me now being a Python programmer
who wants to talk to a SQL database
2547
02:00:52,410 --> 02:00:53,700
using Python.
2548
02:00:53,700 --> 02:00:56,820
And the bridge we're using
happens to be the CS50 library.
2549
02:00:56,820 --> 02:00:59,940
But again, there's third party free
libraries you can also use as well.
2550
02:00:59,940 --> 02:01:01,710
Ours is just very simple.
2551
02:01:01,710 --> 02:01:03,660
And indeed, the
documentation will explain
2552
02:01:03,660 --> 02:01:07,113
how execute behaves a little differently
for inserts, updates, and deletes.
2553
02:01:07,113 --> 02:01:09,780
You don't get back a list because
you're not selecting anything,
2554
02:01:09,780 --> 02:01:13,050
but you do get back some return values.
2555
02:01:13,050 --> 02:01:15,990
Questions on this?
2556
02:01:15,990 --> 02:01:18,030
That's the last of our Python code.
2557
02:01:18,030 --> 02:01:23,320
That ties everything together in spirit.
2558
02:01:23,320 --> 02:01:24,768
Yeah?
2559
02:01:24,768 --> 02:01:29,460
AUDIENCE: [INAUDIBLE]
2560
02:01:29,460 --> 02:01:31,810
DAVID MALAN: This one here?
2561
02:01:31,810 --> 02:01:32,320
Yes.
2562
02:01:32,320 --> 02:01:37,558
So db.execute by definition
returns a list of rows.
2563
02:01:37,558 --> 02:01:40,600
And each of those rows happens to be
a dictionary because its convenient.
2564
02:01:40,600 --> 02:01:42,100
Key value pairs.
2565
02:01:42,100 --> 02:01:45,040
If I'm selecting the
count of rows, I just
2566
02:01:45,040 --> 02:01:48,250
know from having learned SQL
an hour ago that this is always
2567
02:01:48,250 --> 02:01:52,540
going to give me a single row whose
column in this case is called n.
2568
02:01:52,540 --> 02:01:56,470
So if I know it's a single row, I
can just blindly, just like in C,
2569
02:01:56,470 --> 02:02:00,160
go into that list or an array in
C and go to the first location
2570
02:02:00,160 --> 02:02:02,590
and then treat that as the single row.
2571
02:02:02,590 --> 02:02:04,330
What you don't want to do is this.
2572
02:02:04,330 --> 02:02:08,500
Even if you the human know
the query returns one row,
2573
02:02:08,500 --> 02:02:10,780
you can't just magically
change the variable name
2574
02:02:10,780 --> 02:02:13,270
to be singular and expect
to have only one value.
2575
02:02:13,270 --> 02:02:14,960
You will always have a list.
2576
02:02:14,960 --> 02:02:18,190
So even if there is only one value
in it, it's up to you to do something
2577
02:02:18,190 --> 02:02:19,780
like this to get at it.
2578
02:02:19,780 --> 02:02:24,070
Or if you prefer more succinctness,
you can do rows bracket I bracket n.
2579
02:02:24,070 --> 02:02:27,220
That'll achieve the same
thing without a variable.
2580
02:02:27,220 --> 02:02:29,095
Yeah?
2581
02:02:29,095 --> 02:02:33,970
AUDIENCE: [INAUDIBLE]
2582
02:02:33,970 --> 02:02:34,720
DAVID MALAN: Good.
2583
02:02:34,720 --> 02:02:37,150
So I have been misleading
this whole time
2584
02:02:37,150 --> 02:02:40,690
and cheating because this is
only ever going to return Mario.
2585
02:02:40,690 --> 02:02:44,530
I'm ignoring the favorite that the
human typed in here on line five.
2586
02:02:44,530 --> 02:02:45,770
So let me fix that.
2587
02:02:45,770 --> 02:02:49,000
And that's going to lead us to some
of the problems that arise ultimately
2588
02:02:49,000 --> 02:02:49,780
with SQL.
2589
02:02:49,780 --> 02:02:53,320
The right way to solve that problem--
let me get rid of my terminal window
2590
02:02:53,320 --> 02:02:53,890
here.
2591
02:02:53,890 --> 02:02:56,170
The right way to solve
this problem is not
2592
02:02:56,170 --> 02:02:59,170
to use an fstring like we
did in Python generally,
2593
02:02:59,170 --> 02:03:02,680
because SQL queries, as we'll see
in a moment, can be dangerous.
2594
02:03:02,680 --> 02:03:06,940
When you want to plug in
users' data into a query
2595
02:03:06,940 --> 02:03:10,540
that you've written most of
in advance, you should, you
2596
02:03:10,540 --> 02:03:15,640
must, you had better use a placeholder,
namely a question mark in this case.
2597
02:03:15,640 --> 02:03:18,340
This is somewhat specific
to CS50's library,
2598
02:03:18,340 --> 02:03:21,910
but we just borrowed the convention
that every other library uses too.
2599
02:03:21,910 --> 02:03:25,660
In the world of SQL, single question
marks are used as placeholders.
2600
02:03:25,660 --> 02:03:28,190
And the way you do this is as follows.
2601
02:03:28,190 --> 02:03:32,020
If you want to plug-in a
value for that question mark,
2602
02:03:32,020 --> 02:03:37,180
just like in printf in C, you specify as
a second or a third or fourth argument
2603
02:03:37,180 --> 02:03:39,320
all of the values you
want plugged into this.
2604
02:03:39,320 --> 02:03:42,280
So in C weeks ago, we were using %s.
2605
02:03:42,280 --> 02:03:43,810
Same exact idea.
2606
02:03:43,810 --> 02:03:46,810
In SQL it's a question
mark that you use instead.
2607
02:03:46,810 --> 02:03:49,480
This now, if I open
back my terminal window
2608
02:03:49,480 --> 02:03:55,120
and I run Python of favorites.py,
type in Mario, I should still get 39.
2609
02:03:55,120 --> 02:03:58,750
But now I can also type in
Scratch perhaps and get 44
2610
02:03:58,750 --> 02:04:00,520
for that very first piece at 0.
2611
02:04:00,520 --> 02:04:02,900
And that one is even more popular here.
2612
02:04:02,900 --> 02:04:04,150
So this now is correct.
2613
02:04:04,150 --> 02:04:09,560
It would work to use an fstring here
and then plug in a value like favorite
2614
02:04:09,560 --> 02:04:10,060
here.
2615
02:04:10,060 --> 02:04:11,977
But you'll see in just
a moment don't do that.
2616
02:04:11,977 --> 02:04:15,640
You will expose yourself to
potential hack or attacks
2617
02:04:15,640 --> 02:04:17,910
by trusting the user's input.
2618
02:04:17,910 --> 02:04:19,660
And so in fact, let's
transition from that
2619
02:04:19,660 --> 02:04:21,610
to exactly some of these
kinds of challenges,
2620
02:04:21,610 --> 02:04:24,020
namely two before we wrap up.
2621
02:04:24,020 --> 02:04:27,820
So in the world of SQL, especially when
it's used at scale with the Twitters
2622
02:04:27,820 --> 02:04:30,610
and the Googles of the
world, a lot of data
2623
02:04:30,610 --> 02:04:32,610
is probably coming into
the database all at once
2624
02:04:32,610 --> 02:04:34,610
because multiple people
are opening their phones
2625
02:04:34,610 --> 02:04:36,070
at the same time around the world.
2626
02:04:36,070 --> 02:04:37,690
They're clicking on
the same links roughly
2627
02:04:37,690 --> 02:04:39,107
at the same time around the world.
2628
02:04:39,107 --> 02:04:42,235
When you have thousands of people
all using your site at once,
2629
02:04:42,235 --> 02:04:44,110
order of operations is
going to be important.
2630
02:04:44,110 --> 02:04:47,620
But unfortunately in SQL and
in other contexts of computing,
2631
02:04:47,620 --> 02:04:50,150
there's this risk of what's
known as a race condition.
2632
02:04:50,150 --> 02:04:53,710
So for instance, has anyone
ever seen or liked this?
2633
02:04:53,710 --> 02:04:55,780
This is the world record egg.
2634
02:04:55,780 --> 02:04:58,467
Or it's this thing that was
very popular a while back.
2635
02:04:58,467 --> 02:04:59,800
It's still kind of going strong.
2636
02:04:59,800 --> 02:05:03,340
But if you go to the Instagram
profile for World Record Egg,
2637
02:05:03,340 --> 02:05:06,610
the goal was to make the most
liked Instagram post ever.
2638
02:05:06,610 --> 02:05:08,000
And they did pretty well.
2639
02:05:08,000 --> 02:05:09,170
It's just this.
2640
02:05:09,170 --> 02:05:10,790
It's just a picture of an egg.
2641
02:05:10,790 --> 02:05:12,570
Now, at the height of
the popularity, like
2642
02:05:12,570 --> 02:05:15,070
there might have been hundreds,
thousands, tens of thousands
2643
02:05:15,070 --> 02:05:18,520
of people clicking pretty much
at the same time on this egg.
2644
02:05:18,520 --> 02:05:20,830
So it actually creates
a potential problem
2645
02:05:20,830 --> 02:05:23,470
with the integrity of Instagram's data.
2646
02:05:23,470 --> 02:05:24,010
Why?
2647
02:05:24,010 --> 02:05:26,302
Well, if you have all these
requests coming in at once,
2648
02:05:26,302 --> 02:05:28,250
how do you possibly keep
track of all of them
2649
02:05:28,250 --> 02:05:31,850
and update your counter in a way that
can keep up with all of that traffic?
2650
02:05:31,850 --> 02:05:32,350
Why?
2651
02:05:32,350 --> 02:05:35,680
Well, let's just hypothesize
what Meta, formerly Facebook,
2652
02:05:35,680 --> 02:05:38,860
was doing underneath the hood with
Instagram if this were their code.
2653
02:05:38,860 --> 02:05:41,890
So suppose for the sake of
discussion that Instagram servers
2654
02:05:41,890 --> 02:05:44,617
are using a mix of Python and SQL.
2655
02:05:44,617 --> 02:05:47,200
Probably not using the CS50
library, but they could absolutely
2656
02:05:47,200 --> 02:05:50,680
be using those two languages
or two others together.
2657
02:05:50,680 --> 02:05:55,270
Suppose they do this in order to update
the number of likes for that post.
2658
02:05:55,270 --> 02:06:00,160
They first execute a SQL query like
select the current number of likes
2659
02:06:00,160 --> 02:06:03,580
from a table called posts
where the idea of the post
2660
02:06:03,580 --> 02:06:08,530
equals whatever the unique identifier
is for that specific egg in the table.
2661
02:06:08,530 --> 02:06:12,460
And then they store the result in
this rows variable, just like I did.
2662
02:06:12,460 --> 02:06:14,830
And then they do this.
2663
02:06:14,830 --> 02:06:16,420
They create a variable called likes.
2664
02:06:16,420 --> 02:06:18,700
They set it equal to rows bracket 0.
2665
02:06:18,700 --> 02:06:21,280
So the very first row in the result set.
2666
02:06:21,280 --> 02:06:22,450
And they get the likes key.
2667
02:06:22,450 --> 02:06:25,030
So this is literally what
I just did with the count.
2668
02:06:25,030 --> 02:06:27,880
Let me hypothesize that
Instagram does something similar
2669
02:06:27,880 --> 02:06:29,330
with the total number of likes.
2670
02:06:29,330 --> 02:06:30,370
Why are they doing this?
2671
02:06:30,370 --> 02:06:34,600
Because they then want to execute
a third line of code that executes
2672
02:06:34,600 --> 02:06:36,430
update the posts table.
2673
02:06:36,430 --> 02:06:41,920
Set the new number of likes equal to
something where the idea of the post
2674
02:06:41,920 --> 02:06:43,330
equals this other thing.
2675
02:06:43,330 --> 02:06:46,870
Now, notice just like in printf there's
the comma separated list of values.
2676
02:06:46,870 --> 02:06:50,560
They want to update the current
number of likes from the current value
2677
02:06:50,560 --> 02:06:51,970
to the current value plus 1.
2678
02:06:51,970 --> 02:06:53,350
So it's likes plus 1.
2679
02:06:53,350 --> 02:06:55,640
And then we plug in the ID for this.
2680
02:06:55,640 --> 02:06:57,760
So suppose this is what
Instagram is doing.
2681
02:06:57,760 --> 02:07:02,860
Unfortunately, whenever you execute
multiple lines of code independently
2682
02:07:02,860 --> 02:07:06,260
and you're so popular like
Instagram that you have thousands,
2683
02:07:06,260 --> 02:07:08,530
hundreds of thousands
of servers potentially,
2684
02:07:08,530 --> 02:07:11,470
it is quite possible that if you
and I and everyone else in the room
2685
02:07:11,470 --> 02:07:13,630
clicks that egg at the
same time, it's not
2686
02:07:13,630 --> 02:07:17,093
going to be the case statistically that
three lines of code are executed for me
2687
02:07:17,093 --> 02:07:19,510
and then three lines for you
and then three lines for you.
2688
02:07:19,510 --> 02:07:21,490
They're probably going
to get interspersed.
2689
02:07:21,490 --> 02:07:24,430
This gets executed for me and
then this gets executed for you
2690
02:07:24,430 --> 02:07:27,520
and then they get back to doing work
for me and so forth just to kind
2691
02:07:27,520 --> 02:07:30,820
of multitask, just like a human
might, but at a super speed here.
2692
02:07:30,820 --> 02:07:33,590
The problem, though, is
if these lines of code
2693
02:07:33,590 --> 02:07:35,838
get interrupted, what could go wrong?
2694
02:07:35,838 --> 02:07:38,630
Well, suppose that Carter and I
both click the egg at the same time
2695
02:07:38,630 --> 02:07:41,780
and suppose the current number
of likes back in the day is 100.
2696
02:07:41,780 --> 02:07:45,150
That stores in this
variable the value 100.
2697
02:07:45,150 --> 02:07:49,010
But if we click so close in time,
we might get back the same answer
2698
02:07:49,010 --> 02:07:50,300
to this select query.
2699
02:07:50,300 --> 02:07:54,380
As of that moment in time when David
and Carter clicked, it had 100 likes.
2700
02:07:54,380 --> 02:07:58,220
But then this last line of code is
executed for me and then maybe Carter.
2701
02:07:58,220 --> 02:08:03,380
Because that answer, the state of the
database, was stored in this variable,
2702
02:08:03,380 --> 02:08:08,870
then both Carter and I will
result in this line of code
2703
02:08:08,870 --> 02:08:10,610
being executed with the same value.
2704
02:08:10,610 --> 02:08:17,480
Update the post table setting the
likes equal to 101 for that post's ID.
2705
02:08:17,480 --> 02:08:18,080
Why?
2706
02:08:18,080 --> 02:08:21,950
Because again, if each of these lines
of code running on different servers
2707
02:08:21,950 --> 02:08:25,213
are checking the value of
the current number of likes
2708
02:08:25,213 --> 02:08:28,130
but then getting interrupted because
Carter clicked the darn thing too
2709
02:08:28,130 --> 02:08:30,350
and then resuming their
work on my behalf,
2710
02:08:30,350 --> 02:08:33,890
we might have a race condition where
the code is sort of racing to finish
2711
02:08:33,890 --> 02:08:36,320
but getting interrupted
by other users' clicks.
2712
02:08:36,320 --> 02:08:38,720
And the problem with
that is that if you are
2713
02:08:38,720 --> 02:08:42,710
inspecting the value of some variable,
or in this case a database cell,
2714
02:08:42,710 --> 02:08:45,650
and making a decision based
on it, like how to update it,
2715
02:08:45,650 --> 02:08:46,995
you might now lose data.
2716
02:08:46,995 --> 02:08:50,120
And Instagram is probably not good for
advertising if they're losing likes.
2717
02:08:50,120 --> 02:08:54,620
And so that's probably a problem
not to retain the value 102
2718
02:08:54,620 --> 02:08:57,230
and instead insert the number 101 twice.
2719
02:08:57,230 --> 02:09:00,500
It's actually similar in
spirit to a story that
2720
02:09:00,500 --> 02:09:05,960
was told in a databases course I
took myself years ago whereby--
2721
02:09:05,960 --> 02:09:08,900
it's somewhat analogous to
kind of a contrived scenario
2722
02:09:08,900 --> 02:09:10,070
involving a refrigerator.
2723
02:09:10,070 --> 02:09:12,380
And this is the closest thing to a
refrigerator we could get on stage.
2724
02:09:12,380 --> 02:09:14,420
But imagine you've got one
of these little dorm fridges
2725
02:09:14,420 --> 02:09:16,040
in your dorm too and your roommate.
2726
02:09:16,040 --> 02:09:19,102
And maybe both of you, as the story
was told to me, really like milk.
2727
02:09:19,102 --> 02:09:21,560
And one of you is at class,
but the other of you comes home
2728
02:09:21,560 --> 02:09:25,500
and you open your dorm fridge and you're
like, oh darn it, we're out of milk.
2729
02:09:25,500 --> 02:09:27,020
And so you close the fridge.
2730
02:09:27,020 --> 02:09:29,720
You walk across the street
to CVS or some other store
2731
02:09:29,720 --> 02:09:31,490
and you get in line to buy some milk.
2732
02:09:31,490 --> 02:09:33,020
Meanwhile, your roommate
gets out of class.
2733
02:09:33,020 --> 02:09:34,160
They come back to your dorm room.
2734
02:09:34,160 --> 02:09:35,702
They're really thirsty for some milk.
2735
02:09:35,702 --> 02:09:37,010
They open up the fridge.
2736
02:09:37,010 --> 02:09:38,510
They say, oh, we're out of milk.
2737
02:09:38,510 --> 02:09:41,630
And then they take a
different route perhaps to CVS
2738
02:09:41,630 --> 02:09:44,300
or some other store nearby,
get in line to buy some milk.
2739
02:09:44,300 --> 02:09:48,260
Fast forward some amount of time in this
very contrived story and what happens?
2740
02:09:48,260 --> 02:09:50,570
Oh damn it, we now ended
up with two gallons of milk
2741
02:09:50,570 --> 02:09:54,480
and there's no way we can fit gallons
of milk in there, let alone two of them.
2742
02:09:54,480 --> 02:09:56,000
So that's a problem.
2743
02:09:56,000 --> 02:10:00,000
But what's the
relationship to this here?
2744
02:10:00,000 --> 02:10:02,693
Well both of us, yeah, did what?
2745
02:10:02,693 --> 02:10:07,523
AUDIENCE: [INAUDIBLE]
2746
02:10:07,523 --> 02:10:16,685
2747
02:10:16,685 --> 02:10:17,560
DAVID MALAN: Exactly.
2748
02:10:17,560 --> 02:10:19,925
AUDIENCE: [INAUDIBLE]
2749
02:10:19,925 --> 02:10:22,965
2750
02:10:22,965 --> 02:10:23,840
DAVID MALAN: Exactly.
2751
02:10:23,840 --> 02:10:26,480
So to summarize, both of us had
a very similar thought process,
2752
02:10:26,480 --> 02:10:28,820
made a similar decision based
on the same information,
2753
02:10:28,820 --> 02:10:31,760
not realizing that the
information, the fridge,
2754
02:10:31,760 --> 02:10:34,600
was in the process of being updated.
2755
02:10:34,600 --> 02:10:37,100
And of course, in the Instagram
world, it happens like this.
2756
02:10:37,100 --> 02:10:39,300
In the fridge world, it
might take a few minutes.
2757
02:10:39,300 --> 02:10:41,840
But the problem is
ultimately the result of our
2758
02:10:41,840 --> 02:10:46,070
having made a decision about the state
of the world and the state of the world
2759
02:10:46,070 --> 02:10:48,600
was in the middle of being updated.
2760
02:10:48,600 --> 02:10:50,333
The queries got comingled with others.
2761
02:10:50,333 --> 02:10:53,000
Or, in this case, someone was
already on their way to the store.
2762
02:10:53,000 --> 02:10:54,960
So what's the solution
in the real world?
2763
02:10:54,960 --> 02:10:59,000
Well, you could very simply take a
post it note and put like gone for milk
2764
02:10:59,000 --> 02:11:02,060
so as to communicate to your
roommate that they should not
2765
02:11:02,060 --> 02:11:04,620
inspect the value of that variable
and make a decision on it.
2766
02:11:04,620 --> 02:11:05,120
Why?
2767
02:11:05,120 --> 02:11:08,958
Because it's not yet consistent with
the outcome that's about to happen.
2768
02:11:08,958 --> 02:11:11,000
You could be more dramatic
and you could actually
2769
02:11:11,000 --> 02:11:13,580
lock the fridge somehow,
put a padlock around it
2770
02:11:13,580 --> 02:11:15,470
or the like so they
can't even get in there.
2771
02:11:15,470 --> 02:11:18,510
And that would achieve
the same effect too.
2772
02:11:18,510 --> 02:11:22,400
And that is actually pretty much the
solution to this problem in code too.
2773
02:11:22,400 --> 02:11:23,370
It's not safe.
2774
02:11:23,370 --> 02:11:27,110
It's not sufficient to only execute
three lines of code like this.
2775
02:11:27,110 --> 02:11:32,000
Rather, what you probably want to
do is use additional SQL keywords
2776
02:11:32,000 --> 02:11:35,270
that we won't spend much time on
in the class itself, but these.
2777
02:11:35,270 --> 02:11:37,160
There are solutions to this problem.
2778
02:11:37,160 --> 02:11:39,320
You can begin what's
called a transaction
2779
02:11:39,320 --> 02:11:43,460
and you can more explicitly
commit to making a decision,
2780
02:11:43,460 --> 02:11:46,100
like updating the
database to 101 or 102.
2781
02:11:46,100 --> 02:11:50,120
Or if you realize, wait a minute,
Carter's query is interrupting mine.
2782
02:11:50,120 --> 02:11:53,540
Let me roll back to the
previous state and just rewind.
2783
02:11:53,540 --> 02:11:54,290
Let me undo.
2784
02:11:54,290 --> 02:11:55,840
Control Z, if you will.
2785
02:11:55,840 --> 02:11:58,340
There's also another keyword
that's not so much used anymore
2786
02:11:58,340 --> 02:11:59,600
in SQL which is locking.
2787
02:11:59,600 --> 02:12:01,170
You could literally back in the day.
2788
02:12:01,170 --> 02:12:05,690
Lock the entire database table,
preventing anyone from updating it
2789
02:12:05,690 --> 02:12:09,320
or making changes or even reading it
while someone else was accessing it.
2790
02:12:09,320 --> 02:12:12,320
That was a very heavy handed solution
because it slowed everything down.
2791
02:12:12,320 --> 02:12:15,710
But in short, transactions
are now a feature of SQL
2792
02:12:15,710 --> 02:12:17,840
that you won't necessarily
need to use yourselves
2793
02:12:17,840 --> 02:12:22,490
that do solve this problem by doing
the equivalent of saying while David's
2794
02:12:22,490 --> 02:12:26,360
like counter is in the process of
being updated, keep Carter at bay,
2795
02:12:26,360 --> 02:12:29,510
ideally briefly, and then
let his data go through too.
2796
02:12:29,510 --> 02:12:32,240
It's equivalent too to putting
a note or a lock on the fridge.
2797
02:12:32,240 --> 02:12:33,920
And indeed, I mean lock literally.
2798
02:12:33,920 --> 02:12:36,380
They were once upon a
time called and still
2799
02:12:36,380 --> 02:12:41,480
are in some contexts called
locks on databases too.
2800
02:12:41,480 --> 02:12:44,730
And the code for which you might
do this is almost the same.
2801
02:12:44,730 --> 02:12:51,860
You simply wrap the three queries with
a transaction statement and a commit.
2802
02:12:51,860 --> 02:12:57,800
And the term of art here is that
this makes your statements atomic.
2803
02:12:57,800 --> 02:13:01,230
So atomic means they're either
all executed or not at all.
2804
02:13:01,230 --> 02:13:04,340
That is they're all very tightly
coupled together without interruption.
2805
02:13:04,340 --> 02:13:08,480
Transactions solves that problem and
avoid having two gallons of milk.
2806
02:13:08,480 --> 02:13:11,360
And the last problem that
arises that is tragically
2807
02:13:11,360 --> 02:13:14,510
so darn common in the
real world today is
2808
02:13:14,510 --> 02:13:16,173
what's called a SQL injection attack.
2809
02:13:16,173 --> 02:13:18,590
And it's what I alluded to
earlier with the question mark.
2810
02:13:18,590 --> 02:13:20,298
So suppose you're in
the habit of logging
2811
02:13:20,298 --> 02:13:23,240
into Yale websites with
your net ID or password
2812
02:13:23,240 --> 02:13:25,950
or at Harvard, your Harvard
key and password as well.
2813
02:13:25,950 --> 02:13:28,370
Suppose for the sake of
discussion that the people that
2814
02:13:28,370 --> 02:13:32,360
implemented Harvard key log in allow
you to type in your email address,
2815
02:13:32,360 --> 02:13:33,680
of course, and your password.
2816
02:13:33,680 --> 02:13:37,160
But suppose that they are
using SQL underneath the hood
2817
02:13:37,160 --> 02:13:39,770
to check your username
and password to make sure
2818
02:13:39,770 --> 02:13:43,400
that you are David Malan or Carter
Zenke or whoever you claim to be.
2819
02:13:43,400 --> 02:13:48,770
I haven't shown you the syntax yet,
but it turns out that in SQL, --
2820
02:13:48,770 --> 02:13:51,410
is a special way of
indicating a comment.
2821
02:13:51,410 --> 02:13:53,100
It means ignore everything to the right.
2822
02:13:53,100 --> 02:13:57,080
So it's just like // in C or
the hash symbol in Python.
2823
02:13:57,080 --> 02:13:59,430
-- just means ignore
everything to the right.
2824
02:13:59,430 --> 02:14:01,440
And we've, of course,
seen single quotes.
2825
02:14:01,440 --> 02:14:04,730
So one way to wage a
SQL injection attack
2826
02:14:04,730 --> 02:14:08,870
is to try to inject malicious SQL
code into someone else's database
2827
02:14:08,870 --> 02:14:10,220
without them realizing it.
2828
02:14:10,220 --> 02:14:11,220
How do you do this?
2829
02:14:11,220 --> 02:14:15,492
Well, suppose I log in as
malan@harvard.edu single quote dash,
2830
02:14:15,492 --> 02:14:16,316
dash.
2831
02:14:16,316 --> 02:14:20,510
I'm not double quoting anything clearly
and there's nothing to the right
2832
02:14:20,510 --> 02:14:21,770
of the --
2833
02:14:21,770 --> 02:14:23,150
anyway.
2834
02:14:23,150 --> 02:14:25,500
But this imbalance is
going to be useful.
2835
02:14:25,500 --> 02:14:26,000
Why?
2836
02:14:26,000 --> 02:14:29,630
Because if I'm a hacker and I'm
presuming someone at Harvard probably
2837
02:14:29,630 --> 02:14:33,680
is using single quotes to
wrap the user's email address
2838
02:14:33,680 --> 02:14:38,360
and wrap the user's password, what if I
try to complete their thought for them
2839
02:14:38,360 --> 02:14:40,370
and close one of those quotes for them?
2840
02:14:40,370 --> 02:14:41,850
What might happen?
2841
02:14:41,850 --> 02:14:43,220
Well, we could do this.
2842
02:14:43,220 --> 02:14:46,250
Here for instance, let me hypothesize
is the code that Harvard wrote,
2843
02:14:46,250 --> 02:14:48,050
hopefully not, underneath the hood.
2844
02:14:48,050 --> 02:14:51,500
So they're using CS50's library in
Python and they're using SQL inside.
2845
02:14:51,500 --> 02:14:54,350
Suppose that they have
a query like this.
2846
02:14:54,350 --> 02:15:00,380
Select star from users where username
equals question mark and password
2847
02:15:00,380 --> 02:15:01,315
equals question mark.
2848
02:15:01,315 --> 02:15:04,190
And then suppose they just plug in
whatever username and password was
2849
02:15:04,190 --> 02:15:05,070
typed in.
2850
02:15:05,070 --> 02:15:08,090
And then if they get back some
number of rows dot dot dot,
2851
02:15:08,090 --> 02:15:09,410
they assume I am David.
2852
02:15:09,410 --> 02:15:12,200
They assume Carter is Carter if
both the username and password are
2853
02:15:12,200 --> 02:15:12,890
in the database.
2854
02:15:12,890 --> 02:15:14,480
Just end of story there.
2855
02:15:14,480 --> 02:15:15,680
This is good.
2856
02:15:15,680 --> 02:15:18,750
This has the question mark
placeholder, as we discussed earlier.
2857
02:15:18,750 --> 02:15:21,140
But what if you don't
quite remember that?
2858
02:15:21,140 --> 02:15:24,670
You don't quite take that to heart and
you use your more familiar last week
2859
02:15:24,670 --> 02:15:28,690
fstrings whereby we use these
curly braces to plug in values.
2860
02:15:28,690 --> 02:15:30,280
What if you do this instead?
2861
02:15:30,280 --> 02:15:31,900
So it's almost the same idea.
2862
02:15:31,900 --> 02:15:34,950
It's still db execute but now
it's select star from users
2863
02:15:34,950 --> 02:15:36,120
where username equals.
2864
02:15:36,120 --> 02:15:39,570
And now notice I'm doing the single
quotes, which is required by SQL,
2865
02:15:39,570 --> 02:15:41,850
but I'm using fstrings
with the curly braces.
2866
02:15:41,850 --> 02:15:46,530
And the password equals single quote
password and then close single quote.
2867
02:15:46,530 --> 02:15:50,910
The problem is if you're just
blindly pasting effectively
2868
02:15:50,910 --> 02:15:55,530
the user's input into that web form
into the username field and the password
2869
02:15:55,530 --> 02:15:59,890
field, there's nothing stopping a
malicious user, student, faculty,
2870
02:15:59,890 --> 02:16:03,090
staff from including a
single quote in their name.
2871
02:16:03,090 --> 02:16:07,470
Or maybe even benevolently
if their name happens
2872
02:16:07,470 --> 02:16:10,440
to have a single quote, as some
last names in particular do.
2873
02:16:10,440 --> 02:16:12,230
So this is very fragile.
2874
02:16:12,230 --> 02:16:12,730
Why?
2875
02:16:12,730 --> 02:16:15,090
Well, suppose that if
we plug in my malicious,
2876
02:16:15,090 --> 02:16:18,030
malan@harvard.edu single quote --
2877
02:16:18,030 --> 02:16:20,430
notice what happens to username here.
2878
02:16:20,430 --> 02:16:23,940
The username variable
inside of the curly quotes
2879
02:16:23,940 --> 02:16:26,340
will get replaced with this.
2880
02:16:26,340 --> 02:16:29,880
And notice single quote, which
the Harvard programmer wrote,
2881
02:16:29,880 --> 02:16:35,969
malan@harvard.edu single
quote which I wrote --
2882
02:16:35,969 --> 02:16:40,170
which I wrote single quote which
Harvard wrote and whatever else
2883
02:16:40,170 --> 02:16:41,160
they want after that.
2884
02:16:41,160 --> 02:16:43,524
What's the implication,
though, of the dash, dash?
2885
02:16:43,524 --> 02:16:46,722
2886
02:16:46,722 --> 02:16:48,680
Everything to the right
is going to be ignored.
2887
02:16:48,680 --> 02:16:51,730
So the password is never even
checked in this scenario.
2888
02:16:51,730 --> 02:16:55,855
I'm tricking the server into
ignoring everything after the --
2889
02:16:55,855 --> 02:17:00,650
but I have constructed very cleverly,
very maliciously a syntactically valid
2890
02:17:00,650 --> 02:17:01,150
query.
2891
02:17:01,150 --> 02:17:01,650
Why?
2892
02:17:01,650 --> 02:17:03,580
Because I provided the
single quote that's
2893
02:17:03,580 --> 02:17:06,670
going to finish the thought
of that first single quote.
2894
02:17:06,670 --> 02:17:09,219
And now I would only know how
to do this if I saw the code
2895
02:17:09,219 --> 02:17:12,209
or if I just randomly try putting
apostrophes into web forms
2896
02:17:12,209 --> 02:17:13,209
and see if things break.
2897
02:17:13,209 --> 02:17:15,215
That's often how
adversaries attack systems.
2898
02:17:15,215 --> 02:17:17,590
They type in potentially
dangerous characters, hit Enter.
2899
02:17:17,590 --> 02:17:20,510
If something breaks, they're
not necessarily into the system,
2900
02:17:20,510 --> 02:17:22,757
but they know that there
might be a vulnerability.
2901
02:17:22,757 --> 02:17:25,340
And then they start trying more
methodically things like this.
2902
02:17:25,340 --> 02:17:27,850
So this then is going to be
bad, because it effectively
2903
02:17:27,850 --> 02:17:29,500
grays out the rest of the query.
2904
02:17:29,500 --> 02:17:32,889
And this query is surely
going to return some rows
2905
02:17:32,889 --> 02:17:34,570
without even knowing my password.
2906
02:17:34,570 --> 02:17:37,450
And so this logic here
dot dot dot means, well,
2907
02:17:37,450 --> 02:17:39,850
if a data came back
from this query, Harvard
2908
02:17:39,850 --> 02:17:43,480
is presumably going to
assume that Malan logged in.
2909
02:17:43,480 --> 02:17:47,840
Show him his account or whatever
is being protected here.
2910
02:17:47,840 --> 02:17:50,559
So in short, using fstrings bad.
2911
02:17:50,559 --> 02:17:54,280
Using any equivalent like %s in C, bad.
2912
02:17:54,280 --> 02:17:56,559
When it comes to SQL,
using question marks
2913
02:17:56,559 --> 02:18:00,280
or whatever a third party
library like CS50 prescribes
2914
02:18:00,280 --> 02:18:02,110
is the way to solve this.
2915
02:18:02,110 --> 02:18:02,620
Why?
2916
02:18:02,620 --> 02:18:07,540
Because libraries like ours are designed
to at least be smart and be paranoid.
2917
02:18:07,540 --> 02:18:09,740
And what we will do is this.
2918
02:18:09,740 --> 02:18:12,580
When you use the question marks
and the values are plugged in,
2919
02:18:12,580 --> 02:18:16,660
we will escape any potentially
dangerous characters
2920
02:18:16,660 --> 02:18:18,520
inside of those placeholders.
2921
02:18:18,520 --> 02:18:21,580
And so effectively, the
single quote will no longer
2922
02:18:21,580 --> 02:18:23,469
be considered a
grammatical single quote.
2923
02:18:23,469 --> 02:18:26,870
It will just be literally a character
in the username or password.
2924
02:18:26,870 --> 02:18:30,040
So the library takes
care of this for you
2925
02:18:30,040 --> 02:18:33,500
because you're plugging in the username
and password as separate arguments.
2926
02:18:33,500 --> 02:18:37,360
And then we or the third party
you're using actually sanitize.
2927
02:18:37,360 --> 02:18:41,510
That is clean up the data and
prevent those bad characters.
2928
02:18:41,510 --> 02:18:44,650
Now, this is kind of an internet
meme that went around for a while.
2929
02:18:44,650 --> 02:18:47,320
If you've ever driven
a car or been in a car
2930
02:18:47,320 --> 02:18:49,930
where there's the automatic
readers for tolls.
2931
02:18:49,930 --> 02:18:52,969
This person thought it might be funny
to try doing something like this.
2932
02:18:52,969 --> 02:18:54,309
What are they presumably doing?
2933
02:18:54,309 --> 02:18:57,520
The presumption here is, whether
or not it worked is unclear,
2934
02:18:57,520 --> 02:19:01,570
is that here's the end of
actual license plate number,
2935
02:19:01,570 --> 02:19:03,903
but here's an interesting
single quote and a semicolon.
2936
02:19:03,903 --> 02:19:06,070
That's especially bad because
it means you can maybe
2937
02:19:06,070 --> 02:19:08,059
execute a second query on the database.
2938
02:19:08,059 --> 02:19:11,170
This is someone having fun trying
to drop the entire database
2939
02:19:11,170 --> 02:19:16,090
table for whatever municipality is
scanning through cameras their license
2940
02:19:16,090 --> 02:19:16,750
plate code.
2941
02:19:16,750 --> 02:19:19,150
And I would be remiss if
we didn't end on this note.
2942
02:19:19,150 --> 02:19:22,240
At least in computer
science circles, there
2943
02:19:22,240 --> 02:19:26,920
is someone named, no relation to the
TF name we put in the database earlier,
2944
02:19:26,920 --> 02:19:31,660
little Bobby Tables, which
ends with this XKCD comic.
2945
02:19:31,660 --> 02:19:36,340
And if you chuckle, if you laugh,
you're now legit SQL programmers.
2946
02:19:36,340 --> 02:19:39,049
2947
02:19:39,049 --> 02:19:42,030
Nice, nice.
2948
02:19:42,030 --> 02:19:44,740
Every CS student out there
knows about little Bobby Table.
2949
02:19:44,740 --> 02:19:47,657
So if you name drop little
Bobby Tables now, you're in.
2950
02:19:47,657 --> 02:19:49,240
All right, that's it though for today.
2951
02:19:49,240 --> 02:19:51,780
We will see you next time.
2952
02:19:51,780 --> 02:19:55,130
[MUSIC PLAYING]
2953
02:19:55,130 --> 02:20:28,000245154
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.