Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
0
00:00:00,000 --> 00:00:05,259
1
00:00:05,259 --> 00:00:08,300
DOUG LLOYD: So in CS50, we've covered a lot of different data structures,
2
00:00:08,300 --> 00:00:09,180
right?
3
00:00:09,180 --> 00:00:11,420
We've seen arrays, and linked lists, and hash tables,
4
00:00:11,420 --> 00:00:15,210
and tries, stacks and queues.
5
00:00:15,210 --> 00:00:17,579
We'll also learn a little about trees and heaps,
6
00:00:17,579 --> 00:00:20,120
but really these all just end up being variations on a theme.
7
00:00:20,120 --> 00:00:22,840
There really are these kind of four basic ideas
8
00:00:22,840 --> 00:00:25,190
that everything else can boil down to.
9
00:00:25,190 --> 00:00:28,150
Arrays, linked lists, hash tables, and tries.
10
00:00:28,150 --> 00:00:30,720
And like I said, there are variations on them,
11
00:00:30,720 --> 00:00:32,720
but this is pretty much going to summarize
12
00:00:32,720 --> 00:00:38,140
everything we're going to talk about in this class in terms of C.
13
00:00:38,140 --> 00:00:40,140
>> But how do these all measure up, right?
14
00:00:40,140 --> 00:00:44,265
We've talked about the pros and cons of each in separate videos on them,
15
00:00:44,265 --> 00:00:46,390
but there's a lot of numbers getting thrown around.
16
00:00:46,390 --> 00:00:48,723
There's a lot of general thoughts getting thrown around.
17
00:00:48,723 --> 00:00:51,950
Let's try and consolidate it into just one place.
18
00:00:51,950 --> 00:00:55,507
Let's weigh the pros against the cons, and consider
19
00:00:55,507 --> 00:00:57,340
which data structure might be the right data
20
00:00:57,340 --> 00:01:01,440
structure for your particular situation, whatever kind of data you're storing.
21
00:01:01,440 --> 00:01:06,625
You don't necessarily always need to use the super fast insertion, deletion,
22
00:01:06,625 --> 00:01:10,761
and lookup of a trie if you really don't care about inserting and deleting
23
00:01:10,761 --> 00:01:11,260
too much.
24
00:01:11,260 --> 00:01:13,968
If you need just quickly random access, maybe an array is better.
25
00:01:13,968 --> 00:01:15,340
So let's distill that.
26
00:01:15,340 --> 00:01:18,530
Let's talk about each of the four major kinds of data structures
27
00:01:18,530 --> 00:01:21,720
that we've talked about, and just see when they might be good,
28
00:01:21,720 --> 00:01:23,340
and when they might not be so good.
29
00:01:23,340 --> 00:01:24,610
So let's start with arrays.
30
00:01:24,610 --> 00:01:27,300
So insertion, that's kind of bad.
31
00:01:27,300 --> 00:01:31,350
>> Insertion at the end of an array is OK, if we're building an array as we go.
32
00:01:31,350 --> 00:01:33,570
But if we need to insert elements into the middle,
33
00:01:33,570 --> 00:01:35,550
think back to insertion sort, there's a lot
34
00:01:35,550 --> 00:01:37,510
of shifting to fit an element in there.
35
00:01:37,510 --> 00:01:41,170
And so if we're going to insert anywhere but the end of an array,
36
00:01:41,170 --> 00:01:43,590
that's probably not so great.
37
00:01:43,590 --> 00:01:46,710
>> Similarly, deletion, unless we're deleting from the end of an array,
38
00:01:46,710 --> 00:01:49,810
is probably also not so great if we don't want to leave empty gaps,
39
00:01:49,810 --> 00:01:50,790
which usually we don't.
40
00:01:50,790 --> 00:01:54,700
We want to remove an element, and then sort of make it snug again.
41
00:01:54,700 --> 00:01:57,670
And so deleting elements from an array, also not so great.
42
00:01:57,670 --> 00:01:58,820
>> Lookup, though, is great.
43
00:01:58,820 --> 00:02:00,920
We have random access, constant time lookup.
44
00:02:00,920 --> 00:02:03,800
We just say seven, and we go to array relocation seven.
45
00:02:03,800 --> 00:02:05,907
We say 20, with go to array relocation 20.
46
00:02:05,907 --> 00:02:07,240
We don't have to iterate across.
47
00:02:07,240 --> 00:02:08,630
That's pretty good.
48
00:02:08,630 --> 00:02:11,020
>> Arrays are also relatively easy to sort.
49
00:02:11,020 --> 00:02:14,040
Every time we talked about a sorting algorithm, such as selection sort,
50
00:02:14,040 --> 00:02:18,820
insertion sort, bubble sort, merge sort, we always used arrays to do it,
51
00:02:18,820 --> 00:02:21,860
because arrays are pretty easy to sort, relative to the data structures
52
00:02:21,860 --> 00:02:22,970
we've seen so far.
53
00:02:22,970 --> 00:02:24,320
>> They're also relatively small.
54
00:02:24,320 --> 00:02:25,695
There's not a lot of extra space.
55
00:02:25,695 --> 00:02:29,210
You just set aside exactly as much as you need to hold your data,
56
00:02:29,210 --> 00:02:30,320
and that's pretty much it.
57
00:02:30,320 --> 00:02:33,180
So they're pretty small and efficient in that way.
58
00:02:33,180 --> 00:02:36,000
But another downside, though, is that they are fixed in size.
59
00:02:36,000 --> 00:02:38,630
We have to declare exactly how big we want our array to be,
60
00:02:38,630 --> 00:02:39,940
and we only get one shot at it.
61
00:02:39,940 --> 00:02:41,280
We can't grow and shrink it.
62
00:02:41,280 --> 00:02:44,582
>> If we need to grow or shrink it, we need to declare an entirely new array,
63
00:02:44,582 --> 00:02:47,750
copy all of the elements of the first array into the second array.
64
00:02:47,750 --> 00:02:51,410
And if we miscalculated that time, we need to do it again.
65
00:02:51,410 --> 00:02:52,760
Not so great.
66
00:02:52,760 --> 00:02:58,750
So arrays don't give us the flexibility to have variable numbers of elements.
67
00:02:58,750 --> 00:03:01,320
>> With a linked list, insertion is pretty easy.
68
00:03:01,320 --> 00:03:03,290
We just tack onto the front.
69
00:03:03,290 --> 00:03:04,892
Deletion is also pretty easy.
70
00:03:04,892 --> 00:03:06,100
We have to find the elements.
71
00:03:06,100 --> 00:03:07,270
That involve some searching.
72
00:03:07,270 --> 00:03:10,270
>> But once you've found the element you're looking for, all you need to do
73
00:03:10,270 --> 00:03:12,830
is change a pointer, possibly two if you have
74
00:03:12,830 --> 00:03:15,151
a linked list-- a doubly linked list, rather--
75
00:03:15,151 --> 00:03:16,650
and then you can just free the node.
76
00:03:16,650 --> 00:03:18,399
You don't have to shift everything around.
77
00:03:18,399 --> 00:03:22,090
You just change two pointers, so that's pretty quick.
78
00:03:22,090 --> 00:03:23,470
>> Lookup is bad though, right?
79
00:03:23,470 --> 00:03:26,280
In order for us to find an element in a linked list,
80
00:03:26,280 --> 00:03:29,154
whether singly or doubly linked, we have to linear search it.
81
00:03:29,154 --> 00:03:32,320
We have to start at the beginning and move the end, or start at the end move
82
00:03:32,320 --> 00:03:33,860
to the beginning.
83
00:03:33,860 --> 00:03:35,474
We don't have random access anymore.
84
00:03:35,474 --> 00:03:37,265
So if we're doing a lot of searching, maybe
85
00:03:37,265 --> 00:03:39,830
a linked list isn't quite so good for us.
86
00:03:39,830 --> 00:03:43,750
>> They're also really difficult to sort, right?
87
00:03:43,750 --> 00:03:45,666
The only way you can really sort a linked list
88
00:03:45,666 --> 00:03:47,870
is to sort it as you construct it.
89
00:03:47,870 --> 00:03:50,497
But if you sort it as you construct it, you're no longer
90
00:03:50,497 --> 00:03:51,830
making quick insertions anymore.
91
00:03:51,830 --> 00:03:53,746
You're not just tacking things onto the front.
92
00:03:53,746 --> 00:03:55,710
You have to find the right spot to put it,
93
00:03:55,710 --> 00:03:57,820
and then your insertion becomes just about as bad
94
00:03:57,820 --> 00:03:59,390
as inserting into an array.
95
00:03:59,390 --> 00:04:03,130
So linked lists are not so great for sorting data.
96
00:04:03,130 --> 00:04:05,830
>> They're also pretty small, size-wise.
97
00:04:05,830 --> 00:04:08,496
Doubly linked list slightly larger than singly linked lists,
98
00:04:08,496 --> 00:04:10,620
which are slightly larger than arrays, but it's not
99
00:04:10,620 --> 00:04:13,330
a huge amount of wasted space.
100
00:04:13,330 --> 00:04:18,730
So if space is at a premium, but not a really intense premium,
101
00:04:18,730 --> 00:04:22,180
this might be the right way to go.
102
00:04:22,180 --> 00:04:23,330
>> Hash tables.
103
00:04:23,330 --> 00:04:25,850
Insertion into a hash table is fairly straightforward.
104
00:04:25,850 --> 00:04:26,980
It's a two-step process.
105
00:04:26,980 --> 00:04:30,700
First we need to run our data through a hash function to get a hash code,
106
00:04:30,700 --> 00:04:37,550
and then we insert the element into the hash table at that hash code location.
107
00:04:37,550 --> 00:04:40,879
>> Deletion, similar to linked list, is easy once you find the element.
108
00:04:40,879 --> 00:04:43,170
You have to find it first, but then when you delete it,
109
00:04:43,170 --> 00:04:45,128
you just need to exchange a couple of pointers,
110
00:04:45,128 --> 00:04:47,250
if you're using separate chaining.
111
00:04:47,250 --> 00:04:49,942
If you're using probing, or if you're not
112
00:04:49,942 --> 00:04:51,650
using chaining at all in your hash table,
113
00:04:51,650 --> 00:04:53,040
deletion is actually really easy.
114
00:04:53,040 --> 00:04:57,134
All you need to do is hash the data, and then go to that location.
115
00:04:57,134 --> 00:04:58,925
And assuming you don't have any collisions,
116
00:04:58,925 --> 00:05:01,650
you'll be able to delete very quickly.
117
00:05:01,650 --> 00:05:04,930
>> Now, lookup is where things get a little more complicated.
118
00:05:04,930 --> 00:05:06,910
It's on average better than linked lists.
119
00:05:06,910 --> 00:05:09,560
If you're using chaining, you still have a linked list,
120
00:05:09,560 --> 00:05:13,170
which means you still have the search detriment a linked list.
121
00:05:13,170 --> 00:05:18,390
But because you're taking your linked list and splitting it over 100 or 1,000
122
00:05:18,390 --> 00:05:25,380
or n elements in your hash table, you're linked lists are all one nth the size.
123
00:05:25,380 --> 00:05:27,650
They're all substantially smaller.
124
00:05:27,650 --> 00:05:32,080
You have n linked lists instead of one linked list of size n.
125
00:05:32,080 --> 00:05:34,960
>> And so this real-world constant factor, which we generally
126
00:05:34,960 --> 00:05:39,730
don't talk about in time complexity, it does actually make a difference here.
127
00:05:39,730 --> 00:05:43,020
So lookup is still linear search if you're using chaining,
128
00:05:43,020 --> 00:05:46,780
but the length of the list you're searching through
129
00:05:46,780 --> 00:05:50,080
is very, very short by comparison.
130
00:05:50,080 --> 00:05:52,995
Again, if sorting is your goal here, hash table's
131
00:05:52,995 --> 00:05:54,370
probably not the right way to go.
132
00:05:54,370 --> 00:05:56,830
Just use an array if sorting is really important to you.
133
00:05:56,830 --> 00:05:58,590
>> And they can run the gamut of size.
134
00:05:58,590 --> 00:06:01,640
It's hard to say whether a hash table is small or big,
135
00:06:01,640 --> 00:06:04,110
because it really depends on how large your hash table is.
136
00:06:04,110 --> 00:06:07,340
If you're only going to be storing five elements in your hash table,
137
00:06:07,340 --> 00:06:10,620
and you have a hash table with 10,000 elements in it,
138
00:06:10,620 --> 00:06:12,614
you're probably wasting a lot of space.
139
00:06:12,614 --> 00:06:15,030
Contrast being you can also have very compact hash tables,
140
00:06:15,030 --> 00:06:18,720
but the smaller your hash table gets, the longer each of those linked lists
141
00:06:18,720 --> 00:06:19,220
gets.
142
00:06:19,220 --> 00:06:22,607
And so there's really no way to define exactly the size of a hash table,
143
00:06:22,607 --> 00:06:24,440
but it's probably safe to say it's generally
144
00:06:24,440 --> 00:06:27,990
going to be bigger than a linked list storing the same data,
145
00:06:27,990 --> 00:06:30,400
but smaller than a trie.
146
00:06:30,400 --> 00:06:32,720
>> And tries are the fourth of these structures
147
00:06:32,720 --> 00:06:34,070
that we've been talking about.
148
00:06:34,070 --> 00:06:36,450
Inserting into a trie is complex.
149
00:06:36,450 --> 00:06:38,400
There's a lot of dynamic memory allocation,
150
00:06:38,400 --> 00:06:40,780
especially at the beginning, as you're starting to build.
151
00:06:40,780 --> 00:06:43,700
But it's constant time.
152
00:06:43,700 --> 00:06:47,690
It's only the human element here that makes it tricky.
153
00:06:47,690 --> 00:06:53,250
Having to encounter null pointer, malloc space, go there, possibly malloc space
154
00:06:53,250 --> 00:06:54,490
from there again.
155
00:06:54,490 --> 00:06:58,880
The sort of intimidation factor of pointers in dynamic memory allocation
156
00:06:58,880 --> 00:07:00,130
is the hurdle to clear.
157
00:07:00,130 --> 00:07:04,550
But once you've cleared it, insertion actually comes quite simple,
158
00:07:04,550 --> 00:07:06,810
and it certainly is constant time.
159
00:07:06,810 --> 00:07:07,680
>> Deletion is easy.
160
00:07:07,680 --> 00:07:11,330
All you need to do is navigate down a couple of pointers and free the node,
161
00:07:11,330 --> 00:07:12,420
so that's pretty good.
162
00:07:12,420 --> 00:07:13,930
Lookup is also pretty fast.
163
00:07:13,930 --> 00:07:16,780
It's only based on the length of your data.
164
00:07:16,780 --> 00:07:19,924
So if all of your data is five character strings,
165
00:07:19,924 --> 00:07:22,590
for example, you're storing five character strings in your trie,
166
00:07:22,590 --> 00:07:25,439
it only takes five steps to find what you're looking for.
167
00:07:25,439 --> 00:07:28,480
Five is just a constant factor, so again, insertion, deletion, and lookup
168
00:07:28,480 --> 00:07:31,670
here are all constant time, effectively.
169
00:07:31,670 --> 00:07:34,880
>> Another thing is that your trie is actually kind of already sorted, right?
170
00:07:34,880 --> 00:07:36,800
By virtue of how we're inserting elements,
171
00:07:36,800 --> 00:07:40,060
by going letter by letter of the key, or digit by digit of the key,
172
00:07:40,060 --> 00:07:45,084
typically, your trie ends up being kind of sorted as you build it.
173
00:07:45,084 --> 00:07:47,250
It doesn't really makes sense to think about sorting
174
00:07:47,250 --> 00:07:49,874
in the same way we think about it with arrays, or linked lists,
175
00:07:49,874 --> 00:07:51,070
or hash tables.
176
00:07:51,070 --> 00:07:54,780
But in some sense, your trie is sorted as you go.
177
00:07:54,780 --> 00:07:58,630
>> The downside, of course, is that a trie rapidly becomes huge.
178
00:07:58,630 --> 00:08:02,970
From every junction point, you might have-- if your key consists of digits,
179
00:08:02,970 --> 00:08:04,880
you have 10 other places you can go, which
180
00:08:04,880 --> 00:08:07,490
means that every node contains information
181
00:08:07,490 --> 00:08:11,440
about the data you want to store at that node, plus 10 pointers.
182
00:08:11,440 --> 00:08:14,430
Which, on CS50 IDE, is 80 bytes.
183
00:08:14,430 --> 00:08:17,220
So it's at least 80 bytes for every node that you create,
184
00:08:17,220 --> 00:08:19,240
and that's not even counting data.
185
00:08:19,240 --> 00:08:24,950
And if your nodes are letters instead of digits,
186
00:08:24,950 --> 00:08:27,825
now you have 26 pointers from every location.
187
00:08:27,825 --> 00:08:32,007
And 26 times 8 is probably 200 bytes, or something like that.
188
00:08:32,007 --> 00:08:33,840
And you have capital and lowercase-- you can
189
00:08:33,840 --> 00:08:35,381
see where I'm going with this, right?
190
00:08:35,381 --> 00:08:37,500
Your nodes can get really big, and so the trie
191
00:08:37,500 --> 00:08:40,470
itself, overall, can get really big, too.
192
00:08:40,470 --> 00:08:42,630
So if space is at a high premium on your system,
193
00:08:42,630 --> 00:08:45,830
a trie might not be the right way to go, even though its other benefits
194
00:08:45,830 --> 00:08:47,780
come into play.
195
00:08:47,780 --> 00:08:48,710
I'm Doug Lloyd.
196
00:08:48,710 --> 00:08:50,740
This is CS50.
197
00:08:50,740 --> 00:08:52,316
16389
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.