Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:09,000
Hi everyone, welcome to the HINTS video for Project 2.3, Iterative Machine Learning Training.
2
00:00:09,000 --> 00:00:13,000
In this video, I will walk you through the following topics.
3
00:00:13,000 --> 00:00:17,000
First, let's take a look at the starter code in DABS.
4
00:00:17,000 --> 00:00:23,000
We'll figure out the format of the samples RDD and the rationale behind using sparse format.
5
00:00:23,000 --> 00:00:27,000
Then we'll focus on the scalability problem of the starter code
6
00:00:27,000 --> 00:00:34,000
and think about how we can store weights array in the worker nodes rather than purely in the driver node.
7
00:00:34,000 --> 00:00:41,000
And I will give the first hint about how to use inverted index and join-based communication.
8
00:00:41,000 --> 00:00:47,000
Then we will talk about the performance problem of the draft optimization
9
00:00:47,000 --> 00:00:52,000
and how to utilize partitions to optimize join operators.
10
00:00:52,000 --> 00:00:56,000
I will give an introduction for RDD join operators after that.
11
00:00:56,000 --> 00:00:57,000
At last,
12
00:00:57,000 --> 00:01:01,000
I will give some further tips that we want to share with you guys.
13
00:01:01,000 --> 00:01:02,000
Yeah.
14
00:01:02,000 --> 00:01:04,000
OK, so let's get started.
15
00:01:04,000 --> 00:01:07,000
As you have watched in the overview video,
16
00:01:07,000 --> 00:01:11,000
this is the core logic of our machine learning training.
17
00:01:11,000 --> 00:01:15,000
We have data samples and model parameters as inputs.
18
00:01:15,000 --> 00:01:22,000
For data samples, it is stored as an RDD and it's also partitioned across executors.
19
00:01:22,000 --> 00:01:26,000
For model parameters, it is stored as a weight array in the driver.
20
00:01:26,000 --> 00:01:29,000
Next is the training iteration.
21
00:01:29,000 --> 00:01:34,000
Specifically, the driver will broadcast the weights to all executors.
22
00:01:34,000 --> 00:01:42,000
Then each executor will compute the gradient updates with respect to the data partition that resides in the executors.
23
00:01:42,000 --> 00:01:50,000
And the driver would collect updates from executors and then update the parameters locally.
24
00:01:50,000 --> 00:01:55,000
Then the iteration will repeat until the model converges.
25
00:01:55,000 --> 00:02:03,000
Here we focus on the most significant computation, namely the computation of gradient descent.
26
00:02:03,000 --> 00:02:11,000
So for gradient descent, we have broadcasted weight array and the worker's local samples as input.
27
00:02:11,000 --> 00:02:16,000
We need to compute prediction based on the result of the sigmoid function.
28
00:02:16,000 --> 00:02:24,000
But before that, we need to do a dot product between the broadcasted weight array and the feature values for each sample.
29
00:02:24,000 --> 00:02:32,000
Then we can use the prediction result to compute the gradient and of course the loss function.
30
00:02:32,000 --> 00:02:38,000
As we can see here, since the dot product will combine values from both inputs,
31
00:02:38,000 --> 00:02:45,000
therefore it's the most important computation that we will consider for optimization.
32
00:02:45,000 --> 00:02:47,000
Let's go through an example.
33
00:02:47,000 --> 00:02:52,000
Here we have five values, oh sorry, five features.
34
00:02:52,000 --> 00:02:59,000
Therefore the weight array also has five elements and the sample also got five.
35
00:02:59,000 --> 00:03:02,000
By dot producting them, we can get a result.
36
00:03:02,000 --> 00:03:11,000
However, it's clear that since there are a lot of zeros, as you can see here in each sample,
37
00:03:11,000 --> 00:03:14,000
the final result also contains a lot of zeros.
38
00:03:14,000 --> 00:03:19,000
Which means if we use the whole sample array to do dot product,
39
00:03:19,000 --> 00:03:22,000
then there will be a lot of unnecessary computation.
40
00:03:22,000 --> 00:03:26,000
This is especially true in real-world training sets,
41
00:03:26,000 --> 00:03:30,000
since there are typically a lot of zeros in those samples,
42
00:03:30,000 --> 00:03:33,000
or null values in those samples.
43
00:03:33,000 --> 00:03:36,000
So how can we solve this problem?
44
00:03:36,000 --> 00:03:40,000
The answer is using sparse format instead of dense format.
45
00:03:40,000 --> 00:03:44,000
Rather than storing the whole dense vector for each sample,
46
00:03:44,000 --> 00:03:48,000
we store those feature values that are non-zero.
47
00:03:48,000 --> 00:03:51,000
And here is a logic representation for this.
48
00:03:52,000 --> 00:03:59,000
For sparse format, we have a key-value pair list for each sample,
49
00:03:59,000 --> 00:04:04,000
and the key is the feature ID while the value is the feature at value.
50
00:04:04,000 --> 00:04:09,000
For example, here 0 represents its feature 0,
51
00:04:09,000 --> 00:04:15,000
and 1 means feature 0 has value equals to 1.
52
00:04:15,000 --> 00:04:17,000
Note that for the starter code,
53
00:04:17,000 --> 00:04:22,000
the physical representation is actually a tuple,
54
00:04:22,000 --> 00:04:24,000
with three elements,
55
00:04:24,000 --> 00:04:26,000
a label,
56
00:04:26,000 --> 00:04:29,000
a feature ID list,
57
00:04:29,000 --> 00:04:31,000
and a feature value list.
58
00:04:31,000 --> 00:04:34,000
Then, in order to dot product,
59
00:04:34,000 --> 00:04:37,000
we need to use a feature ID list
60
00:04:37,000 --> 00:04:40,000
to retrieve all the corresponding weights
61
00:04:40,000 --> 00:04:42,000
and do the dot product
62
00:04:42,000 --> 00:04:47,000
between them and the feature value list.
63
00:04:47,000 --> 00:04:49,000
Through this way,
64
00:04:49,000 --> 00:04:51,000
those computations for zero,
65
00:04:51,000 --> 00:04:53,000
for zero products,
66
00:04:53,000 --> 00:04:55,000
can be saved.
67
00:04:55,000 --> 00:04:58,000
As a quick summary,
68
00:04:58,000 --> 00:05:02,000
the final samples RDD looks like this.
69
00:05:02,000 --> 00:05:03,000
For each sample,
70
00:05:03,000 --> 00:05:06,000
we will only store useful features
71
00:05:06,000 --> 00:05:09,000
and their corresponding non-zero values.
72
00:05:09,000 --> 00:05:10,000
Next,
73
00:05:10,000 --> 00:05:15,000
we will take a look at the scalability problem in the starter code.
74
00:05:15,000 --> 00:05:16,000
From a high level,
75
00:05:16,000 --> 00:05:19,000
there are two limitations in the starter code.
76
00:05:19,000 --> 00:05:20,000
First,
77
00:05:20,000 --> 00:05:22,000
we need a single-node memory.
78
00:05:22,000 --> 00:05:24,000
The training set might have too many features
79
00:05:24,000 --> 00:05:28,000
to fit into a single-node memory.
80
00:05:28,000 --> 00:05:29,000
Secondly,
81
00:05:29,000 --> 00:05:30,000
for communication,
82
00:05:30,000 --> 00:05:33,000
since each partition might only require
83
00:05:33,000 --> 00:05:37,000
a subset of parameters to do the dot product,
84
00:05:37,000 --> 00:05:41,000
then broadcasting the whole weight array across the nodes
85
00:05:41,000 --> 00:05:42,000
might be a waste.
86
00:05:42,000 --> 00:05:43,000
So,
87
00:05:43,000 --> 00:05:46,000
how can we improve the implementation?
88
00:05:46,000 --> 00:05:49,000
The first step is to partition the weight array,
89
00:05:49,000 --> 00:05:52,000
like what we did for the sample data,
90
00:05:52,000 --> 00:05:55,000
so that we can store them across the whole cluster
91
00:05:55,000 --> 00:05:59,000
instead of only within one single node.
92
00:05:59,000 --> 00:06:00,000
Then,
93
00:06:00,000 --> 00:06:01,000
we need to figure out a way
94
00:06:01,000 --> 00:06:04,000
to tag each weight with a feature ID.
95
00:06:04,000 --> 00:06:06,000
After this step,
96
00:06:06,000 --> 00:06:12,000
we can get the weights ready,
97
00:06:12,000 --> 00:06:15,000
which the key is a feature ID
98
00:06:15,000 --> 00:06:18,000
and the value is a feature weight.
99
00:06:18,000 --> 00:06:21,000
The next step is that we need to bring the feature weights
100
00:06:21,000 --> 00:06:22,000
to each sample
101
00:06:22,000 --> 00:06:24,000
so that we can do the dot product
102
00:06:24,000 --> 00:06:26,000
and further computation.
103
00:06:26,000 --> 00:06:27,000
In general,
104
00:06:27,000 --> 00:06:31,000
the goal is to combine samples RDD
105
00:06:31,000 --> 00:06:35,000
and the weights RDD together.
106
00:06:35,000 --> 00:06:37,000
A simple way to do this
107
00:06:37,000 --> 00:06:40,000
is to use RDD lookup operation
108
00:06:40,000 --> 00:06:42,000
for each sample.
109
00:06:42,000 --> 00:06:45,000
Since we know its relevant feature ID,
110
00:06:45,000 --> 00:06:47,000
we can use those feature IDs
111
00:06:47,000 --> 00:06:52,000
to fetch each weight with RDD
112
00:06:52,000 --> 00:06:54,000
and they can do the computation.
113
00:06:54,000 --> 00:06:55,000
However,
114
00:06:55,000 --> 00:06:58,000
since lookup can only bring weight
115
00:06:58,000 --> 00:06:59,000
to the driver node,
116
00:06:59,000 --> 00:07:02,000
we still need to either broadcast weight
117
00:07:02,000 --> 00:07:03,000
to worker node
118
00:07:03,000 --> 00:07:06,000
or fetch each sample to the driver,
119
00:07:06,000 --> 00:07:09,000
then do the computation.
120
00:07:09,000 --> 00:07:11,000
This is very slow.
121
00:07:11,000 --> 00:07:12,000
So,
122
00:07:12,000 --> 00:07:16,000
is there any better way to do the combination?
123
00:07:16,000 --> 00:07:17,000
Actually,
124
00:07:17,000 --> 00:07:19,000
what we want here is
125
00:07:19,000 --> 00:07:21,000
a distributed shuffle operation
126
00:07:21,000 --> 00:07:23,000
to combine the samples dataset
127
00:07:23,000 --> 00:07:25,000
and the weights dataset.
128
00:07:25,000 --> 00:07:29,000
And let's use the join operation.
129
00:07:29,000 --> 00:07:32,000
One potential issue is that
130
00:07:32,000 --> 00:07:35,000
the weights already
131
00:07:35,000 --> 00:07:37,000
sample
132
00:07:37,000 --> 00:07:41,000
is a KV pair,
133
00:07:41,000 --> 00:07:44,000
but samples RDD is a list.
134
00:07:44,000 --> 00:07:45,000
To solve this,
135
00:07:45,000 --> 00:07:51,000
we will consider inverted index.
136
00:07:51,000 --> 00:07:54,000
Here is an inverted index computation pipeline
137
00:07:54,000 --> 00:07:57,000
for samples RDD.
138
00:07:57,000 --> 00:07:58,000
As shown here,
139
00:07:58,000 --> 00:08:01,000
we can first do a map operation
140
00:08:01,000 --> 00:08:04,000
to construct multiple key-value pairs
141
00:08:04,000 --> 00:08:06,000
for each sample
142
00:08:06,000 --> 00:08:10,000
as shown in the operation here,
143
00:08:10,000 --> 00:08:12,000
where key is a feature ID
144
00:08:12,000 --> 00:08:14,000
and the value is a
145
00:08:14,000 --> 00:08:18,000
single-sample sparse vector.
146
00:08:18,000 --> 00:08:19,000
Note that
147
00:08:19,000 --> 00:08:21,000
since different samples might have
148
00:08:21,000 --> 00:08:24,000
the same relevant feature value,
149
00:08:24,000 --> 00:08:26,000
for example,
150
00:08:26,000 --> 00:08:29,000
sample 2 will generate a pair for
151
00:08:29,000 --> 00:08:31,000
feature 3,
152
00:08:31,000 --> 00:08:33,000
and sample 3 would also generate
153
00:08:33,000 --> 00:08:35,000
a key-value pair
154
00:08:35,000 --> 00:08:39,000
for the same feature,
155
00:08:39,000 --> 00:08:42,000
that is, feature 3.
156
00:08:42,000 --> 00:08:43,000
So,
157
00:08:43,000 --> 00:08:45,000
we will
158
00:08:45,000 --> 00:08:46,000
as a
159
00:08:46,000 --> 00:08:49,000
we can use the
160
00:08:49,000 --> 00:08:51,000
groupByKey operation
161
00:08:51,000 --> 00:08:53,000
to group those
162
00:08:53,000 --> 00:08:54,000
values
163
00:08:54,000 --> 00:08:56,000
with the same key together,
164
00:08:56,000 --> 00:08:58,000
so that we don't have like duplicate keys
165
00:08:58,000 --> 00:09:00,000
specifically.
166
00:09:00,000 --> 00:09:01,000
Here, for example,
167
00:09:01,000 --> 00:09:02,000
for feature 3,
168
00:09:02,000 --> 00:09:04,000
the values would be a list
169
00:09:04,000 --> 00:09:05,000
that contains
170
00:09:05,000 --> 00:09:07,000
sample 2 and sample 3.
171
00:09:11,000 --> 00:09:13,000
And at last,
172
00:09:13,000 --> 00:09:15,000
we'll have a key-value data set
173
00:09:15,000 --> 00:09:17,000
mapping from feature ID
174
00:09:17,000 --> 00:09:19,000
to a list of samples.
175
00:09:19,000 --> 00:09:20,000
Now,
176
00:09:20,000 --> 00:09:22,000
we can do joint operations
177
00:09:22,000 --> 00:09:24,000
between the intermediate data set
178
00:09:24,000 --> 00:09:26,000
with the weights RDD
179
00:09:26,000 --> 00:09:28,000
because they own the
180
00:09:28,000 --> 00:09:31,000
same keys.
181
00:09:31,000 --> 00:09:32,000
So,
182
00:09:32,000 --> 00:09:34,000
here is the draft data flow
183
00:09:34,000 --> 00:09:36,000
for our training task.
184
00:09:36,000 --> 00:09:38,000
As shown in this whole pipeline
185
00:09:38,000 --> 00:09:41,000
after we do the joint operation,
186
00:09:41,000 --> 00:09:42,000
we will get a
187
00:09:42,000 --> 00:09:45,000
new mapping
188
00:09:45,000 --> 00:09:46,000
that is from
189
00:09:46,000 --> 00:09:47,000
feature ID
190
00:09:47,000 --> 00:09:49,000
to its weight value
191
00:09:49,000 --> 00:09:50,000
and the list of
192
00:09:50,000 --> 00:09:53,000
samples
193
00:09:53,000 --> 00:09:54,000
here.
194
00:09:54,000 --> 00:09:56,000
Note that for each key-value pair,
195
00:09:56,000 --> 00:10:00,000
we only have one feature.
196
00:10:00,000 --> 00:10:01,000
So,
197
00:10:01,000 --> 00:10:02,000
we're still not able
198
00:10:02,000 --> 00:10:04,000
to do the dot product here.
199
00:10:04,000 --> 00:10:05,000
Therefore,
200
00:10:05,000 --> 00:10:06,000
in order to
201
00:10:06,000 --> 00:10:07,000
get each sample
202
00:10:07,000 --> 00:10:10,000
to get all its relevant parameter weights,
203
00:10:10,000 --> 00:10:11,000
we can do an
204
00:10:11,000 --> 00:10:14,000
inverted index again.
205
00:10:14,000 --> 00:10:15,000
That is,
206
00:10:15,000 --> 00:10:17,000
to construct a new data set
207
00:10:17,000 --> 00:10:20,000
with samples as the key
208
00:10:20,000 --> 00:10:22,000
while
209
00:10:22,000 --> 00:10:24,000
a whole relevant weight list
210
00:10:24,000 --> 00:10:26,000
as the value.
211
00:10:26,000 --> 00:10:27,000
By this way,
212
00:10:27,000 --> 00:10:29,000
each sample can update itself
213
00:10:29,000 --> 00:10:31,000
based on the corresponding
214
00:10:31,000 --> 00:10:32,000
weight list.
215
00:10:32,000 --> 00:10:33,000
Unfortunately,
216
00:10:33,000 --> 00:10:35,000
life is not easy.
217
00:10:35,000 --> 00:10:36,000
You might have noticed
218
00:10:36,000 --> 00:10:39,000
the warning sign here.
219
00:10:39,000 --> 00:10:40,000
There are several problems
220
00:10:40,000 --> 00:10:41,000
for this design.
221
00:10:41,000 --> 00:10:42,000
For one thing,
222
00:10:42,000 --> 00:10:44,000
the first inverted index
223
00:10:44,000 --> 00:10:46,000
is very expensive.
224
00:10:46,000 --> 00:10:48,000
It's possible that
225
00:10:48,000 --> 00:10:50,000
we replicate the whole
226
00:10:50,000 --> 00:10:51,000
sample data set
227
00:10:51,000 --> 00:10:52,000
multiple times.
228
00:10:52,000 --> 00:10:53,000
For example,
229
00:10:53,000 --> 00:10:55,000
if we got a hot feature,
230
00:10:55,000 --> 00:10:57,000
then every sample
231
00:10:57,000 --> 00:11:00,000
that has a non-zero value for it,
232
00:11:00,000 --> 00:11:03,000
then for the feature's
233
00:11:03,000 --> 00:11:04,000
key-value pair,
234
00:11:04,000 --> 00:11:06,000
we have to store the whole data set
235
00:11:06,000 --> 00:11:07,000
in one list,
236
00:11:07,000 --> 00:11:09,000
which is clearly not acceptable.
237
00:11:10,000 --> 00:11:12,000
For another,
238
00:11:12,000 --> 00:11:13,000
the join operation
239
00:11:13,000 --> 00:11:15,000
is also very expensive,
240
00:11:15,000 --> 00:11:17,000
since the number of features
241
00:11:17,000 --> 00:11:19,000
might be huge,
242
00:11:19,000 --> 00:11:20,000
in which case,
243
00:11:20,000 --> 00:11:22,000
we are trying to join
244
00:11:22,000 --> 00:11:24,000
a large number of records
245
00:11:24,000 --> 00:11:27,000
while each pair of weights
246
00:11:27,000 --> 00:11:30,000
has a very small size,
247
00:11:30,000 --> 00:11:33,000
namely one single feature weight,
248
00:11:33,000 --> 00:11:36,000
so it is also not desirable.
249
00:11:36,000 --> 00:11:39,000
Before we give
250
00:11:39,000 --> 00:11:41,000
our next thing,
251
00:11:41,000 --> 00:11:42,000
after this part,
252
00:11:42,000 --> 00:11:43,000
you should be able
253
00:11:43,000 --> 00:11:44,000
to calculate
254
00:11:44,000 --> 00:11:46,000
and know what is
255
00:11:46,000 --> 00:11:48,000
the inverted index
256
00:11:48,000 --> 00:11:50,000
using RDD operations
257
00:11:50,000 --> 00:11:52,000
and know the basic ideas
258
00:11:52,000 --> 00:11:55,000
of how to do
259
00:11:55,000 --> 00:11:57,000
join-based communication
260
00:11:57,000 --> 00:12:00,000
to combine two RDDs.
261
00:12:00,000 --> 00:12:02,000
Make sure you comprehend those
262
00:12:02,000 --> 00:12:04,000
and then we can proceed
263
00:12:04,000 --> 00:12:06,000
to the next steps.
264
00:12:06,000 --> 00:12:07,000
So now we focus on
265
00:12:07,000 --> 00:12:08,000
the performance part.
266
00:12:08,000 --> 00:12:11,000
As we stated before,
267
00:12:11,000 --> 00:12:12,000
one of our
268
00:12:12,000 --> 00:12:13,000
jacked implementation
269
00:12:13,000 --> 00:12:14,000
drawbacks is that
270
00:12:14,000 --> 00:12:15,000
it tries to join
271
00:12:15,000 --> 00:12:17,000
a large number of records,
272
00:12:17,000 --> 00:12:18,000
and each record
273
00:12:18,000 --> 00:12:21,000
is small in size.
274
00:12:21,000 --> 00:12:24,000
From the test case perspective,
275
00:12:24,000 --> 00:12:25,000
for example,
276
00:12:25,000 --> 00:12:27,000
for test case A,
277
00:12:27,000 --> 00:12:28,000
it contains over
278
00:12:28,000 --> 00:12:29,000
20 million features,
279
00:12:29,000 --> 00:12:31,000
which makes the join cost
280
00:12:31,000 --> 00:12:32,000
unacceptable
281
00:12:32,000 --> 00:12:33,000
because it joins
282
00:12:33,000 --> 00:12:37,000
on the feature side.
283
00:12:37,000 --> 00:12:38,000
Let us now
284
00:12:38,000 --> 00:12:39,000
take a look at
285
00:12:39,000 --> 00:12:40,000
a test case
286
00:12:40,000 --> 00:12:45,000
that has multiple times
287
00:12:45,000 --> 00:12:48,000
of the feature dimensions
288
00:12:48,000 --> 00:12:50,000
compared to test case A.
289
00:12:50,000 --> 00:12:51,000
So our goal is that
290
00:12:51,000 --> 00:12:53,000
we want to join
291
00:12:53,000 --> 00:12:55,000
a smaller number of records
292
00:12:55,000 --> 00:12:56,000
while each of them
293
00:12:56,000 --> 00:12:58,000
has a larger size.
294
00:12:58,000 --> 00:13:00,000
And we want to avoid
295
00:13:00,000 --> 00:13:03,000
storing multiple copies
296
00:13:03,000 --> 00:13:06,000
of each sample.
297
00:13:06,000 --> 00:13:07,000
Okay,
298
00:13:07,000 --> 00:13:09,000
so here is our second hint.
299
00:13:09,000 --> 00:13:10,000
Instead of using
300
00:13:10,000 --> 00:13:11,000
logical features
301
00:13:11,000 --> 00:13:13,000
or sample IDs,
302
00:13:13,000 --> 00:13:15,000
let's use partition ID.
303
00:13:15,000 --> 00:13:17,000
And based on the partition ID
304
00:13:17,000 --> 00:13:19,000
to do the join operation.
305
00:13:19,000 --> 00:13:20,000
In this way,
306
00:13:20,000 --> 00:13:21,000
we can adjust
307
00:13:21,000 --> 00:13:23,000
the number of partitions
308
00:13:23,000 --> 00:13:24,000
so that we can control
309
00:13:24,000 --> 00:13:26,000
the size of the
310
00:13:26,000 --> 00:13:29,000
to-be-joined record.
311
00:13:29,000 --> 00:13:31,000
So we have a partition ID
312
00:13:31,000 --> 00:13:32,000
based on
313
00:13:32,000 --> 00:13:35,000
the join here.
314
00:13:35,000 --> 00:13:36,000
As shown
315
00:13:36,000 --> 00:13:38,000
in the lower half
316
00:13:38,000 --> 00:13:39,000
of the pipeline,
317
00:13:39,000 --> 00:13:42,000
instead of directly computing
318
00:13:42,000 --> 00:13:44,000
the feature ID
319
00:13:44,000 --> 00:13:45,000
to sample's data set,
320
00:13:45,000 --> 00:13:47,000
we actually generate
321
00:13:47,000 --> 00:13:49,000
two data sets here.
322
00:13:49,000 --> 00:13:50,000
First is the data set
323
00:13:50,000 --> 00:13:53,000
that maps partition ID
324
00:13:53,000 --> 00:13:54,000
to sample list.
325
00:13:54,000 --> 00:13:55,000
This can be done
326
00:13:55,000 --> 00:13:57,000
using the map partitions
327
00:13:57,000 --> 00:13:58,000
with index transformation
328
00:13:58,000 --> 00:14:00,000
in Spark.
329
00:14:00,000 --> 00:14:01,000
Another data set
330
00:14:01,000 --> 00:14:03,000
is a mapping from feature ID
331
00:14:03,000 --> 00:14:05,000
to a list of partition IDs.
332
00:14:05,000 --> 00:14:08,000
Constructed through
333
00:14:08,000 --> 00:14:12,000
the inverted index approach.
334
00:14:12,000 --> 00:14:13,000
Notably,
335
00:14:13,000 --> 00:14:14,000
rather than having
336
00:14:14,000 --> 00:14:16,000
redundant data samples,
337
00:14:16,000 --> 00:14:18,000
now we only have
338
00:14:18,000 --> 00:14:20,000
redundant partition IDs,
339
00:14:20,000 --> 00:14:22,000
which is much cheaper.
340
00:14:22,000 --> 00:14:23,000
Next,
341
00:14:23,000 --> 00:14:26,000
through the two cheaper joins
342
00:14:26,000 --> 00:14:27,000
here,
343
00:14:27,000 --> 00:14:28,000
we can achieve
344
00:14:28,000 --> 00:14:30,000
the data set
345
00:14:30,000 --> 00:14:31,000
that we want
346
00:14:31,000 --> 00:14:32,000
to be able to calculate
347
00:14:32,000 --> 00:14:34,000
gradient descent locally.
348
00:14:34,000 --> 00:14:35,000
So,
349
00:14:35,000 --> 00:14:36,000
for example,
350
00:14:36,000 --> 00:14:37,000
we can map
351
00:14:37,000 --> 00:14:38,000
partition IDs
352
00:14:38,000 --> 00:14:39,000
in each partition
353
00:14:39,000 --> 00:14:40,000
or locally,
354
00:14:40,000 --> 00:14:41,000
we mean,
355
00:14:41,000 --> 00:14:42,000
like locally
356
00:14:42,000 --> 00:14:43,000
in the executors
357
00:14:43,000 --> 00:14:44,000
when possible,
358
00:14:44,000 --> 00:14:45,000
or when candidate
359
00:14:45,000 --> 00:14:46,000
RDD
360
00:14:46,000 --> 00:14:47,000
might look like.
361
00:14:47,000 --> 00:14:48,000
It's like
362
00:14:48,000 --> 00:14:49,000
a mapping
363
00:14:49,000 --> 00:14:50,000
from partition ID
364
00:14:50,000 --> 00:14:51,000
to a tuple
365
00:14:51,000 --> 00:14:52,000
containing
366
00:14:52,000 --> 00:14:53,000
a waitlist
367
00:14:53,000 --> 00:14:54,000
and a sample list.
368
00:14:54,000 --> 00:14:55,000
And it's your job
369
00:14:55,000 --> 00:14:56,000
to think about
370
00:14:56,000 --> 00:14:57,000
and figure out
371
00:14:57,000 --> 00:14:58,000
how to use
372
00:14:58,000 --> 00:14:59,000
join operations
373
00:14:59,000 --> 00:15:00,000
to achieve
374
00:15:00,000 --> 00:15:01,000
this task.
375
00:15:01,000 --> 00:15:02,000
Think about
376
00:15:02,000 --> 00:15:03,000
how you can
377
00:15:03,000 --> 00:15:04,000
use
378
00:15:04,000 --> 00:15:07,000
these two RDDs,
379
00:15:07,000 --> 00:15:09,000
the samples partition RDD
380
00:15:09,000 --> 00:15:12,000
and the feature partition RDD,
381
00:15:12,000 --> 00:15:13,000
together with
382
00:15:13,000 --> 00:15:15,000
other existing RDDs
383
00:15:15,000 --> 00:15:17,000
like weights RDD,
384
00:15:17,000 --> 00:15:18,000
to derive
385
00:15:18,000 --> 00:15:21,000
the weight sample RDD.
386
00:15:21,000 --> 00:15:22,000
Brainstorm yourself
387
00:15:22,000 --> 00:15:23,000
and think
388
00:15:23,000 --> 00:15:24,000
about
389
00:15:24,000 --> 00:15:26,000
how you can use
390
00:15:26,000 --> 00:15:27,000
the inverted index
391
00:15:27,000 --> 00:15:29,000
join operations
392
00:15:29,000 --> 00:15:31,000
to join keys
393
00:15:31,000 --> 00:15:32,000
and eventually
394
00:15:32,000 --> 00:15:33,000
get this step.
395
00:15:34,000 --> 00:15:35,000
Furthermore,
396
00:15:35,000 --> 00:15:36,000
as for
397
00:15:36,000 --> 00:15:37,000
RDD partitioning
398
00:15:37,000 --> 00:15:38,000
and join optimization,
399
00:15:38,000 --> 00:15:39,000
we have
400
00:15:39,000 --> 00:15:41,000
some suggestions here.
401
00:15:41,000 --> 00:15:42,000
First,
402
00:15:42,000 --> 00:15:43,000
we would like
403
00:15:43,000 --> 00:15:44,000
to introduce
404
00:15:44,000 --> 00:15:47,000
some RDD partition operations.
405
00:15:47,000 --> 00:15:48,000
So,
406
00:15:48,000 --> 00:15:49,000
for
407
00:15:49,000 --> 00:15:50,000
partition level
408
00:15:50,000 --> 00:15:51,000
map
409
00:15:51,000 --> 00:15:52,000
and reduce
410
00:15:52,000 --> 00:15:53,000
task,
411
00:15:53,000 --> 00:15:54,000
you might want
412
00:15:54,000 --> 00:15:55,000
to take a look
413
00:15:55,000 --> 00:15:56,000
at map partitions
414
00:15:56,000 --> 00:15:58,000
and partition with index
415
00:15:58,000 --> 00:15:59,000
and also
416
00:15:59,000 --> 00:16:01,000
glom operations.
417
00:16:01,000 --> 00:16:02,000
And for
418
00:16:02,000 --> 00:16:03,000
repartition operations,
419
00:16:03,000 --> 00:16:04,000
you might want
420
00:16:04,000 --> 00:16:05,000
to take a look
421
00:16:05,000 --> 00:16:06,000
at map partitions
422
00:16:06,000 --> 00:16:07,000
with partition
423
00:16:07,000 --> 00:16:08,000
and partition by
424
00:16:08,000 --> 00:16:09,000
is useful.
425
00:16:10,000 --> 00:16:11,000
Second point
426
00:16:11,000 --> 00:16:12,000
is that
427
00:16:12,000 --> 00:16:13,000
before doing
428
00:16:13,000 --> 00:16:14,000
join operation,
429
00:16:14,000 --> 00:16:15,000
if you do
430
00:16:15,000 --> 00:16:16,000
a partition
431
00:16:16,000 --> 00:16:18,000
by key operation first,
432
00:16:18,000 --> 00:16:19,000
then the
433
00:16:19,000 --> 00:16:20,000
join operation
434
00:16:20,000 --> 00:16:21,000
might be cheaper.
435
00:16:23,000 --> 00:16:24,000
One interesting
436
00:16:24,000 --> 00:16:25,000
thinking question
437
00:16:25,000 --> 00:16:26,000
is that
438
00:16:26,000 --> 00:16:27,000
can you partition
439
00:16:27,000 --> 00:16:28,000
the data set once
440
00:16:28,000 --> 00:16:29,000
and do not disturb
441
00:16:29,000 --> 00:16:30,000
the data set
442
00:16:30,000 --> 00:16:31,000
and you reuse
443
00:16:31,000 --> 00:16:32,000
the data set
444
00:16:32,000 --> 00:16:33,000
for multiple joins
445
00:16:33,000 --> 00:16:34,000
so that it saves
446
00:16:34,000 --> 00:16:35,000
you the
447
00:16:35,000 --> 00:16:36,000
shuffle cost.
448
00:16:36,000 --> 00:16:37,000
The third point
449
00:16:37,000 --> 00:16:38,000
is that
450
00:16:38,000 --> 00:16:39,000
you might need
451
00:16:39,000 --> 00:16:40,000
to
452
00:16:40,000 --> 00:16:41,000
think about
453
00:16:41,000 --> 00:16:42,000
the number
454
00:16:42,000 --> 00:16:43,000
of relations,
455
00:16:43,000 --> 00:16:44,000
its relationship
456
00:16:44,000 --> 00:16:45,000
with the number
457
00:16:45,000 --> 00:16:46,000
of features
458
00:16:46,000 --> 00:16:47,000
and number
459
00:16:47,000 --> 00:16:48,000
of cores
460
00:16:48,000 --> 00:16:49,000
so that
461
00:16:49,000 --> 00:16:50,000
each record
462
00:16:50,000 --> 00:16:51,000
can fit
463
00:16:51,000 --> 00:16:52,000
in memory
464
00:16:52,000 --> 00:16:53,000
and the
465
00:16:53,000 --> 00:16:54,000
join operation
466
00:16:54,000 --> 00:16:55,000
can have
467
00:16:55,000 --> 00:16:56,000
a good
468
00:16:56,000 --> 00:16:57,000
performance.
469
00:16:57,000 --> 00:16:58,000
One trade-off
470
00:16:58,000 --> 00:16:59,000
here is that
471
00:16:59,000 --> 00:17:00,000
if we are
472
00:17:00,000 --> 00:17:01,000
having higher
473
00:17:01,000 --> 00:17:02,000
number
474
00:17:02,000 --> 00:17:03,000
of partitions,
475
00:17:03,000 --> 00:17:04,000
then the
476
00:17:04,000 --> 00:17:05,000
join operation
477
00:17:05,000 --> 00:17:06,000
would be slower
478
00:17:06,000 --> 00:17:07,000
but it will cause
479
00:17:07,000 --> 00:17:08,000
less memory.
480
00:17:08,000 --> 00:17:09,000
However,
481
00:17:09,000 --> 00:17:10,000
if we have
482
00:17:10,000 --> 00:17:11,000
less partition
483
00:17:11,000 --> 00:17:12,000
numbers,
484
00:17:12,000 --> 00:17:13,000
the join operation
485
00:17:13,000 --> 00:17:14,000
would be quicker
486
00:17:14,000 --> 00:17:15,000
but it will cause
487
00:17:15,000 --> 00:17:16,000
a lot of memory
488
00:17:16,000 --> 00:17:17,000
and even
489
00:17:17,000 --> 00:17:18,000
cause
490
00:17:18,000 --> 00:17:19,000
out-of-memory
491
00:17:19,000 --> 00:17:20,000
errors.
492
00:17:22,000 --> 00:17:23,000
So,
493
00:17:23,000 --> 00:17:24,000
explore
494
00:17:24,000 --> 00:17:25,000
and evaluate
495
00:17:25,000 --> 00:17:26,000
the impact
496
00:17:26,000 --> 00:17:27,000
yourself
497
00:17:27,000 --> 00:17:28,000
and figure out
498
00:17:28,000 --> 00:17:29,000
some general
499
00:17:29,000 --> 00:17:30,000
solution
500
00:17:30,000 --> 00:17:31,000
to decide
501
00:17:31,000 --> 00:17:32,000
the number
502
00:17:32,000 --> 00:17:33,000
of features
503
00:17:33,000 --> 00:17:34,000
and number
504
00:17:34,000 --> 00:17:35,000
of cores.
505
00:17:35,000 --> 00:17:36,000
And please pay
506
00:17:36,000 --> 00:17:37,000
attention that
507
00:17:37,000 --> 00:17:38,000
we
508
00:17:38,000 --> 00:17:39,000
do not
509
00:17:39,000 --> 00:17:40,000
allow hard
510
00:17:40,000 --> 00:17:41,000
code
511
00:17:41,000 --> 00:17:42,000
and make
512
00:17:42,000 --> 00:17:43,000
separate
513
00:17:43,000 --> 00:17:44,000
submissions
514
00:17:44,000 --> 00:17:45,000
for each
515
00:17:45,000 --> 00:17:46,000
test cases.
516
00:17:46,000 --> 00:17:47,000
For more
517
00:17:47,000 --> 00:17:48,000
information,
518
00:17:48,000 --> 00:17:49,000
please refer
519
00:17:49,000 --> 00:17:50,000
to the
520
00:17:50,000 --> 00:17:51,000
write-up.
521
00:17:51,000 --> 00:17:52,000
In addition,
522
00:17:52,000 --> 00:17:53,000
you might
523
00:17:53,000 --> 00:17:54,000
always
524
00:17:54,000 --> 00:17:55,000
want
525
00:17:55,000 --> 00:17:56,000
to
526
00:17:56,000 --> 00:17:57,000
use
527
00:17:57,000 --> 00:17:58,000
the same
528
00:17:58,000 --> 00:17:59,000
number
529
00:17:59,000 --> 00:18:00,000
of partitions
530
00:18:00,000 --> 00:18:01,000
in your
531
00:18:01,000 --> 00:18:02,000
test cases.
532
00:18:02,000 --> 00:18:03,000
In this
533
00:18:03,000 --> 00:18:04,000
example,
534
00:18:04,000 --> 00:18:05,000
there are
535
00:18:05,000 --> 00:18:06,000
at least
536
00:18:06,000 --> 00:18:07,000
two possible
537
00:18:07,000 --> 00:18:08,000
cases that
538
00:18:08,000 --> 00:18:09,000
the output
539
00:18:09,000 --> 00:18:10,000
size might
540
00:18:10,000 --> 00:18:11,000
blow out.
541
00:18:11,000 --> 00:18:12,000
First is
542
00:18:12,000 --> 00:18:13,000
the output
543
00:18:13,000 --> 00:18:14,000
of the partition
544
00:18:14,000 --> 00:18:15,000
of the join
545
00:18:15,000 --> 00:18:16,000
operation.
546
00:18:16,000 --> 00:18:17,000
Another is
547
00:18:17,000 --> 00:18:18,000
the output
548
00:18:18,000 --> 00:18:19,000
of reduce
549
00:18:19,000 --> 00:18:20,000
by key
550
00:18:20,000 --> 00:18:21,000
operation.
551
00:18:21,000 --> 00:18:22,000
In both
552
00:18:22,000 --> 00:18:23,000
cases,
553
00:18:23,000 --> 00:18:24,000
you might
554
00:18:24,000 --> 00:18:25,000
want to
555
00:18:25,000 --> 00:18:26,000
try using
556
00:18:26,000 --> 00:18:27,000
repartition
557
00:18:27,000 --> 00:18:28,000
or partition
558
00:18:28,000 --> 00:18:29,000
by operation
559
00:18:29,000 --> 00:18:30,000
to maintain
560
00:18:30,000 --> 00:18:31,000
the
561
00:18:31,000 --> 00:18:32,000
result.
562
00:18:32,000 --> 00:18:33,000
In this
563
00:18:33,000 --> 00:18:34,000
example,
564
00:18:34,000 --> 00:18:35,000
you might
565
00:18:35,000 --> 00:18:36,000
be able
566
00:18:36,000 --> 00:18:37,000
to find
567
00:18:37,000 --> 00:18:38,000
situations
568
00:18:38,000 --> 00:18:39,000
where
569
00:18:39,000 --> 00:18:40,000
RDDs
570
00:18:40,000 --> 00:18:41,000
can be
571
00:18:41,000 --> 00:18:42,000
used
572
00:18:42,000 --> 00:18:43,000
across
573
00:18:43,000 --> 00:18:44,000
iterations.
574
00:18:44,000 --> 00:18:45,000
For example,
575
00:18:45,000 --> 00:18:46,000
the samples
576
00:18:46,000 --> 00:18:47,000
RDD
577
00:18:47,000 --> 00:18:48,000
and the
578
00:18:48,000 --> 00:18:49,000
features
579
00:18:49,000 --> 00:18:50,000
indices
580
00:18:50,000 --> 00:18:51,000
are
581
00:18:51,000 --> 00:18:52,000
read-only
582
00:18:52,000 --> 00:18:53,000
data
583
00:18:53,000 --> 00:18:54,000
without
584
00:18:54,000 --> 00:18:55,000
any
585
00:18:55,000 --> 00:18:56,000
updates.
586
00:18:56,000 --> 00:18:57,000
So you
587
00:18:57,000 --> 00:18:58,000
can
588
00:18:58,000 --> 00:18:59,000
consider
589
00:18:59,000 --> 00:19:00,000
most of
590
00:19:00,000 --> 00:19:01,000
the
591
00:19:01,000 --> 00:19:02,000
important
592
00:19:02,000 --> 00:19:03,000
hints
593
00:19:03,000 --> 00:19:04,000
for P2-3.
594
00:19:04,000 --> 00:19:05,000
I'd
595
00:19:05,000 --> 00:19:06,000
like to
596
00:19:06,000 --> 00:19:07,000
give a
597
00:19:07,000 --> 00:19:08,000
brief
598
00:19:08,000 --> 00:19:09,000
introduction
599
00:19:09,000 --> 00:19:10,000
to RDD
600
00:19:10,000 --> 00:19:11,000
join
601
00:19:11,000 --> 00:19:12,000
operators.
602
00:19:12,000 --> 00:19:13,000
From a
603
00:19:13,000 --> 00:19:14,000
logical
604
00:19:14,000 --> 00:19:15,000
view,
605
00:19:15,000 --> 00:19:16,000
the use
606
00:19:16,000 --> 00:19:17,000
of RDD
607
00:19:17,000 --> 00:19:18,000
join
608
00:19:18,000 --> 00:19:19,000
operation
609
00:19:19,000 --> 00:19:20,000
is for
610
00:19:20,000 --> 00:19:21,000
combining
611
00:19:21,000 --> 00:19:22,000
two key
612
00:19:22,000 --> 00:19:23,000
value
613
00:19:23,000 --> 00:19:24,000
datasets
614
00:19:24,000 --> 00:19:25,000
together
615
00:19:25,000 --> 00:19:26,000
so that
616
00:19:26,000 --> 00:19:27,000
pairs
617
00:19:27,000 --> 00:19:28,000
with
618
00:19:28,000 --> 00:19:29,000
RDDs
619
00:19:29,000 --> 00:19:30,000
can
620
00:19:30,000 --> 00:19:31,000
be
621
00:19:31,000 --> 00:19:32,000
combined
622
00:19:32,000 --> 00:19:33,000
locally.
623
00:19:33,000 --> 00:19:34,000
On the
624
00:19:34,000 --> 00:19:35,000
other hand,
625
00:19:35,000 --> 00:19:36,000
from an
626
00:19:36,000 --> 00:19:37,000
implementation
627
00:19:37,000 --> 00:19:38,000
perspective,
628
00:19:38,000 --> 00:19:39,000
the goal
629
00:19:39,000 --> 00:19:40,000
of the
630
00:19:40,000 --> 00:19:41,000
Spark RDD
631
00:19:41,000 --> 00:19:42,000
join
632
00:19:42,000 --> 00:19:43,000
operator
633
00:19:43,000 --> 00:19:44,000
is to
634
00:19:44,000 --> 00:19:45,000
bring
635
00:19:45,000 --> 00:19:46,000
corresponding
636
00:19:46,000 --> 00:19:47,000
keys
637
00:19:47,000 --> 00:19:48,000
from
638
00:19:48,000 --> 00:19:49,000
each
639
00:19:49,000 --> 00:19:50,000
RDD
640
00:19:50,000 --> 00:19:51,000
to
641
00:19:51,000 --> 00:19:52,000
the
642
00:19:52,000 --> 00:19:53,000
same
643
00:19:53,000 --> 00:19:54,000
partition
644
00:19:54,000 --> 00:19:55,000
so that
645
00:19:55,000 --> 00:19:56,000
they
646
00:19:56,000 --> 00:19:57,000
can
647
00:19:57,000 --> 00:19:58,000
be
648
00:19:58,000 --> 00:19:59,000
integrated
649
00:19:59,000 --> 00:20:00,000
into
650
00:20:00,000 --> 00:20:01,000
the
651
00:20:01,000 --> 00:20:02,000
RDD
652
00:20:02,000 --> 00:20:03,000
join
653
00:20:03,000 --> 00:20:04,000
operation.
654
00:20:04,000 --> 00:20:05,000
For example,
655
00:20:05,000 --> 00:20:06,000
if RDD-A
656
00:20:06,000 --> 00:20:07,000
and
657
00:20:07,000 --> 00:20:08,000
RDD-B
658
00:20:08,000 --> 00:20:09,000
don't
659
00:20:09,000 --> 00:20:10,000
have a
660
00:20:10,000 --> 00:20:11,000
partitioner,
661
00:20:11,000 --> 00:20:12,000
we'll
662
00:20:12,000 --> 00:20:13,000
have
663
00:20:13,000 --> 00:20:14,000
two
664
00:20:14,000 --> 00:20:15,000
shuffle
665
00:20:15,000 --> 00:20:16,000
operations
666
00:20:16,000 --> 00:20:17,000
here.
667
00:20:17,000 --> 00:20:18,000
However,
668
00:20:18,000 --> 00:20:19,000
if we
669
00:20:19,000 --> 00:20:20,000
specify
670
00:20:20,000 --> 00:20:21,000
the
671
00:20:21,000 --> 00:20:22,000
partitioner
672
00:20:22,000 --> 00:20:23,000
before
673
00:20:23,000 --> 00:20:24,000
doing
674
00:20:24,000 --> 00:20:25,000
join
675
00:20:25,000 --> 00:20:26,000
operation,
676
00:20:26,000 --> 00:20:27,000
we
677
00:20:27,000 --> 00:20:28,000
can
678
00:20:28,000 --> 00:20:29,000
use
679
00:20:29,000 --> 00:20:30,000
the
680
00:20:30,000 --> 00:20:31,000
hash
681
00:20:31,000 --> 00:20:32,000
function,
682
00:20:32,000 --> 00:20:33,000
which
683
00:20:33,000 --> 00:20:34,000
will
684
00:20:34,000 --> 00:20:35,000
use
685
00:20:35,000 --> 00:20:36,000
the
686
00:20:36,000 --> 00:20:37,000
default
687
00:20:37,000 --> 00:20:38,000
hash
688
00:20:38,000 --> 00:20:39,000
partition
689
00:20:39,000 --> 00:20:40,000
algorithm
690
00:20:40,000 --> 00:20:41,000
to
691
00:20:41,000 --> 00:20:42,000
distribute
692
00:20:42,000 --> 00:20:43,000
key
693
00:20:43,000 --> 00:20:44,000
value
694
00:20:44,000 --> 00:20:45,000
pairs.
695
00:20:45,000 --> 00:20:46,000
Another way
696
00:20:46,000 --> 00:20:47,000
to specify
697
00:20:47,000 --> 00:20:48,000
the
698
00:20:48,000 --> 00:20:49,000
partitioner
699
00:20:49,000 --> 00:20:50,000
is
700
00:20:50,000 --> 00:20:51,000
using
701
00:20:51,000 --> 00:20:52,000
partition
702
00:20:52,000 --> 00:20:53,000
bind
703
00:20:53,000 --> 00:20:54,000
operations,
704
00:20:54,000 --> 00:20:55,000
in
705
00:20:55,000 --> 00:20:56,000
which
706
00:20:56,000 --> 00:20:58,000
only
707
00:20:58,000 --> 00:20:59,000
one
708
00:20:59,000 --> 00:21:00,000
shuffle
709
00:21:00,000 --> 00:21:01,000
operation
710
00:21:01,000 --> 00:21:02,000
is
711
00:21:02,000 --> 00:21:03,000
required.
712
00:21:03,000 --> 00:21:04,000
In addition,
713
00:21:04,000 --> 00:21:05,000
if both
714
00:21:05,000 --> 00:21:06,000
RDDs
715
00:21:06,000 --> 00:21:07,000
use
716
00:21:07,000 --> 00:21:08,000
the
717
00:21:08,000 --> 00:21:09,000
same
718
00:21:09,000 --> 00:21:10,000
partitioner,
719
00:21:10,000 --> 00:21:11,000
which
720
00:21:11,000 --> 00:21:12,000
means
721
00:21:12,000 --> 00:21:13,000
firstly,
722
00:21:13,000 --> 00:21:14,000
they'll
723
00:21:14,000 --> 00:21:15,000
have
724
00:21:15,000 --> 00:21:16,000
the
725
00:21:16,000 --> 00:21:17,000
same
726
00:21:17,000 --> 00:21:18,000
number
727
00:21:18,000 --> 00:21:19,000
of
728
00:21:19,000 --> 00:21:20,000
partitions,
729
00:21:20,000 --> 00:21:21,000
and
730
00:21:21,000 --> 00:21:22,000
secondly,
731
00:21:22,000 --> 00:21:23,000
they'll
732
00:21:23,000 --> 00:21:24,000
use
733
00:21:24,000 --> 00:21:25,000
the
734
00:21:25,000 --> 00:21:26,000
same number
735
00:21:26,000 --> 00:21:27,000
of
736
00:21:27,000 --> 00:21:28,000
partition
737
00:21:28,000 --> 00:21:29,000
bind
738
00:21:29,000 --> 00:21:30,000
functions,
739
00:21:30,000 --> 00:21:31,000
which
740
00:21:31,000 --> 00:21:32,000
saves
741
00:21:32,000 --> 00:21:33,000
you
742
00:21:33,000 --> 00:21:34,000
a lot
743
00:21:34,000 --> 00:21:35,000
of
744
00:21:35,000 --> 00:21:36,000
shuffle
745
00:21:36,000 --> 00:21:37,000
cost.
746
00:21:37,000 --> 00:21:38,000
And
747
00:21:38,000 --> 00:21:39,000
also,
748
00:21:39,000 --> 00:21:40,000
some
749
00:21:40,000 --> 00:21:41,000
other
750
00:21:41,000 --> 00:21:42,000
general
751
00:21:42,000 --> 00:21:43,000
optimizations
752
00:21:43,000 --> 00:21:44,000
for
753
00:21:44,000 --> 00:21:45,000
RDD
754
00:21:45,000 --> 00:21:46,000
join
755
00:21:46,000 --> 00:21:47,000
operator
756
00:21:47,000 --> 00:21:48,000
are as
757
00:21:48,000 --> 00:21:49,000
follows.
758
00:21:49,000 --> 00:21:50,000
Note that
759
00:21:50,000 --> 00:21:51,000
you
760
00:21:51,000 --> 00:21:52,000
might
761
00:21:52,000 --> 00:21:53,000
not
762
00:21:53,000 --> 00:21:54,000
need
763
00:21:54,000 --> 00:21:55,000
to
764
00:21:55,000 --> 00:21:56,000
do
765
00:21:56,000 --> 00:21:57,000
a
766
00:21:57,000 --> 00:21:58,000
join
767
00:21:58,000 --> 00:21:59,000
operation
768
00:21:59,000 --> 00:22:00,000
to
769
00:22:00,000 --> 00:22:01,000
reduce
770
00:22:01,000 --> 00:22:02,000
the
771
00:22:02,000 --> 00:22:03,000
result
772
00:22:03,000 --> 00:22:04,000
size
773
00:22:04,000 --> 00:22:05,000
for
774
00:22:05,000 --> 00:22:06,000
the
775
00:22:06,000 --> 00:22:07,000
join
776
00:22:07,000 --> 00:22:08,000
operation.
777
00:22:08,000 --> 00:22:09,000
Secondly,
778
00:22:09,000 --> 00:22:10,000
pre-partitioning
779
00:22:10,000 --> 00:22:11,000
with
780
00:22:11,000 --> 00:22:12,000
non-partitioner
781
00:22:12,000 --> 00:22:13,000
can
782
00:22:13,000 --> 00:22:14,000
reduce
783
00:22:14,000 --> 00:22:15,000
the
784
00:22:15,000 --> 00:22:16,000
cost
785
00:22:16,000 --> 00:22:17,000
from
786
00:22:17,000 --> 00:22:18,000
shuffle
787
00:22:18,000 --> 00:22:19,000
operations.
788
00:22:19,000 --> 00:22:20,000
Besides,
789
00:22:20,000 --> 00:22:21,000
a common
790
00:22:21,000 --> 00:22:22,000
strategy
791
00:22:22,000 --> 00:22:23,000
is
792
00:22:23,000 --> 00:22:24,000
to
793
00:22:24,000 --> 00:22:25,000
test
794
00:22:25,000 --> 00:22:26,000
one
795
00:22:26,000 --> 00:22:27,000
iteration
796
00:22:27,000 --> 00:22:28,000
at
797
00:22:28,000 --> 00:22:29,000
a
798
00:22:29,000 --> 00:22:30,000
time
799
00:22:30,000 --> 00:22:31,000
or
800
00:22:31,000 --> 00:22:32,000
two
801
00:22:32,000 --> 00:22:33,000
iterations
802
00:22:33,000 --> 00:22:34,000
to
803
00:22:34,000 --> 00:22:35,000
have
804
00:22:35,000 --> 00:22:36,000
an
805
00:22:36,000 --> 00:22:37,000
estimate
806
00:22:37,000 --> 00:22:38,000
about
807
00:22:38,000 --> 00:22:39,000
what's
808
00:22:39,000 --> 00:22:40,000
your
809
00:22:40,000 --> 00:22:41,000
runtime
810
00:22:41,000 --> 00:22:42,000
for
811
00:22:42,000 --> 00:22:43,000
the
812
00:22:43,000 --> 00:22:44,000
whole
813
00:22:44,000 --> 00:22:45,000
test
814
00:22:45,000 --> 00:22:46,000
cases.
815
00:22:46,000 --> 00:22:47,000
Subsequent
816
00:22:47,000 --> 00:22:48,000
iterations
817
00:22:48,000 --> 00:22:49,000
are
818
00:22:49,000 --> 00:22:50,000
about
819
00:22:50,000 --> 00:22:51,000
20%
820
00:22:51,000 --> 00:22:52,000
cheaper.
821
00:22:52,000 --> 00:22:53,000
It's
822
00:22:53,000 --> 00:22:54,000
especially
823
00:22:54,000 --> 00:22:55,000
useful
824
00:22:55,000 --> 00:22:56,000
for
825
00:22:56,000 --> 00:22:57,000
test
826
00:22:57,000 --> 00:22:58,000
cases.
827
00:22:58,000 --> 00:22:59,000
And always,
828
00:22:59,000 --> 00:23:00,000
please
829
00:23:00,000 --> 00:23:01,000
start
830
00:23:01,000 --> 00:23:02,000
early.
831
00:23:02,000 --> 00:23:03,000
Yeah.
832
00:23:03,000 --> 00:23:04,000
That's
833
00:23:04,000 --> 00:23:05,000
all for
834
00:23:05,000 --> 00:23:06,000
the
835
00:23:06,000 --> 00:23:07,000
hints
836
00:23:07,000 --> 00:23:08,000
video.
837
00:23:08,000 --> 00:23:09,000
Hope you
838
00:23:09,000 --> 00:23:10,000
enjoyed
839
00:23:10,000 --> 00:23:11,000
this project
840
00:23:11,000 --> 00:23:12,000
and best
841
00:23:12,000 --> 00:23:13,000
wishes.
842
00:23:13,000 --> 00:23:14,000
Thank you.
43891
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.