subtitlecat.com

All language subtitles for s24-advcc-p23-hints

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional) Download

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:09,000 Hi everyone, welcome to the HINTS video for Project 2.3, Iterative Machine Learning Training. 2 00:00:09,000 --> 00:00:13,000 In this video, I will walk you through the following topics. 3 00:00:13,000 --> 00:00:17,000 First, let's take a look at the starter code in DABS. 4 00:00:17,000 --> 00:00:23,000 We'll figure out the format of the samples RDD and the rationale behind using sparse format. 5 00:00:23,000 --> 00:00:27,000 Then we'll focus on the scalability problem of the starter code 6 00:00:27,000 --> 00:00:34,000 and think about how we can store weights array in the worker nodes rather than purely in the driver node. 7 00:00:34,000 --> 00:00:41,000 And I will give the first hint about how to use inverted index and join-based communication. 8 00:00:41,000 --> 00:00:47,000 Then we will talk about the performance problem of the draft optimization 9 00:00:47,000 --> 00:00:52,000 and how to utilize partitions to optimize join operators. 10 00:00:52,000 --> 00:00:56,000 I will give an introduction for RDD join operators after that. 11 00:00:56,000 --> 00:00:57,000 At last, 12 00:00:57,000 --> 00:01:01,000 I will give some further tips that we want to share with you guys. 13 00:01:01,000 --> 00:01:02,000 Yeah. 14 00:01:02,000 --> 00:01:04,000 OK, so let's get started. 15 00:01:04,000 --> 00:01:07,000 As you have watched in the overview video, 16 00:01:07,000 --> 00:01:11,000 this is the core logic of our machine learning training. 17 00:01:11,000 --> 00:01:15,000 We have data samples and model parameters as inputs. 18 00:01:15,000 --> 00:01:22,000 For data samples, it is stored as an RDD and it's also partitioned across executors. 19 00:01:22,000 --> 00:01:26,000 For model parameters, it is stored as a weight array in the driver. 20 00:01:26,000 --> 00:01:29,000 Next is the training iteration. 21 00:01:29,000 --> 00:01:34,000 Specifically, the driver will broadcast the weights to all executors. 22 00:01:34,000 --> 00:01:42,000 Then each executor will compute the gradient updates with respect to the data partition that resides in the executors. 23 00:01:42,000 --> 00:01:50,000 And the driver would collect updates from executors and then update the parameters locally. 24 00:01:50,000 --> 00:01:55,000 Then the iteration will repeat until the model converges. 25 00:01:55,000 --> 00:02:03,000 Here we focus on the most significant computation, namely the computation of gradient descent. 26 00:02:03,000 --> 00:02:11,000 So for gradient descent, we have broadcasted weight array and the worker's local samples as input. 27 00:02:11,000 --> 00:02:16,000 We need to compute prediction based on the result of the sigmoid function. 28 00:02:16,000 --> 00:02:24,000 But before that, we need to do a dot product between the broadcasted weight array and the feature values for each sample. 29 00:02:24,000 --> 00:02:32,000 Then we can use the prediction result to compute the gradient and of course the loss function. 30 00:02:32,000 --> 00:02:38,000 As we can see here, since the dot product will combine values from both inputs, 31 00:02:38,000 --> 00:02:45,000 therefore it's the most important computation that we will consider for optimization. 32 00:02:45,000 --> 00:02:47,000 Let's go through an example. 33 00:02:47,000 --> 00:02:52,000 Here we have five values, oh sorry, five features. 34 00:02:52,000 --> 00:02:59,000 Therefore the weight array also has five elements and the sample also got five. 35 00:02:59,000 --> 00:03:02,000 By dot producting them, we can get a result. 36 00:03:02,000 --> 00:03:11,000 However, it's clear that since there are a lot of zeros, as you can see here in each sample, 37 00:03:11,000 --> 00:03:14,000 the final result also contains a lot of zeros. 38 00:03:14,000 --> 00:03:19,000 Which means if we use the whole sample array to do dot product, 39 00:03:19,000 --> 00:03:22,000 then there will be a lot of unnecessary computation. 40 00:03:22,000 --> 00:03:26,000 This is especially true in real-world training sets, 41 00:03:26,000 --> 00:03:30,000 since there are typically a lot of zeros in those samples, 42 00:03:30,000 --> 00:03:33,000 or null values in those samples. 43 00:03:33,000 --> 00:03:36,000 So how can we solve this problem? 44 00:03:36,000 --> 00:03:40,000 The answer is using sparse format instead of dense format. 45 00:03:40,000 --> 00:03:44,000 Rather than storing the whole dense vector for each sample, 46 00:03:44,000 --> 00:03:48,000 we store those feature values that are non-zero. 47 00:03:48,000 --> 00:03:51,000 And here is a logic representation for this. 48 00:03:52,000 --> 00:03:59,000 For sparse format, we have a key-value pair list for each sample, 49 00:03:59,000 --> 00:04:04,000 and the key is the feature ID while the value is the feature at value. 50 00:04:04,000 --> 00:04:09,000 For example, here 0 represents its feature 0, 51 00:04:09,000 --> 00:04:15,000 and 1 means feature 0 has value equals to 1. 52 00:04:15,000 --> 00:04:17,000 Note that for the starter code, 53 00:04:17,000 --> 00:04:22,000 the physical representation is actually a tuple, 54 00:04:22,000 --> 00:04:24,000 with three elements, 55 00:04:24,000 --> 00:04:26,000 a label, 56 00:04:26,000 --> 00:04:29,000 a feature ID list, 57 00:04:29,000 --> 00:04:31,000 and a feature value list. 58 00:04:31,000 --> 00:04:34,000 Then, in order to dot product, 59 00:04:34,000 --> 00:04:37,000 we need to use a feature ID list 60 00:04:37,000 --> 00:04:40,000 to retrieve all the corresponding weights 61 00:04:40,000 --> 00:04:42,000 and do the dot product 62 00:04:42,000 --> 00:04:47,000 between them and the feature value list. 63 00:04:47,000 --> 00:04:49,000 Through this way, 64 00:04:49,000 --> 00:04:51,000 those computations for zero, 65 00:04:51,000 --> 00:04:53,000 for zero products, 66 00:04:53,000 --> 00:04:55,000 can be saved. 67 00:04:55,000 --> 00:04:58,000 As a quick summary, 68 00:04:58,000 --> 00:05:02,000 the final samples RDD looks like this. 69 00:05:02,000 --> 00:05:03,000 For each sample, 70 00:05:03,000 --> 00:05:06,000 we will only store useful features 71 00:05:06,000 --> 00:05:09,000 and their corresponding non-zero values. 72 00:05:09,000 --> 00:05:10,000 Next, 73 00:05:10,000 --> 00:05:15,000 we will take a look at the scalability problem in the starter code. 74 00:05:15,000 --> 00:05:16,000 From a high level, 75 00:05:16,000 --> 00:05:19,000 there are two limitations in the starter code. 76 00:05:19,000 --> 00:05:20,000 First, 77 00:05:20,000 --> 00:05:22,000 we need a single-node memory. 78 00:05:22,000 --> 00:05:24,000 The training set might have too many features 79 00:05:24,000 --> 00:05:28,000 to fit into a single-node memory. 80 00:05:28,000 --> 00:05:29,000 Secondly, 81 00:05:29,000 --> 00:05:30,000 for communication, 82 00:05:30,000 --> 00:05:33,000 since each partition might only require 83 00:05:33,000 --> 00:05:37,000 a subset of parameters to do the dot product, 84 00:05:37,000 --> 00:05:41,000 then broadcasting the whole weight array across the nodes 85 00:05:41,000 --> 00:05:42,000 might be a waste. 86 00:05:42,000 --> 00:05:43,000 So, 87 00:05:43,000 --> 00:05:46,000 how can we improve the implementation? 88 00:05:46,000 --> 00:05:49,000 The first step is to partition the weight array, 89 00:05:49,000 --> 00:05:52,000 like what we did for the sample data, 90 00:05:52,000 --> 00:05:55,000 so that we can store them across the whole cluster 91 00:05:55,000 --> 00:05:59,000 instead of only within one single node. 92 00:05:59,000 --> 00:06:00,000 Then, 93 00:06:00,000 --> 00:06:01,000 we need to figure out a way 94 00:06:01,000 --> 00:06:04,000 to tag each weight with a feature ID. 95 00:06:04,000 --> 00:06:06,000 After this step, 96 00:06:06,000 --> 00:06:12,000 we can get the weights ready, 97 00:06:12,000 --> 00:06:15,000 which the key is a feature ID 98 00:06:15,000 --> 00:06:18,000 and the value is a feature weight. 99 00:06:18,000 --> 00:06:21,000 The next step is that we need to bring the feature weights 100 00:06:21,000 --> 00:06:22,000 to each sample 101 00:06:22,000 --> 00:06:24,000 so that we can do the dot product 102 00:06:24,000 --> 00:06:26,000 and further computation. 103 00:06:26,000 --> 00:06:27,000 In general, 104 00:06:27,000 --> 00:06:31,000 the goal is to combine samples RDD 105 00:06:31,000 --> 00:06:35,000 and the weights RDD together. 106 00:06:35,000 --> 00:06:37,000 A simple way to do this 107 00:06:37,000 --> 00:06:40,000 is to use RDD lookup operation 108 00:06:40,000 --> 00:06:42,000 for each sample. 109 00:06:42,000 --> 00:06:45,000 Since we know its relevant feature ID, 110 00:06:45,000 --> 00:06:47,000 we can use those feature IDs 111 00:06:47,000 --> 00:06:52,000 to fetch each weight with RDD 112 00:06:52,000 --> 00:06:54,000 and they can do the computation. 113 00:06:54,000 --> 00:06:55,000 However, 114 00:06:55,000 --> 00:06:58,000 since lookup can only bring weight 115 00:06:58,000 --> 00:06:59,000 to the driver node, 116 00:06:59,000 --> 00:07:02,000 we still need to either broadcast weight 117 00:07:02,000 --> 00:07:03,000 to worker node 118 00:07:03,000 --> 00:07:06,000 or fetch each sample to the driver, 119 00:07:06,000 --> 00:07:09,000 then do the computation. 120 00:07:09,000 --> 00:07:11,000 This is very slow. 121 00:07:11,000 --> 00:07:12,000 So, 122 00:07:12,000 --> 00:07:16,000 is there any better way to do the combination? 123 00:07:16,000 --> 00:07:17,000 Actually, 124 00:07:17,000 --> 00:07:19,000 what we want here is 125 00:07:19,000 --> 00:07:21,000 a distributed shuffle operation 126 00:07:21,000 --> 00:07:23,000 to combine the samples dataset 127 00:07:23,000 --> 00:07:25,000 and the weights dataset. 128 00:07:25,000 --> 00:07:29,000 And let's use the join operation. 129 00:07:29,000 --> 00:07:32,000 One potential issue is that 130 00:07:32,000 --> 00:07:35,000 the weights already 131 00:07:35,000 --> 00:07:37,000 sample 132 00:07:37,000 --> 00:07:41,000 is a KV pair, 133 00:07:41,000 --> 00:07:44,000 but samples RDD is a list. 134 00:07:44,000 --> 00:07:45,000 To solve this, 135 00:07:45,000 --> 00:07:51,000 we will consider inverted index. 136 00:07:51,000 --> 00:07:54,000 Here is an inverted index computation pipeline 137 00:07:54,000 --> 00:07:57,000 for samples RDD. 138 00:07:57,000 --> 00:07:58,000 As shown here, 139 00:07:58,000 --> 00:08:01,000 we can first do a map operation 140 00:08:01,000 --> 00:08:04,000 to construct multiple key-value pairs 141 00:08:04,000 --> 00:08:06,000 for each sample 142 00:08:06,000 --> 00:08:10,000 as shown in the operation here, 143 00:08:10,000 --> 00:08:12,000 where key is a feature ID 144 00:08:12,000 --> 00:08:14,000 and the value is a 145 00:08:14,000 --> 00:08:18,000 single-sample sparse vector. 146 00:08:18,000 --> 00:08:19,000 Note that 147 00:08:19,000 --> 00:08:21,000 since different samples might have 148 00:08:21,000 --> 00:08:24,000 the same relevant feature value, 149 00:08:24,000 --> 00:08:26,000 for example, 150 00:08:26,000 --> 00:08:29,000 sample 2 will generate a pair for 151 00:08:29,000 --> 00:08:31,000 feature 3, 152 00:08:31,000 --> 00:08:33,000 and sample 3 would also generate 153 00:08:33,000 --> 00:08:35,000 a key-value pair 154 00:08:35,000 --> 00:08:39,000 for the same feature, 155 00:08:39,000 --> 00:08:42,000 that is, feature 3. 156 00:08:42,000 --> 00:08:43,000 So, 157 00:08:43,000 --> 00:08:45,000 we will 158 00:08:45,000 --> 00:08:46,000 as a 159 00:08:46,000 --> 00:08:49,000 we can use the 160 00:08:49,000 --> 00:08:51,000 groupByKey operation 161 00:08:51,000 --> 00:08:53,000 to group those 162 00:08:53,000 --> 00:08:54,000 values 163 00:08:54,000 --> 00:08:56,000 with the same key together, 164 00:08:56,000 --> 00:08:58,000 so that we don't have like duplicate keys 165 00:08:58,000 --> 00:09:00,000 specifically. 166 00:09:00,000 --> 00:09:01,000 Here, for example, 167 00:09:01,000 --> 00:09:02,000 for feature 3, 168 00:09:02,000 --> 00:09:04,000 the values would be a list 169 00:09:04,000 --> 00:09:05,000 that contains 170 00:09:05,000 --> 00:09:07,000 sample 2 and sample 3. 171 00:09:11,000 --> 00:09:13,000 And at last, 172 00:09:13,000 --> 00:09:15,000 we'll have a key-value data set 173 00:09:15,000 --> 00:09:17,000 mapping from feature ID 174 00:09:17,000 --> 00:09:19,000 to a list of samples. 175 00:09:19,000 --> 00:09:20,000 Now, 176 00:09:20,000 --> 00:09:22,000 we can do joint operations 177 00:09:22,000 --> 00:09:24,000 between the intermediate data set 178 00:09:24,000 --> 00:09:26,000 with the weights RDD 179 00:09:26,000 --> 00:09:28,000 because they own the 180 00:09:28,000 --> 00:09:31,000 same keys. 181 00:09:31,000 --> 00:09:32,000 So, 182 00:09:32,000 --> 00:09:34,000 here is the draft data flow 183 00:09:34,000 --> 00:09:36,000 for our training task. 184 00:09:36,000 --> 00:09:38,000 As shown in this whole pipeline 185 00:09:38,000 --> 00:09:41,000 after we do the joint operation, 186 00:09:41,000 --> 00:09:42,000 we will get a 187 00:09:42,000 --> 00:09:45,000 new mapping 188 00:09:45,000 --> 00:09:46,000 that is from 189 00:09:46,000 --> 00:09:47,000 feature ID 190 00:09:47,000 --> 00:09:49,000 to its weight value 191 00:09:49,000 --> 00:09:50,000 and the list of 192 00:09:50,000 --> 00:09:53,000 samples 193 00:09:53,000 --> 00:09:54,000 here. 194 00:09:54,000 --> 00:09:56,000 Note that for each key-value pair, 195 00:09:56,000 --> 00:10:00,000 we only have one feature. 196 00:10:00,000 --> 00:10:01,000 So, 197 00:10:01,000 --> 00:10:02,000 we're still not able 198 00:10:02,000 --> 00:10:04,000 to do the dot product here. 199 00:10:04,000 --> 00:10:05,000 Therefore, 200 00:10:05,000 --> 00:10:06,000 in order to 201 00:10:06,000 --> 00:10:07,000 get each sample 202 00:10:07,000 --> 00:10:10,000 to get all its relevant parameter weights, 203 00:10:10,000 --> 00:10:11,000 we can do an 204 00:10:11,000 --> 00:10:14,000 inverted index again. 205 00:10:14,000 --> 00:10:15,000 That is, 206 00:10:15,000 --> 00:10:17,000 to construct a new data set 207 00:10:17,000 --> 00:10:20,000 with samples as the key 208 00:10:20,000 --> 00:10:22,000 while 209 00:10:22,000 --> 00:10:24,000 a whole relevant weight list 210 00:10:24,000 --> 00:10:26,000 as the value. 211 00:10:26,000 --> 00:10:27,000 By this way, 212 00:10:27,000 --> 00:10:29,000 each sample can update itself 213 00:10:29,000 --> 00:10:31,000 based on the corresponding 214 00:10:31,000 --> 00:10:32,000 weight list. 215 00:10:32,000 --> 00:10:33,000 Unfortunately, 216 00:10:33,000 --> 00:10:35,000 life is not easy. 217 00:10:35,000 --> 00:10:36,000 You might have noticed 218 00:10:36,000 --> 00:10:39,000 the warning sign here. 219 00:10:39,000 --> 00:10:40,000 There are several problems 220 00:10:40,000 --> 00:10:41,000 for this design. 221 00:10:41,000 --> 00:10:42,000 For one thing, 222 00:10:42,000 --> 00:10:44,000 the first inverted index 223 00:10:44,000 --> 00:10:46,000 is very expensive. 224 00:10:46,000 --> 00:10:48,000 It's possible that 225 00:10:48,000 --> 00:10:50,000 we replicate the whole 226 00:10:50,000 --> 00:10:51,000 sample data set 227 00:10:51,000 --> 00:10:52,000 multiple times. 228 00:10:52,000 --> 00:10:53,000 For example, 229 00:10:53,000 --> 00:10:55,000 if we got a hot feature, 230 00:10:55,000 --> 00:10:57,000 then every sample 231 00:10:57,000 --> 00:11:00,000 that has a non-zero value for it, 232 00:11:00,000 --> 00:11:03,000 then for the feature's 233 00:11:03,000 --> 00:11:04,000 key-value pair, 234 00:11:04,000 --> 00:11:06,000 we have to store the whole data set 235 00:11:06,000 --> 00:11:07,000 in one list, 236 00:11:07,000 --> 00:11:09,000 which is clearly not acceptable. 237 00:11:10,000 --> 00:11:12,000 For another, 238 00:11:12,000 --> 00:11:13,000 the join operation 239 00:11:13,000 --> 00:11:15,000 is also very expensive, 240 00:11:15,000 --> 00:11:17,000 since the number of features 241 00:11:17,000 --> 00:11:19,000 might be huge, 242 00:11:19,000 --> 00:11:20,000 in which case, 243 00:11:20,000 --> 00:11:22,000 we are trying to join 244 00:11:22,000 --> 00:11:24,000 a large number of records 245 00:11:24,000 --> 00:11:27,000 while each pair of weights 246 00:11:27,000 --> 00:11:30,000 has a very small size, 247 00:11:30,000 --> 00:11:33,000 namely one single feature weight, 248 00:11:33,000 --> 00:11:36,000 so it is also not desirable. 249 00:11:36,000 --> 00:11:39,000 Before we give 250 00:11:39,000 --> 00:11:41,000 our next thing, 251 00:11:41,000 --> 00:11:42,000 after this part, 252 00:11:42,000 --> 00:11:43,000 you should be able 253 00:11:43,000 --> 00:11:44,000 to calculate 254 00:11:44,000 --> 00:11:46,000 and know what is 255 00:11:46,000 --> 00:11:48,000 the inverted index 256 00:11:48,000 --> 00:11:50,000 using RDD operations 257 00:11:50,000 --> 00:11:52,000 and know the basic ideas 258 00:11:52,000 --> 00:11:55,000 of how to do 259 00:11:55,000 --> 00:11:57,000 join-based communication 260 00:11:57,000 --> 00:12:00,000 to combine two RDDs. 261 00:12:00,000 --> 00:12:02,000 Make sure you comprehend those 262 00:12:02,000 --> 00:12:04,000 and then we can proceed 263 00:12:04,000 --> 00:12:06,000 to the next steps. 264 00:12:06,000 --> 00:12:07,000 So now we focus on 265 00:12:07,000 --> 00:12:08,000 the performance part. 266 00:12:08,000 --> 00:12:11,000 As we stated before, 267 00:12:11,000 --> 00:12:12,000 one of our 268 00:12:12,000 --> 00:12:13,000 jacked implementation 269 00:12:13,000 --> 00:12:14,000 drawbacks is that 270 00:12:14,000 --> 00:12:15,000 it tries to join 271 00:12:15,000 --> 00:12:17,000 a large number of records, 272 00:12:17,000 --> 00:12:18,000 and each record 273 00:12:18,000 --> 00:12:21,000 is small in size. 274 00:12:21,000 --> 00:12:24,000 From the test case perspective, 275 00:12:24,000 --> 00:12:25,000 for example, 276 00:12:25,000 --> 00:12:27,000 for test case A, 277 00:12:27,000 --> 00:12:28,000 it contains over 278 00:12:28,000 --> 00:12:29,000 20 million features, 279 00:12:29,000 --> 00:12:31,000 which makes the join cost 280 00:12:31,000 --> 00:12:32,000 unacceptable 281 00:12:32,000 --> 00:12:33,000 because it joins 282 00:12:33,000 --> 00:12:37,000 on the feature side. 283 00:12:37,000 --> 00:12:38,000 Let us now 284 00:12:38,000 --> 00:12:39,000 take a look at 285 00:12:39,000 --> 00:12:40,000 a test case 286 00:12:40,000 --> 00:12:45,000 that has multiple times 287 00:12:45,000 --> 00:12:48,000 of the feature dimensions 288 00:12:48,000 --> 00:12:50,000 compared to test case A. 289 00:12:50,000 --> 00:12:51,000 So our goal is that 290 00:12:51,000 --> 00:12:53,000 we want to join 291 00:12:53,000 --> 00:12:55,000 a smaller number of records 292 00:12:55,000 --> 00:12:56,000 while each of them 293 00:12:56,000 --> 00:12:58,000 has a larger size. 294 00:12:58,000 --> 00:13:00,000 And we want to avoid 295 00:13:00,000 --> 00:13:03,000 storing multiple copies 296 00:13:03,000 --> 00:13:06,000 of each sample. 297 00:13:06,000 --> 00:13:07,000 Okay, 298 00:13:07,000 --> 00:13:09,000 so here is our second hint. 299 00:13:09,000 --> 00:13:10,000 Instead of using 300 00:13:10,000 --> 00:13:11,000 logical features 301 00:13:11,000 --> 00:13:13,000 or sample IDs, 302 00:13:13,000 --> 00:13:15,000 let's use partition ID. 303 00:13:15,000 --> 00:13:17,000 And based on the partition ID 304 00:13:17,000 --> 00:13:19,000 to do the join operation. 305 00:13:19,000 --> 00:13:20,000 In this way, 306 00:13:20,000 --> 00:13:21,000 we can adjust 307 00:13:21,000 --> 00:13:23,000 the number of partitions 308 00:13:23,000 --> 00:13:24,000 so that we can control 309 00:13:24,000 --> 00:13:26,000 the size of the 310 00:13:26,000 --> 00:13:29,000 to-be-joined record. 311 00:13:29,000 --> 00:13:31,000 So we have a partition ID 312 00:13:31,000 --> 00:13:32,000 based on 313 00:13:32,000 --> 00:13:35,000 the join here. 314 00:13:35,000 --> 00:13:36,000 As shown 315 00:13:36,000 --> 00:13:38,000 in the lower half 316 00:13:38,000 --> 00:13:39,000 of the pipeline, 317 00:13:39,000 --> 00:13:42,000 instead of directly computing 318 00:13:42,000 --> 00:13:44,000 the feature ID 319 00:13:44,000 --> 00:13:45,000 to sample's data set, 320 00:13:45,000 --> 00:13:47,000 we actually generate 321 00:13:47,000 --> 00:13:49,000 two data sets here. 322 00:13:49,000 --> 00:13:50,000 First is the data set 323 00:13:50,000 --> 00:13:53,000 that maps partition ID 324 00:13:53,000 --> 00:13:54,000 to sample list. 325 00:13:54,000 --> 00:13:55,000 This can be done 326 00:13:55,000 --> 00:13:57,000 using the map partitions 327 00:13:57,000 --> 00:13:58,000 with index transformation 328 00:13:58,000 --> 00:14:00,000 in Spark. 329 00:14:00,000 --> 00:14:01,000 Another data set 330 00:14:01,000 --> 00:14:03,000 is a mapping from feature ID 331 00:14:03,000 --> 00:14:05,000 to a list of partition IDs. 332 00:14:05,000 --> 00:14:08,000 Constructed through 333 00:14:08,000 --> 00:14:12,000 the inverted index approach. 334 00:14:12,000 --> 00:14:13,000 Notably, 335 00:14:13,000 --> 00:14:14,000 rather than having 336 00:14:14,000 --> 00:14:16,000 redundant data samples, 337 00:14:16,000 --> 00:14:18,000 now we only have 338 00:14:18,000 --> 00:14:20,000 redundant partition IDs, 339 00:14:20,000 --> 00:14:22,000 which is much cheaper. 340 00:14:22,000 --> 00:14:23,000 Next, 341 00:14:23,000 --> 00:14:26,000 through the two cheaper joins 342 00:14:26,000 --> 00:14:27,000 here, 343 00:14:27,000 --> 00:14:28,000 we can achieve 344 00:14:28,000 --> 00:14:30,000 the data set 345 00:14:30,000 --> 00:14:31,000 that we want 346 00:14:31,000 --> 00:14:32,000 to be able to calculate 347 00:14:32,000 --> 00:14:34,000 gradient descent locally. 348 00:14:34,000 --> 00:14:35,000 So, 349 00:14:35,000 --> 00:14:36,000 for example, 350 00:14:36,000 --> 00:14:37,000 we can map 351 00:14:37,000 --> 00:14:38,000 partition IDs 352 00:14:38,000 --> 00:14:39,000 in each partition 353 00:14:39,000 --> 00:14:40,000 or locally, 354 00:14:40,000 --> 00:14:41,000 we mean, 355 00:14:41,000 --> 00:14:42,000 like locally 356 00:14:42,000 --> 00:14:43,000 in the executors 357 00:14:43,000 --> 00:14:44,000 when possible, 358 00:14:44,000 --> 00:14:45,000 or when candidate 359 00:14:45,000 --> 00:14:46,000 RDD 360 00:14:46,000 --> 00:14:47,000 might look like. 361 00:14:47,000 --> 00:14:48,000 It's like 362 00:14:48,000 --> 00:14:49,000 a mapping 363 00:14:49,000 --> 00:14:50,000 from partition ID 364 00:14:50,000 --> 00:14:51,000 to a tuple 365 00:14:51,000 --> 00:14:52,000 containing 366 00:14:52,000 --> 00:14:53,000 a waitlist 367 00:14:53,000 --> 00:14:54,000 and a sample list. 368 00:14:54,000 --> 00:14:55,000 And it's your job 369 00:14:55,000 --> 00:14:56,000 to think about 370 00:14:56,000 --> 00:14:57,000 and figure out 371 00:14:57,000 --> 00:14:58,000 how to use 372 00:14:58,000 --> 00:14:59,000 join operations 373 00:14:59,000 --> 00:15:00,000 to achieve 374 00:15:00,000 --> 00:15:01,000 this task. 375 00:15:01,000 --> 00:15:02,000 Think about 376 00:15:02,000 --> 00:15:03,000 how you can 377 00:15:03,000 --> 00:15:04,000 use 378 00:15:04,000 --> 00:15:07,000 these two RDDs, 379 00:15:07,000 --> 00:15:09,000 the samples partition RDD 380 00:15:09,000 --> 00:15:12,000 and the feature partition RDD, 381 00:15:12,000 --> 00:15:13,000 together with 382 00:15:13,000 --> 00:15:15,000 other existing RDDs 383 00:15:15,000 --> 00:15:17,000 like weights RDD, 384 00:15:17,000 --> 00:15:18,000 to derive 385 00:15:18,000 --> 00:15:21,000 the weight sample RDD. 386 00:15:21,000 --> 00:15:22,000 Brainstorm yourself 387 00:15:22,000 --> 00:15:23,000 and think 388 00:15:23,000 --> 00:15:24,000 about 389 00:15:24,000 --> 00:15:26,000 how you can use 390 00:15:26,000 --> 00:15:27,000 the inverted index 391 00:15:27,000 --> 00:15:29,000 join operations 392 00:15:29,000 --> 00:15:31,000 to join keys 393 00:15:31,000 --> 00:15:32,000 and eventually 394 00:15:32,000 --> 00:15:33,000 get this step. 395 00:15:34,000 --> 00:15:35,000 Furthermore, 396 00:15:35,000 --> 00:15:36,000 as for 397 00:15:36,000 --> 00:15:37,000 RDD partitioning 398 00:15:37,000 --> 00:15:38,000 and join optimization, 399 00:15:38,000 --> 00:15:39,000 we have 400 00:15:39,000 --> 00:15:41,000 some suggestions here. 401 00:15:41,000 --> 00:15:42,000 First, 402 00:15:42,000 --> 00:15:43,000 we would like 403 00:15:43,000 --> 00:15:44,000 to introduce 404 00:15:44,000 --> 00:15:47,000 some RDD partition operations. 405 00:15:47,000 --> 00:15:48,000 So, 406 00:15:48,000 --> 00:15:49,000 for 407 00:15:49,000 --> 00:15:50,000 partition level 408 00:15:50,000 --> 00:15:51,000 map 409 00:15:51,000 --> 00:15:52,000 and reduce 410 00:15:52,000 --> 00:15:53,000 task, 411 00:15:53,000 --> 00:15:54,000 you might want 412 00:15:54,000 --> 00:15:55,000 to take a look 413 00:15:55,000 --> 00:15:56,000 at map partitions 414 00:15:56,000 --> 00:15:58,000 and partition with index 415 00:15:58,000 --> 00:15:59,000 and also 416 00:15:59,000 --> 00:16:01,000 glom operations. 417 00:16:01,000 --> 00:16:02,000 And for 418 00:16:02,000 --> 00:16:03,000 repartition operations, 419 00:16:03,000 --> 00:16:04,000 you might want 420 00:16:04,000 --> 00:16:05,000 to take a look 421 00:16:05,000 --> 00:16:06,000 at map partitions 422 00:16:06,000 --> 00:16:07,000 with partition 423 00:16:07,000 --> 00:16:08,000 and partition by 424 00:16:08,000 --> 00:16:09,000 is useful. 425 00:16:10,000 --> 00:16:11,000 Second point 426 00:16:11,000 --> 00:16:12,000 is that 427 00:16:12,000 --> 00:16:13,000 before doing 428 00:16:13,000 --> 00:16:14,000 join operation, 429 00:16:14,000 --> 00:16:15,000 if you do 430 00:16:15,000 --> 00:16:16,000 a partition 431 00:16:16,000 --> 00:16:18,000 by key operation first, 432 00:16:18,000 --> 00:16:19,000 then the 433 00:16:19,000 --> 00:16:20,000 join operation 434 00:16:20,000 --> 00:16:21,000 might be cheaper. 435 00:16:23,000 --> 00:16:24,000 One interesting 436 00:16:24,000 --> 00:16:25,000 thinking question 437 00:16:25,000 --> 00:16:26,000 is that 438 00:16:26,000 --> 00:16:27,000 can you partition 439 00:16:27,000 --> 00:16:28,000 the data set once 440 00:16:28,000 --> 00:16:29,000 and do not disturb 441 00:16:29,000 --> 00:16:30,000 the data set 442 00:16:30,000 --> 00:16:31,000 and you reuse 443 00:16:31,000 --> 00:16:32,000 the data set 444 00:16:32,000 --> 00:16:33,000 for multiple joins 445 00:16:33,000 --> 00:16:34,000 so that it saves 446 00:16:34,000 --> 00:16:35,000 you the 447 00:16:35,000 --> 00:16:36,000 shuffle cost. 448 00:16:36,000 --> 00:16:37,000 The third point 449 00:16:37,000 --> 00:16:38,000 is that 450 00:16:38,000 --> 00:16:39,000 you might need 451 00:16:39,000 --> 00:16:40,000 to 452 00:16:40,000 --> 00:16:41,000 think about 453 00:16:41,000 --> 00:16:42,000 the number 454 00:16:42,000 --> 00:16:43,000 of relations, 455 00:16:43,000 --> 00:16:44,000 its relationship 456 00:16:44,000 --> 00:16:45,000 with the number 457 00:16:45,000 --> 00:16:46,000 of features 458 00:16:46,000 --> 00:16:47,000 and number 459 00:16:47,000 --> 00:16:48,000 of cores 460 00:16:48,000 --> 00:16:49,000 so that 461 00:16:49,000 --> 00:16:50,000 each record 462 00:16:50,000 --> 00:16:51,000 can fit 463 00:16:51,000 --> 00:16:52,000 in memory 464 00:16:52,000 --> 00:16:53,000 and the 465 00:16:53,000 --> 00:16:54,000 join operation 466 00:16:54,000 --> 00:16:55,000 can have 467 00:16:55,000 --> 00:16:56,000 a good 468 00:16:56,000 --> 00:16:57,000 performance. 469 00:16:57,000 --> 00:16:58,000 One trade-off 470 00:16:58,000 --> 00:16:59,000 here is that 471 00:16:59,000 --> 00:17:00,000 if we are 472 00:17:00,000 --> 00:17:01,000 having higher 473 00:17:01,000 --> 00:17:02,000 number 474 00:17:02,000 --> 00:17:03,000 of partitions, 475 00:17:03,000 --> 00:17:04,000 then the 476 00:17:04,000 --> 00:17:05,000 join operation 477 00:17:05,000 --> 00:17:06,000 would be slower 478 00:17:06,000 --> 00:17:07,000 but it will cause 479 00:17:07,000 --> 00:17:08,000 less memory. 480 00:17:08,000 --> 00:17:09,000 However, 481 00:17:09,000 --> 00:17:10,000 if we have 482 00:17:10,000 --> 00:17:11,000 less partition 483 00:17:11,000 --> 00:17:12,000 numbers, 484 00:17:12,000 --> 00:17:13,000 the join operation 485 00:17:13,000 --> 00:17:14,000 would be quicker 486 00:17:14,000 --> 00:17:15,000 but it will cause 487 00:17:15,000 --> 00:17:16,000 a lot of memory 488 00:17:16,000 --> 00:17:17,000 and even 489 00:17:17,000 --> 00:17:18,000 cause 490 00:17:18,000 --> 00:17:19,000 out-of-memory 491 00:17:19,000 --> 00:17:20,000 errors. 492 00:17:22,000 --> 00:17:23,000 So, 493 00:17:23,000 --> 00:17:24,000 explore 494 00:17:24,000 --> 00:17:25,000 and evaluate 495 00:17:25,000 --> 00:17:26,000 the impact 496 00:17:26,000 --> 00:17:27,000 yourself 497 00:17:27,000 --> 00:17:28,000 and figure out 498 00:17:28,000 --> 00:17:29,000 some general 499 00:17:29,000 --> 00:17:30,000 solution 500 00:17:30,000 --> 00:17:31,000 to decide 501 00:17:31,000 --> 00:17:32,000 the number 502 00:17:32,000 --> 00:17:33,000 of features 503 00:17:33,000 --> 00:17:34,000 and number 504 00:17:34,000 --> 00:17:35,000 of cores. 505 00:17:35,000 --> 00:17:36,000 And please pay 506 00:17:36,000 --> 00:17:37,000 attention that 507 00:17:37,000 --> 00:17:38,000 we 508 00:17:38,000 --> 00:17:39,000 do not 509 00:17:39,000 --> 00:17:40,000 allow hard 510 00:17:40,000 --> 00:17:41,000 code 511 00:17:41,000 --> 00:17:42,000 and make 512 00:17:42,000 --> 00:17:43,000 separate 513 00:17:43,000 --> 00:17:44,000 submissions 514 00:17:44,000 --> 00:17:45,000 for each 515 00:17:45,000 --> 00:17:46,000 test cases. 516 00:17:46,000 --> 00:17:47,000 For more 517 00:17:47,000 --> 00:17:48,000 information, 518 00:17:48,000 --> 00:17:49,000 please refer 519 00:17:49,000 --> 00:17:50,000 to the 520 00:17:50,000 --> 00:17:51,000 write-up. 521 00:17:51,000 --> 00:17:52,000 In addition, 522 00:17:52,000 --> 00:17:53,000 you might 523 00:17:53,000 --> 00:17:54,000 always 524 00:17:54,000 --> 00:17:55,000 want 525 00:17:55,000 --> 00:17:56,000 to 526 00:17:56,000 --> 00:17:57,000 use 527 00:17:57,000 --> 00:17:58,000 the same 528 00:17:58,000 --> 00:17:59,000 number 529 00:17:59,000 --> 00:18:00,000 of partitions 530 00:18:00,000 --> 00:18:01,000 in your 531 00:18:01,000 --> 00:18:02,000 test cases. 532 00:18:02,000 --> 00:18:03,000 In this 533 00:18:03,000 --> 00:18:04,000 example, 534 00:18:04,000 --> 00:18:05,000 there are 535 00:18:05,000 --> 00:18:06,000 at least 536 00:18:06,000 --> 00:18:07,000 two possible 537 00:18:07,000 --> 00:18:08,000 cases that 538 00:18:08,000 --> 00:18:09,000 the output 539 00:18:09,000 --> 00:18:10,000 size might 540 00:18:10,000 --> 00:18:11,000 blow out. 541 00:18:11,000 --> 00:18:12,000 First is 542 00:18:12,000 --> 00:18:13,000 the output 543 00:18:13,000 --> 00:18:14,000 of the partition 544 00:18:14,000 --> 00:18:15,000 of the join 545 00:18:15,000 --> 00:18:16,000 operation. 546 00:18:16,000 --> 00:18:17,000 Another is 547 00:18:17,000 --> 00:18:18,000 the output 548 00:18:18,000 --> 00:18:19,000 of reduce 549 00:18:19,000 --> 00:18:20,000 by key 550 00:18:20,000 --> 00:18:21,000 operation. 551 00:18:21,000 --> 00:18:22,000 In both 552 00:18:22,000 --> 00:18:23,000 cases, 553 00:18:23,000 --> 00:18:24,000 you might 554 00:18:24,000 --> 00:18:25,000 want to 555 00:18:25,000 --> 00:18:26,000 try using 556 00:18:26,000 --> 00:18:27,000 repartition 557 00:18:27,000 --> 00:18:28,000 or partition 558 00:18:28,000 --> 00:18:29,000 by operation 559 00:18:29,000 --> 00:18:30,000 to maintain 560 00:18:30,000 --> 00:18:31,000 the 561 00:18:31,000 --> 00:18:32,000 result. 562 00:18:32,000 --> 00:18:33,000 In this 563 00:18:33,000 --> 00:18:34,000 example, 564 00:18:34,000 --> 00:18:35,000 you might 565 00:18:35,000 --> 00:18:36,000 be able 566 00:18:36,000 --> 00:18:37,000 to find 567 00:18:37,000 --> 00:18:38,000 situations 568 00:18:38,000 --> 00:18:39,000 where 569 00:18:39,000 --> 00:18:40,000 RDDs 570 00:18:40,000 --> 00:18:41,000 can be 571 00:18:41,000 --> 00:18:42,000 used 572 00:18:42,000 --> 00:18:43,000 across 573 00:18:43,000 --> 00:18:44,000 iterations. 574 00:18:44,000 --> 00:18:45,000 For example, 575 00:18:45,000 --> 00:18:46,000 the samples 576 00:18:46,000 --> 00:18:47,000 RDD 577 00:18:47,000 --> 00:18:48,000 and the 578 00:18:48,000 --> 00:18:49,000 features 579 00:18:49,000 --> 00:18:50,000 indices 580 00:18:50,000 --> 00:18:51,000 are 581 00:18:51,000 --> 00:18:52,000 read-only 582 00:18:52,000 --> 00:18:53,000 data 583 00:18:53,000 --> 00:18:54,000 without 584 00:18:54,000 --> 00:18:55,000 any 585 00:18:55,000 --> 00:18:56,000 updates. 586 00:18:56,000 --> 00:18:57,000 So you 587 00:18:57,000 --> 00:18:58,000 can 588 00:18:58,000 --> 00:18:59,000 consider 589 00:18:59,000 --> 00:19:00,000 most of 590 00:19:00,000 --> 00:19:01,000 the 591 00:19:01,000 --> 00:19:02,000 important 592 00:19:02,000 --> 00:19:03,000 hints 593 00:19:03,000 --> 00:19:04,000 for P2-3. 594 00:19:04,000 --> 00:19:05,000 I'd 595 00:19:05,000 --> 00:19:06,000 like to 596 00:19:06,000 --> 00:19:07,000 give a 597 00:19:07,000 --> 00:19:08,000 brief 598 00:19:08,000 --> 00:19:09,000 introduction 599 00:19:09,000 --> 00:19:10,000 to RDD 600 00:19:10,000 --> 00:19:11,000 join 601 00:19:11,000 --> 00:19:12,000 operators. 602 00:19:12,000 --> 00:19:13,000 From a 603 00:19:13,000 --> 00:19:14,000 logical 604 00:19:14,000 --> 00:19:15,000 view, 605 00:19:15,000 --> 00:19:16,000 the use 606 00:19:16,000 --> 00:19:17,000 of RDD 607 00:19:17,000 --> 00:19:18,000 join 608 00:19:18,000 --> 00:19:19,000 operation 609 00:19:19,000 --> 00:19:20,000 is for 610 00:19:20,000 --> 00:19:21,000 combining 611 00:19:21,000 --> 00:19:22,000 two key 612 00:19:22,000 --> 00:19:23,000 value 613 00:19:23,000 --> 00:19:24,000 datasets 614 00:19:24,000 --> 00:19:25,000 together 615 00:19:25,000 --> 00:19:26,000 so that 616 00:19:26,000 --> 00:19:27,000 pairs 617 00:19:27,000 --> 00:19:28,000 with 618 00:19:28,000 --> 00:19:29,000 RDDs 619 00:19:29,000 --> 00:19:30,000 can 620 00:19:30,000 --> 00:19:31,000 be 621 00:19:31,000 --> 00:19:32,000 combined 622 00:19:32,000 --> 00:19:33,000 locally. 623 00:19:33,000 --> 00:19:34,000 On the 624 00:19:34,000 --> 00:19:35,000 other hand, 625 00:19:35,000 --> 00:19:36,000 from an 626 00:19:36,000 --> 00:19:37,000 implementation 627 00:19:37,000 --> 00:19:38,000 perspective, 628 00:19:38,000 --> 00:19:39,000 the goal 629 00:19:39,000 --> 00:19:40,000 of the 630 00:19:40,000 --> 00:19:41,000 Spark RDD 631 00:19:41,000 --> 00:19:42,000 join 632 00:19:42,000 --> 00:19:43,000 operator 633 00:19:43,000 --> 00:19:44,000 is to 634 00:19:44,000 --> 00:19:45,000 bring 635 00:19:45,000 --> 00:19:46,000 corresponding 636 00:19:46,000 --> 00:19:47,000 keys 637 00:19:47,000 --> 00:19:48,000 from 638 00:19:48,000 --> 00:19:49,000 each 639 00:19:49,000 --> 00:19:50,000 RDD 640 00:19:50,000 --> 00:19:51,000 to 641 00:19:51,000 --> 00:19:52,000 the 642 00:19:52,000 --> 00:19:53,000 same 643 00:19:53,000 --> 00:19:54,000 partition 644 00:19:54,000 --> 00:19:55,000 so that 645 00:19:55,000 --> 00:19:56,000 they 646 00:19:56,000 --> 00:19:57,000 can 647 00:19:57,000 --> 00:19:58,000 be 648 00:19:58,000 --> 00:19:59,000 integrated 649 00:19:59,000 --> 00:20:00,000 into 650 00:20:00,000 --> 00:20:01,000 the 651 00:20:01,000 --> 00:20:02,000 RDD 652 00:20:02,000 --> 00:20:03,000 join 653 00:20:03,000 --> 00:20:04,000 operation. 654 00:20:04,000 --> 00:20:05,000 For example, 655 00:20:05,000 --> 00:20:06,000 if RDD-A 656 00:20:06,000 --> 00:20:07,000 and 657 00:20:07,000 --> 00:20:08,000 RDD-B 658 00:20:08,000 --> 00:20:09,000 don't 659 00:20:09,000 --> 00:20:10,000 have a 660 00:20:10,000 --> 00:20:11,000 partitioner, 661 00:20:11,000 --> 00:20:12,000 we'll 662 00:20:12,000 --> 00:20:13,000 have 663 00:20:13,000 --> 00:20:14,000 two 664 00:20:14,000 --> 00:20:15,000 shuffle 665 00:20:15,000 --> 00:20:16,000 operations 666 00:20:16,000 --> 00:20:17,000 here. 667 00:20:17,000 --> 00:20:18,000 However, 668 00:20:18,000 --> 00:20:19,000 if we 669 00:20:19,000 --> 00:20:20,000 specify 670 00:20:20,000 --> 00:20:21,000 the 671 00:20:21,000 --> 00:20:22,000 partitioner 672 00:20:22,000 --> 00:20:23,000 before 673 00:20:23,000 --> 00:20:24,000 doing 674 00:20:24,000 --> 00:20:25,000 join 675 00:20:25,000 --> 00:20:26,000 operation, 676 00:20:26,000 --> 00:20:27,000 we 677 00:20:27,000 --> 00:20:28,000 can 678 00:20:28,000 --> 00:20:29,000 use 679 00:20:29,000 --> 00:20:30,000 the 680 00:20:30,000 --> 00:20:31,000 hash 681 00:20:31,000 --> 00:20:32,000 function, 682 00:20:32,000 --> 00:20:33,000 which 683 00:20:33,000 --> 00:20:34,000 will 684 00:20:34,000 --> 00:20:35,000 use 685 00:20:35,000 --> 00:20:36,000 the 686 00:20:36,000 --> 00:20:37,000 default 687 00:20:37,000 --> 00:20:38,000 hash 688 00:20:38,000 --> 00:20:39,000 partition 689 00:20:39,000 --> 00:20:40,000 algorithm 690 00:20:40,000 --> 00:20:41,000 to 691 00:20:41,000 --> 00:20:42,000 distribute 692 00:20:42,000 --> 00:20:43,000 key 693 00:20:43,000 --> 00:20:44,000 value 694 00:20:44,000 --> 00:20:45,000 pairs. 695 00:20:45,000 --> 00:20:46,000 Another way 696 00:20:46,000 --> 00:20:47,000 to specify 697 00:20:47,000 --> 00:20:48,000 the 698 00:20:48,000 --> 00:20:49,000 partitioner 699 00:20:49,000 --> 00:20:50,000 is 700 00:20:50,000 --> 00:20:51,000 using 701 00:20:51,000 --> 00:20:52,000 partition 702 00:20:52,000 --> 00:20:53,000 bind 703 00:20:53,000 --> 00:20:54,000 operations, 704 00:20:54,000 --> 00:20:55,000 in 705 00:20:55,000 --> 00:20:56,000 which 706 00:20:56,000 --> 00:20:58,000 only 707 00:20:58,000 --> 00:20:59,000 one 708 00:20:59,000 --> 00:21:00,000 shuffle 709 00:21:00,000 --> 00:21:01,000 operation 710 00:21:01,000 --> 00:21:02,000 is 711 00:21:02,000 --> 00:21:03,000 required. 712 00:21:03,000 --> 00:21:04,000 In addition, 713 00:21:04,000 --> 00:21:05,000 if both 714 00:21:05,000 --> 00:21:06,000 RDDs 715 00:21:06,000 --> 00:21:07,000 use 716 00:21:07,000 --> 00:21:08,000 the 717 00:21:08,000 --> 00:21:09,000 same 718 00:21:09,000 --> 00:21:10,000 partitioner, 719 00:21:10,000 --> 00:21:11,000 which 720 00:21:11,000 --> 00:21:12,000 means 721 00:21:12,000 --> 00:21:13,000 firstly, 722 00:21:13,000 --> 00:21:14,000 they'll 723 00:21:14,000 --> 00:21:15,000 have 724 00:21:15,000 --> 00:21:16,000 the 725 00:21:16,000 --> 00:21:17,000 same 726 00:21:17,000 --> 00:21:18,000 number 727 00:21:18,000 --> 00:21:19,000 of 728 00:21:19,000 --> 00:21:20,000 partitions, 729 00:21:20,000 --> 00:21:21,000 and 730 00:21:21,000 --> 00:21:22,000 secondly, 731 00:21:22,000 --> 00:21:23,000 they'll 732 00:21:23,000 --> 00:21:24,000 use 733 00:21:24,000 --> 00:21:25,000 the 734 00:21:25,000 --> 00:21:26,000 same number 735 00:21:26,000 --> 00:21:27,000 of 736 00:21:27,000 --> 00:21:28,000 partition 737 00:21:28,000 --> 00:21:29,000 bind 738 00:21:29,000 --> 00:21:30,000 functions, 739 00:21:30,000 --> 00:21:31,000 which 740 00:21:31,000 --> 00:21:32,000 saves 741 00:21:32,000 --> 00:21:33,000 you 742 00:21:33,000 --> 00:21:34,000 a lot 743 00:21:34,000 --> 00:21:35,000 of 744 00:21:35,000 --> 00:21:36,000 shuffle 745 00:21:36,000 --> 00:21:37,000 cost. 746 00:21:37,000 --> 00:21:38,000 And 747 00:21:38,000 --> 00:21:39,000 also, 748 00:21:39,000 --> 00:21:40,000 some 749 00:21:40,000 --> 00:21:41,000 other 750 00:21:41,000 --> 00:21:42,000 general 751 00:21:42,000 --> 00:21:43,000 optimizations 752 00:21:43,000 --> 00:21:44,000 for 753 00:21:44,000 --> 00:21:45,000 RDD 754 00:21:45,000 --> 00:21:46,000 join 755 00:21:46,000 --> 00:21:47,000 operator 756 00:21:47,000 --> 00:21:48,000 are as 757 00:21:48,000 --> 00:21:49,000 follows. 758 00:21:49,000 --> 00:21:50,000 Note that 759 00:21:50,000 --> 00:21:51,000 you 760 00:21:51,000 --> 00:21:52,000 might 761 00:21:52,000 --> 00:21:53,000 not 762 00:21:53,000 --> 00:21:54,000 need 763 00:21:54,000 --> 00:21:55,000 to 764 00:21:55,000 --> 00:21:56,000 do 765 00:21:56,000 --> 00:21:57,000 a 766 00:21:57,000 --> 00:21:58,000 join 767 00:21:58,000 --> 00:21:59,000 operation 768 00:21:59,000 --> 00:22:00,000 to 769 00:22:00,000 --> 00:22:01,000 reduce 770 00:22:01,000 --> 00:22:02,000 the 771 00:22:02,000 --> 00:22:03,000 result 772 00:22:03,000 --> 00:22:04,000 size 773 00:22:04,000 --> 00:22:05,000 for 774 00:22:05,000 --> 00:22:06,000 the 775 00:22:06,000 --> 00:22:07,000 join 776 00:22:07,000 --> 00:22:08,000 operation. 777 00:22:08,000 --> 00:22:09,000 Secondly, 778 00:22:09,000 --> 00:22:10,000 pre-partitioning 779 00:22:10,000 --> 00:22:11,000 with 780 00:22:11,000 --> 00:22:12,000 non-partitioner 781 00:22:12,000 --> 00:22:13,000 can 782 00:22:13,000 --> 00:22:14,000 reduce 783 00:22:14,000 --> 00:22:15,000 the 784 00:22:15,000 --> 00:22:16,000 cost 785 00:22:16,000 --> 00:22:17,000 from 786 00:22:17,000 --> 00:22:18,000 shuffle 787 00:22:18,000 --> 00:22:19,000 operations. 788 00:22:19,000 --> 00:22:20,000 Besides, 789 00:22:20,000 --> 00:22:21,000 a common 790 00:22:21,000 --> 00:22:22,000 strategy 791 00:22:22,000 --> 00:22:23,000 is 792 00:22:23,000 --> 00:22:24,000 to 793 00:22:24,000 --> 00:22:25,000 test 794 00:22:25,000 --> 00:22:26,000 one 795 00:22:26,000 --> 00:22:27,000 iteration 796 00:22:27,000 --> 00:22:28,000 at 797 00:22:28,000 --> 00:22:29,000 a 798 00:22:29,000 --> 00:22:30,000 time 799 00:22:30,000 --> 00:22:31,000 or 800 00:22:31,000 --> 00:22:32,000 two 801 00:22:32,000 --> 00:22:33,000 iterations 802 00:22:33,000 --> 00:22:34,000 to 803 00:22:34,000 --> 00:22:35,000 have 804 00:22:35,000 --> 00:22:36,000 an 805 00:22:36,000 --> 00:22:37,000 estimate 806 00:22:37,000 --> 00:22:38,000 about 807 00:22:38,000 --> 00:22:39,000 what's 808 00:22:39,000 --> 00:22:40,000 your 809 00:22:40,000 --> 00:22:41,000 runtime 810 00:22:41,000 --> 00:22:42,000 for 811 00:22:42,000 --> 00:22:43,000 the 812 00:22:43,000 --> 00:22:44,000 whole 813 00:22:44,000 --> 00:22:45,000 test 814 00:22:45,000 --> 00:22:46,000 cases. 815 00:22:46,000 --> 00:22:47,000 Subsequent 816 00:22:47,000 --> 00:22:48,000 iterations 817 00:22:48,000 --> 00:22:49,000 are 818 00:22:49,000 --> 00:22:50,000 about 819 00:22:50,000 --> 00:22:51,000 20% 820 00:22:51,000 --> 00:22:52,000 cheaper. 821 00:22:52,000 --> 00:22:53,000 It's 822 00:22:53,000 --> 00:22:54,000 especially 823 00:22:54,000 --> 00:22:55,000 useful 824 00:22:55,000 --> 00:22:56,000 for 825 00:22:56,000 --> 00:22:57,000 test 826 00:22:57,000 --> 00:22:58,000 cases. 827 00:22:58,000 --> 00:22:59,000 And always, 828 00:22:59,000 --> 00:23:00,000 please 829 00:23:00,000 --> 00:23:01,000 start 830 00:23:01,000 --> 00:23:02,000 early. 831 00:23:02,000 --> 00:23:03,000 Yeah. 832 00:23:03,000 --> 00:23:04,000 That's 833 00:23:04,000 --> 00:23:05,000 all for 834 00:23:05,000 --> 00:23:06,000 the 835 00:23:06,000 --> 00:23:07,000 hints 836 00:23:07,000 --> 00:23:08,000 video. 837 00:23:08,000 --> 00:23:09,000 Hope you 838 00:23:09,000 --> 00:23:10,000 enjoyed 839 00:23:10,000 --> 00:23:11,000 this project 840 00:23:11,000 --> 00:23:12,000 and best 841 00:23:12,000 --> 00:23:13,000 wishes. 842 00:23:13,000 --> 00:23:14,000 Thank you. 43891