All language subtitles for 005 Using the Tokenizer in Python_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,940 --> 00:00:15,500 So in this lecture, we will be continuing our notebook from earlier in the session. 2 00:00:16,160 --> 00:00:19,000 As you know, the next step will be to load up our tokenize. 3 00:00:19,820 --> 00:00:24,890 In this notebook, we'll be using the stillbirth as our checkpoint, since it trains much faster than 4 00:00:24,920 --> 00:00:25,580 Bert. 5 00:00:25,700 --> 00:00:30,600 But if you're doing this at home by yourself, I would recommend trying Bert instead. 6 00:00:30,620 --> 00:00:32,689 In addition, a two to Bert. 7 00:00:34,010 --> 00:00:39,350 As usual we call auto tokenize or from pre-trained passing in the checkpoint. 8 00:00:47,100 --> 00:00:51,510 The next step is to test our token isare on a question and context pair. 9 00:00:51,690 --> 00:00:54,060 We'll choose the sample at index one again. 10 00:00:55,820 --> 00:01:02,450 In order to understand the results, we'll take the input IDs and call tokenize or decode to get the 11 00:01:02,450 --> 00:01:03,890 result back in English. 12 00:01:10,500 --> 00:01:14,140 So as expected, we get both the question and context. 13 00:01:14,160 --> 00:01:16,320 Concatenate it into one string. 14 00:01:17,470 --> 00:01:20,800 Note that by convention we'll always put the question first. 15 00:01:21,190 --> 00:01:25,750 As you can see, the string starts with the special CLS token as usual. 16 00:01:25,840 --> 00:01:30,520 Then we have the question, which is what is in front of the Notre Dame main building? 17 00:01:30,700 --> 00:01:35,440 Then we have the SEP token, then we have the context, and then it ends with another set. 18 00:01:41,880 --> 00:01:45,420 So as you've just seen, the context can be quite long. 19 00:01:45,450 --> 00:01:51,150 So long, in fact, that the combination of the context and question might be too long for the model 20 00:01:51,150 --> 00:01:53,820 to handle the accepted strategy. 21 00:01:53,820 --> 00:01:57,270 For this is to split the context into multiple samples. 22 00:01:57,810 --> 00:02:03,390 In order to do this, we need to specify a few additional arguments when the tokenizing is called. 23 00:02:03,630 --> 00:02:10,590 Specifically, we set max length to 100, specifying that the number of input tokens for any given sample 24 00:02:10,590 --> 00:02:13,350 should be in most 100 tokens long. 25 00:02:13,980 --> 00:02:20,100 We set truncation into only seconds, which means that we will only truncate the context but not the 26 00:02:20,100 --> 00:02:20,820 question. 27 00:02:21,180 --> 00:02:24,660 As you recall, the context is the second input string. 28 00:02:25,200 --> 00:02:27,840 We wouldn't want to truncate the question since then. 29 00:02:27,840 --> 00:02:29,790 We wouldn't know what the question is. 30 00:02:30,300 --> 00:02:36,450 And because we split the context into multiple windows, we don't want to accidentally cut off the answer. 31 00:02:36,690 --> 00:02:41,310 Meaning we don't want part of the answer to be in one window and the other part of the answer to be 32 00:02:41,310 --> 00:02:42,450 in another window. 33 00:02:42,840 --> 00:02:48,840 So in order to avoid that situation, we set the stride which in this case has been set to 50. 34 00:02:49,320 --> 00:02:53,370 Note that 150 are just toI values for this experiment. 35 00:02:53,550 --> 00:02:56,100 Later on, we'll be using different values. 36 00:02:57,590 --> 00:03:02,030 Finally we said return overflowing tokens the true meaning. 37 00:03:02,030 --> 00:03:06,800 We want to return the tokens in the overlapping parts of the context windows. 38 00:03:11,570 --> 00:03:15,980 So once we've called the token Pfizer note that it will give us multiple samples. 39 00:03:16,250 --> 00:03:21,560 This is because the context is split into multiple windows and it's combined with the question each 40 00:03:21,560 --> 00:03:22,190 time. 41 00:03:22,670 --> 00:03:28,010 So after we've called the token isare, we're going to loop through each of the resulting samples and 42 00:03:28,010 --> 00:03:30,320 print the decoded input IDs. 43 00:03:36,650 --> 00:03:39,830 So as you can see, we get four new samples. 44 00:03:39,950 --> 00:03:43,550 This means that the context was split into four different parts. 45 00:03:44,700 --> 00:03:48,930 Note that the question has not been truncated, which is what we specified. 46 00:03:49,710 --> 00:03:55,350 Instead, the context in this case is just a part of the full context which we saw above. 47 00:03:56,490 --> 00:04:01,470 And again, note that we have the CLS and SEP tokens in their usual places. 48 00:04:04,990 --> 00:04:10,000 Now you know that when we call the organizer, we don't just get back the token IDs. 49 00:04:10,120 --> 00:04:13,990 So let's check the keys of the result to see what we actually got. 50 00:04:19,040 --> 00:04:24,950 So in addition to the input IDs and attention mask, which you've already seen, we get another item 51 00:04:24,950 --> 00:04:27,350 called overflow to sample mapping. 52 00:04:31,810 --> 00:04:34,600 So let's print this out just to see what it is. 53 00:04:39,630 --> 00:04:42,810 So it's just a list of zeroes, which is not that informative. 54 00:04:43,140 --> 00:04:48,600 What I'm going to do is tell you right now what this means and then do a more meaningful demonstration 55 00:04:48,600 --> 00:04:49,710 in the next step. 56 00:04:50,310 --> 00:04:51,870 So what does this mean? 57 00:04:52,170 --> 00:04:57,630 Well, as you recall, when we split up the context, we get back multiple samples. 58 00:04:57,990 --> 00:05:02,880 However, we may want to know what is the original sample index it came from? 59 00:05:03,000 --> 00:05:04,560 That's what this tells us. 60 00:05:04,860 --> 00:05:10,800 Since we only passed in one original sample, this is going to be zero each time because they all came 61 00:05:10,800 --> 00:05:12,960 from the zero original sample. 62 00:05:18,570 --> 00:05:23,880 So just to get a better feel for this overflow to sample mapping, we're going to call the tokenizing 63 00:05:23,910 --> 00:05:30,360 once again, but this time on the first three samples instead of just one, we're also going to add 64 00:05:30,360 --> 00:05:35,100 one additional argument return offsets mapping which has been set to true. 65 00:05:35,670 --> 00:05:39,090 Again, we print out the overflow to sample mapping results. 66 00:05:45,990 --> 00:05:48,480 So hopefully these results make more sense. 67 00:05:48,900 --> 00:05:53,190 As you can see, all of our input samples have been split into four parts. 68 00:05:53,550 --> 00:05:59,070 This tells us that the first four outputs correspond to the original sample at index zero. 69 00:05:59,400 --> 00:06:05,370 The next four outputs correspond to the original sample at index one, and the final four outputs correspond 70 00:06:05,370 --> 00:06:07,470 to the original sample at index two. 71 00:06:14,320 --> 00:06:19,240 Now to really confirm that this is the case, we can also decode the input IDs. 72 00:06:25,850 --> 00:06:32,450 So as you can see, this confirms that each of the first three samples really was split into four parts. 73 00:06:32,660 --> 00:06:36,530 This is why we see the same three questions repeated four times. 74 00:06:43,070 --> 00:06:48,890 So for the next step, we're going to go back to looking at a single context question pair with the 75 00:06:48,890 --> 00:06:50,000 same arguments. 76 00:06:50,330 --> 00:06:54,830 As you recall, we just added a new argument called the return offsets mapping. 77 00:06:55,100 --> 00:06:58,850 This will return something that will be very useful throughout this notebook. 78 00:07:00,190 --> 00:07:04,870 We'll follow the tokenized call by checking the keys of the output once again. 79 00:07:11,390 --> 00:07:15,980 So as you can see, we get one new output, which is called offset mapping. 80 00:07:24,410 --> 00:07:28,220 So let's take a look at what this offset mapping actually is. 81 00:07:37,560 --> 00:07:41,670 So as you can see, it appears to be a list of lists of tuples. 82 00:07:44,780 --> 00:07:48,110 Note that I've also made a note in the comments describing what you see. 83 00:07:48,140 --> 00:07:50,330 So check that out as well if you need. 84 00:07:53,300 --> 00:07:59,540 Basically these tuples show us the character positions of each token in the model input. 85 00:07:59,990 --> 00:08:06,140 As you recall, the model input can really be conceptualized as a string, and a string is really an 86 00:08:06,140 --> 00:08:07,430 array of characters. 87 00:08:07,610 --> 00:08:13,700 So if we think of the input as an array of characters, we can describe where each token is located 88 00:08:13,700 --> 00:08:16,310 by its positions in that array. 89 00:08:16,970 --> 00:08:22,010 Note that special tokens like CLS and CEP always have the tuples zero zero. 90 00:08:22,760 --> 00:08:25,640 Otherwise, consider the question for this input. 91 00:08:26,060 --> 00:08:29,750 Recall that it's what is in front of the Notre Dame main building. 92 00:08:30,200 --> 00:08:33,620 The first tuple other than CLS is zero four. 93 00:08:34,159 --> 00:08:40,309 This makes sense because the first token of the question starts at position zero and it's four characters, 94 00:08:40,309 --> 00:08:44,450 long as in the word of what has four characters in it. 95 00:08:45,260 --> 00:08:47,210 The next tuple is five seven. 96 00:08:47,540 --> 00:08:53,450 This makes sense because the next word which is is starts at the next position and is two characters 97 00:08:53,450 --> 00:08:54,050 long. 98 00:08:54,950 --> 00:08:58,310 The next tuple is 810, which also makes sense. 99 00:08:58,640 --> 00:09:01,580 The next word is in which is two characters long. 100 00:09:02,330 --> 00:09:06,200 The next tuple is 1116, which again makes sense. 101 00:09:06,410 --> 00:09:10,370 This corresponds to the word front, which is five characters long. 102 00:09:10,730 --> 00:09:12,710 So hopefully you get the idea. 103 00:09:13,010 --> 00:09:17,660 Please feel free to go through the rest of the words in the question to make sure that this makes sense 104 00:09:17,660 --> 00:09:18,260 to you. 105 00:09:18,800 --> 00:09:23,510 Note that the final token is just one character which corresponds to the question mark. 106 00:09:29,830 --> 00:09:36,460 Another important detail to notice is that when we look at the context, the indices start back at zero. 107 00:09:36,760 --> 00:09:40,930 This will be useful to know for any computations you want to do later on. 108 00:09:42,220 --> 00:09:48,160 Now, as you recall, although we only have one question and context pairs input, the tokenize are 109 00:09:48,160 --> 00:09:52,000 still produced multiple samples because the context was split up. 110 00:09:52,420 --> 00:09:56,620 So let's have a look at the offset mapping corresponding to the next sample. 111 00:10:06,310 --> 00:10:10,780 So here we can see that it goes to the next sample because a new list starts. 112 00:10:14,060 --> 00:10:16,300 Now, as you can see, it starts at the same. 113 00:10:16,310 --> 00:10:18,560 We have zero zero for CLS. 114 00:10:18,890 --> 00:10:24,020 Then we have other tuples for the question, which again is what is in front of the Notre-Dame main 115 00:10:24,020 --> 00:10:25,490 building question mark. 116 00:10:26,750 --> 00:10:28,910 We again have zero zero for Sep. 117 00:10:33,520 --> 00:10:35,440 But then we have something interesting. 118 00:10:35,830 --> 00:10:39,550 The first context tuple starts with 174 180. 119 00:10:40,000 --> 00:10:42,460 Importantly, it does not start from zero. 120 00:10:43,030 --> 00:10:49,840 What this means is that these two tell us the positions of the characters in the original full context. 121 00:10:50,470 --> 00:10:55,810 Again, these facts may seem kind of random, but you'll need to know them in order to properly understand 122 00:10:55,810 --> 00:10:56,600 the code. 123 00:10:56,620 --> 00:10:58,390 Later on in this notebook. 124 00:11:04,670 --> 00:11:10,100 Now, just to make things more concrete, let's check the size of the offset mapping list. 125 00:11:13,610 --> 00:11:19,580 So as you can see, it contains four elements because the original question context pair was split into 126 00:11:19,580 --> 00:11:21,320 four different model inputs. 127 00:11:25,430 --> 00:11:30,710 What may also be useful is checking the length of the elements inside the parent list. 128 00:11:34,180 --> 00:11:40,330 As you can see, it contains 100 tuples since we truncated each input to be size 100. 129 00:11:41,440 --> 00:11:45,430 Now you should note that not all of the inputs will be of this size. 130 00:11:45,700 --> 00:11:49,780 As an exercise, try to find one that isn't and think about why. 13394

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.