subtitlecat.com

All language subtitles for 005 Using the Tokenizer in Python_en

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,940 --> 00:00:15,500 So in this lecture, we will be continuing our notebook from earlier in the session. 2 00:00:16,160 --> 00:00:19,000 As you know, the next step will be to load up our tokenize. 3 00:00:19,820 --> 00:00:24,890 In this notebook, we'll be using the stillbirth as our checkpoint, since it trains much faster than 4 00:00:24,920 --> 00:00:25,580 Bert. 5 00:00:25,700 --> 00:00:30,600 But if you're doing this at home by yourself, I would recommend trying Bert instead. 6 00:00:30,620 --> 00:00:32,689 In addition, a two to Bert. 7 00:00:34,010 --> 00:00:39,350 As usual we call auto tokenize or from pre-trained passing in the checkpoint. 8 00:00:47,100 --> 00:00:51,510 The next step is to test our token isare on a question and context pair. 9 00:00:51,690 --> 00:00:54,060 We'll choose the sample at index one again. 10 00:00:55,820 --> 00:01:02,450 In order to understand the results, we'll take the input IDs and call tokenize or decode to get the 11 00:01:02,450 --> 00:01:03,890 result back in English. 12 00:01:10,500 --> 00:01:14,140 So as expected, we get both the question and context. 13 00:01:14,160 --> 00:01:16,320 Concatenate it into one string. 14 00:01:17,470 --> 00:01:20,800 Note that by convention we'll always put the question first. 15 00:01:21,190 --> 00:01:25,750 As you can see, the string starts with the special CLS token as usual. 16 00:01:25,840 --> 00:01:30,520 Then we have the question, which is what is in front of the Notre Dame main building? 17 00:01:30,700 --> 00:01:35,440 Then we have the SEP token, then we have the context, and then it ends with another set. 18 00:01:41,880 --> 00:01:45,420 So as you've just seen, the context can be quite long. 19 00:01:45,450 --> 00:01:51,150 So long, in fact, that the combination of the context and question might be too long for the model 20 00:01:51,150 --> 00:01:53,820 to handle the accepted strategy. 21 00:01:53,820 --> 00:01:57,270 For this is to split the context into multiple samples. 22 00:01:57,810 --> 00:02:03,390 In order to do this, we need to specify a few additional arguments when the tokenizing is called. 23 00:02:03,630 --> 00:02:10,590 Specifically, we set max length to 100, specifying that the number of input tokens for any given sample 24 00:02:10,590 --> 00:02:13,350 should be in most 100 tokens long. 25 00:02:13,980 --> 00:02:20,100 We set truncation into only seconds, which means that we will only truncate the context but not the 26 00:02:20,100 --> 00:02:20,820 question. 27 00:02:21,180 --> 00:02:24,660 As you recall, the context is the second input string. 28 00:02:25,200 --> 00:02:27,840 We wouldn't want to truncate the question since then. 29 00:02:27,840 --> 00:02:29,790 We wouldn't know what the question is. 30 00:02:30,300 --> 00:02:36,450 And because we split the context into multiple windows, we don't want to accidentally cut off the answer. 31 00:02:36,690 --> 00:02:41,310 Meaning we don't want part of the answer to be in one window and the other part of the answer to be 32 00:02:41,310 --> 00:02:42,450 in another window. 33 00:02:42,840 --> 00:02:48,840 So in order to avoid that situation, we set the stride which in this case has been set to 50. 34 00:02:49,320 --> 00:02:53,370 Note that 150 are just toI values for this experiment. 35 00:02:53,550 --> 00:02:56,100 Later on, we'll be using different values. 36 00:02:57,590 --> 00:03:02,030 Finally we said return overflowing tokens the true meaning. 37 00:03:02,030 --> 00:03:06,800 We want to return the tokens in the overlapping parts of the context windows. 38 00:03:11,570 --> 00:03:15,980 So once we've called the token Pfizer note that it will give us multiple samples. 39 00:03:16,250 --> 00:03:21,560 This is because the context is split into multiple windows and it's combined with the question each 40 00:03:21,560 --> 00:03:22,190 time. 41 00:03:22,670 --> 00:03:28,010 So after we've called the token isare, we're going to loop through each of the resulting samples and 42 00:03:28,010 --> 00:03:30,320 print the decoded input IDs. 43 00:03:36,650 --> 00:03:39,830 So as you can see, we get four new samples. 44 00:03:39,950 --> 00:03:43,550 This means that the context was split into four different parts. 45 00:03:44,700 --> 00:03:48,930 Note that the question has not been truncated, which is what we specified. 46 00:03:49,710 --> 00:03:55,350 Instead, the context in this case is just a part of the full context which we saw above. 47 00:03:56,490 --> 00:04:01,470 And again, note that we have the CLS and SEP tokens in their usual places. 48 00:04:04,990 --> 00:04:10,000 Now you know that when we call the organizer, we don't just get back the token IDs. 49 00:04:10,120 --> 00:04:13,990 So let's check the keys of the result to see what we actually got. 50 00:04:19,040 --> 00:04:24,950 So in addition to the input IDs and attention mask, which you've already seen, we get another item 51 00:04:24,950 --> 00:04:27,350 called overflow to sample mapping. 52 00:04:31,810 --> 00:04:34,600 So let's print this out just to see what it is. 53 00:04:39,630 --> 00:04:42,810 So it's just a list of zeroes, which is not that informative. 54 00:04:43,140 --> 00:04:48,600 What I'm going to do is tell you right now what this means and then do a more meaningful demonstration 55 00:04:48,600 --> 00:04:49,710 in the next step. 56 00:04:50,310 --> 00:04:51,870 So what does this mean? 57 00:04:52,170 --> 00:04:57,630 Well, as you recall, when we split up the context, we get back multiple samples. 58 00:04:57,990 --> 00:05:02,880 However, we may want to know what is the original sample index it came from? 59 00:05:03,000 --> 00:05:04,560 That's what this tells us. 60 00:05:04,860 --> 00:05:10,800 Since we only passed in one original sample, this is going to be zero each time because they all came 61 00:05:10,800 --> 00:05:12,960 from the zero original sample. 62 00:05:18,570 --> 00:05:23,880 So just to get a better feel for this overflow to sample mapping, we're going to call the tokenizing 63 00:05:23,910 --> 00:05:30,360 once again, but this time on the first three samples instead of just one, we're also going to add 64 00:05:30,360 --> 00:05:35,100 one additional argument return offsets mapping which has been set to true. 65 00:05:35,670 --> 00:05:39,090 Again, we print out the overflow to sample mapping results. 66 00:05:45,990 --> 00:05:48,480 So hopefully these results make more sense. 67 00:05:48,900 --> 00:05:53,190 As you can see, all of our input samples have been split into four parts. 68 00:05:53,550 --> 00:05:59,070 This tells us that the first four outputs correspond to the original sample at index zero. 69 00:05:59,400 --> 00:06:05,370 The next four outputs correspond to the original sample at index one, and the final four outputs correspond 70 00:06:05,370 --> 00:06:07,470 to the original sample at index two. 71 00:06:14,320 --> 00:06:19,240 Now to really confirm that this is the case, we can also decode the input IDs. 72 00:06:25,850 --> 00:06:32,450 So as you can see, this confirms that each of the first three samples really was split into four parts. 73 00:06:32,660 --> 00:06:36,530 This is why we see the same three questions repeated four times. 74 00:06:43,070 --> 00:06:48,890 So for the next step, we're going to go back to looking at a single context question pair with the 75 00:06:48,890 --> 00:06:50,000 same arguments. 76 00:06:50,330 --> 00:06:54,830 As you recall, we just added a new argument called the return offsets mapping. 77 00:06:55,100 --> 00:06:58,850 This will return something that will be very useful throughout this notebook. 78 00:07:00,190 --> 00:07:04,870 We'll follow the tokenized call by checking the keys of the output once again. 79 00:07:11,390 --> 00:07:15,980 So as you can see, we get one new output, which is called offset mapping. 80 00:07:24,410 --> 00:07:28,220 So let's take a look at what this offset mapping actually is. 81 00:07:37,560 --> 00:07:41,670 So as you can see, it appears to be a list of lists of tuples. 82 00:07:44,780 --> 00:07:48,110 Note that I've also made a note in the comments describing what you see. 83 00:07:48,140 --> 00:07:50,330 So check that out as well if you need. 84 00:07:53,300 --> 00:07:59,540 Basically these tuples show us the character positions of each token in the model input. 85 00:07:59,990 --> 00:08:06,140 As you recall, the model input can really be conceptualized as a string, and a string is really an 86 00:08:06,140 --> 00:08:07,430 array of characters. 87 00:08:07,610 --> 00:08:13,700 So if we think of the input as an array of characters, we can describe where each token is located 88 00:08:13,700 --> 00:08:16,310 by its positions in that array. 89 00:08:16,970 --> 00:08:22,010 Note that special tokens like CLS and CEP always have the tuples zero zero. 90 00:08:22,760 --> 00:08:25,640 Otherwise, consider the question for this input. 91 00:08:26,060 --> 00:08:29,750 Recall that it's what is in front of the Notre Dame main building. 92 00:08:30,200 --> 00:08:33,620 The first tuple other than CLS is zero four. 93 00:08:34,159 --> 00:08:40,309 This makes sense because the first token of the question starts at position zero and it's four characters, 94 00:08:40,309 --> 00:08:44,450 long as in the word of what has four characters in it. 95 00:08:45,260 --> 00:08:47,210 The next tuple is five seven. 96 00:08:47,540 --> 00:08:53,450 This makes sense because the next word which is is starts at the next position and is two characters 97 00:08:53,450 --> 00:08:54,050 long. 98 00:08:54,950 --> 00:08:58,310 The next tuple is 810, which also makes sense. 99 00:08:58,640 --> 00:09:01,580 The next word is in which is two characters long. 100 00:09:02,330 --> 00:09:06,200 The next tuple is 1116, which again makes sense. 101 00:09:06,410 --> 00:09:10,370 This corresponds to the word front, which is five characters long. 102 00:09:10,730 --> 00:09:12,710 So hopefully you get the idea. 103 00:09:13,010 --> 00:09:17,660 Please feel free to go through the rest of the words in the question to make sure that this makes sense 104 00:09:17,660 --> 00:09:18,260 to you. 105 00:09:18,800 --> 00:09:23,510 Note that the final token is just one character which corresponds to the question mark. 106 00:09:29,830 --> 00:09:36,460 Another important detail to notice is that when we look at the context, the indices start back at zero. 107 00:09:36,760 --> 00:09:40,930 This will be useful to know for any computations you want to do later on. 108 00:09:42,220 --> 00:09:48,160 Now, as you recall, although we only have one question and context pairs input, the tokenize are 109 00:09:48,160 --> 00:09:52,000 still produced multiple samples because the context was split up. 110 00:09:52,420 --> 00:09:56,620 So let's have a look at the offset mapping corresponding to the next sample. 111 00:10:06,310 --> 00:10:10,780 So here we can see that it goes to the next sample because a new list starts. 112 00:10:14,060 --> 00:10:16,300 Now, as you can see, it starts at the same. 113 00:10:16,310 --> 00:10:18,560 We have zero zero for CLS. 114 00:10:18,890 --> 00:10:24,020 Then we have other tuples for the question, which again is what is in front of the Notre-Dame main 115 00:10:24,020 --> 00:10:25,490 building question mark. 116 00:10:26,750 --> 00:10:28,910 We again have zero zero for Sep. 117 00:10:33,520 --> 00:10:35,440 But then we have something interesting. 118 00:10:35,830 --> 00:10:39,550 The first context tuple starts with 174 180. 119 00:10:40,000 --> 00:10:42,460 Importantly, it does not start from zero. 120 00:10:43,030 --> 00:10:49,840 What this means is that these two tell us the positions of the characters in the original full context. 121 00:10:50,470 --> 00:10:55,810 Again, these facts may seem kind of random, but you'll need to know them in order to properly understand 122 00:10:55,810 --> 00:10:56,600 the code. 123 00:10:56,620 --> 00:10:58,390 Later on in this notebook. 124 00:11:04,670 --> 00:11:10,100 Now, just to make things more concrete, let's check the size of the offset mapping list. 125 00:11:13,610 --> 00:11:19,580 So as you can see, it contains four elements because the original question context pair was split into 126 00:11:19,580 --> 00:11:21,320 four different model inputs. 127 00:11:25,430 --> 00:11:30,710 What may also be useful is checking the length of the elements inside the parent list. 128 00:11:34,180 --> 00:11:40,330 As you can see, it contains 100 tuples since we truncated each input to be size 100. 129 00:11:41,440 --> 00:11:45,430 Now you should note that not all of the inputs will be of this size. 130 00:11:45,700 --> 00:11:49,780 As an exercise, try to find one that isn't and think about why. 13394