Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:10,940 --> 00:00:15,500
So in this lecture, we will be continuing our notebook from earlier in the session.
2
00:00:16,160 --> 00:00:19,000
As you know, the next step will be to load up our tokenize.
3
00:00:19,820 --> 00:00:24,890
In this notebook, we'll be using the stillbirth as our checkpoint, since it trains much faster than
4
00:00:24,920 --> 00:00:25,580
Bert.
5
00:00:25,700 --> 00:00:30,600
But if you're doing this at home by yourself, I would recommend trying Bert instead.
6
00:00:30,620 --> 00:00:32,689
In addition, a two to Bert.
7
00:00:34,010 --> 00:00:39,350
As usual we call auto tokenize or from pre-trained passing in the checkpoint.
8
00:00:47,100 --> 00:00:51,510
The next step is to test our token isare on a question and context pair.
9
00:00:51,690 --> 00:00:54,060
We'll choose the sample at index one again.
10
00:00:55,820 --> 00:01:02,450
In order to understand the results, we'll take the input IDs and call tokenize or decode to get the
11
00:01:02,450 --> 00:01:03,890
result back in English.
12
00:01:10,500 --> 00:01:14,140
So as expected, we get both the question and context.
13
00:01:14,160 --> 00:01:16,320
Concatenate it into one string.
14
00:01:17,470 --> 00:01:20,800
Note that by convention we'll always put the question first.
15
00:01:21,190 --> 00:01:25,750
As you can see, the string starts with the special CLS token as usual.
16
00:01:25,840 --> 00:01:30,520
Then we have the question, which is what is in front of the Notre Dame main building?
17
00:01:30,700 --> 00:01:35,440
Then we have the SEP token, then we have the context, and then it ends with another set.
18
00:01:41,880 --> 00:01:45,420
So as you've just seen, the context can be quite long.
19
00:01:45,450 --> 00:01:51,150
So long, in fact, that the combination of the context and question might be too long for the model
20
00:01:51,150 --> 00:01:53,820
to handle the accepted strategy.
21
00:01:53,820 --> 00:01:57,270
For this is to split the context into multiple samples.
22
00:01:57,810 --> 00:02:03,390
In order to do this, we need to specify a few additional arguments when the tokenizing is called.
23
00:02:03,630 --> 00:02:10,590
Specifically, we set max length to 100, specifying that the number of input tokens for any given sample
24
00:02:10,590 --> 00:02:13,350
should be in most 100 tokens long.
25
00:02:13,980 --> 00:02:20,100
We set truncation into only seconds, which means that we will only truncate the context but not the
26
00:02:20,100 --> 00:02:20,820
question.
27
00:02:21,180 --> 00:02:24,660
As you recall, the context is the second input string.
28
00:02:25,200 --> 00:02:27,840
We wouldn't want to truncate the question since then.
29
00:02:27,840 --> 00:02:29,790
We wouldn't know what the question is.
30
00:02:30,300 --> 00:02:36,450
And because we split the context into multiple windows, we don't want to accidentally cut off the answer.
31
00:02:36,690 --> 00:02:41,310
Meaning we don't want part of the answer to be in one window and the other part of the answer to be
32
00:02:41,310 --> 00:02:42,450
in another window.
33
00:02:42,840 --> 00:02:48,840
So in order to avoid that situation, we set the stride which in this case has been set to 50.
34
00:02:49,320 --> 00:02:53,370
Note that 150 are just toI values for this experiment.
35
00:02:53,550 --> 00:02:56,100
Later on, we'll be using different values.
36
00:02:57,590 --> 00:03:02,030
Finally we said return overflowing tokens the true meaning.
37
00:03:02,030 --> 00:03:06,800
We want to return the tokens in the overlapping parts of the context windows.
38
00:03:11,570 --> 00:03:15,980
So once we've called the token Pfizer note that it will give us multiple samples.
39
00:03:16,250 --> 00:03:21,560
This is because the context is split into multiple windows and it's combined with the question each
40
00:03:21,560 --> 00:03:22,190
time.
41
00:03:22,670 --> 00:03:28,010
So after we've called the token isare, we're going to loop through each of the resulting samples and
42
00:03:28,010 --> 00:03:30,320
print the decoded input IDs.
43
00:03:36,650 --> 00:03:39,830
So as you can see, we get four new samples.
44
00:03:39,950 --> 00:03:43,550
This means that the context was split into four different parts.
45
00:03:44,700 --> 00:03:48,930
Note that the question has not been truncated, which is what we specified.
46
00:03:49,710 --> 00:03:55,350
Instead, the context in this case is just a part of the full context which we saw above.
47
00:03:56,490 --> 00:04:01,470
And again, note that we have the CLS and SEP tokens in their usual places.
48
00:04:04,990 --> 00:04:10,000
Now you know that when we call the organizer, we don't just get back the token IDs.
49
00:04:10,120 --> 00:04:13,990
So let's check the keys of the result to see what we actually got.
50
00:04:19,040 --> 00:04:24,950
So in addition to the input IDs and attention mask, which you've already seen, we get another item
51
00:04:24,950 --> 00:04:27,350
called overflow to sample mapping.
52
00:04:31,810 --> 00:04:34,600
So let's print this out just to see what it is.
53
00:04:39,630 --> 00:04:42,810
So it's just a list of zeroes, which is not that informative.
54
00:04:43,140 --> 00:04:48,600
What I'm going to do is tell you right now what this means and then do a more meaningful demonstration
55
00:04:48,600 --> 00:04:49,710
in the next step.
56
00:04:50,310 --> 00:04:51,870
So what does this mean?
57
00:04:52,170 --> 00:04:57,630
Well, as you recall, when we split up the context, we get back multiple samples.
58
00:04:57,990 --> 00:05:02,880
However, we may want to know what is the original sample index it came from?
59
00:05:03,000 --> 00:05:04,560
That's what this tells us.
60
00:05:04,860 --> 00:05:10,800
Since we only passed in one original sample, this is going to be zero each time because they all came
61
00:05:10,800 --> 00:05:12,960
from the zero original sample.
62
00:05:18,570 --> 00:05:23,880
So just to get a better feel for this overflow to sample mapping, we're going to call the tokenizing
63
00:05:23,910 --> 00:05:30,360
once again, but this time on the first three samples instead of just one, we're also going to add
64
00:05:30,360 --> 00:05:35,100
one additional argument return offsets mapping which has been set to true.
65
00:05:35,670 --> 00:05:39,090
Again, we print out the overflow to sample mapping results.
66
00:05:45,990 --> 00:05:48,480
So hopefully these results make more sense.
67
00:05:48,900 --> 00:05:53,190
As you can see, all of our input samples have been split into four parts.
68
00:05:53,550 --> 00:05:59,070
This tells us that the first four outputs correspond to the original sample at index zero.
69
00:05:59,400 --> 00:06:05,370
The next four outputs correspond to the original sample at index one, and the final four outputs correspond
70
00:06:05,370 --> 00:06:07,470
to the original sample at index two.
71
00:06:14,320 --> 00:06:19,240
Now to really confirm that this is the case, we can also decode the input IDs.
72
00:06:25,850 --> 00:06:32,450
So as you can see, this confirms that each of the first three samples really was split into four parts.
73
00:06:32,660 --> 00:06:36,530
This is why we see the same three questions repeated four times.
74
00:06:43,070 --> 00:06:48,890
So for the next step, we're going to go back to looking at a single context question pair with the
75
00:06:48,890 --> 00:06:50,000
same arguments.
76
00:06:50,330 --> 00:06:54,830
As you recall, we just added a new argument called the return offsets mapping.
77
00:06:55,100 --> 00:06:58,850
This will return something that will be very useful throughout this notebook.
78
00:07:00,190 --> 00:07:04,870
We'll follow the tokenized call by checking the keys of the output once again.
79
00:07:11,390 --> 00:07:15,980
So as you can see, we get one new output, which is called offset mapping.
80
00:07:24,410 --> 00:07:28,220
So let's take a look at what this offset mapping actually is.
81
00:07:37,560 --> 00:07:41,670
So as you can see, it appears to be a list of lists of tuples.
82
00:07:44,780 --> 00:07:48,110
Note that I've also made a note in the comments describing what you see.
83
00:07:48,140 --> 00:07:50,330
So check that out as well if you need.
84
00:07:53,300 --> 00:07:59,540
Basically these tuples show us the character positions of each token in the model input.
85
00:07:59,990 --> 00:08:06,140
As you recall, the model input can really be conceptualized as a string, and a string is really an
86
00:08:06,140 --> 00:08:07,430
array of characters.
87
00:08:07,610 --> 00:08:13,700
So if we think of the input as an array of characters, we can describe where each token is located
88
00:08:13,700 --> 00:08:16,310
by its positions in that array.
89
00:08:16,970 --> 00:08:22,010
Note that special tokens like CLS and CEP always have the tuples zero zero.
90
00:08:22,760 --> 00:08:25,640
Otherwise, consider the question for this input.
91
00:08:26,060 --> 00:08:29,750
Recall that it's what is in front of the Notre Dame main building.
92
00:08:30,200 --> 00:08:33,620
The first tuple other than CLS is zero four.
93
00:08:34,159 --> 00:08:40,309
This makes sense because the first token of the question starts at position zero and it's four characters,
94
00:08:40,309 --> 00:08:44,450
long as in the word of what has four characters in it.
95
00:08:45,260 --> 00:08:47,210
The next tuple is five seven.
96
00:08:47,540 --> 00:08:53,450
This makes sense because the next word which is is starts at the next position and is two characters
97
00:08:53,450 --> 00:08:54,050
long.
98
00:08:54,950 --> 00:08:58,310
The next tuple is 810, which also makes sense.
99
00:08:58,640 --> 00:09:01,580
The next word is in which is two characters long.
100
00:09:02,330 --> 00:09:06,200
The next tuple is 1116, which again makes sense.
101
00:09:06,410 --> 00:09:10,370
This corresponds to the word front, which is five characters long.
102
00:09:10,730 --> 00:09:12,710
So hopefully you get the idea.
103
00:09:13,010 --> 00:09:17,660
Please feel free to go through the rest of the words in the question to make sure that this makes sense
104
00:09:17,660 --> 00:09:18,260
to you.
105
00:09:18,800 --> 00:09:23,510
Note that the final token is just one character which corresponds to the question mark.
106
00:09:29,830 --> 00:09:36,460
Another important detail to notice is that when we look at the context, the indices start back at zero.
107
00:09:36,760 --> 00:09:40,930
This will be useful to know for any computations you want to do later on.
108
00:09:42,220 --> 00:09:48,160
Now, as you recall, although we only have one question and context pairs input, the tokenize are
109
00:09:48,160 --> 00:09:52,000
still produced multiple samples because the context was split up.
110
00:09:52,420 --> 00:09:56,620
So let's have a look at the offset mapping corresponding to the next sample.
111
00:10:06,310 --> 00:10:10,780
So here we can see that it goes to the next sample because a new list starts.
112
00:10:14,060 --> 00:10:16,300
Now, as you can see, it starts at the same.
113
00:10:16,310 --> 00:10:18,560
We have zero zero for CLS.
114
00:10:18,890 --> 00:10:24,020
Then we have other tuples for the question, which again is what is in front of the Notre-Dame main
115
00:10:24,020 --> 00:10:25,490
building question mark.
116
00:10:26,750 --> 00:10:28,910
We again have zero zero for Sep.
117
00:10:33,520 --> 00:10:35,440
But then we have something interesting.
118
00:10:35,830 --> 00:10:39,550
The first context tuple starts with 174 180.
119
00:10:40,000 --> 00:10:42,460
Importantly, it does not start from zero.
120
00:10:43,030 --> 00:10:49,840
What this means is that these two tell us the positions of the characters in the original full context.
121
00:10:50,470 --> 00:10:55,810
Again, these facts may seem kind of random, but you'll need to know them in order to properly understand
122
00:10:55,810 --> 00:10:56,600
the code.
123
00:10:56,620 --> 00:10:58,390
Later on in this notebook.
124
00:11:04,670 --> 00:11:10,100
Now, just to make things more concrete, let's check the size of the offset mapping list.
125
00:11:13,610 --> 00:11:19,580
So as you can see, it contains four elements because the original question context pair was split into
126
00:11:19,580 --> 00:11:21,320
four different model inputs.
127
00:11:25,430 --> 00:11:30,710
What may also be useful is checking the length of the elements inside the parent list.
128
00:11:34,180 --> 00:11:40,330
As you can see, it contains 100 tuples since we truncated each input to be size 100.
129
00:11:41,440 --> 00:11:45,430
Now you should note that not all of the inputs will be of this size.
130
00:11:45,700 --> 00:11:49,780
As an exercise, try to find one that isn't and think about why.
13394
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.