subtitlecat.com

All language subtitles for 003 Exploring the Dataset (SQuAD) in Python_en

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:11,150 --> 00:00:14,960 In this lecture, we will begin looking at our question answering notebook. 2 00:00:15,230 --> 00:00:19,250 As usual, we begin by installing transformers and data sets. 3 00:00:27,290 --> 00:00:30,470 The next step is to input the function load data set. 4 00:00:30,620 --> 00:00:33,860 We then call this function passing in the string squad. 5 00:00:34,010 --> 00:00:36,380 We'll call the result of raw data sets. 6 00:00:44,590 --> 00:00:50,650 So note that the data set already becomes split into two parts one for train and one for validation. 7 00:00:50,950 --> 00:00:56,770 The train set has about 88,000 samples, while the validation set has about a 10,000. 8 00:00:57,130 --> 00:01:02,920 Note that each data set it comes with the columns ID title, context, question and answers. 9 00:01:04,129 --> 00:01:09,050 So far in this course, we haven't had much a need for the ID column, but in this section it will be 10 00:01:09,050 --> 00:01:09,770 crucial. 11 00:01:10,070 --> 00:01:12,560 On the other hand, we won't be using the title. 12 00:01:18,640 --> 00:01:21,790 Let's look at one of the titles anyway just to see what it is. 13 00:01:25,100 --> 00:01:28,190 As you can see, it says University of Notre Dame. 14 00:01:31,690 --> 00:01:34,450 Now let's look at the context for the same sample. 15 00:01:37,650 --> 00:01:41,490 As you can see, it's the same context I showed you in the previous lecture. 16 00:01:46,700 --> 00:01:48,980 Now let's look at the corresponding question. 17 00:01:53,010 --> 00:01:56,700 So it says what is in front of the Notre Dame main building. 18 00:02:00,450 --> 00:02:02,610 Finally, let's look at the answers. 19 00:02:05,400 --> 00:02:08,370 So the answer is a copper statue of Christ. 20 00:02:14,120 --> 00:02:20,330 Now, as you've seen, the answers are stored in a list and the column is the plural answers indicating 21 00:02:20,330 --> 00:02:23,180 that there can be multiple answers per input. 22 00:02:23,570 --> 00:02:28,490 I stated that although this is always possible, it is not the case for the train set. 23 00:02:28,730 --> 00:02:32,330 So in this line we simply check how many samples in the train set. 24 00:02:32,360 --> 00:02:36,050 Have a list of answers with any length, not equal to one. 25 00:02:44,870 --> 00:02:50,570 So as you can see, after we apply this filter, we find that zero samples have an answer list with 26 00:02:50,570 --> 00:02:52,220 length and not equal to one. 27 00:02:52,670 --> 00:02:57,380 This means that for the train set, every input has precisely one answer. 28 00:03:02,570 --> 00:03:06,260 The next step is to check out one of the answers from the validation set. 29 00:03:11,310 --> 00:03:15,030 As you can see, this particular sample has three answers. 30 00:03:15,030 --> 00:03:20,610 So for the validation set, it is possible for one question to have multiple answers. 31 00:03:25,350 --> 00:03:30,930 Let's now check the corresponding context to see why this question may have multiple answers. 32 00:03:39,780 --> 00:03:41,210 So we find the text. 33 00:03:41,220 --> 00:03:48,810 The game was played on February seven, 2016 at Levi's Stadium in the San Francisco Bay area at Santa 34 00:03:48,810 --> 00:03:50,040 Clara, California. 35 00:03:50,430 --> 00:03:53,520 This explains the three possible answers we saw above. 36 00:03:53,970 --> 00:03:57,000 In fact, there are probably even more valid answers. 37 00:03:57,030 --> 00:04:01,770 For instance, we could say San Francisco Bay Area or simply Bay Area. 38 00:04:02,070 --> 00:04:04,170 Although these are not in the data set. 39 00:04:10,420 --> 00:04:12,160 Now just for completion sake. 40 00:04:12,160 --> 00:04:13,690 Let's check out the question. 41 00:04:17,500 --> 00:04:22,810 As you can see, the question is, as expected, where did Super Bowl 50 take place? 42 00:04:28,140 --> 00:04:34,260 Now, one weird aspect of this data set is that for some cases where there are multiple answers, the 43 00:04:34,260 --> 00:04:36,090 answers are actually all the same. 44 00:04:37,230 --> 00:04:39,750 Let's check the first sample to see an example. 45 00:04:44,350 --> 00:04:48,310 So as you can see, we get Denver Broncos three times. 46 00:04:48,700 --> 00:04:53,830 Now, you might think that they could be different answers if the same string shows up multiple times 47 00:04:53,830 --> 00:04:54,970 in the context. 48 00:04:55,120 --> 00:05:00,430 But in this case, we see that the answer is START index is the same in all three cases. 4775