subtitlecat.com

All language subtitles for 029 Object Detection - Step 3-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,540 --> 00:00:07,080 Hello and welcome to this new tutorial today in this to all we are going to define the detect function 2 00:00:07,380 --> 00:00:12,660 that will do the detections exactly like what we did for open city only this time it is not going to 3 00:00:12,660 --> 00:00:14,260 be based on open city. 4 00:00:14,370 --> 00:00:20,340 It's going to be based on deep learning because we're going to do that detection through the SSD model 5 00:00:20,430 --> 00:00:23,350 single shot multi-book detection. 6 00:00:23,400 --> 00:00:28,310 So excited to start because this time we're really taking things at the next level. 7 00:00:28,320 --> 00:00:29,710 All right so let's do this. 8 00:00:29,760 --> 00:00:35,100 The first thing very important to understand is that exactly like before we are going to do a frame 9 00:00:35,100 --> 00:00:42,180 by frame detection that is that the detect function that we're about to implement will work on single 10 00:00:42,270 --> 00:00:49,170 images it will not do the detection on the video directly it will do the detection on each single image 11 00:00:49,380 --> 00:00:56,940 of the video and then using some tricks with actually image IO we will manage to extract all the frames 12 00:00:57,030 --> 00:01:02,880 of the video apply the detect function on the frames and then reassemble the whole thing to make the 13 00:01:02,880 --> 00:01:07,320 video with the rectangles detecting the dog and the humans. 14 00:01:07,350 --> 00:01:11,530 And Carol Carol is actually going to be detected on just one frame you'll see. 15 00:01:11,580 --> 00:01:15,150 But anyway that's frame by frame detection. 16 00:01:15,360 --> 00:01:17,490 That's the first thing important to understand. 17 00:01:17,500 --> 00:01:20,430 And now let's start to implement that function. 18 00:01:20,460 --> 00:01:27,530 So as usual we start with that to define a new function that we need to give a name to these functions 19 00:01:27,540 --> 00:01:29,920 we are going to call it detect. 20 00:01:30,180 --> 00:01:31,090 Just like that. 21 00:01:31,140 --> 00:01:36,560 And now we need to specify the arguments that this function is going to take. 22 00:01:36,660 --> 00:01:39,530 So this function detector is going to take three arguments. 23 00:01:39,720 --> 00:01:46,770 The first one is the image on which the detect function is going to be applied to detect the object. 24 00:01:46,770 --> 00:01:50,970 That's our first argument and we're going to call it Freyne. 25 00:01:51,200 --> 00:02:00,360 Now as opposed to before with open C we only need to input as arguments the frame and not the great 26 00:02:00,480 --> 00:02:01,130 image. 27 00:02:01,140 --> 00:02:06,710 We only need to put the original image here the frame the original frame in color and not the grayscale 28 00:02:06,720 --> 00:02:12,490 because remember we had to do this before because Open City only works on great images. 29 00:02:12,690 --> 00:02:17,370 But here we're doing something totally different and therefore we don't have to take the black and white 30 00:02:17,370 --> 00:02:18,920 version of the frame. 31 00:02:18,960 --> 00:02:26,350 All right so that's our first argument then the second argument will be net which will be the SS The 32 00:02:26,400 --> 00:02:33,780 neural network the single shots multi-book detection neural network and then the third and last argument 33 00:02:33,900 --> 00:02:40,320 is transform because there is going to be some transformations applied to the image but not to put them 34 00:02:40,440 --> 00:02:46,970 in black and white just to make sure that the images are compatible with the new one that work you know 35 00:02:47,130 --> 00:02:51,630 the images will be the input of the neural network and therefore they have to have a certain format 36 00:02:51,870 --> 00:02:59,310 and this transform argument that is the final argument of this detect function will transform the images 37 00:02:59,400 --> 00:03:02,950 so that they have the right format to get into the new network. 38 00:03:02,970 --> 00:03:03,720 All right. 39 00:03:03,870 --> 00:03:06,950 And then it's important to understand what this function will do. 40 00:03:07,020 --> 00:03:13,880 So as you might have guessed it will do the detections on the images the single images one by one. 41 00:03:14,130 --> 00:03:17,660 But what exactly is this function going to return. 42 00:03:17,850 --> 00:03:25,350 Well it will simply return this same frame but with the rectangle detecting the objects the dog and 43 00:03:25,350 --> 00:03:31,370 the humans and not only will there be direct Englebert Also there will be the label on the rectangle 44 00:03:31,380 --> 00:03:37,560 So we'll see some rectangles with the label humans because there are several persons on the video and 45 00:03:37,650 --> 00:03:39,810 one rectangle with the label Doug. 46 00:03:39,990 --> 00:03:44,850 So there will be everything that will be totally clear and amazing detection on the video. 47 00:03:44,850 --> 00:03:47,490 I can't wait to show you this. 48 00:03:47,550 --> 00:03:48,090 All right. 49 00:03:48,090 --> 00:03:54,520 So there are three arguments and now we're ready to go inside the function to define what we wanted 50 00:03:54,540 --> 00:03:56,250 to do. 51 00:03:56,250 --> 00:04:01,710 All right so the first thing we have to do is to get the height and the weight of the image and the 52 00:04:01,710 --> 00:04:06,210 frame the frame that is the argument here on which the function is applied. 53 00:04:06,240 --> 00:04:08,790 So we're going to introduce two new variables. 54 00:04:08,880 --> 00:04:19,310 Hide and with right and to get these height and width of the frame we're going to do that very efficiently. 55 00:04:19,540 --> 00:04:26,370 We're going to get it from our frame obviously because this is some information specific to the frame. 56 00:04:26,620 --> 00:04:29,530 And then this frame has some attributes. 57 00:04:29,530 --> 00:04:35,910 One of them is shape and shape is actually an attribute that returns a vector of three elements. 58 00:04:35,950 --> 00:04:38,820 The first one is the height of the frame. 59 00:04:38,830 --> 00:04:45,790 The second one therefore of index One is the width of the frame and is there one of index 2 is the number 60 00:04:45,790 --> 00:04:46,410 of channels. 61 00:04:46,510 --> 00:04:51,850 So the number of channels means that if you have a black and white image you will have one channel and 62 00:04:51,860 --> 00:04:56,010 if you have a color image you will have three channels for red blue and green. 63 00:04:56,290 --> 00:04:59,400 But we just want the height and width. 64 00:04:59,560 --> 00:05:05,710 So we're going to get the first index which is zero corresponding to the height and the second index 65 00:05:05,920 --> 00:05:08,270 one corresponding to the width. 66 00:05:08,470 --> 00:05:14,920 But the correct way to do this is actually to type a colon here and then to because that means we're 67 00:05:14,920 --> 00:05:19,090 taking the range from zero to two but with two excluded. 68 00:05:19,130 --> 00:05:23,900 So we're just taking zero and 1 and therefore we're taking the hide and do it. 69 00:05:24,340 --> 00:05:24,750 All right. 70 00:05:24,790 --> 00:05:26,240 So that's the first thing we had to do. 71 00:05:26,350 --> 00:05:32,590 And now we're going to do several transformations to go from the original image which is our friend 72 00:05:32,590 --> 00:05:39,000 right now to a torch variable that will be accepted into the as is the neural network. 73 00:05:39,070 --> 00:05:44,200 So there is a series of transformations to do before getting to this torche viable. 74 00:05:44,200 --> 00:05:51,580 The first one is to apply the transform transformation to make sure that the image has the right format. 75 00:05:51,580 --> 00:05:54,990 That is the right dimensions and the right color values. 76 00:05:55,000 --> 00:05:57,070 That's the first transformation we need to make. 77 00:05:57,220 --> 00:06:04,770 Once we have done this transformation then we will need to convert this transform frame from a number 78 00:06:04,850 --> 00:06:10,250 array because it will still be an entire array from a number of array to a torch tensor. 79 00:06:10,300 --> 00:06:11,380 That's not the same. 80 00:06:11,380 --> 00:06:18,520 That sensor is a more advanced matrix A more advanced array and therefore that's exactly what is our 81 00:06:18,520 --> 00:06:19,910 second transformation. 82 00:06:19,930 --> 00:06:26,890 We convert the transform frame from an umpire array to a torch tensor then that's not all we will need 83 00:06:26,890 --> 00:06:33,890 to do a third transformation which will be to add a fake dimension to the torch sensor and that fig 84 00:06:33,890 --> 00:06:35,840 damage and will respond to the batch. 85 00:06:35,950 --> 00:06:42,510 And then finally the fourth and final transformation to do before it is ready to go into the new one 86 00:06:42,510 --> 00:06:46,840 that work will be to convert it into a torch variable. 87 00:06:46,840 --> 00:06:53,990 Remember this variable closets we imported here is a class that converts a torch sensor into a torch 88 00:06:54,040 --> 00:06:58,620 variable that contains both the tensor and a gradient. 89 00:06:58,740 --> 00:07:04,420 And this torch variable will then be an element of the dynamic graph which will allow us later to do 90 00:07:04,420 --> 00:07:09,560 some very fast and efficient computation of the gradients during backward propagation. 91 00:07:09,850 --> 00:07:12,490 So there we go we have four transformations to make. 92 00:07:12,490 --> 00:07:15,820 We'll make them in the next tutorial starting with the first one. 93 00:07:16,000 --> 00:07:21,310 And then finally we'll be able to feed the neural network with this image. 94 00:07:21,550 --> 00:07:22,610 So let's match this. 95 00:07:22,660 --> 00:07:24,760 And until then enjoy computer vision. 10799