Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,540 --> 00:00:07,080
Hello and welcome to this new tutorial today in this to all we are going to define the detect function
2
00:00:07,380 --> 00:00:12,660
that will do the detections exactly like what we did for open city only this time it is not going to
3
00:00:12,660 --> 00:00:14,260
be based on open city.
4
00:00:14,370 --> 00:00:20,340
It's going to be based on deep learning because we're going to do that detection through the SSD model
5
00:00:20,430 --> 00:00:23,350
single shot multi-book detection.
6
00:00:23,400 --> 00:00:28,310
So excited to start because this time we're really taking things at the next level.
7
00:00:28,320 --> 00:00:29,710
All right so let's do this.
8
00:00:29,760 --> 00:00:35,100
The first thing very important to understand is that exactly like before we are going to do a frame
9
00:00:35,100 --> 00:00:42,180
by frame detection that is that the detect function that we're about to implement will work on single
10
00:00:42,270 --> 00:00:49,170
images it will not do the detection on the video directly it will do the detection on each single image
11
00:00:49,380 --> 00:00:56,940
of the video and then using some tricks with actually image IO we will manage to extract all the frames
12
00:00:57,030 --> 00:01:02,880
of the video apply the detect function on the frames and then reassemble the whole thing to make the
13
00:01:02,880 --> 00:01:07,320
video with the rectangles detecting the dog and the humans.
14
00:01:07,350 --> 00:01:11,530
And Carol Carol is actually going to be detected on just one frame you'll see.
15
00:01:11,580 --> 00:01:15,150
But anyway that's frame by frame detection.
16
00:01:15,360 --> 00:01:17,490
That's the first thing important to understand.
17
00:01:17,500 --> 00:01:20,430
And now let's start to implement that function.
18
00:01:20,460 --> 00:01:27,530
So as usual we start with that to define a new function that we need to give a name to these functions
19
00:01:27,540 --> 00:01:29,920
we are going to call it detect.
20
00:01:30,180 --> 00:01:31,090
Just like that.
21
00:01:31,140 --> 00:01:36,560
And now we need to specify the arguments that this function is going to take.
22
00:01:36,660 --> 00:01:39,530
So this function detector is going to take three arguments.
23
00:01:39,720 --> 00:01:46,770
The first one is the image on which the detect function is going to be applied to detect the object.
24
00:01:46,770 --> 00:01:50,970
That's our first argument and we're going to call it Freyne.
25
00:01:51,200 --> 00:02:00,360
Now as opposed to before with open C we only need to input as arguments the frame and not the great
26
00:02:00,480 --> 00:02:01,130
image.
27
00:02:01,140 --> 00:02:06,710
We only need to put the original image here the frame the original frame in color and not the grayscale
28
00:02:06,720 --> 00:02:12,490
because remember we had to do this before because Open City only works on great images.
29
00:02:12,690 --> 00:02:17,370
But here we're doing something totally different and therefore we don't have to take the black and white
30
00:02:17,370 --> 00:02:18,920
version of the frame.
31
00:02:18,960 --> 00:02:26,350
All right so that's our first argument then the second argument will be net which will be the SS The
32
00:02:26,400 --> 00:02:33,780
neural network the single shots multi-book detection neural network and then the third and last argument
33
00:02:33,900 --> 00:02:40,320
is transform because there is going to be some transformations applied to the image but not to put them
34
00:02:40,440 --> 00:02:46,970
in black and white just to make sure that the images are compatible with the new one that work you know
35
00:02:47,130 --> 00:02:51,630
the images will be the input of the neural network and therefore they have to have a certain format
36
00:02:51,870 --> 00:02:59,310
and this transform argument that is the final argument of this detect function will transform the images
37
00:02:59,400 --> 00:03:02,950
so that they have the right format to get into the new network.
38
00:03:02,970 --> 00:03:03,720
All right.
39
00:03:03,870 --> 00:03:06,950
And then it's important to understand what this function will do.
40
00:03:07,020 --> 00:03:13,880
So as you might have guessed it will do the detections on the images the single images one by one.
41
00:03:14,130 --> 00:03:17,660
But what exactly is this function going to return.
42
00:03:17,850 --> 00:03:25,350
Well it will simply return this same frame but with the rectangle detecting the objects the dog and
43
00:03:25,350 --> 00:03:31,370
the humans and not only will there be direct Englebert Also there will be the label on the rectangle
44
00:03:31,380 --> 00:03:37,560
So we'll see some rectangles with the label humans because there are several persons on the video and
45
00:03:37,650 --> 00:03:39,810
one rectangle with the label Doug.
46
00:03:39,990 --> 00:03:44,850
So there will be everything that will be totally clear and amazing detection on the video.
47
00:03:44,850 --> 00:03:47,490
I can't wait to show you this.
48
00:03:47,550 --> 00:03:48,090
All right.
49
00:03:48,090 --> 00:03:54,520
So there are three arguments and now we're ready to go inside the function to define what we wanted
50
00:03:54,540 --> 00:03:56,250
to do.
51
00:03:56,250 --> 00:04:01,710
All right so the first thing we have to do is to get the height and the weight of the image and the
52
00:04:01,710 --> 00:04:06,210
frame the frame that is the argument here on which the function is applied.
53
00:04:06,240 --> 00:04:08,790
So we're going to introduce two new variables.
54
00:04:08,880 --> 00:04:19,310
Hide and with right and to get these height and width of the frame we're going to do that very efficiently.
55
00:04:19,540 --> 00:04:26,370
We're going to get it from our frame obviously because this is some information specific to the frame.
56
00:04:26,620 --> 00:04:29,530
And then this frame has some attributes.
57
00:04:29,530 --> 00:04:35,910
One of them is shape and shape is actually an attribute that returns a vector of three elements.
58
00:04:35,950 --> 00:04:38,820
The first one is the height of the frame.
59
00:04:38,830 --> 00:04:45,790
The second one therefore of index One is the width of the frame and is there one of index 2 is the number
60
00:04:45,790 --> 00:04:46,410
of channels.
61
00:04:46,510 --> 00:04:51,850
So the number of channels means that if you have a black and white image you will have one channel and
62
00:04:51,860 --> 00:04:56,010
if you have a color image you will have three channels for red blue and green.
63
00:04:56,290 --> 00:04:59,400
But we just want the height and width.
64
00:04:59,560 --> 00:05:05,710
So we're going to get the first index which is zero corresponding to the height and the second index
65
00:05:05,920 --> 00:05:08,270
one corresponding to the width.
66
00:05:08,470 --> 00:05:14,920
But the correct way to do this is actually to type a colon here and then to because that means we're
67
00:05:14,920 --> 00:05:19,090
taking the range from zero to two but with two excluded.
68
00:05:19,130 --> 00:05:23,900
So we're just taking zero and 1 and therefore we're taking the hide and do it.
69
00:05:24,340 --> 00:05:24,750
All right.
70
00:05:24,790 --> 00:05:26,240
So that's the first thing we had to do.
71
00:05:26,350 --> 00:05:32,590
And now we're going to do several transformations to go from the original image which is our friend
72
00:05:32,590 --> 00:05:39,000
right now to a torch variable that will be accepted into the as is the neural network.
73
00:05:39,070 --> 00:05:44,200
So there is a series of transformations to do before getting to this torche viable.
74
00:05:44,200 --> 00:05:51,580
The first one is to apply the transform transformation to make sure that the image has the right format.
75
00:05:51,580 --> 00:05:54,990
That is the right dimensions and the right color values.
76
00:05:55,000 --> 00:05:57,070
That's the first transformation we need to make.
77
00:05:57,220 --> 00:06:04,770
Once we have done this transformation then we will need to convert this transform frame from a number
78
00:06:04,850 --> 00:06:10,250
array because it will still be an entire array from a number of array to a torch tensor.
79
00:06:10,300 --> 00:06:11,380
That's not the same.
80
00:06:11,380 --> 00:06:18,520
That sensor is a more advanced matrix A more advanced array and therefore that's exactly what is our
81
00:06:18,520 --> 00:06:19,910
second transformation.
82
00:06:19,930 --> 00:06:26,890
We convert the transform frame from an umpire array to a torch tensor then that's not all we will need
83
00:06:26,890 --> 00:06:33,890
to do a third transformation which will be to add a fake dimension to the torch sensor and that fig
84
00:06:33,890 --> 00:06:35,840
damage and will respond to the batch.
85
00:06:35,950 --> 00:06:42,510
And then finally the fourth and final transformation to do before it is ready to go into the new one
86
00:06:42,510 --> 00:06:46,840
that work will be to convert it into a torch variable.
87
00:06:46,840 --> 00:06:53,990
Remember this variable closets we imported here is a class that converts a torch sensor into a torch
88
00:06:54,040 --> 00:06:58,620
variable that contains both the tensor and a gradient.
89
00:06:58,740 --> 00:07:04,420
And this torch variable will then be an element of the dynamic graph which will allow us later to do
90
00:07:04,420 --> 00:07:09,560
some very fast and efficient computation of the gradients during backward propagation.
91
00:07:09,850 --> 00:07:12,490
So there we go we have four transformations to make.
92
00:07:12,490 --> 00:07:15,820
We'll make them in the next tutorial starting with the first one.
93
00:07:16,000 --> 00:07:21,310
And then finally we'll be able to feed the neural network with this image.
94
00:07:21,550 --> 00:07:22,610
So let's match this.
95
00:07:22,660 --> 00:07:24,760
And until then enjoy computer vision.
10799
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.