subtitlecat.com

All language subtitles for 10. Cost Functions and Gradient Descent

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,290 --> 00:00:06,520 Welcome back everyone. 2 00:00:06,580 --> 00:00:10,930 In this lecture we're going to be talking about cost functions which is going to allow us to measure 3 00:00:11,170 --> 00:00:15,090 how far off we are in the output predictions of our neural network. 4 00:00:15,190 --> 00:00:19,990 And then we'll talk about gradient descent which is going to help us minimize that cost or minimize 5 00:00:19,990 --> 00:00:20,740 that error. 6 00:00:20,890 --> 00:00:24,960 So often cost functions you also hear them label as lost functions or error functions. 7 00:00:25,060 --> 00:00:27,690 So we don't talk about all that really interesting topic here. 8 00:00:27,880 --> 00:00:31,630 And it really fundamentally is gonna help us understand better how the neural network will actually 9 00:00:31,630 --> 00:00:37,870 learn so we already understand that neural networks they taken inputs in that first layer then they 10 00:00:37,870 --> 00:00:43,150 multiply them by weights and then add biases to them and then maybe that gets passed on through an activation 11 00:00:43,150 --> 00:00:48,640 function like the sigmoid function or the rectified linear unit and then that ends up going into another 12 00:00:48,640 --> 00:00:53,020 layer and then you add another set of weights and biases and then so on and so on all the way to that 13 00:00:53,020 --> 00:00:59,870 last output layer so that last output layer we can maybe call that or refer to it as y hat. 14 00:00:59,960 --> 00:01:03,970 That's essentially the model's estimation of what it predicts the label to be. 15 00:01:04,040 --> 00:01:08,600 And so we have two main questions after the network creates that prediction. 16 00:01:08,630 --> 00:01:11,840 How do we actually evaluate it against the true label. 17 00:01:11,840 --> 00:01:16,410 And then after the evaluation how can we update the networks weights and biases. 18 00:01:16,430 --> 00:01:23,000 So really we're going to focus on this lecture is how do we evaluate how far off our prediction is for 19 00:01:23,390 --> 00:01:25,130 updating the networks weights and biases. 20 00:01:25,130 --> 00:01:28,830 We'll have to learn about back propagation which is coming up in a future lecture. 21 00:01:28,850 --> 00:01:34,880 So right now let's focus on that first question so what we need to do is we need to take the estimated 22 00:01:34,970 --> 00:01:40,760 outputs of the network and then compare them to the real values of the label and keep in mind what I'm 23 00:01:40,760 --> 00:01:46,490 referring to right now is happening during the training or fitting portion of the supervised learning 24 00:01:46,490 --> 00:01:47,530 process. 25 00:01:47,540 --> 00:01:53,150 So right now what we're doing is we're using just the training data set that way we can go back and 26 00:01:53,210 --> 00:01:55,970 update our weights and biases during the test set. 27 00:01:55,970 --> 00:01:57,700 We're not really updating weights and biases. 28 00:01:57,710 --> 00:02:01,910 Instead we're just again evaluating overall on the entire dataset. 29 00:02:01,940 --> 00:02:03,710 How does our neural network perform. 30 00:02:03,740 --> 00:02:08,840 Right now we're doing these small evaluations on these training batches in order to actually then go 31 00:02:08,840 --> 00:02:11,190 back and update weights and biases in our network. 32 00:02:11,210 --> 00:02:14,860 So keep that in mind OK. 33 00:02:14,990 --> 00:02:21,380 So in order to actually compare our neural networks output to the true value we're gonna be using what's 34 00:02:21,380 --> 00:02:23,150 known as a cost function. 35 00:02:23,150 --> 00:02:26,630 And this is also often referred to as a lost function or error function. 36 00:02:26,630 --> 00:02:31,400 Essentially it's just something that measures how far off you are from the true value based off your 37 00:02:31,400 --> 00:02:32,570 prediction. 38 00:02:32,600 --> 00:02:39,170 And one important caveat is it should be an average that we can output a single value then you can keep 39 00:02:39,170 --> 00:02:43,400 track of that loss or cost during training to monitor network performance. 40 00:02:43,550 --> 00:02:51,140 So hopefully during each epoch of training your loss or cost goes down down down until you kind of like 41 00:02:51,140 --> 00:03:00,000 converge to some minimum cost value so we're going to here is I want to introduce a couple of variables. 42 00:03:00,000 --> 00:03:05,940 We're gonna be using Y to represent the true value and then we'll be using a to represent the neurons 43 00:03:05,970 --> 00:03:06,600 prediction. 44 00:03:07,290 --> 00:03:14,160 So in terms of weights and biases what we have here is recall we set Z to be a representation of weights 45 00:03:14,160 --> 00:03:20,760 times x plus B and then we pass that Z into an activation function such as the sigmoid function so that 46 00:03:20,760 --> 00:03:22,110 Z passed into the sigmoid. 47 00:03:22,110 --> 00:03:23,530 Is that equal to a. 48 00:03:23,550 --> 00:03:29,400 So all I'm trying to say here is that a represents kind of that final output of a neuron which takes 49 00:03:29,400 --> 00:03:31,490 into account what the activation function was. 50 00:03:31,500 --> 00:03:37,350 And that also then takes into account Z which in turn takes to account w which are the weights and the 51 00:03:37,350 --> 00:03:38,050 biases. 52 00:03:38,070 --> 00:03:41,160 So a just keep that in mind holds a lot of information. 53 00:03:41,310 --> 00:03:47,690 It holds information about the activation function the weights and the biases so probably the most common 54 00:03:47,690 --> 00:03:51,190 cost function you'll see is known as the quadratic cost function. 55 00:03:51,290 --> 00:03:54,680 And if you've done any machine learning this probably looks really familiar to you because it looks 56 00:03:54,680 --> 00:03:58,430 like root mean square error which is kind of essentially what it is. 57 00:03:58,550 --> 00:04:06,080 Here is just not notated for multidimensional data so all we're doing here is that the other day we're 58 00:04:06,080 --> 00:04:10,610 calculating the difference between the real values here we label it y of x. 59 00:04:10,610 --> 00:04:16,340 So if you add some X input we specify Y is the true function so that would be the true value. 60 00:04:16,490 --> 00:04:18,760 And then we're subtracting our predicted values. 61 00:04:18,770 --> 00:04:21,170 So here we have a level of X. 62 00:04:21,350 --> 00:04:28,740 And keep in mind that a of L that notation just signifies that that is the activation function. 63 00:04:28,760 --> 00:04:35,450 Output of the L layer where L is your last layer which means the layer before that in the network is 64 00:04:35,480 --> 00:04:36,470 L minus 1. 65 00:04:36,620 --> 00:04:39,950 Then the layer before that is L minus 2 and so on. 66 00:04:39,950 --> 00:04:44,840 Later on we'll see why it's more convenient to kind of Mark L as your last layer and then work backwards 67 00:04:44,840 --> 00:04:46,820 from there instead of starting from the beginning. 68 00:04:46,940 --> 00:04:55,650 So again AFL of X that's essentially your predicted output so keep in mind the notation again shown 69 00:04:55,650 --> 00:04:58,070 here kind of corresponds to vector inputs and outputs. 70 00:04:58,080 --> 00:05:02,400 Since we're really dealing with a batch of training points and predictions as we go along but the main 71 00:05:02,400 --> 00:05:07,830 idea is you have C as the cost function you're doing some sort of averaging so that's we have one over 72 00:05:08,070 --> 00:05:11,730 two times and so n is the number of points there. 73 00:05:11,790 --> 00:05:17,740 Then you're taking the sum of all those differences and squaring them so really come question is why 74 00:05:17,730 --> 00:05:20,720 are we actually squaring this into this two useful things for us. 75 00:05:20,740 --> 00:05:25,980 1 It keeps everything positive because well you have a positive air or negative error. 76 00:05:26,020 --> 00:05:30,130 If you square it it becomes positive which is good because we want some sort of absolute measurement 77 00:05:30,250 --> 00:05:31,350 of error. 78 00:05:31,450 --> 00:05:35,980 If we weren't squaring this and we had stuff that was negative and positive when you average it out 79 00:05:36,070 --> 00:05:42,010 that could hover around zero which is actually not a true indication of the absolute value or absolute 80 00:05:42,010 --> 00:05:43,480 units of how far off you are. 81 00:05:43,870 --> 00:05:49,120 So squaring it make sure that everything's positive the other and much more important thing it does 82 00:05:49,270 --> 00:05:51,620 is it's gonna punish really large errors. 83 00:05:51,730 --> 00:05:57,250 So sometimes you have some data points that you're gonna be really off on and if you actually square 84 00:05:57,250 --> 00:06:02,610 that error then it exponentially grows as terms of your cost. 85 00:06:02,770 --> 00:06:08,550 So maybe you're off by ten dollars in whatever unit you're trying to measure but your cost is going 86 00:06:08,550 --> 00:06:10,420 to report that and unit squared. 87 00:06:10,440 --> 00:06:13,910 So it's going to say you're off by one hundred instead of just ten. 88 00:06:14,080 --> 00:06:19,570 You're going to really punish your network for being really off on certain points which is good because 89 00:06:19,570 --> 00:06:25,630 you don't want your network to suddenly be able to not predict well on even just a few points where 90 00:06:25,900 --> 00:06:27,370 it gives you a huge error. 91 00:06:27,370 --> 00:06:32,380 You'd rather suffer a little bit on all the other points and not be totally off on those few kind of 92 00:06:32,380 --> 00:06:33,320 edge cases. 93 00:06:33,340 --> 00:06:39,970 So that really helps punish large errors now in general we can think of the cost function as a function 94 00:06:40,060 --> 00:06:47,340 of four main things so the cost function is going to be a function of W which is our neural networks 95 00:06:47,340 --> 00:06:54,060 weights B which is all the biases in our neural network SFR which is the input of a single training 96 00:06:54,060 --> 00:06:58,230 sample and each of our which is the desired output of that training sample. 97 00:06:58,230 --> 00:07:03,300 And that makes a lot of sense because the cost is dependent on what the current weights and biases are. 98 00:07:03,300 --> 00:07:06,950 It's also dependent on what you passed in as the actual training example. 99 00:07:06,950 --> 00:07:12,850 And then it's also dependent on what you're comparing it to which is e of our so notice how that information 100 00:07:12,850 --> 00:07:16,500 was actually all encoded in that simplified notation that we had. 101 00:07:16,510 --> 00:07:21,790 So I showed you this as the cost function and you may be wondering didn't you just say that the cost 102 00:07:21,790 --> 00:07:26,890 function is a function of W and B was W and B in the quadratic function here. 103 00:07:27,190 --> 00:07:32,680 Well it's actually encoded within a of X because remember a of X holds information about weights and 104 00:07:32,680 --> 00:07:39,940 biases because of X is Z passed into the activation function where z then contains information about 105 00:07:39,940 --> 00:07:48,600 W and B Ok so this means that if we have a huge network we can expect the actual cost function to be 106 00:07:48,600 --> 00:07:55,010 really quite complex with a huge vector or tensor of weights and another huge tensor of biases. 107 00:07:56,550 --> 00:08:01,860 So for example if we were just to take a small network and start labeling every way and every bias and 108 00:08:01,860 --> 00:08:07,260 every output as kind of the output of the activation function which is a you can see all the parameters 109 00:08:07,260 --> 00:08:12,090 labeled here you get really complicated really fast and this is a really small network. 110 00:08:12,090 --> 00:08:16,880 There is only kind of four layers here two hidden layers and those hidden layers aren't even that big. 111 00:08:16,900 --> 00:08:20,880 We can see here if we start noticing everything it can become quite complex. 112 00:08:20,880 --> 00:08:28,580 You already have quite a large matrix of weights and biases so how do we actually calculate this. 113 00:08:28,580 --> 00:08:35,520 How do we calculate that cost function and then figure out how to minimize it so in a real case this 114 00:08:35,520 --> 00:08:40,080 means that we have some cost function C dependent on lots of weights that cost function is going to 115 00:08:40,080 --> 00:08:42,950 dependent on the weight of that first input than weight. 116 00:08:42,960 --> 00:08:48,210 Second one weight on third one all the way to W M and we do is when you figure out which particular 117 00:08:48,210 --> 00:08:54,360 weights lead us to the lowest cost because we want to go back here and figure out for all these weights 118 00:08:54,570 --> 00:08:57,590 how do we change them to minimize my cost function. 119 00:08:57,660 --> 00:09:05,420 At the very end so for simplicity and just thinking about this let's imagine that we're dealing with 120 00:09:05,420 --> 00:09:10,400 a really simple network that only has a single weight essentially just one year on. 121 00:09:10,400 --> 00:09:16,400 So what we want to do is we want to minimize our loss or cost essentially our overall error which again 122 00:09:16,430 --> 00:09:22,700 that means we need to figure out what value of W do we use that's going to result in the minimum of 123 00:09:22,790 --> 00:09:31,940 C of value so here is our plot it out really simple cost function where it's a really simple network 124 00:09:32,000 --> 00:09:38,840 it only contains one weight well what we want to do is we want to figure out what value of W minimizes 125 00:09:38,840 --> 00:09:44,910 this cost function and while this is a really simple example you can probably just tell that in order 126 00:09:44,910 --> 00:09:49,690 to minimize the cost function here you can see that the minimum probably fall somewhere where that arrow 127 00:09:49,690 --> 00:09:55,850 is so that's the weight that is going to minimize that cost function. 128 00:09:55,850 --> 00:10:01,580 Which means that's probably the weight we want in the actual neuron or that input to the neuron. 129 00:10:01,580 --> 00:10:06,060 Because that reduces the cost to its minimum. 130 00:10:06,130 --> 00:10:10,640 Now students of calculus know what we could do is we could just take the derivative of this cost function 131 00:10:10,730 --> 00:10:12,840 and then solve for zero. 132 00:10:13,070 --> 00:10:18,650 But recall our real cost function is gonna be super complex and it's not going to be one dimensional 133 00:10:18,680 --> 00:10:20,720 two dimensional and three dimensional. 134 00:10:20,720 --> 00:10:24,670 If you take a look back at that network it's gonna be as many dimensions as there are W.. 135 00:10:24,740 --> 00:10:31,060 And that's not even something I can actually plot so again it's going to be n dimensional which means 136 00:10:31,330 --> 00:10:36,810 taking that derivative and setting it equal to zero is actually not going to be you're not gonna be 137 00:10:36,810 --> 00:10:43,160 able to calculate that without spending kind of a thousand years of computational time so our networks 138 00:10:43,250 --> 00:10:46,730 especially when we build out really large networks are gonna have thousands of weights to them hundreds 139 00:10:46,730 --> 00:10:47,480 of weights. 140 00:10:47,480 --> 00:10:49,910 We're not gonna build take that derivative. 141 00:10:49,910 --> 00:10:55,850 So instead what we do is a stochastic process to what we can do is we can use gradient descent to solve 142 00:10:55,850 --> 00:11:02,100 this sort of problem so let's again go back to this kind of simplified version of our network we just 143 00:11:02,100 --> 00:11:05,970 have one wait and see how gradient descent would work on this simple example. 144 00:11:05,970 --> 00:11:10,630 And then we can easily expand it to more complex examples. 145 00:11:10,750 --> 00:11:17,440 So what we do is we just start off at one point on this cost function and then again what we're searching 146 00:11:17,440 --> 00:11:26,300 for here is that w value that minimizes this cost function so what we do is we calculate the slope at 147 00:11:26,300 --> 00:11:34,030 one point and then we move in the downward direction of the slope and you keep repeating this process 148 00:11:35,320 --> 00:11:40,760 until eventually you're going to converge to zero indicating a minimum. 149 00:11:40,830 --> 00:11:45,330 So what we could have done is keep in mind we could've changed our step size to find the next point 150 00:11:46,780 --> 00:11:54,010 so here we kind of took equal step sizes and if you take smaller step sizes it takes longer to find 151 00:11:54,010 --> 00:11:59,430 the minimum if you take larger steps sizes you'll go faster. 152 00:11:59,430 --> 00:12:02,430 But what happens is you risk overshooting the minimum. 153 00:12:02,430 --> 00:12:08,850 So if you go too large of a step size you may actually miss that minimum weight or a minimum weight 154 00:12:08,850 --> 00:12:13,170 tensor and then kind of overshoot and you don't end up converging. 155 00:12:13,170 --> 00:12:16,340 So that's step size is known as the learning rate. 156 00:12:16,380 --> 00:12:19,980 So if you ever see in your own networks that they're editing the learning rate what they're really doing 157 00:12:19,980 --> 00:12:25,560 here is they're editing how fast they're going to try to find that minimum weight value and it works 158 00:12:25,560 --> 00:12:30,930 the same for biases you're finding those minimum weights really the values of the weights and biases 159 00:12:30,960 --> 00:12:33,300 that minimize that cost function. 160 00:12:33,300 --> 00:12:39,050 Now in those previous examples something I should know is that the learning rate was constant. 161 00:12:39,090 --> 00:12:41,670 That is to say each step size was equal. 162 00:12:41,670 --> 00:12:47,670 So regardless of which one we're actually looking at such as the smaller step sizes or the larger step 163 00:12:47,670 --> 00:12:52,570 sizes the actual step is equal for all of these. 164 00:12:52,600 --> 00:12:58,810 Now we can be actually be a little clever and adapt our step size as we go along. 165 00:12:58,850 --> 00:13:04,580 You can imagine that since you're starting off kind of an around the position in this and dimensional 166 00:13:04,580 --> 00:13:10,880 space of possible weights and biases if you start with larger steps well you can do is you can then 167 00:13:11,150 --> 00:13:16,990 go smaller and smaller in your step size as that gradient or that slope gets closer to zero. 168 00:13:17,090 --> 00:13:23,240 And this is known as Adaptive gradient descent depending on that gradient you get back you're going 169 00:13:23,240 --> 00:13:32,150 to adapt your step size so in 2015 Kingman and Bob published a paper called Adam a method first stochastic 170 00:13:32,180 --> 00:13:33,490 optimization. 171 00:13:33,500 --> 00:13:37,570 And Adam is a much more efficient way of searching for these minimums. 172 00:13:37,640 --> 00:13:43,130 So you're going to see us actually kind of sometimes state Adam as our optimizer during the code. 173 00:13:43,160 --> 00:13:49,160 So keep that in mind if you ever see Adam all we're really referring to is this optimized way of performing 174 00:13:49,220 --> 00:13:52,780 this gradient descent where we have kind of this adaptive step side. 175 00:13:52,810 --> 00:13:56,700 So we kind of start off large and then depending on where we are maybe go smaller and smaller. 176 00:13:56,810 --> 00:13:58,810 So you kind of get the best of both worlds. 177 00:13:58,880 --> 00:14:03,500 You get to use the larger step sizes and kind of speed up finding that minimum. 178 00:14:03,500 --> 00:14:07,490 But then as you get closer and closer to it and you don't want to overshoot you can go a smaller step 179 00:14:07,490 --> 00:14:13,280 sizes so you can actually then compare Adam versus other gradient descent algorithms. 180 00:14:13,280 --> 00:14:19,190 And here it's showing you the training cost versus iterations over the entire dataset and you can see 181 00:14:19,190 --> 00:14:23,540 Adam here is outperforming these other adaptive gradient descent algorithms. 182 00:14:23,570 --> 00:14:29,030 So all the ones listed here that are not Adam they're also actually adaptive gradient descents however 183 00:14:29,120 --> 00:14:31,640 Adam performs better than all these. 184 00:14:31,640 --> 00:14:35,450 So we're going to be doing is we'll be using Adam since it's really quite common to use it for neural 185 00:14:35,450 --> 00:14:36,290 networks. 186 00:14:36,290 --> 00:14:41,120 So if you see that all we're doing here is we're kind of saying OK optimize the way we find this minimum 187 00:14:42,940 --> 00:14:48,950 now realistically we're showing you that illustration of gradient descent on just a single W.. 188 00:14:48,950 --> 00:14:54,380 So that was kind of a one dimensional W. or really doing is we're calculating gradient descent in an 189 00:14:54,440 --> 00:14:56,710 end dimensional space for all our weights. 190 00:14:56,810 --> 00:15:02,660 Here you can see calculating gradient descent in a two dimensional plane including the weights and the 191 00:15:02,660 --> 00:15:03,680 bias. 192 00:15:03,680 --> 00:15:09,380 Realistically I'm not even gonna be able to illustrate the end dimensional space because it's gonna 193 00:15:09,410 --> 00:15:13,210 be end dimensions of tens or hundreds of weights and biases. 194 00:15:13,250 --> 00:15:15,730 So there's really no way we can illustrate that for you. 195 00:15:15,730 --> 00:15:20,710 Which is why we simplify it down to kind of a single way you can understand gradient descent is doing. 196 00:15:20,900 --> 00:15:24,980 And then what we're actually doing or what the computer will be doing for us is doing the same sort 197 00:15:24,980 --> 00:15:31,830 of calculations that on an n dimensional space that we can't really realistically illustrate for you. 198 00:15:31,880 --> 00:15:36,920 So when dealing with these n dimensional vectors otherwise known as sensors the notation changes from 199 00:15:36,920 --> 00:15:38,390 derivative to gradient. 200 00:15:38,420 --> 00:15:42,720 So that's why you've heard me say a gradient a couple of times instead of the term derivative. 201 00:15:42,770 --> 00:15:47,690 So when we're dealing with n dimensions instead of saying the derivative that actually the correct term 202 00:15:47,690 --> 00:15:48,860 becomes gradient. 203 00:15:48,980 --> 00:15:53,920 And so that means we calculate the gradient of the cost function with respect to all these weights. 204 00:15:53,930 --> 00:15:59,540 So if you see that upside that upside down triangle that's essentially kind of the way you notate gradient 205 00:15:59,660 --> 00:16:01,790 instead of just saying derivative. 206 00:16:01,790 --> 00:16:08,030 Now before we finish up and wrap up this lecture on loss or cost functions and gradient descent on a 207 00:16:08,030 --> 00:16:13,820 quickly I mentioned that for classification problems instead of using the quadratic cost function we 208 00:16:13,820 --> 00:16:17,260 end up often using is the cross entropy loss function. 209 00:16:17,420 --> 00:16:22,520 And what's nice about this cross entropy loss function is that basically what it does it assumes that 210 00:16:22,520 --> 00:16:26,470 your model predicts a probability distribution for each class. 211 00:16:26,780 --> 00:16:32,330 So maybe it has a distribution for class one class two plus three and so on. 212 00:16:32,510 --> 00:16:38,360 And the way this formula actually works out is for a binary classification that is only two classes. 213 00:16:38,450 --> 00:16:45,380 You have the top formula resulting and then those logs actually represent natural logs and then for 214 00:16:45,380 --> 00:16:49,500 any number of classes greater than two you are using the formula below. 215 00:16:49,650 --> 00:16:53,420 And the worry too much about this formula because essentially more coding it out we're just going to 216 00:16:53,420 --> 00:16:55,320 specify use cross entropy. 217 00:16:55,400 --> 00:17:00,620 So when we're performing classification especially multi class classification that's greater than just 218 00:17:00,620 --> 00:17:01,950 binary classification. 219 00:17:02,060 --> 00:17:05,720 We'll be calling upon cross entropy to be our cost function. 220 00:17:06,230 --> 00:17:06,490 Okay. 221 00:17:06,500 --> 00:17:10,320 So just keep that in mind if you ever see us coding and we specify cross entropy. 222 00:17:10,430 --> 00:17:14,990 These are the actual formulas we're specifying the computer to use in order to figure out a probability 223 00:17:14,990 --> 00:17:21,170 distribution for each of those classes so as a quick review we talked about cost functions. 224 00:17:21,180 --> 00:17:22,740 We talked about gradient descent. 225 00:17:22,740 --> 00:17:27,000 We talked about the fact that there's different optimizer is like the atom optimizer and then we also 226 00:17:27,000 --> 00:17:33,590 talked about quadratic cost and cross entropy so far we understand the networks take an input affect 227 00:17:33,590 --> 00:17:38,050 that input of weights and biases and activation functions to produce an estimated output. 228 00:17:38,060 --> 00:17:42,020 Then we learned how to evaluate that output against the true labels. 229 00:17:42,020 --> 00:17:47,150 The last thing you need to do in our theory discussions is the following question. 230 00:17:47,150 --> 00:17:51,470 Once we actually get that cost or loss value we understand how far off we are. 231 00:17:51,470 --> 00:17:56,100 We still haven't really talked about how we go back and adjust all those weights and biases. 232 00:17:56,170 --> 00:18:01,040 Still kind of this magical thing and that magical thing and just the second will no longer be so magical. 233 00:18:01,050 --> 00:18:04,600 So it'll be mathematical and it's called Back propagation. 234 00:18:04,630 --> 00:18:09,650 You essentially propagate backwards through your network and then update all those weights and biases. 235 00:18:09,740 --> 00:18:12,290 So that's exactly we're going to cover in the next lecture. 236 00:18:12,290 --> 00:18:12,770 I'll see you there. 27668