All language subtitles for [English (auto-generated)] Mathematics For Machine Learning Essential Mathematics - Machine Learning Tutorial Simplilearn [DownSub.com](1)

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:07,359 --> 00:00:09,840 mathematics for machine learning 2 00:00:09,840 --> 00:00:11,759 my name is richard kirschner with the 3 00:00:11,759 --> 00:00:13,759 simply learn team that's get certified 4 00:00:13,759 --> 00:00:15,040 get ahead 5 00:00:15,040 --> 00:00:16,400 we're going to cover mathematics for 6 00:00:16,400 --> 00:00:18,800 machine learning so today's agenda is 7 00:00:18,800 --> 00:00:21,039 going to cover data and its types then 8 00:00:21,039 --> 00:00:23,760 we're going to dive into linear algebra 9 00:00:23,760 --> 00:00:25,439 and its concepts 10 00:00:25,439 --> 00:00:28,240 calculus statistics for machine learning 11 00:00:28,240 --> 00:00:30,320 probability for machine learning 12 00:00:30,320 --> 00:00:32,159 hands-on demos 13 00:00:32,159 --> 00:00:33,360 and of course thrown in there in the 14 00:00:33,360 --> 00:00:35,120 middle is going to be your matrixes and 15 00:00:35,120 --> 00:00:36,960 a few other things to go along with all 16 00:00:36,960 --> 00:00:38,800 this 17 00:00:38,800 --> 00:00:40,719 data in its types 18 00:00:40,719 --> 00:00:42,559 data denotes the individual pieces of 19 00:00:42,559 --> 00:00:44,239 factual information collected from 20 00:00:44,239 --> 00:00:46,559 various sources it is stored processed 21 00:00:46,559 --> 00:00:49,360 and later used for analysis 22 00:00:49,360 --> 00:00:51,680 and so we see here just a huge grouping 23 00:00:51,680 --> 00:00:54,800 of information a lot of tech stuff money 24 00:00:54,800 --> 00:00:57,760 dollar signs numbers 25 00:00:57,760 --> 00:00:58,879 and then you have your performing 26 00:00:58,879 --> 00:01:00,719 analytics to drive insights and 27 00:01:00,719 --> 00:01:02,480 hopefully you have a nice share your 28 00:01:02,480 --> 00:01:04,000 shareholders gather it at the meeting 29 00:01:04,000 --> 00:01:05,280 and you're able to explain it in 30 00:01:05,280 --> 00:01:07,280 something they can understand 31 00:01:07,280 --> 00:01:10,000 so we talk about data types of data 32 00:01:10,000 --> 00:01:12,320 we have in our types of data we have a 33 00:01:12,320 --> 00:01:14,799 qualitative categorical 34 00:01:14,799 --> 00:01:17,280 you think nominal or ordinal 35 00:01:17,280 --> 00:01:18,880 and then you have your quantitative or 36 00:01:18,880 --> 00:01:20,880 numerical which is discrete or 37 00:01:20,880 --> 00:01:22,640 continuous 38 00:01:22,640 --> 00:01:24,320 and let's look a little closer at those 39 00:01:24,320 --> 00:01:27,040 data type vocabulary always people's 40 00:01:27,040 --> 00:01:29,680 favorite is the vocabulary words okay 41 00:01:29,680 --> 00:01:31,200 not mine 42 00:01:31,200 --> 00:01:33,200 but let's dive into this what we mean by 43 00:01:33,200 --> 00:01:34,159 nominal 44 00:01:34,159 --> 00:01:36,799 nominal they are used to label various 45 00:01:36,799 --> 00:01:39,600 uh label our variables without providing 46 00:01:39,600 --> 00:01:42,159 any measurable value 47 00:01:42,159 --> 00:01:46,320 country gender race hair color etc 48 00:01:46,320 --> 00:01:48,079 it's something that you either mark true 49 00:01:48,079 --> 00:01:50,640 or false this is a label it's on or off 50 00:01:50,640 --> 00:01:52,640 either they have a red hat on or they do 51 00:01:52,640 --> 00:01:53,680 not 52 00:01:53,680 --> 00:01:54,880 so a lot of times when you're thinking 53 00:01:54,880 --> 00:01:56,399 nominal data 54 00:01:56,399 --> 00:01:57,840 labels 55 00:01:57,840 --> 00:01:59,759 think of it as a true false kind of 56 00:01:59,759 --> 00:02:02,079 setup and we look at ordinal this is 57 00:02:02,079 --> 00:02:04,399 categorical data with a set order or a 58 00:02:04,399 --> 00:02:06,079 scale to it 59 00:02:06,079 --> 00:02:07,600 and you can think of salary range as a 60 00:02:07,600 --> 00:02:08,720 great one 61 00:02:08,720 --> 00:02:11,120 movie ratings etc you can see here the 62 00:02:11,120 --> 00:02:13,520 salary rates if you have 10 000 to 20 63 00:02:13,520 --> 00:02:16,000 000 number of employees earning that 64 00:02:16,000 --> 00:02:18,879 rate is 150 20 000 to 30 65 00:02:18,879 --> 00:02:21,760 100 and so forth some of the terms 66 00:02:21,760 --> 00:02:23,840 you'll hear is bucket 67 00:02:23,840 --> 00:02:25,200 this is where you have 10 different 68 00:02:25,200 --> 00:02:27,040 buckets and you want to separate it into 69 00:02:27,040 --> 00:02:28,640 something that makes sense into those 10 70 00:02:28,640 --> 00:02:29,840 buckets 71 00:02:29,840 --> 00:02:32,239 and so when we start talking about 72 00:02:32,239 --> 00:02:34,640 ordinal a lot of times when you get down 73 00:02:34,640 --> 00:02:36,800 to the brass bones again we're talking 74 00:02:36,800 --> 00:02:38,239 true false 75 00:02:38,239 --> 00:02:40,000 so if you're a member of the 10 to 20k 76 00:02:40,000 --> 00:02:41,519 reigns 77 00:02:41,519 --> 00:02:43,920 so forth those would each be either part 78 00:02:43,920 --> 00:02:45,920 of that group or you're not but now 79 00:02:45,920 --> 00:02:47,280 we're talking about buckets so we want 80 00:02:47,280 --> 00:02:48,840 to count how many people are in that 81 00:02:48,840 --> 00:02:52,800 bucket quantitative numerical data 82 00:02:52,800 --> 00:02:55,360 falls into two classes discrete or 83 00:02:55,360 --> 00:02:56,640 continuous 84 00:02:56,640 --> 00:02:58,879 and so data with a final set of values 85 00:02:58,879 --> 00:03:01,360 which can be categorized class strength 86 00:03:01,360 --> 00:03:04,319 questions answered correctly and runs 87 00:03:04,319 --> 00:03:06,640 hit and cricket a lot of times when you 88 00:03:06,640 --> 00:03:09,040 see this you can think integer 89 00:03:09,040 --> 00:03:11,840 and a very restricted integer i.e 90 00:03:11,840 --> 00:03:13,840 you can only have 100 questions 91 00:03:13,840 --> 00:03:16,239 on a test so you can it's very discreet 92 00:03:16,239 --> 00:03:18,159 i only have 100 different values that it 93 00:03:18,159 --> 00:03:20,560 can attain so think usually you're 94 00:03:20,560 --> 00:03:22,800 talking about integers but within a very 95 00:03:22,800 --> 00:03:24,879 small range they don't have an open in 96 00:03:24,879 --> 00:03:26,879 or anything like that 97 00:03:26,879 --> 00:03:28,959 so discrete is very solid 98 00:03:28,959 --> 00:03:30,480 simple to count 99 00:03:30,480 --> 00:03:31,760 set number 100 00:03:31,760 --> 00:03:33,680 continuous on the other hand uh 101 00:03:33,680 --> 00:03:36,000 continuous data can take any numerical 102 00:03:36,000 --> 00:03:38,720 value within a range so water pressure 103 00:03:38,720 --> 00:03:41,440 weight of a person etc usually we start 104 00:03:41,440 --> 00:03:43,360 thinking about float values where they 105 00:03:43,360 --> 00:03:45,519 can get phenomenally small and they're 106 00:03:45,519 --> 00:03:47,120 in what they're worth 107 00:03:47,120 --> 00:03:48,640 and there's a whole series of values 108 00:03:48,640 --> 00:03:50,640 that falls right between discrete and 109 00:03:50,640 --> 00:03:52,480 continuous 110 00:03:52,480 --> 00:03:54,400 you can think of the stock market you 111 00:03:54,400 --> 00:03:57,280 have dollar amounts it's still discrete 112 00:03:57,280 --> 00:03:59,439 but it starts to get complicated enough 113 00:03:59,439 --> 00:04:01,040 when you have like you know jump in the 114 00:04:01,040 --> 00:04:03,480 stock market from 115 00:04:03,480 --> 00:04:05,840 525.33 cents to 116 00:04:05,840 --> 00:04:08,640 580.67 117 00:04:08,640 --> 00:04:10,560 there's a lot of point values in there 118 00:04:10,560 --> 00:04:12,799 it'd still be called discrete but you 119 00:04:12,799 --> 00:04:14,720 start looking at it as almost continuous 120 00:04:14,720 --> 00:04:17,120 because it does have such a variance in 121 00:04:17,120 --> 00:04:19,600 it now uh we talked about no we did we 122 00:04:19,600 --> 00:04:22,240 went over nominal and ordinal 123 00:04:22,240 --> 00:04:24,800 almost true false charts and we looked 124 00:04:24,800 --> 00:04:27,120 at quantitative and numerical data which 125 00:04:27,120 --> 00:04:29,520 will start to get into numbers discrete 126 00:04:29,520 --> 00:04:31,759 you can usually a lot of times discrete 127 00:04:31,759 --> 00:04:33,440 will be put into it could be put into 128 00:04:33,440 --> 00:04:35,680 true false but usually it's not uh so we 129 00:04:35,680 --> 00:04:36,880 want to address this stuff and the first 130 00:04:36,880 --> 00:04:38,960 thing you want to look at is the very 131 00:04:38,960 --> 00:04:40,800 basic which is your algebra so we're 132 00:04:40,800 --> 00:04:43,280 going to take a look at linear algebra 133 00:04:43,280 --> 00:04:44,800 you can remember back when your 134 00:04:44,800 --> 00:04:47,840 euclidean geometry we have a line well 135 00:04:47,840 --> 00:04:49,440 let's go through this we have a linear 136 00:04:49,440 --> 00:04:51,360 algebra is the domain of mathematics 137 00:04:51,360 --> 00:04:54,000 concerning linear equations 138 00:04:54,000 --> 00:04:56,560 and the representations in vector spaces 139 00:04:56,560 --> 00:04:58,479 and through matrixes i told you we're 140 00:04:58,479 --> 00:05:00,320 going to talk about matrix is 141 00:05:00,320 --> 00:05:03,199 uh so a linear equation 142 00:05:03,199 --> 00:05:05,120 is simply 143 00:05:05,120 --> 00:05:09,039 2x plus 4y minus 3z equals 10. very 144 00:05:09,039 --> 00:05:13,520 linear 10x plus 12.4 y equals z and now 145 00:05:13,520 --> 00:05:14,880 you can actually solve these two 146 00:05:14,880 --> 00:05:17,360 equations by combining them 147 00:05:17,360 --> 00:05:18,880 and that's we're talking about a linear 148 00:05:18,880 --> 00:05:20,560 equation 149 00:05:20,560 --> 00:05:24,000 in the vectors we have a plus b equals c 150 00:05:24,000 --> 00:05:25,120 now we're starting to look at a 151 00:05:25,120 --> 00:05:26,479 direction 152 00:05:26,479 --> 00:05:29,039 and these values usually think of an x y 153 00:05:29,039 --> 00:05:30,479 z plot 154 00:05:30,479 --> 00:05:32,320 so each one is a direction 155 00:05:32,320 --> 00:05:33,680 and the actual 156 00:05:33,680 --> 00:05:37,039 distance of like a triangle a b is c 157 00:05:37,039 --> 00:05:38,960 and then your matrix can describe all 158 00:05:38,960 --> 00:05:40,400 kinds of things 159 00:05:40,400 --> 00:05:42,080 i find matrixes 160 00:05:42,080 --> 00:05:44,240 confuse a lot of people 161 00:05:44,240 --> 00:05:46,000 not because they're particularly 162 00:05:46,000 --> 00:05:49,360 difficult but because of the magnitude 163 00:05:49,360 --> 00:05:50,840 and the different things they're used 164 00:05:50,840 --> 00:05:54,800 for and a matrix is a chart 165 00:05:54,800 --> 00:05:57,440 or a you know think of a spreadsheet but 166 00:05:57,440 --> 00:05:59,680 you have your rows and your columns 167 00:05:59,680 --> 00:06:02,240 and you'll see here we have a times b 168 00:06:02,240 --> 00:06:03,840 equals c 169 00:06:03,840 --> 00:06:07,120 very important to know your counts 170 00:06:07,120 --> 00:06:09,199 so depending on how the math is being 171 00:06:09,199 --> 00:06:11,199 done what you're using it for making 172 00:06:11,199 --> 00:06:12,800 sure you have the same rows and number 173 00:06:12,800 --> 00:06:15,120 of columns or a single number there's 174 00:06:15,120 --> 00:06:16,479 all kinds of things that play in that 175 00:06:16,479 --> 00:06:19,039 that can make matrixes confusing 176 00:06:19,039 --> 00:06:20,639 but really has a lot more to do with 177 00:06:20,639 --> 00:06:23,039 what domain you're working in are you 178 00:06:23,039 --> 00:06:24,240 adding in 179 00:06:24,240 --> 00:06:28,400 multiple polynomials where you have like 180 00:06:28,400 --> 00:06:31,600 a x squared plus b y plus you know you 181 00:06:31,600 --> 00:06:32,800 start to see that it can be very 182 00:06:32,800 --> 00:06:34,720 confusing versus a very straightforward 183 00:06:34,720 --> 00:06:35,840 matrix 184 00:06:35,840 --> 00:06:37,520 and let's just go a little deeper into 185 00:06:37,520 --> 00:06:39,600 these because these are such primary 186 00:06:39,600 --> 00:06:41,520 this is what we're here to talk about is 187 00:06:41,520 --> 00:06:43,840 these different math uh mathematical 188 00:06:43,840 --> 00:06:45,919 computations that come up 189 00:06:45,919 --> 00:06:47,440 so we're looking at linear equations 190 00:06:47,440 --> 00:06:49,039 let's dig deeper into that one an 191 00:06:49,039 --> 00:06:51,120 equation having a maximum order of one 192 00:06:51,120 --> 00:06:54,080 is called a linear equation 193 00:06:54,080 --> 00:06:55,919 so it's linear because when you look at 194 00:06:55,919 --> 00:06:58,800 this we have ax plus b equals c which is 195 00:06:58,800 --> 00:07:00,479 a one variable 196 00:07:00,479 --> 00:07:03,440 we have two variable a x plus b y equals 197 00:07:03,440 --> 00:07:07,360 c a x plus b y plus z c z equals d 198 00:07:07,360 --> 00:07:09,919 and so forth but all of these 199 00:07:09,919 --> 00:07:12,240 are to the power of one you don't see x 200 00:07:12,240 --> 00:07:14,479 squared you don't see x cubed so we're 201 00:07:14,479 --> 00:07:16,080 talking about linear equations that's 202 00:07:16,080 --> 00:07:17,680 what we're talking about in their 203 00:07:17,680 --> 00:07:18,639 addition 204 00:07:18,639 --> 00:07:20,720 if you have already dived into say 205 00:07:20,720 --> 00:07:22,720 neural networks you should recognize 206 00:07:22,720 --> 00:07:25,919 this ax plus by plus cz 207 00:07:25,919 --> 00:07:28,160 setup plus the intercept 208 00:07:28,160 --> 00:07:30,560 uh which is basically your your neural 209 00:07:30,560 --> 00:07:32,400 network each node adding up all the 210 00:07:32,400 --> 00:07:35,039 different inputs and we can drill down 211 00:07:35,039 --> 00:07:38,000 into that most common formula is your y 212 00:07:38,000 --> 00:07:41,199 equals mx plus c 213 00:07:41,199 --> 00:07:43,599 so you have your y 214 00:07:43,599 --> 00:07:46,240 equals the m which is your slope 215 00:07:46,240 --> 00:07:48,639 your x value plus c 216 00:07:48,639 --> 00:07:50,400 which is your 217 00:07:50,400 --> 00:07:52,560 y-intercept you kind of labeled it wrong 218 00:07:52,560 --> 00:07:54,400 here 219 00:07:54,400 --> 00:07:56,240 threw me for a loop but the the c would 220 00:07:56,240 --> 00:07:58,479 be your y-intercept so when you set x 221 00:07:58,479 --> 00:08:01,680 equal to zero y equals c and that's 222 00:08:01,680 --> 00:08:04,000 that's your y-intercept right there 223 00:08:04,000 --> 00:08:06,080 uh and that's they just had a reverse 224 00:08:06,080 --> 00:08:08,879 value of y when x equals zero equals the 225 00:08:08,879 --> 00:08:11,280 y-intercept which is c and your slow 226 00:08:11,280 --> 00:08:13,759 gradient line which is your m so you get 227 00:08:13,759 --> 00:08:16,319 your y equals two x plus three 228 00:08:16,319 --> 00:08:18,400 and there's lots of easy ways to compute 229 00:08:18,400 --> 00:08:20,160 this this way this is why we always 230 00:08:20,160 --> 00:08:21,680 start with the most basic one when we're 231 00:08:21,680 --> 00:08:23,520 solving one of these problems and then 232 00:08:23,520 --> 00:08:24,720 of course the 233 00:08:24,720 --> 00:08:26,400 one of the most important takeaways is 234 00:08:26,400 --> 00:08:28,879 the slope gradient of the line 235 00:08:28,879 --> 00:08:30,800 so the slope is very important that m 236 00:08:30,800 --> 00:08:31,759 value 237 00:08:31,759 --> 00:08:33,200 in this case we went ahead and solved 238 00:08:33,200 --> 00:08:34,479 this 239 00:08:34,479 --> 00:08:37,120 if you have y equals 2x plus 3 you can 240 00:08:37,120 --> 00:08:39,360 see how it has a nice line graph here on 241 00:08:39,360 --> 00:08:41,279 the right 242 00:08:41,279 --> 00:08:44,240 so matrixes a matrix refers to a 243 00:08:44,240 --> 00:08:46,480 rectangular representation of an array 244 00:08:46,480 --> 00:08:50,000 of numbers arranged in columns and rows 245 00:08:50,000 --> 00:08:52,640 so we're talking m rows by n columns 246 00:08:52,640 --> 00:08:55,200 here a11 is denotes the element of the 247 00:08:55,200 --> 00:08:58,399 first row in the first column similarly 248 00:08:58,399 --> 00:09:01,279 a12 and it's really pronounced a11 in 249 00:09:01,279 --> 00:09:04,320 this particular setup so it's a row one 250 00:09:04,320 --> 00:09:07,440 column one a 12 is a 251 00:09:07,440 --> 00:09:09,680 row one column two 252 00:09:09,680 --> 00:09:13,040 first row and second column and so on 253 00:09:13,040 --> 00:09:14,880 and there's a lot of ways to denote this 254 00:09:14,880 --> 00:09:17,200 i've seen these as like a capital letter 255 00:09:17,200 --> 00:09:20,160 a smaller case a for the top row or i 256 00:09:20,160 --> 00:09:22,160 mean you can see where they can go all 257 00:09:22,160 --> 00:09:23,760 kinds of different directions as far as 258 00:09:23,760 --> 00:09:25,519 the value 259 00:09:25,519 --> 00:09:26,800 you just take a moment to realize 260 00:09:26,800 --> 00:09:28,399 there's need to be some designation as 261 00:09:28,399 --> 00:09:30,640 far as what row it's in and what column 262 00:09:30,640 --> 00:09:32,320 it's in 263 00:09:32,320 --> 00:09:34,880 and we have our basic operations we have 264 00:09:34,880 --> 00:09:36,240 addition so when you think about 265 00:09:36,240 --> 00:09:38,399 addition you have uh 266 00:09:38,399 --> 00:09:41,680 two matrixes of two by two and you just 267 00:09:41,680 --> 00:09:44,320 add each individual number in that 268 00:09:44,320 --> 00:09:46,080 matrix and then when you get to the 269 00:09:46,080 --> 00:09:48,000 bottom you have uh in this case the 270 00:09:48,000 --> 00:09:50,800 solution is twelve 10 plus 2 is 12 5 271 00:09:50,800 --> 00:09:53,360 plus 3 is 8 and so on and the same thing 272 00:09:53,360 --> 00:09:55,279 with subtraction 273 00:09:55,279 --> 00:09:57,680 now again your counting matrix is you 274 00:09:57,680 --> 00:09:59,519 want to check your 275 00:09:59,519 --> 00:10:01,519 dimensions of the matrix 276 00:10:01,519 --> 00:10:03,760 the shape you'll see shape come up a lot 277 00:10:03,760 --> 00:10:05,279 in programming so we're talking about 278 00:10:05,279 --> 00:10:07,920 dimensions we're talking about the shape 279 00:10:07,920 --> 00:10:10,399 if the two shapes are equal 280 00:10:10,399 --> 00:10:12,080 this is what happens when you add them 281 00:10:12,080 --> 00:10:14,480 together or subtract them 282 00:10:14,480 --> 00:10:16,560 and we have multiplication when you look 283 00:10:16,560 --> 00:10:18,320 at the multiplication you end up with a 284 00:10:18,320 --> 00:10:21,279 very uh a slightly different setup going 285 00:10:21,279 --> 00:10:22,720 now 286 00:10:22,720 --> 00:10:25,040 if we look at our last one we're uh uh 287 00:10:25,040 --> 00:10:26,640 we're like why 288 00:10:26,640 --> 00:10:28,000 this always gets to me when we get to 289 00:10:28,000 --> 00:10:30,240 matrixes they don't really say why you 290 00:10:30,240 --> 00:10:32,800 multiply matrixes 291 00:10:32,800 --> 00:10:34,399 you know my first thought is one times 292 00:10:34,399 --> 00:10:36,480 two four times three but if you look at 293 00:10:36,480 --> 00:10:38,560 this we get one times two plus four 294 00:10:38,560 --> 00:10:39,920 times three 295 00:10:39,920 --> 00:10:43,040 one times three plus four times five 296 00:10:43,040 --> 00:10:45,200 uh six times two plus three times three 297 00:10:45,200 --> 00:10:47,760 six times three plus three times five if 298 00:10:47,760 --> 00:10:50,000 you're looking at these matrixes uh 299 00:10:50,000 --> 00:10:52,560 think of this more as an equation 300 00:10:52,560 --> 00:10:54,560 and so we have if you remember we went 301 00:10:54,560 --> 00:10:56,240 back up here for our multiple line 302 00:10:56,240 --> 00:10:58,240 equations let's just go back up a couple 303 00:10:58,240 --> 00:11:00,399 slides where we were looking at 304 00:11:00,399 --> 00:11:02,399 two variables so this is a two variable 305 00:11:02,399 --> 00:11:06,959 equation a x plus b y equals c 306 00:11:06,959 --> 00:11:09,440 and this is a way to make it very quick 307 00:11:09,440 --> 00:11:11,200 to solve these variables and that's why 308 00:11:11,200 --> 00:11:12,720 you have the matrix and that's why you 309 00:11:12,720 --> 00:11:14,480 do 310 00:11:14,480 --> 00:11:16,720 the multiplication the way they do 311 00:11:16,720 --> 00:11:19,680 and this is the dot product of uh one 312 00:11:19,680 --> 00:11:20,959 times two 313 00:11:20,959 --> 00:11:24,240 plus four times three 314 00:11:24,240 --> 00:11:28,800 one times three plus four times five 315 00:11:28,800 --> 00:11:32,320 six times two plus three times three 316 00:11:32,320 --> 00:11:34,320 six times three plus three times five 317 00:11:34,320 --> 00:11:36,079 and it gives us a nice little 318 00:11:36,079 --> 00:11:39,519 14 23 21 and 33 over here which then can 319 00:11:39,519 --> 00:11:42,959 be used and reduced down to a sample 320 00:11:42,959 --> 00:11:45,279 formula as far as solving the variables 321 00:11:45,279 --> 00:11:47,600 as you have enough inputs 322 00:11:47,600 --> 00:11:49,279 and then in matrix operations when 323 00:11:49,279 --> 00:11:51,600 you're dealing with a lot of matrixes 324 00:11:51,600 --> 00:11:54,480 now keep in mind multiplying matrixes is 325 00:11:54,480 --> 00:11:56,000 different than finding the product of 326 00:11:56,000 --> 00:11:58,560 two matrixes okay so we're talking about 327 00:11:58,560 --> 00:12:00,240 multiplication we're talking about 328 00:12:00,240 --> 00:12:02,959 solving uh for equations when you're 329 00:12:02,959 --> 00:12:04,560 finding the product you are just finding 330 00:12:04,560 --> 00:12:06,639 one times two keep that in mind because 331 00:12:06,639 --> 00:12:08,160 that does come up i've had that come up 332 00:12:08,160 --> 00:12:10,160 a number of times where i am altering 333 00:12:10,160 --> 00:12:12,160 data and i get confused as to what i'm 334 00:12:12,160 --> 00:12:13,360 doing with it 335 00:12:13,360 --> 00:12:16,240 uh transpose flipping the matrix over is 336 00:12:16,240 --> 00:12:19,120 diagonal comes up all the time where you 337 00:12:19,120 --> 00:12:20,959 have you still have 12 but instead of it 338 00:12:20,959 --> 00:12:24,399 being 12 8 it's now 12 14 339 00:12:24,399 --> 00:12:26,560 8 21 you're just flipping the columns 340 00:12:26,560 --> 00:12:28,079 and the rows 341 00:12:28,079 --> 00:12:30,720 and then of course you can do an inverse 342 00:12:30,720 --> 00:12:32,399 changing the signs of the values across 343 00:12:32,399 --> 00:12:34,000 this main diagonal 344 00:12:34,000 --> 00:12:35,600 and you can see here we have the inverse 345 00:12:35,600 --> 00:12:38,560 a to the minus 1 and ends up with 346 00:12:38,560 --> 00:12:41,760 instead of 12 8 14 12 is now minus 22 347 00:12:41,760 --> 00:12:43,279 minus 12. 348 00:12:43,279 --> 00:12:44,560 vectors 349 00:12:44,560 --> 00:12:47,440 vector just means we have 350 00:12:47,440 --> 00:12:50,320 a value and a direction 351 00:12:50,320 --> 00:12:52,639 and we have down four numbers here on 352 00:12:52,639 --> 00:12:54,240 our vector 353 00:12:54,240 --> 00:12:56,240 uh in mathematics a one dimensional 354 00:12:56,240 --> 00:12:59,600 matrix is called a vector uh so 355 00:12:59,600 --> 00:13:01,360 if you have your x plot and you have a 356 00:13:01,360 --> 00:13:04,000 single value that values along the x 357 00:13:04,000 --> 00:13:06,720 axis and it's a single dimension 358 00:13:06,720 --> 00:13:08,480 if you have two dimensions you can think 359 00:13:08,480 --> 00:13:10,480 about putting them on a graph you might 360 00:13:10,480 --> 00:13:13,519 have x and you might have y and each 361 00:13:13,519 --> 00:13:15,440 value denotes a direction and then of 362 00:13:15,440 --> 00:13:17,120 course the actual distance is going to 363 00:13:17,120 --> 00:13:20,480 be the hypothesis of that triangle and 364 00:13:20,480 --> 00:13:22,000 you can do that with three dimensionals 365 00:13:22,000 --> 00:13:23,600 x y and z 366 00:13:23,600 --> 00:13:25,040 and you can do it all the way to nth 367 00:13:25,040 --> 00:13:27,920 dimensions so when they talk about the k 368 00:13:27,920 --> 00:13:29,519 means 369 00:13:29,519 --> 00:13:31,760 for categorizing and how close data is 370 00:13:31,760 --> 00:13:34,560 together they will compute that based on 371 00:13:34,560 --> 00:13:36,320 the pythagorean theorem so you would 372 00:13:36,320 --> 00:13:37,440 take 373 00:13:37,440 --> 00:13:39,360 the square of each value add them all 374 00:13:39,360 --> 00:13:41,040 together and find the square root and 375 00:13:41,040 --> 00:13:42,880 that gives you a distance 376 00:13:42,880 --> 00:13:44,880 as far as where that point is where that 377 00:13:44,880 --> 00:13:47,440 vector exists or an actual point value 378 00:13:47,440 --> 00:13:48,880 and then you can compare that point 379 00:13:48,880 --> 00:13:51,440 value to another one it makes a very 380 00:13:51,440 --> 00:13:54,880 easy comparison versus comparing 50 or 381 00:13:54,880 --> 00:13:56,560 60 different numbers 382 00:13:56,560 --> 00:13:58,959 and that brings us up to i gene vectors 383 00:13:58,959 --> 00:14:01,519 and i gene values 384 00:14:01,519 --> 00:14:04,160 hygiene vectors the vectors that don't 385 00:14:04,160 --> 00:14:07,519 change their span while transformation 386 00:14:07,519 --> 00:14:10,160 and i gene values the scalar values that 387 00:14:10,160 --> 00:14:12,639 are associated to the vectors 388 00:14:12,639 --> 00:14:14,000 conceptually 389 00:14:14,000 --> 00:14:16,320 you can think of the vector as your 390 00:14:16,320 --> 00:14:18,720 picture you have a picture it's uh 391 00:14:18,720 --> 00:14:21,279 two dimensions x and y 392 00:14:21,279 --> 00:14:22,959 and so when you do those two dimensions 393 00:14:22,959 --> 00:14:25,120 and those two values or whatever that 394 00:14:25,120 --> 00:14:27,760 value is 395 00:14:27,920 --> 00:14:30,720 that is that point but the values change 396 00:14:30,720 --> 00:14:32,959 when you skew it and so 397 00:14:32,959 --> 00:14:36,320 if we take and we have a vector a 398 00:14:36,320 --> 00:14:38,639 and that's a set value uh 399 00:14:38,639 --> 00:14:41,519 b is um your is your you have a and b 400 00:14:41,519 --> 00:14:44,399 which is your i gene vector two is the i 401 00:14:44,399 --> 00:14:47,199 gene value so we're altering 402 00:14:47,199 --> 00:14:50,240 all the values by two that means we're 403 00:14:50,240 --> 00:14:51,839 maybe we're stretching it out one 404 00:14:51,839 --> 00:14:53,760 direction making it tall uh if you're 405 00:14:53,760 --> 00:14:56,160 doing picture editing 406 00:14:56,160 --> 00:14:57,920 that's one of the places this comes in 407 00:14:57,920 --> 00:15:00,079 but you can see when you're transforming 408 00:15:00,079 --> 00:15:02,399 uh your different information how you 409 00:15:02,399 --> 00:15:05,600 transform it is then your hygiene value 410 00:15:05,600 --> 00:15:07,839 and you can see here vector after line 411 00:15:07,839 --> 00:15:11,519 transit transition uh we have 3a a is 412 00:15:11,519 --> 00:15:14,959 the hygiene vector 3 is the aging value 413 00:15:14,959 --> 00:15:17,199 so a doesn't change that's whatever we 414 00:15:17,199 --> 00:15:18,399 started with that's your original 415 00:15:18,399 --> 00:15:21,040 picture and 3 416 00:15:21,040 --> 00:15:23,680 is skewing it one direction and maybe 417 00:15:23,680 --> 00:15:26,079 a b is being skewed another direction 418 00:15:26,079 --> 00:15:27,600 and so you have a nice tilted picture 419 00:15:27,600 --> 00:15:29,600 because you altered it by those by the 420 00:15:29,600 --> 00:15:31,760 hygiene values 421 00:15:31,760 --> 00:15:34,720 so let's go ahead and pull up a demo on 422 00:15:34,720 --> 00:15:37,519 linear algebra and to do this i'm going 423 00:15:37,519 --> 00:15:40,320 to go through my trusted anaconda into 424 00:15:40,320 --> 00:15:42,399 my jupiter notebook 425 00:15:42,399 --> 00:15:44,320 and we'll create a new 426 00:15:44,320 --> 00:15:47,199 notebook called linear algebra 427 00:15:47,199 --> 00:15:49,600 since we are working in python we're 428 00:15:49,600 --> 00:15:51,600 going to use our numpy i always import 429 00:15:51,600 --> 00:15:54,320 that as np or numpy array probably the 430 00:15:54,320 --> 00:15:56,160 most popular 431 00:15:56,160 --> 00:16:00,480 module for doing matrixes and things in 432 00:16:00,480 --> 00:16:02,240 given that this is part of a series i'm 433 00:16:02,240 --> 00:16:04,480 not going to go too much into numpy we 434 00:16:04,480 --> 00:16:06,000 are going to go ahead and create two 435 00:16:06,000 --> 00:16:08,240 different variables a for a numpy array 436 00:16:08,240 --> 00:16:12,000 10 15 and b 29 437 00:16:12,000 --> 00:16:13,199 we'll go ahead and run this and you can 438 00:16:13,199 --> 00:16:16,079 see there's our two arrays 10 15 29 and 439 00:16:16,079 --> 00:16:17,519 i went and added a space there in 440 00:16:17,519 --> 00:16:18,959 between 441 00:16:18,959 --> 00:16:20,880 so it's easier to read 442 00:16:20,880 --> 00:16:23,440 and since it's the last line we don't 443 00:16:23,440 --> 00:16:24,720 have to put the print statement on it 444 00:16:24,720 --> 00:16:27,279 unless you want we can simply but we can 445 00:16:27,279 --> 00:16:30,639 simply do a plus b so when i run this uh 446 00:16:30,639 --> 00:16:35,680 we have 10 15 29 and we get 30 24 which 447 00:16:35,680 --> 00:16:39,759 is what you expect 10 plus 20 15 plus 9 448 00:16:39,759 --> 00:16:41,600 you could almost look at this addition 449 00:16:41,600 --> 00:16:44,399 as being 450 00:16:45,360 --> 00:16:46,959 just adding up the columns on here 451 00:16:46,959 --> 00:16:49,279 coming down and if we wanted to do it a 452 00:16:49,279 --> 00:16:52,560 different way we could also do a dot t 453 00:16:52,560 --> 00:16:54,399 plus b dot t 454 00:16:54,399 --> 00:16:56,880 remember that t flips them and so if we 455 00:16:56,880 --> 00:17:00,160 do that we now get them uh we now have 456 00:17:00,160 --> 00:17:02,800 30 24 going the other way 457 00:17:02,800 --> 00:17:05,119 we could also do something kind of fun 458 00:17:05,119 --> 00:17:06,559 there's a lot of different ways to do 459 00:17:06,559 --> 00:17:07,919 this 460 00:17:07,919 --> 00:17:10,799 as far as a plus b i can also do a plus 461 00:17:10,799 --> 00:17:11,679 b 462 00:17:11,679 --> 00:17:13,280 dot t 463 00:17:13,280 --> 00:17:14,480 and you're going to see that that will 464 00:17:14,480 --> 00:17:17,119 come out the same the 30 24 whether i 465 00:17:17,119 --> 00:17:19,359 transpose a and b or transpose them both 466 00:17:19,359 --> 00:17:21,918 at the end 467 00:17:22,400 --> 00:17:24,559 and likewise we can very easily subtract 468 00:17:24,559 --> 00:17:28,000 two vectors i can go a minus b 469 00:17:28,000 --> 00:17:31,679 and we run that and we get minus 10 6 470 00:17:31,679 --> 00:17:33,440 now remember this is the last line in 471 00:17:33,440 --> 00:17:35,120 this particular section that's right not 472 00:17:35,120 --> 00:17:37,600 to put the print around it 473 00:17:37,600 --> 00:17:40,080 and just like we did before 474 00:17:40,080 --> 00:17:42,799 we can transpose either the individual 475 00:17:42,799 --> 00:17:45,440 or we can transpose the main setup and 476 00:17:45,440 --> 00:17:48,000 then we get a minus 10 6 going the other 477 00:17:48,000 --> 00:17:50,240 way 478 00:17:51,520 --> 00:17:53,440 now we didn't mention this in our notes 479 00:17:53,440 --> 00:17:56,160 but you can also do a scalar 480 00:17:56,160 --> 00:17:57,760 multiplication 481 00:17:57,760 --> 00:17:59,440 and just put down the scalar so you can 482 00:17:59,440 --> 00:18:00,960 remember that 483 00:18:00,960 --> 00:18:02,880 uh what we're talking about here is i 484 00:18:02,880 --> 00:18:04,000 have 485 00:18:04,000 --> 00:18:04,880 this 486 00:18:04,880 --> 00:18:06,799 array here u 487 00:18:06,799 --> 00:18:08,880 and if i go a 488 00:18:08,880 --> 00:18:11,039 times u 489 00:18:11,039 --> 00:18:12,960 we'll take the value 2 we'll multiply it 490 00:18:12,960 --> 00:18:15,840 by every value in here so 2 times 30 is 491 00:18:15,840 --> 00:18:19,360 60 2 times 15 492 00:18:19,360 --> 00:18:22,400 and just like we did before 493 00:18:22,400 --> 00:18:24,240 this happens a lot because when you're 494 00:18:24,240 --> 00:18:26,640 doing matrixes you do need to flip them 495 00:18:26,640 --> 00:18:29,600 you get 60 30 coming this way 496 00:18:29,600 --> 00:18:32,320 so in numpy uh we have what they call 497 00:18:32,320 --> 00:18:35,440 dot product 498 00:18:35,600 --> 00:18:37,760 and uh with this this is in a two 499 00:18:37,760 --> 00:18:40,160 dimensional vectors it is equivalent of 500 00:18:40,160 --> 00:18:42,799 two matrix multiplication remember we 501 00:18:42,799 --> 00:18:45,120 were talking about matrix multiplication 502 00:18:45,120 --> 00:18:47,919 uh where it is the 503 00:18:47,919 --> 00:18:50,960 well let's walk through it 504 00:18:50,960 --> 00:18:54,240 we'll go ahead and start by defining two 505 00:18:54,240 --> 00:18:58,160 numpy arrays we'll have uh 10 20 25 6 or 506 00:18:58,160 --> 00:19:00,400 our u and our v uh and then we're going 507 00:19:00,400 --> 00:19:02,000 to go ahead and do 508 00:19:02,000 --> 00:19:04,080 if we take 509 00:19:04,080 --> 00:19:05,679 the values 510 00:19:05,679 --> 00:19:08,640 and if you remember correctly 511 00:19:08,640 --> 00:19:11,679 an array like this would be 10 times 25 512 00:19:11,679 --> 00:19:14,640 plus 20 times 6. 513 00:19:14,640 --> 00:19:16,240 we'll go ahead and 514 00:19:16,240 --> 00:19:18,880 print that 515 00:19:20,960 --> 00:19:23,200 there we go 516 00:19:23,200 --> 00:19:25,919 and then we'll go ahead and do the np 517 00:19:25,919 --> 00:19:27,600 dot dot 518 00:19:27,600 --> 00:19:30,559 of u comma 519 00:19:31,200 --> 00:19:32,880 v 520 00:19:32,880 --> 00:19:35,120 and we'll find when we do this we go and 521 00:19:35,120 --> 00:19:36,640 run this 522 00:19:36,640 --> 00:19:39,720 we're going to get 370 523 00:19:39,720 --> 00:19:42,799 370. so this is a strain multiplication 524 00:19:42,799 --> 00:19:45,679 where they use it to solve 525 00:19:45,679 --> 00:19:48,880 linear algebra when you have multiple 526 00:19:48,880 --> 00:19:50,880 numbers going across and so this could 527 00:19:50,880 --> 00:19:52,320 be very complicated we could have a 528 00:19:52,320 --> 00:19:53,679 whole string of different variables 529 00:19:53,679 --> 00:19:56,480 going in here but for this we get a nice 530 00:19:56,480 --> 00:20:00,160 value for our dot multiplication 531 00:20:00,480 --> 00:20:01,840 and we did 532 00:20:01,840 --> 00:20:03,600 addition earlier which is just your 533 00:20:03,600 --> 00:20:05,200 basic addition 534 00:20:05,200 --> 00:20:06,799 and of course the matrix you can get 535 00:20:06,799 --> 00:20:09,120 very complicated on these or 536 00:20:09,120 --> 00:20:11,840 in this case we'll go ahead and do 537 00:20:11,840 --> 00:20:13,440 let's create two 538 00:20:13,440 --> 00:20:15,840 complex matrixes 539 00:20:15,840 --> 00:20:19,280 this one is a matrix of 540 00:20:19,280 --> 00:20:22,640 you know 12 10 4 6 4 31. we'll just 541 00:20:22,640 --> 00:20:24,080 print out a so you can see what that 542 00:20:24,080 --> 00:20:25,919 looks like here's print 543 00:20:25,919 --> 00:20:27,440 a 544 00:20:27,440 --> 00:20:30,000 we print a out you can see that we have 545 00:20:30,000 --> 00:20:31,840 a 546 00:20:31,840 --> 00:20:35,840 2 by 3 layer matrix for a 547 00:20:35,840 --> 00:20:37,919 and we can also put together 548 00:20:37,919 --> 00:20:39,280 always kind of fun when you're playing 549 00:20:39,280 --> 00:20:41,039 with print values 550 00:20:41,039 --> 00:20:42,720 we could do something like this we could 551 00:20:42,720 --> 00:20:44,400 go in here 552 00:20:44,400 --> 00:20:45,520 there we go 553 00:20:45,520 --> 00:20:48,080 we could print a we have it end with 554 00:20:48,080 --> 00:20:49,280 equals a 555 00:20:49,280 --> 00:20:50,159 run 556 00:20:50,159 --> 00:20:52,000 and this kind of gives it a nice look 557 00:20:52,000 --> 00:20:54,400 here's your matrix that's all this is 558 00:20:54,400 --> 00:20:56,559 comma n means it just tags it on the end 559 00:20:56,559 --> 00:20:59,120 that's all that is doing on there 560 00:20:59,120 --> 00:21:01,200 and then we can simply add in what is a 561 00:21:01,200 --> 00:21:02,960 plus b and you should already guess 562 00:21:02,960 --> 00:21:04,240 because this is the same as what we did 563 00:21:04,240 --> 00:21:06,240 before there's no difference 564 00:21:06,240 --> 00:21:08,400 we do a simple vector addition we have 565 00:21:08,400 --> 00:21:12,080 12 plus 2 is 14 10 plus 8 is 18 and so 566 00:21:12,080 --> 00:21:13,120 on 567 00:21:13,120 --> 00:21:15,760 and just like we did the matrix addition 568 00:21:15,760 --> 00:21:19,280 we can also do a minus b 569 00:21:19,280 --> 00:21:21,760 and do our matrix subtraction 570 00:21:21,760 --> 00:21:24,320 and we look at this we have what 12 571 00:21:24,320 --> 00:21:31,400 minus 2 is 10 10 minus 8 um where are we 572 00:21:32,640 --> 00:21:35,760 oh there we go eight minus uh 573 00:21:35,760 --> 00:21:37,200 confusing what i'm looking at i should 574 00:21:37,200 --> 00:21:39,120 have reprinted out the original numbers 575 00:21:39,120 --> 00:21:41,840 uh but we can see here 12 minus 2 is of 576 00:21:41,840 --> 00:21:45,039 course 10 10 minus 8 is 2 577 00:21:45,039 --> 00:21:48,880 4 minus 46 is minus 42 and so forth so 578 00:21:48,880 --> 00:21:50,799 same as a subtraction as before we just 579 00:21:50,799 --> 00:21:52,240 call it matrix subtraction it's 580 00:21:52,240 --> 00:21:54,559 identical 581 00:21:54,559 --> 00:21:56,400 now if you remember up here we had 582 00:21:56,400 --> 00:21:59,120 scalar addition we're adding just one 583 00:21:59,120 --> 00:22:00,159 number 584 00:22:00,159 --> 00:22:02,480 to a matrix you can also do scalar 585 00:22:02,480 --> 00:22:04,159 multiplication 586 00:22:04,159 --> 00:22:06,559 and so simply if you have a single value 587 00:22:06,559 --> 00:22:09,039 a and you have b which is your array we 588 00:22:09,039 --> 00:22:13,440 can also do a times b when we run that 589 00:22:13,440 --> 00:22:16,880 you can see here we have 2 times 4 is 8 590 00:22:16,880 --> 00:22:19,840 5 times 4 is 20 and so forth you're just 591 00:22:19,840 --> 00:22:21,679 multiplying the 4 across each one of 592 00:22:21,679 --> 00:22:23,280 these values 593 00:22:23,280 --> 00:22:24,799 and this is an interesting one that 594 00:22:24,799 --> 00:22:27,840 comes up a little bit of a brain teaser 595 00:22:27,840 --> 00:22:32,400 is matrix and vector multiplication 596 00:22:32,400 --> 00:22:35,520 and so when we're looking at this 597 00:22:35,520 --> 00:22:36,640 we are 598 00:22:36,640 --> 00:22:37,919 just doing regular arrays it doesn't 599 00:22:37,919 --> 00:22:40,080 necessarily have to be a numpy array we 600 00:22:40,080 --> 00:22:41,679 have a 601 00:22:41,679 --> 00:22:44,080 which has our 602 00:22:44,080 --> 00:22:46,880 array of arrays and b which is a single 603 00:22:46,880 --> 00:22:51,280 array and so we can from here 604 00:22:51,520 --> 00:22:53,600 do the dot 605 00:22:53,600 --> 00:22:55,360 a b 606 00:22:55,360 --> 00:22:57,600 and this is going to return two values 607 00:22:57,600 --> 00:23:00,000 and the first value is that is you could 608 00:23:00,000 --> 00:23:01,600 say it's like 609 00:23:01,600 --> 00:23:03,039 we're doing the 610 00:23:03,039 --> 00:23:05,120 this array b array 611 00:23:05,120 --> 00:23:07,760 first with a and then with a second one 612 00:23:07,760 --> 00:23:09,200 and so it splits it up so you have a 613 00:23:09,200 --> 00:23:10,880 matrix of vector multiplication and you 614 00:23:10,880 --> 00:23:12,240 can mix and match 615 00:23:12,240 --> 00:23:14,320 when you get into really complicated uh 616 00:23:14,320 --> 00:23:16,640 back end stuff this becomes more common 617 00:23:16,640 --> 00:23:18,320 because you're now you've got layers 618 00:23:18,320 --> 00:23:21,600 upon layers of data and so you'll end up 619 00:23:21,600 --> 00:23:23,760 with a matrix and a set of bolt uh 620 00:23:23,760 --> 00:23:26,880 vector matrices do you want to multiply 621 00:23:26,880 --> 00:23:28,799 now keep in mind 622 00:23:28,799 --> 00:23:31,039 that if you're doing data science a lot 623 00:23:31,039 --> 00:23:32,640 of times you're not looking at this this 624 00:23:32,640 --> 00:23:34,960 is what's going on behind the scenes so 625 00:23:34,960 --> 00:23:38,000 if you're in the scikit looking at sk 626 00:23:38,000 --> 00:23:39,280 learn where you're doing linear 627 00:23:39,280 --> 00:23:40,720 regression models 628 00:23:40,720 --> 00:23:42,320 this is some of the math that's hidden 629 00:23:42,320 --> 00:23:44,640 behind the scenes that's going on 630 00:23:44,640 --> 00:23:46,320 other times you might find yourself 631 00:23:46,320 --> 00:23:48,559 having to do part of this and manipulate 632 00:23:48,559 --> 00:23:50,240 the data around so it fits right and 633 00:23:50,240 --> 00:23:52,159 then you go back in and you run it 634 00:23:52,159 --> 00:23:56,480 through the psi kit and if we can do 635 00:23:56,480 --> 00:23:58,799 up here where we did a 636 00:23:58,799 --> 00:24:00,720 matrix and vector multiplication we can 637 00:24:00,720 --> 00:24:03,440 also do matrix two matrix multiplication 638 00:24:03,440 --> 00:24:05,039 and if we run this where we have the two 639 00:24:05,039 --> 00:24:06,480 matrixes 640 00:24:06,480 --> 00:24:07,840 you can see we have a very complicated 641 00:24:07,840 --> 00:24:09,440 array that of course comes out on there 642 00:24:09,440 --> 00:24:10,720 for our dot 643 00:24:10,720 --> 00:24:13,120 and just to reiterate it we have our 644 00:24:13,120 --> 00:24:16,559 transpose a matrix which is your dot t 645 00:24:16,559 --> 00:24:19,120 and so if we create a matrix a and we do 646 00:24:19,120 --> 00:24:21,039 uh transpose it you can see how it flips 647 00:24:21,039 --> 00:24:22,000 it from 648 00:24:22,000 --> 00:24:26,880 5 10 15 20 25 30 to 5 15 25 649 00:24:26,880 --> 00:24:28,480 10 20 30 650 00:24:28,480 --> 00:24:31,440 rows and columns 651 00:24:31,440 --> 00:24:33,279 and certainly with the math uh this 652 00:24:33,279 --> 00:24:36,240 comes up a lot um it also comes up a lot 653 00:24:36,240 --> 00:24:38,720 with xy plotting when you put in the pi 654 00:24:38,720 --> 00:24:40,799 plot you have one format where they're 655 00:24:40,799 --> 00:24:42,400 looking at pairs and numbers and then 656 00:24:42,400 --> 00:24:45,039 they want all of x's and all y's so you 657 00:24:45,039 --> 00:24:47,200 know the transpose is an important tool 658 00:24:47,200 --> 00:24:49,360 both for your math and for plotting and 659 00:24:49,360 --> 00:24:51,039 all kinds of things 660 00:24:51,039 --> 00:24:53,679 another tool that we didn't discuss uh 661 00:24:53,679 --> 00:24:56,960 is your identity matrix 662 00:24:56,960 --> 00:25:01,200 uh and this one is more definition 663 00:25:01,200 --> 00:25:03,679 the identity matrix 664 00:25:03,679 --> 00:25:06,559 we have here one where we just did two 665 00:25:06,559 --> 00:25:09,600 so it comes down as one zero zero one 666 00:25:09,600 --> 00:25:12,159 one zero zero zero one zero it creates a 667 00:25:12,159 --> 00:25:14,799 diagonal of one and what that is is when 668 00:25:14,799 --> 00:25:16,880 you're doing your identities you could 669 00:25:16,880 --> 00:25:18,320 be comparing 670 00:25:18,320 --> 00:25:20,320 all your different features to the 671 00:25:20,320 --> 00:25:21,600 different features and how they 672 00:25:21,600 --> 00:25:22,720 correlate 673 00:25:22,720 --> 00:25:25,200 and of course when you have feature one 674 00:25:25,200 --> 00:25:27,520 compared to feature one to itself it is 675 00:25:27,520 --> 00:25:28,480 always 676 00:25:28,480 --> 00:25:29,760 one 677 00:25:29,760 --> 00:25:30,880 where 678 00:25:30,880 --> 00:25:32,480 usually it's between zero one depending 679 00:25:32,480 --> 00:25:34,640 on how well correlates so when we're 680 00:25:34,640 --> 00:25:36,720 talking about identity matrix that's 681 00:25:36,720 --> 00:25:38,720 what we're talking about right here is 682 00:25:38,720 --> 00:25:41,039 that you create this preset matrix and 683 00:25:41,039 --> 00:25:42,480 then you might adjust these numbers 684 00:25:42,480 --> 00:25:44,000 depending on what you're working with 685 00:25:44,000 --> 00:25:45,600 and what the domain is 686 00:25:45,600 --> 00:25:47,520 and then another thing we can do 687 00:25:47,520 --> 00:25:49,039 to kind of wrap this up we'll hit you 688 00:25:49,039 --> 00:25:51,200 with the most complicated 689 00:25:51,200 --> 00:25:52,320 piece of this 690 00:25:52,320 --> 00:25:55,120 puzzle here is an inverse 691 00:25:55,120 --> 00:25:57,120 a matrix 692 00:25:57,120 --> 00:25:59,840 and let's just go ahead and put the um 693 00:25:59,840 --> 00:26:02,640 it's a lengthy description 694 00:26:02,640 --> 00:26:04,159 let's go and put the description this is 695 00:26:04,159 --> 00:26:06,480 straight out of the uh 696 00:26:06,480 --> 00:26:08,559 the website for 697 00:26:08,559 --> 00:26:12,880 numpy so given a square matrix a here's 698 00:26:12,880 --> 00:26:16,000 our square matrix a which is 2 1 0 0 1 0 699 00:26:16,000 --> 00:26:19,279 1 2 1. keep in mind 3 by 3 is square 700 00:26:19,279 --> 00:26:20,960 it's got to be equal it's going to 701 00:26:20,960 --> 00:26:24,559 return the matrix a inverse satisfying 702 00:26:24,559 --> 00:26:26,720 dot a 703 00:26:26,720 --> 00:26:29,360 a inverse so here's our matrix 704 00:26:29,360 --> 00:26:32,080 multiplication 705 00:26:32,080 --> 00:26:34,559 and then of course it equals the dot 706 00:26:34,559 --> 00:26:37,440 yeah a inverse of a 707 00:26:37,440 --> 00:26:39,600 with an identity shape of 708 00:26:39,600 --> 00:26:41,919 a dot shaped zero this is just reshaping 709 00:26:41,919 --> 00:26:43,840 the identity 710 00:26:43,840 --> 00:26:45,919 that's a little complicated there uh so 711 00:26:45,919 --> 00:26:48,480 we're going to have our here's our array 712 00:26:48,480 --> 00:26:50,159 we'll go ahead and run this 713 00:26:50,159 --> 00:26:52,640 and you can see what we end up with 714 00:26:52,640 --> 00:26:55,919 is we end up with uh an array 0.5 minus 715 00:26:55,919 --> 00:26:59,440 0.5 and so forth with our 2 1 1 going 716 00:26:59,440 --> 00:27:04,000 down 2 1 0 0 1 0 1 2 1. 717 00:27:04,000 --> 00:27:06,080 getting into a little deep on the math 718 00:27:06,080 --> 00:27:07,200 understanding 719 00:27:07,200 --> 00:27:09,360 when you need this is probably really is 720 00:27:09,360 --> 00:27:10,880 is what's really important when you're 721 00:27:10,880 --> 00:27:12,480 doing data science 722 00:27:12,480 --> 00:27:13,760 versus 723 00:27:13,760 --> 00:27:15,520 handwriting this out and looking up the 724 00:27:15,520 --> 00:27:17,520 math and handwriting all the pieces out 725 00:27:17,520 --> 00:27:19,200 you do need to know about the linear 726 00:27:19,200 --> 00:27:21,440 algorithm inverse of a 727 00:27:21,440 --> 00:27:23,360 so if it comes up you can easily pull it 728 00:27:23,360 --> 00:27:24,640 up or at least remember where to look it 729 00:27:24,640 --> 00:27:27,600 up we took a look at the algebra side of 730 00:27:27,600 --> 00:27:29,200 it let's go ahead and take a look at the 731 00:27:29,200 --> 00:27:31,679 calculus side of what's going on here 732 00:27:31,679 --> 00:27:35,039 with the machine learning so calculus oh 733 00:27:35,039 --> 00:27:36,880 my goodness and differential equations 734 00:27:36,880 --> 00:27:38,159 you got to throw that in there because 735 00:27:38,159 --> 00:27:40,159 that's all part of the 736 00:27:40,159 --> 00:27:42,400 bag of tricks especially when you're 737 00:27:42,400 --> 00:27:44,400 doing large neural networks but also 738 00:27:44,400 --> 00:27:46,640 comes up in many other areas the good 739 00:27:46,640 --> 00:27:48,320 news is most of it's already done for 740 00:27:48,320 --> 00:27:50,480 you in the back end uh so when it comes 741 00:27:50,480 --> 00:27:52,240 up you really do need to understand from 742 00:27:52,240 --> 00:27:54,480 the data science not data analytics data 743 00:27:54,480 --> 00:27:56,960 analytics means you're digging deep into 744 00:27:56,960 --> 00:27:59,600 actually solving these math equations 745 00:27:59,600 --> 00:28:01,679 and a neural network is just a giant 746 00:28:01,679 --> 00:28:04,080 differential equation 747 00:28:04,080 --> 00:28:06,000 so we talk about calculus 748 00:28:06,000 --> 00:28:07,840 we're going to go ahead 749 00:28:07,840 --> 00:28:10,880 and understand it by talking about cars 750 00:28:10,880 --> 00:28:13,520 versus time and speed 751 00:28:13,520 --> 00:28:16,320 so helps to calculate the spontaneous 752 00:28:16,320 --> 00:28:19,120 rate of change 753 00:28:19,120 --> 00:28:21,039 so suppose we plot a graph of the speed 754 00:28:21,039 --> 00:28:23,120 of a car with respect to time 755 00:28:23,120 --> 00:28:25,279 so as you can see here going down the 756 00:28:25,279 --> 00:28:27,760 highway probably merged into the highway 757 00:28:27,760 --> 00:28:30,080 from an on-ramp so i had to accelerate 758 00:28:30,080 --> 00:28:31,919 so my speed went way up 759 00:28:31,919 --> 00:28:34,799 uh stuck in traffic merged into the 760 00:28:34,799 --> 00:28:36,559 traffic traffic opens up and i 761 00:28:36,559 --> 00:28:38,880 accelerate again up to the speed limit 762 00:28:38,880 --> 00:28:41,840 and uh maybe peter's off up there so you 763 00:28:41,840 --> 00:28:44,559 can look at this as as 764 00:28:44,559 --> 00:28:46,720 the speed versus time i'm getting faster 765 00:28:46,720 --> 00:28:48,080 and faster because i'm continually 766 00:28:48,080 --> 00:28:50,480 accelerating and if i hit the brakes you 767 00:28:50,480 --> 00:28:52,080 go the other way 768 00:28:52,080 --> 00:28:53,679 so the rate of change of speed with 769 00:28:53,679 --> 00:28:55,760 respect of time is nothing but 770 00:28:55,760 --> 00:28:57,760 acceleration how fast are we 771 00:28:57,760 --> 00:28:59,039 accelerating 772 00:28:59,039 --> 00:29:00,799 the acceleration is the area between the 773 00:29:00,799 --> 00:29:02,799 star point of x and the end point of 774 00:29:02,799 --> 00:29:04,960 delta x 775 00:29:04,960 --> 00:29:07,279 so we can calculate a simple 776 00:29:07,279 --> 00:29:09,360 if you had x and delta x we could put a 777 00:29:09,360 --> 00:29:12,480 line there and that slope of the line is 778 00:29:12,480 --> 00:29:14,559 our acceleration 779 00:29:14,559 --> 00:29:16,640 now that's pretty easy when you're doing 780 00:29:16,640 --> 00:29:18,960 linear algebra but i don't want to know 781 00:29:18,960 --> 00:29:20,720 it just for that line in those two 782 00:29:20,720 --> 00:29:22,559 points i want to know what across 783 00:29:22,559 --> 00:29:24,960 the whole of what i'm working with 784 00:29:24,960 --> 00:29:26,960 that's where we get into calculus 785 00:29:26,960 --> 00:29:28,799 so we talk about the distance between x 786 00:29:28,799 --> 00:29:31,120 and delta x it has to be the smallest 787 00:29:31,120 --> 00:29:33,279 possible near to zero in order to 788 00:29:33,279 --> 00:29:36,399 approximate the acceleration 789 00:29:36,399 --> 00:29:38,799 uh so the idea is instead of i mean if 790 00:29:38,799 --> 00:29:41,520 you ever did took a basic calculus class 791 00:29:41,520 --> 00:29:43,760 they would draw bars down here and you 792 00:29:43,760 --> 00:29:45,840 would divide this area up 793 00:29:45,840 --> 00:29:47,840 let's go back up the screen you divide 794 00:29:47,840 --> 00:29:50,559 this area of this time period up into 795 00:29:50,559 --> 00:29:53,039 maybe 10 sections and you'd use that and 796 00:29:53,039 --> 00:29:54,559 you could calculate the acceleration 797 00:29:54,559 --> 00:29:56,320 between each one of those 10 sections 798 00:29:56,320 --> 00:29:57,760 kind of thing 799 00:29:57,760 --> 00:29:59,600 and then we just keep making that space 800 00:29:59,600 --> 00:30:02,640 smaller and smaller until delta x is 801 00:30:02,640 --> 00:30:03,919 almost 802 00:30:03,919 --> 00:30:06,559 infinitesimally small 803 00:30:06,559 --> 00:30:08,960 and so we get a function of a 804 00:30:08,960 --> 00:30:12,559 equals a limit as h goes to 0 of a 805 00:30:12,559 --> 00:30:14,720 function of a plus h minus a function of 806 00:30:14,720 --> 00:30:17,440 a over h and that is you're 807 00:30:17,440 --> 00:30:20,799 computing the slope of the line 808 00:30:20,799 --> 00:30:22,399 we're just computing that slope and 809 00:30:22,399 --> 00:30:23,919 they're smaller and smaller and smaller 810 00:30:23,919 --> 00:30:25,679 samples 811 00:30:25,679 --> 00:30:27,679 and that's what calculus is calculus is 812 00:30:27,679 --> 00:30:29,760 the integral you can see down here we 813 00:30:29,760 --> 00:30:32,000 have our nice uh integral sign looks 814 00:30:32,000 --> 00:30:33,679 like a giant s 815 00:30:33,679 --> 00:30:35,679 and that's what that means is that we've 816 00:30:35,679 --> 00:30:39,120 taken this down to as small as we can 817 00:30:39,120 --> 00:30:41,120 for that sampling 818 00:30:41,120 --> 00:30:42,399 so we're talking about calculus we're 819 00:30:42,399 --> 00:30:45,279 finding the area under the slope is the 820 00:30:45,279 --> 00:30:48,000 main process in the integration 821 00:30:48,000 --> 00:30:50,000 similar small intervals are made of the 822 00:30:50,000 --> 00:30:52,320 smallest possible length of x plus delta 823 00:30:52,320 --> 00:30:55,279 x where delta x approaches almost an 824 00:30:55,279 --> 00:30:57,440 infinitesimally small space 825 00:30:57,440 --> 00:30:59,039 and then it helps to find the overall 826 00:30:59,039 --> 00:31:01,279 acceleration by summing up all the links 827 00:31:01,279 --> 00:31:02,559 together 828 00:31:02,559 --> 00:31:03,840 so we're summing up all the 829 00:31:03,840 --> 00:31:05,600 accelerations from the beginning to the 830 00:31:05,600 --> 00:31:06,480 end 831 00:31:06,480 --> 00:31:08,799 and so here's our integral we sum of a 832 00:31:08,799 --> 00:31:11,200 of x times d of x 833 00:31:11,200 --> 00:31:13,279 equals a plus c 834 00:31:13,279 --> 00:31:16,799 uh that is our basic calculus here 835 00:31:16,799 --> 00:31:19,440 so when we talk about multivariate 836 00:31:19,440 --> 00:31:20,799 calculus 837 00:31:20,799 --> 00:31:22,640 multivariate calculus deals with 838 00:31:22,640 --> 00:31:25,440 functions that have multiple variables 839 00:31:25,440 --> 00:31:26,799 and you can see here we start getting 840 00:31:26,799 --> 00:31:30,399 into some very complicated equations um 841 00:31:30,399 --> 00:31:33,039 changing w over change of time 842 00:31:33,039 --> 00:31:35,520 equals change of w over change of z 843 00:31:35,520 --> 00:31:38,159 the differential of z to dx differential 844 00:31:38,159 --> 00:31:41,200 of x to dt it gets pretty complicated uh 845 00:31:41,200 --> 00:31:43,120 and it really translates into the 846 00:31:43,120 --> 00:31:45,039 multivariate integration using double 847 00:31:45,039 --> 00:31:47,840 integrals and so you have the the sum of 848 00:31:47,840 --> 00:31:51,200 the sum of f of x of y of d of a equals 849 00:31:51,200 --> 00:31:54,399 the sum from c to d and a to b of f of x 850 00:31:54,399 --> 00:31:56,720 y d x d y equals 851 00:31:56,720 --> 00:31:59,679 uh the sum of a to b sum of c to d of f 852 00:31:59,679 --> 00:32:03,039 x y d y d x 853 00:32:03,039 --> 00:32:05,600 understanding the very specifics of 854 00:32:05,600 --> 00:32:07,519 everything going on in here and actually 855 00:32:07,519 --> 00:32:10,880 doing the math is usually calculus 1 856 00:32:10,880 --> 00:32:14,000 calculus 2 and differential equations so 857 00:32:14,000 --> 00:32:15,440 you're talking about three full length 858 00:32:15,440 --> 00:32:17,519 courses to dig into 859 00:32:17,519 --> 00:32:20,000 and solve these math equations 860 00:32:20,000 --> 00:32:21,600 what we want to take from here is we're 861 00:32:21,600 --> 00:32:23,440 talking about calculus 862 00:32:23,440 --> 00:32:26,080 we're talking about summing of all these 863 00:32:26,080 --> 00:32:28,320 different slopes and so we're still 864 00:32:28,320 --> 00:32:30,159 solving a linear 865 00:32:30,159 --> 00:32:32,159 expression we're still solving 866 00:32:32,159 --> 00:32:34,960 y equals m x plus b 867 00:32:34,960 --> 00:32:36,960 but we're doing this for infinitesimally 868 00:32:36,960 --> 00:32:38,880 small x's and then we want to sum them 869 00:32:38,880 --> 00:32:40,880 up that's what this integral sign means 870 00:32:40,880 --> 00:32:45,760 the sum of a of x d of x equals a plus c 871 00:32:45,760 --> 00:32:48,000 and when you see these very complicated 872 00:32:48,000 --> 00:32:50,320 multivariate differentiation using the 873 00:32:50,320 --> 00:32:51,760 chain rule 874 00:32:51,760 --> 00:32:53,360 when we come in here and we have the 875 00:32:53,360 --> 00:32:55,840 change of w to the change of t equals 876 00:32:55,840 --> 00:32:57,760 the change of w dz 877 00:32:57,760 --> 00:32:59,919 uh and so forth 878 00:32:59,919 --> 00:33:01,360 that's what's going on here that's what 879 00:33:01,360 --> 00:33:03,279 these means we're basically looking for 880 00:33:03,279 --> 00:33:05,120 the area under the curve which really 881 00:33:05,120 --> 00:33:06,480 comes to 882 00:33:06,480 --> 00:33:09,519 how is the change changing speeds going 883 00:33:09,519 --> 00:33:12,080 up how is that changing and then you end 884 00:33:12,080 --> 00:33:14,240 up with a multiple layer so if i have 885 00:33:14,240 --> 00:33:16,320 three layers of neural networks how is 886 00:33:16,320 --> 00:33:18,399 the third layer changing based on the 887 00:33:18,399 --> 00:33:20,080 second layer changing which is based on 888 00:33:20,080 --> 00:33:22,399 the first layer changing and you get the 889 00:33:22,399 --> 00:33:23,760 picture here that now we have a very 890 00:33:23,760 --> 00:33:25,039 complicated 891 00:33:25,039 --> 00:33:27,200 multivariate integration 892 00:33:27,200 --> 00:33:29,679 with integrals 893 00:33:29,679 --> 00:33:32,480 the good news is we can solve this 894 00:33:32,480 --> 00:33:34,000 mathematically and that's what we do 895 00:33:34,000 --> 00:33:35,519 when you do neural networks and reverse 896 00:33:35,519 --> 00:33:37,039 propagation 897 00:33:37,039 --> 00:33:38,720 so the nice thing is that you don't have 898 00:33:38,720 --> 00:33:40,480 to solve this on paper unless you're a 899 00:33:40,480 --> 00:33:42,000 data analysis and you're working on the 900 00:33:42,000 --> 00:33:44,399 back end of integrating these formulas 901 00:33:44,399 --> 00:33:45,600 and building the script to actually 902 00:33:45,600 --> 00:33:47,760 build them so we talk about applications 903 00:33:47,760 --> 00:33:50,320 of calculus it provides us the tools to 904 00:33:50,320 --> 00:33:53,039 build an accurate predictive model 905 00:33:53,039 --> 00:33:55,360 so it's really behind the scenes we want 906 00:33:55,360 --> 00:33:56,640 to guess at what the change of the 907 00:33:56,640 --> 00:33:59,120 change of the change is 908 00:33:59,120 --> 00:34:01,360 that's a little goofy i know i just 909 00:34:01,360 --> 00:34:02,960 threw that out there it's kind of a meta 910 00:34:02,960 --> 00:34:05,360 term but if you can guess how things are 911 00:34:05,360 --> 00:34:07,519 going to change then you can guess what 912 00:34:07,519 --> 00:34:09,199 the new numbers are 913 00:34:09,199 --> 00:34:11,679 multivariate calculus explains the 914 00:34:11,679 --> 00:34:13,520 change in our target variable in 915 00:34:13,520 --> 00:34:15,440 relation to the rate of change in the 916 00:34:15,440 --> 00:34:17,440 input variables 917 00:34:17,440 --> 00:34:19,599 so there's our multiple variables going 918 00:34:19,599 --> 00:34:20,399 in there 919 00:34:20,399 --> 00:34:22,239 if one variable is changing how does it 920 00:34:22,239 --> 00:34:24,320 affect the other variable 921 00:34:24,320 --> 00:34:27,119 and then in gradient descent calculus is 922 00:34:27,119 --> 00:34:30,800 used to find the local and global maxima 923 00:34:30,800 --> 00:34:32,719 and this is really big 924 00:34:32,719 --> 00:34:33,839 we're actually going to have a whole 925 00:34:33,839 --> 00:34:36,239 section here on gradient descent because 926 00:34:36,239 --> 00:34:37,520 it is 927 00:34:37,520 --> 00:34:39,359 really i mean i talked about neural 928 00:34:39,359 --> 00:34:41,199 networks and how you can see how the 929 00:34:41,199 --> 00:34:42,719 different layers go in there but 930 00:34:42,719 --> 00:34:46,079 gradient descent is one of the most key 931 00:34:46,079 --> 00:34:48,079 things for trying to guess the best 932 00:34:48,079 --> 00:34:49,839 answer to something 933 00:34:49,839 --> 00:34:53,760 so let's take a look at the code behind 934 00:34:53,760 --> 00:34:55,679 gradient descent 935 00:34:55,679 --> 00:34:57,839 and uh before we open up the code let's 936 00:34:57,839 --> 00:34:59,599 just do real quick 937 00:34:59,599 --> 00:35:03,280 uh gradient descent 938 00:35:03,280 --> 00:35:05,599 let's say we have a curve like this and 939 00:35:05,599 --> 00:35:07,280 most common 940 00:35:07,280 --> 00:35:09,200 is that this is going to represent your 941 00:35:09,200 --> 00:35:12,400 error oops 942 00:35:12,400 --> 00:35:15,200 error there we go error 943 00:35:15,200 --> 00:35:17,119 hard to read there and i want to make 944 00:35:17,119 --> 00:35:20,160 the error as low as possible 945 00:35:20,160 --> 00:35:22,160 and so what i'm looking at it is i want 946 00:35:22,160 --> 00:35:25,040 to find this line here 947 00:35:25,040 --> 00:35:26,480 which is the 948 00:35:26,480 --> 00:35:28,960 minimum value so we're looking for the 949 00:35:28,960 --> 00:35:33,599 minimum and it does that by uh sampling 950 00:35:33,599 --> 00:35:35,040 there 951 00:35:35,040 --> 00:35:37,200 and then based on this it guesses it 952 00:35:37,200 --> 00:35:39,280 might be someplace here and it goes hey 953 00:35:39,280 --> 00:35:41,119 this is still going down 954 00:35:41,119 --> 00:35:43,280 it goes here and then goes back over 955 00:35:43,280 --> 00:35:45,680 here and then goes a little bit closer 956 00:35:45,680 --> 00:35:47,359 and it's just playing a high low until 957 00:35:47,359 --> 00:35:50,720 it gets to that spot that bottom spot 958 00:35:50,720 --> 00:35:54,960 and so we want to minimize the error in 959 00:35:54,960 --> 00:35:57,280 on the flip note you could also want to 960 00:35:57,280 --> 00:35:59,359 be maximizing something you want to get 961 00:35:59,359 --> 00:36:02,320 the best output of it uh that's simply 962 00:36:02,320 --> 00:36:04,320 uh minus the value 963 00:36:04,320 --> 00:36:05,839 so if you're looking for where the peak 964 00:36:05,839 --> 00:36:06,880 is 965 00:36:06,880 --> 00:36:09,760 this is the same as a negative 966 00:36:09,760 --> 00:36:11,760 for where the valley is looking for that 967 00:36:11,760 --> 00:36:12,960 valley 968 00:36:12,960 --> 00:36:14,480 that's all that is and this is a way of 969 00:36:14,480 --> 00:36:16,560 finding it so 970 00:36:16,560 --> 00:36:18,800 the cool thing is all the heavy lifting 971 00:36:18,800 --> 00:36:20,880 is done i actually 972 00:36:20,880 --> 00:36:22,720 ended up putting together one of these a 973 00:36:22,720 --> 00:36:25,119 while back is when i didn't know about 974 00:36:25,119 --> 00:36:27,760 sidekick and it was just starting 975 00:36:27,760 --> 00:36:29,839 boy it's a long while back 976 00:36:29,839 --> 00:36:31,040 and uh 977 00:36:31,040 --> 00:36:32,880 is playing high low how do you play high 978 00:36:32,880 --> 00:36:35,359 low not get stuck in the valleys uh 979 00:36:35,359 --> 00:36:37,119 figure out these curves and things like 980 00:36:37,119 --> 00:36:39,440 that well you do that and the back end 981 00:36:39,440 --> 00:36:40,960 is all the calculus and differential 982 00:36:40,960 --> 00:36:43,520 equations to calculate this out 983 00:36:43,520 --> 00:36:45,280 the good news is you don't have to do 984 00:36:45,280 --> 00:36:47,040 those 985 00:36:47,040 --> 00:36:48,400 so instead we're going to put together 986 00:36:48,400 --> 00:36:52,480 the code and let's go ahead 987 00:36:52,480 --> 00:36:55,839 and see what we can do with that 988 00:36:56,800 --> 00:36:59,040 so uh guys in the back put together a 989 00:36:59,040 --> 00:37:00,800 nice little piece of code here which is 990 00:37:00,800 --> 00:37:02,079 kind of fun 991 00:37:02,079 --> 00:37:04,320 uh some things we're gonna note and this 992 00:37:04,320 --> 00:37:06,240 is this is really important stuff 993 00:37:06,240 --> 00:37:07,680 because when you start doing your data 994 00:37:07,680 --> 00:37:09,680 science and digging into your machine 995 00:37:09,680 --> 00:37:11,280 learning models 996 00:37:11,280 --> 00:37:12,720 you're going to find 997 00:37:12,720 --> 00:37:14,800 these things are stumbling blocks 998 00:37:14,800 --> 00:37:17,119 the first one is current x where do we 999 00:37:17,119 --> 00:37:18,480 start at 1000 00:37:18,480 --> 00:37:20,079 keep in mind 1001 00:37:20,079 --> 00:37:20,880 your 1002 00:37:20,880 --> 00:37:23,119 model that you're working with is very 1003 00:37:23,119 --> 00:37:25,119 generic so whatever you use to minimize 1004 00:37:25,119 --> 00:37:27,119 it the first question is where do we 1005 00:37:27,119 --> 00:37:28,640 start 1006 00:37:28,640 --> 00:37:30,320 and we started at this because the 1007 00:37:30,320 --> 00:37:32,720 algorithm starts at x equals three 1008 00:37:32,720 --> 00:37:35,280 so we arbitrarily picked five 1009 00:37:35,280 --> 00:37:37,200 learning rate is uh how many bars to 1010 00:37:37,200 --> 00:37:39,599 skip going one way or the other i'm in 1011 00:37:39,599 --> 00:37:40,880 fact i'm going to separate that a little 1012 00:37:40,880 --> 00:37:42,079 bit because these two are really 1013 00:37:42,079 --> 00:37:43,119 important 1014 00:37:43,119 --> 00:37:45,440 um if we're dealing with something like 1015 00:37:45,440 --> 00:37:47,440 this where we're talking about 1016 00:37:47,440 --> 00:37:49,119 well here's our here's the function 1017 00:37:49,119 --> 00:37:51,280 we're going to use our 1018 00:37:51,280 --> 00:37:53,280 gradient of our function 1019 00:37:53,280 --> 00:37:56,160 2 times x plus 5 keep it simple so 1020 00:37:56,160 --> 00:37:57,440 that's a function we're going to work 1021 00:37:57,440 --> 00:37:59,839 with so if i'm dealing with increments 1022 00:37:59,839 --> 00:38:02,880 of 1000.1 is going to be a very long 1023 00:38:02,880 --> 00:38:05,839 time and if i'm dealing with increments 1024 00:38:05,839 --> 00:38:08,480 of 0.001 1025 00:38:08,480 --> 00:38:11,520 0.1 is going to skip over my answer so i 1026 00:38:11,520 --> 00:38:13,520 won't get a very good answer 1027 00:38:13,520 --> 00:38:15,280 and then we look at precision this tells 1028 00:38:15,280 --> 00:38:18,400 us when to stop the algorithm so again 1029 00:38:18,400 --> 00:38:21,280 very specific to what you're working on 1030 00:38:21,280 --> 00:38:23,040 if you're working with money 1031 00:38:23,040 --> 00:38:24,240 and 1032 00:38:24,240 --> 00:38:28,240 you don't convert it into a float value 1033 00:38:28,240 --> 00:38:31,040 you might be dealing with 0.01 which is 1034 00:38:31,040 --> 00:38:32,640 a penny that might be your precision 1035 00:38:32,640 --> 00:38:34,560 you're working with 1036 00:38:34,560 --> 00:38:36,160 and then of course the previous step 1037 00:38:36,160 --> 00:38:39,200 size max iterations we want something to 1038 00:38:39,200 --> 00:38:41,040 cut out at a certain point usually 1039 00:38:41,040 --> 00:38:43,680 that's built into a lot of minimization 1040 00:38:43,680 --> 00:38:44,880 functions 1041 00:38:44,880 --> 00:38:47,040 and then here's our actual uh formula 1042 00:38:47,040 --> 00:38:48,720 we're going to be working with 1043 00:38:48,720 --> 00:38:50,720 and then we come in we go while previous 1044 00:38:50,720 --> 00:38:52,800 step size is greater than precision and 1045 00:38:52,800 --> 00:38:56,960 it is less than max and max iters 1046 00:38:56,960 --> 00:38:59,280 say that 10 times fast 1047 00:38:59,280 --> 00:39:00,960 um 1048 00:39:00,960 --> 00:39:02,960 we're just saying if it's uh if we're if 1049 00:39:02,960 --> 00:39:04,480 we're still greater than our precision 1050 00:39:04,480 --> 00:39:05,920 level we still got to keep digging 1051 00:39:05,920 --> 00:39:07,440 deeper 1052 00:39:07,440 --> 00:39:09,359 and then we also don't want to go past a 1053 00:39:09,359 --> 00:39:11,680 thou or whatever this is a million or 10 1054 00:39:11,680 --> 00:39:12,720 000 1055 00:39:12,720 --> 00:39:14,880 running that's actually pretty high we 1056 00:39:14,880 --> 00:39:16,880 almost never do max iterations more than 1057 00:39:16,880 --> 00:39:19,200 like 100 or 200 1058 00:39:19,200 --> 00:39:20,640 rare occasions you might go up to four 1059 00:39:20,640 --> 00:39:22,720 or 500 if it's depending on the problem 1060 00:39:22,720 --> 00:39:24,000 you're working with 1061 00:39:24,000 --> 00:39:26,079 uh so we have our previous equals our 1062 00:39:26,079 --> 00:39:29,760 current that way we can track time wise 1063 00:39:29,760 --> 00:39:31,760 the current now equals the current minus 1064 00:39:31,760 --> 00:39:34,480 the rate times the formula of our 1065 00:39:34,480 --> 00:39:36,000 previous x 1066 00:39:36,000 --> 00:39:38,640 so now we've generated our new version 1067 00:39:38,640 --> 00:39:41,280 previous step size equals the absolute 1068 00:39:41,280 --> 00:39:43,359 current previous 1069 00:39:43,359 --> 00:39:46,320 so we're looking for the change in x 1070 00:39:46,320 --> 00:39:48,560 errors equals iterations plus one that's 1071 00:39:48,560 --> 00:39:50,800 so we know to stop if we get too far 1072 00:39:50,800 --> 00:39:51,920 and then we're just going to print the 1073 00:39:51,920 --> 00:39:54,480 local minimum occurs at 1074 00:39:54,480 --> 00:39:56,800 x on here and if we go ahead and run 1075 00:39:56,800 --> 00:39:58,720 this 1076 00:39:58,720 --> 00:40:00,400 you can see right here it gets down to 1077 00:40:00,400 --> 00:40:03,839 this point and it says hey 1078 00:40:03,839 --> 00:40:06,280 local minimum is minus 1079 00:40:06,280 --> 00:40:08,960 3.3222 for this particular series we 1080 00:40:08,960 --> 00:40:11,440 created this is created off of our 1081 00:40:11,440 --> 00:40:15,200 formula here lambda x2 times x plus five 1082 00:40:15,200 --> 00:40:16,400 now 1083 00:40:16,400 --> 00:40:18,480 when i'm running this stuff you'll see 1084 00:40:18,480 --> 00:40:21,040 this come up a lot 1085 00:40:21,040 --> 00:40:21,920 in 1086 00:40:21,920 --> 00:40:24,880 with the sk learn kit and one of the 1087 00:40:24,880 --> 00:40:26,640 nice reasons of breaking this down the 1088 00:40:26,640 --> 00:40:27,920 way we did 1089 00:40:27,920 --> 00:40:30,560 is i could go over those top pieces 1090 00:40:30,560 --> 00:40:32,560 those top pieces are everything when you 1091 00:40:32,560 --> 00:40:34,880 start looking at these minimization tool 1092 00:40:34,880 --> 00:40:39,359 kits in built-in code and so from 1093 00:40:39,359 --> 00:40:41,760 we'll just do it's actually 1094 00:40:41,760 --> 00:40:43,359 docs 1095 00:40:43,359 --> 00:40:44,720 dot 1096 00:40:44,720 --> 00:40:46,880 scipy.org 1097 00:40:46,880 --> 00:40:49,200 and we're looking at 1098 00:40:49,200 --> 00:40:50,960 the scikit 1099 00:40:50,960 --> 00:40:54,720 there we go optimize minimize 1100 00:40:54,720 --> 00:40:57,119 you can only minimize one value 1101 00:40:57,119 --> 00:40:58,560 you have the function that's going in 1102 00:40:58,560 --> 00:41:01,359 this function can be very complicated 1103 00:41:01,359 --> 00:41:03,040 so we used a very simple function up 1104 00:41:03,040 --> 00:41:03,920 here 1105 00:41:03,920 --> 00:41:05,760 it could be 1106 00:41:05,760 --> 00:41:06,960 there's all kinds of things that could 1107 00:41:06,960 --> 00:41:08,319 be on there and there's a number of 1108 00:41:08,319 --> 00:41:10,319 methods to solve this as far as how they 1109 00:41:10,319 --> 00:41:11,599 shrink down 1110 00:41:11,599 --> 00:41:13,440 uh and your x naught there's your 1111 00:41:13,440 --> 00:41:15,040 there's your start value so your 1112 00:41:15,040 --> 00:41:18,160 function your start value 1113 00:41:18,160 --> 00:41:19,520 there's all kinds of things that come in 1114 00:41:19,520 --> 00:41:20,720 here that you can look at which we're 1115 00:41:20,720 --> 00:41:22,319 not going to 1116 00:41:22,319 --> 00:41:24,400 optimization automatically creates 1117 00:41:24,400 --> 00:41:26,720 constraints bounds 1118 00:41:26,720 --> 00:41:28,560 some of this it does automatically but 1119 00:41:28,560 --> 00:41:30,640 you really the big thing i want to point 1120 00:41:30,640 --> 00:41:32,160 out here is you need to have a starting 1121 00:41:32,160 --> 00:41:34,079 point you want to start with something 1122 00:41:34,079 --> 00:41:35,599 that you already know is mostly the 1123 00:41:35,599 --> 00:41:36,880 answer 1124 00:41:36,880 --> 00:41:38,079 if you don't then it's going to have a 1125 00:41:38,079 --> 00:41:39,680 heck of a time trying to calculate it 1126 00:41:39,680 --> 00:41:41,359 out 1127 00:41:41,359 --> 00:41:42,640 or you can write your own little script 1128 00:41:42,640 --> 00:41:44,480 that does this and does a high low 1129 00:41:44,480 --> 00:41:47,280 guessing and tries to find the max value 1130 00:41:47,280 --> 00:41:50,240 that brings us to statistics what this 1131 00:41:50,240 --> 00:41:52,319 is kind of all about is figuring things 1132 00:41:52,319 --> 00:41:55,359 out a lot of vocabulary and statistics 1133 00:41:55,359 --> 00:41:58,319 uh so statistics well i guess it's all 1134 00:41:58,319 --> 00:42:00,079 relative it's definitely not an edel 1135 00:42:00,079 --> 00:42:01,359 class 1136 00:42:01,359 --> 00:42:03,359 so a bunch of stuff going on statistics 1137 00:42:03,359 --> 00:42:05,680 statistics concerns with the collection 1138 00:42:05,680 --> 00:42:08,640 organization analysis interpretation 1139 00:42:08,640 --> 00:42:12,160 and presentation of data 1140 00:42:12,160 --> 00:42:14,640 that is a mouthful 1141 00:42:14,640 --> 00:42:17,280 so we have from end to end 1142 00:42:17,280 --> 00:42:19,280 where does it come from is it valid what 1143 00:42:19,280 --> 00:42:22,079 does it mean how do we organize it um 1144 00:42:22,079 --> 00:42:24,240 how do we analyze it and then you gotta 1145 00:42:24,240 --> 00:42:26,079 take those analysis and interpret it 1146 00:42:26,079 --> 00:42:28,240 into something that people can use kind 1147 00:42:28,240 --> 00:42:29,680 of reduce it to 1148 00:42:29,680 --> 00:42:31,359 understandable 1149 00:42:31,359 --> 00:42:32,880 and nowadays you have to be able to 1150 00:42:32,880 --> 00:42:34,800 present it if you can't present it then 1151 00:42:34,800 --> 00:42:36,000 no one else is going to understand what 1152 00:42:36,000 --> 00:42:38,640 the heck you did 1153 00:42:38,640 --> 00:42:41,440 so we look at the terminologies 1154 00:42:41,440 --> 00:42:43,040 there is a lot of terminologies 1155 00:42:43,040 --> 00:42:45,040 depending on what domain you're working 1156 00:42:45,040 --> 00:42:45,839 in 1157 00:42:45,839 --> 00:42:49,119 so clearly if you're working in 1158 00:42:49,119 --> 00:42:52,000 a domain that deals with 1159 00:42:52,000 --> 00:42:56,160 viruses and t cells and and 1160 00:42:56,160 --> 00:42:57,680 how does you know where does that come 1161 00:42:57,680 --> 00:42:58,800 from you're studying the different 1162 00:42:58,800 --> 00:43:00,880 people that you can have a population 1163 00:43:00,880 --> 00:43:04,640 if you are working with um 1164 00:43:04,640 --> 00:43:06,720 mechanical gear 1165 00:43:06,720 --> 00:43:07,760 you know a little bit different if 1166 00:43:07,760 --> 00:43:09,200 you're looking for the wobbling 1167 00:43:09,200 --> 00:43:12,000 statistics uh to know when to replace a 1168 00:43:12,000 --> 00:43:13,520 rotor on a machine or something like 1169 00:43:13,520 --> 00:43:14,319 that 1170 00:43:14,319 --> 00:43:15,920 that can be a big deal you know we have 1171 00:43:15,920 --> 00:43:18,560 these huge fans that turn 1172 00:43:18,560 --> 00:43:21,760 in our sewage processing systems and so 1173 00:43:21,760 --> 00:43:24,160 those fans they start to wobble and hum 1174 00:43:24,160 --> 00:43:25,680 and do different things that the sensors 1175 00:43:25,680 --> 00:43:28,240 pick up at one point do you replace them 1176 00:43:28,240 --> 00:43:29,599 instead of waiting for it to break in 1177 00:43:29,599 --> 00:43:31,119 which case it costs a lot of money 1178 00:43:31,119 --> 00:43:32,640 instead of replacing a bushing you're 1179 00:43:32,640 --> 00:43:35,200 replacing the whole fan unit 1180 00:43:35,200 --> 00:43:37,200 an interesting project that came up for 1181 00:43:37,200 --> 00:43:39,200 our city a while back 1182 00:43:39,200 --> 00:43:41,440 so population all objects are 1183 00:43:41,440 --> 00:43:43,760 measurements whose properties are being 1184 00:43:43,760 --> 00:43:45,040 observed 1185 00:43:45,040 --> 00:43:46,960 so that's your population all the 1186 00:43:46,960 --> 00:43:49,359 objects it's easy to see it with people 1187 00:43:49,359 --> 00:43:52,880 because we have our population in large 1188 00:43:52,880 --> 00:43:55,200 but in the case of the sewer fans we're 1189 00:43:55,200 --> 00:43:56,880 talking about having the fan units 1190 00:43:56,880 --> 00:43:58,400 that's the population of fans that we're 1191 00:43:58,400 --> 00:44:00,560 working with 1192 00:44:00,560 --> 00:44:03,040 you have a parameter a matrix that is 1193 00:44:03,040 --> 00:44:04,960 used to represent a population or 1194 00:44:04,960 --> 00:44:06,560 characteristic 1195 00:44:06,560 --> 00:44:08,560 you have your sample a subset of the 1196 00:44:08,560 --> 00:44:10,800 population studied you don't want to do 1197 00:44:10,800 --> 00:44:12,560 them all because then you don't have a 1198 00:44:12,560 --> 00:44:14,319 if you come up with a conclusion for 1199 00:44:14,319 --> 00:44:16,000 everyone you don't have a way of testing 1200 00:44:16,000 --> 00:44:17,760 it so you take a sample 1201 00:44:17,760 --> 00:44:18,880 sometimes you don't have a choice you 1202 00:44:18,880 --> 00:44:20,480 can only take a sample of what's going 1203 00:44:20,480 --> 00:44:24,000 on you can't study the whole population 1204 00:44:24,000 --> 00:44:26,319 and a variable a metric of interest for 1205 00:44:26,319 --> 00:44:30,240 each person or object in a population 1206 00:44:30,240 --> 00:44:31,839 types of sampling 1207 00:44:31,839 --> 00:44:34,160 we have a probabilistic approach 1208 00:44:34,160 --> 00:44:35,920 selecting samples from a larger 1209 00:44:35,920 --> 00:44:38,319 population using a method based on the 1210 00:44:38,319 --> 00:44:40,880 theory of probability 1211 00:44:40,880 --> 00:44:42,079 and we'll go into a little bit more 1212 00:44:42,079 --> 00:44:44,000 deeper on these we have random 1213 00:44:44,000 --> 00:44:46,640 systematic stratified and then you have 1214 00:44:46,640 --> 00:44:49,040 a non-probabilistic approach selecting 1215 00:44:49,040 --> 00:44:51,520 samples based on the subjective judgment 1216 00:44:51,520 --> 00:44:53,760 of the researcher rather than random 1217 00:44:53,760 --> 00:44:55,119 selection 1218 00:44:55,119 --> 00:44:56,960 it has to do with convenience trying to 1219 00:44:56,960 --> 00:44:58,480 reach a quota 1220 00:44:58,480 --> 00:45:00,800 or snowball 1221 00:45:00,800 --> 00:45:02,560 and they're very biased that's one of 1222 00:45:02,560 --> 00:45:04,160 the reasons you'll see this big stamp on 1223 00:45:04,160 --> 00:45:06,160 that says biased so you gotta be very 1224 00:45:06,160 --> 00:45:08,079 careful on that 1225 00:45:08,079 --> 00:45:10,720 so probabilistic sampling uh when we 1226 00:45:10,720 --> 00:45:12,800 talk about a random sampling we select 1227 00:45:12,800 --> 00:45:15,040 random size samples from each group or 1228 00:45:15,040 --> 00:45:17,520 category so we it's as random as you can 1229 00:45:17,520 --> 00:45:21,440 get we talk about systematic sampling 1230 00:45:21,440 --> 00:45:23,760 we're selecting random size samples from 1231 00:45:23,760 --> 00:45:25,760 each group or category with a fixed 1232 00:45:25,760 --> 00:45:28,079 periodic interval 1233 00:45:28,079 --> 00:45:29,920 uh so we kind of split it up this would 1234 00:45:29,920 --> 00:45:31,599 be like a time set up or different 1235 00:45:31,599 --> 00:45:32,960 categories 1236 00:45:32,960 --> 00:45:34,400 and you might ask your question what is 1237 00:45:34,400 --> 00:45:37,040 a category or a group 1238 00:45:37,040 --> 00:45:38,640 if you look at i'm going to go back a 1239 00:45:38,640 --> 00:45:41,359 window let's say we're studying 1240 00:45:41,359 --> 00:45:44,800 economics of different of an area 1241 00:45:44,800 --> 00:45:47,680 we know pretty much that based on their 1242 00:45:47,680 --> 00:45:49,839 culture where they came from 1243 00:45:49,839 --> 00:45:52,560 they might need to be separated and so 1244 00:45:52,560 --> 00:45:54,960 uh and when i say separated i don't mean 1245 00:45:54,960 --> 00:45:56,640 separated from their 1246 00:45:56,640 --> 00:45:58,800 place where they live i mean as far as 1247 00:45:58,800 --> 00:46:00,400 the analysis we want to look at the 1248 00:46:00,400 --> 00:46:01,920 different groups and make sure they're 1249 00:46:01,920 --> 00:46:03,520 all represented 1250 00:46:03,520 --> 00:46:06,240 so if we had like an eighty percent uh 1251 00:46:06,240 --> 00:46:09,760 of a group that is uh say hispanic and 1252 00:46:09,760 --> 00:46:13,040 or indian and also in that same area we 1253 00:46:13,040 --> 00:46:15,760 have 20 20 percent who are 1254 00:46:15,760 --> 00:46:17,359 let's call our expatriates they left 1255 00:46:17,359 --> 00:46:19,520 america and they're nice and 1256 00:46:19,520 --> 00:46:22,000 your caucasian group we might want to 1257 00:46:22,000 --> 00:46:24,400 sample a group that is representative of 1258 00:46:24,400 --> 00:46:26,720 both uh so we're talking about 1259 00:46:26,720 --> 00:46:29,280 stratified sampling and we're talking 1260 00:46:29,280 --> 00:46:30,560 about groups those are the groups we're 1261 00:46:30,560 --> 00:46:32,079 talking about and it brings us to 1262 00:46:32,079 --> 00:46:33,599 stratified sampling selecting 1263 00:46:33,599 --> 00:46:35,760 approximately equal size samples from 1264 00:46:35,760 --> 00:46:38,160 each group or category 1265 00:46:38,160 --> 00:46:40,160 this way we can actually separate the 1266 00:46:40,160 --> 00:46:43,359 categories and give us an insight into 1267 00:46:43,359 --> 00:46:44,800 the different cultures and how that 1268 00:46:44,800 --> 00:46:47,119 might affect them in that area 1269 00:46:47,119 --> 00:46:49,040 so you can see these are very very 1270 00:46:49,040 --> 00:46:50,480 different kind of 1271 00:46:50,480 --> 00:46:52,720 depends on what you're working with 1272 00:46:52,720 --> 00:46:54,640 as far as your data and what you're 1273 00:46:54,640 --> 00:46:55,599 studying 1274 00:46:55,599 --> 00:46:57,520 and so we can see here just a little bit 1275 00:46:57,520 --> 00:46:59,440 more we'd have selecting 25 employees 1276 00:46:59,440 --> 00:47:02,240 from a company of 250 employees randomly 1277 00:47:02,240 --> 00:47:03,440 don't care anything about them what 1278 00:47:03,440 --> 00:47:05,200 groups are in which office they're in 1279 00:47:05,200 --> 00:47:06,800 nothing 1280 00:47:06,800 --> 00:47:08,560 and we might be selecting one employee 1281 00:47:08,560 --> 00:47:10,720 from every 50 unique employees and a 1282 00:47:10,720 --> 00:47:13,280 company of 250 employees 1283 00:47:13,280 --> 00:47:15,040 and then we have selecting one employee 1284 00:47:15,040 --> 00:47:17,359 from every branch in the company office 1285 00:47:17,359 --> 00:47:18,880 so we have all the different branches 1286 00:47:18,880 --> 00:47:20,960 there's our group or categories by the 1287 00:47:20,960 --> 00:47:23,040 branch and the category could depend on 1288 00:47:23,040 --> 00:47:25,040 what you're studying so it has a lot of 1289 00:47:25,040 --> 00:47:26,640 variation on there 1290 00:47:26,640 --> 00:47:28,000 you see this kind of grouping and 1291 00:47:28,000 --> 00:47:30,400 categorizing is also used to generate a 1292 00:47:30,400 --> 00:47:33,359 lot of misinformation 1293 00:47:33,359 --> 00:47:35,520 so if you only study one group and you 1294 00:47:35,520 --> 00:47:37,359 say this is what it is 1295 00:47:37,359 --> 00:47:38,960 then everybody assumes that's what it is 1296 00:47:38,960 --> 00:47:40,480 for everybody and so you've got to be 1297 00:47:40,480 --> 00:47:41,680 very careful of that and it's very 1298 00:47:41,680 --> 00:47:44,400 unethical thing to kind of do 1299 00:47:44,400 --> 00:47:46,880 so types of statistics uh we talk about 1300 00:47:46,880 --> 00:47:48,160 statistics 1301 00:47:48,160 --> 00:47:50,240 we're going to talk about descriptive 1302 00:47:50,240 --> 00:47:52,880 and inferential statistics 1303 00:47:52,880 --> 00:47:54,720 there are so many different terms and 1304 00:47:54,720 --> 00:47:58,000 statistics to break it up uh so we so 1305 00:47:58,000 --> 00:48:00,240 we're talking about a particular 1306 00:48:00,240 --> 00:48:01,440 setup 1307 00:48:01,440 --> 00:48:03,200 so we're talking about descriptive and 1308 00:48:03,200 --> 00:48:05,920 inferential uh statistics 1309 00:48:05,920 --> 00:48:08,560 the base of the word describe 1310 00:48:08,560 --> 00:48:10,560 is pretty solid you're describing the 1311 00:48:10,560 --> 00:48:13,040 data what does it look like with 1312 00:48:13,040 --> 00:48:15,040 inferential statistics we're going to 1313 00:48:15,040 --> 00:48:17,119 take that from the small population to a 1314 00:48:17,119 --> 00:48:19,200 large population so if you're working 1315 00:48:19,200 --> 00:48:21,200 with a drug company you might look at 1316 00:48:21,200 --> 00:48:23,040 the data and say these people were 1317 00:48:23,040 --> 00:48:24,720 helped by this drug 1318 00:48:24,720 --> 00:48:25,599 they did 1319 00:48:25,599 --> 00:48:27,200 80 percent better 1320 00:48:27,200 --> 00:48:29,040 as far as their health or 80 percent 1321 00:48:29,040 --> 00:48:32,400 better survival rate than the people 1322 00:48:32,400 --> 00:48:34,160 who did not have the drug so we can 1323 00:48:34,160 --> 00:48:36,160 infer that that drug will work in the 1324 00:48:36,160 --> 00:48:38,400 greater populace and will help people so 1325 00:48:38,400 --> 00:48:40,400 that's where you get your inferential so 1326 00:48:40,400 --> 00:48:41,280 we are 1327 00:48:41,280 --> 00:48:42,880 predicting how it's going to affect the 1328 00:48:42,880 --> 00:48:44,880 greater population 1329 00:48:44,880 --> 00:48:46,880 so descriptive statistics it is used to 1330 00:48:46,880 --> 00:48:49,280 describe the basic features of data and 1331 00:48:49,280 --> 00:48:51,839 form the basis of quantitative analysis 1332 00:48:51,839 --> 00:48:53,280 of data 1333 00:48:53,280 --> 00:48:54,960 so we have a measure of central 1334 00:48:54,960 --> 00:48:57,119 tendencies we have your mean median and 1335 00:48:57,119 --> 00:48:58,240 mode 1336 00:48:58,240 --> 00:49:00,400 and then we have a measure of spread 1337 00:49:00,400 --> 00:49:02,880 like your range your interquartile range 1338 00:49:02,880 --> 00:49:04,319 your variance and your standard 1339 00:49:04,319 --> 00:49:05,839 deviation 1340 00:49:05,839 --> 00:49:06,960 and we're going to look at all these a 1341 00:49:06,960 --> 00:49:08,880 little deeper here in a second 1342 00:49:08,880 --> 00:49:12,640 but one of them you can think of is 1343 00:49:12,640 --> 00:49:14,960 how it the data difference 1344 00:49:14,960 --> 00:49:17,359 differences you know what's the max min 1345 00:49:17,359 --> 00:49:19,839 range all that stuff is your spread and 1346 00:49:19,839 --> 00:49:21,680 anything that's just a single number is 1347 00:49:21,680 --> 00:49:24,000 usually your central uh tendencies 1348 00:49:24,000 --> 00:49:26,160 measure of central tendencies 1349 00:49:26,160 --> 00:49:28,000 so we talk about the mean it is the 1350 00:49:28,000 --> 00:49:30,480 average of the set of values considered 1351 00:49:30,480 --> 00:49:32,480 what is the average outcome of whatever 1352 00:49:32,480 --> 00:49:33,920 is going on 1353 00:49:33,920 --> 00:49:35,520 and then your median 1354 00:49:35,520 --> 00:49:37,680 separates the higher half and the lower 1355 00:49:37,680 --> 00:49:40,400 half of data 1356 00:49:40,400 --> 00:49:42,400 so where's the center point of all your 1357 00:49:42,400 --> 00:49:45,760 different data points so your mean might 1358 00:49:45,760 --> 00:49:47,839 have some a couple really big numbers 1359 00:49:47,839 --> 00:49:49,440 that skew it 1360 00:49:49,440 --> 00:49:51,760 so that the average is much higher than 1361 00:49:51,760 --> 00:49:54,640 if you took those outliers out where the 1362 00:49:54,640 --> 00:49:57,359 median would by separating the high from 1363 00:49:57,359 --> 00:49:59,599 the low might give you a much lower 1364 00:49:59,599 --> 00:50:01,040 number you might look at and say oh 1365 00:50:01,040 --> 00:50:03,200 that's that's odd why is the average so 1366 00:50:03,200 --> 00:50:04,880 much higher than the median well it's 1367 00:50:04,880 --> 00:50:06,400 because you have some outliers or why is 1368 00:50:06,400 --> 00:50:07,839 it so much lower 1369 00:50:07,839 --> 00:50:09,599 and then the mode is the most frequent 1370 00:50:09,599 --> 00:50:11,200 appearing value 1371 00:50:11,200 --> 00:50:12,400 this is really interesting if you're 1372 00:50:12,400 --> 00:50:14,480 studying economics and how people are 1373 00:50:14,480 --> 00:50:16,319 doing you might find that the most 1374 00:50:16,319 --> 00:50:17,839 common 1375 00:50:17,839 --> 00:50:19,880 income like in the us was 1376 00:50:19,880 --> 00:50:22,880 1.24 000 a year 1377 00:50:22,880 --> 00:50:26,240 where the average was closer to 80 000 1378 00:50:26,240 --> 00:50:28,240 and it's like wow what a difference well 1379 00:50:28,240 --> 00:50:29,839 there's some people have a lot of money 1380 00:50:29,839 --> 00:50:32,160 and so that skews that way up so the 1381 00:50:32,160 --> 00:50:34,079 average person is not making that kind 1382 00:50:34,079 --> 00:50:36,000 of money and then you look at the median 1383 00:50:36,000 --> 00:50:37,359 income and you're like well the median 1384 00:50:37,359 --> 00:50:39,280 income is a little bit closer to the 1385 00:50:39,280 --> 00:50:41,200 average so it does create a very 1386 00:50:41,200 --> 00:50:43,520 interesting way of looking at the data 1387 00:50:43,520 --> 00:50:45,680 again these are all uh central 1388 00:50:45,680 --> 00:50:47,520 tendencies single numbers you can look 1389 00:50:47,520 --> 00:50:50,480 at for the whole spread of the data 1390 00:50:50,480 --> 00:50:52,640 and we look at the measure of central 1391 00:50:52,640 --> 00:50:54,880 tendencies the mean is the average marks 1392 00:50:54,880 --> 00:50:56,960 of a students in a classroom so here we 1393 00:50:56,960 --> 00:50:58,880 have the mean sum of the marks of the 1394 00:50:58,880 --> 00:51:01,280 students total number of students and as 1395 00:51:01,280 --> 00:51:03,359 we talked about the median 1396 00:51:03,359 --> 00:51:04,480 we have 1397 00:51:04,480 --> 00:51:07,040 0 through 10 and we take half the 1398 00:51:07,040 --> 00:51:08,400 numbers and put them on one side of the 1399 00:51:08,400 --> 00:51:09,920 line half the numbers on the other side 1400 00:51:09,920 --> 00:51:12,079 of the line uh we end up with five in 1401 00:51:12,079 --> 00:51:14,000 the middle and then the mode what mark 1402 00:51:14,000 --> 00:51:16,400 was scored by most of the students in a 1403 00:51:16,400 --> 00:51:17,680 test 1404 00:51:17,680 --> 00:51:19,680 in a simple case where most people 1405 00:51:19,680 --> 00:51:21,839 scored like an 82 percent and got 1406 00:51:21,839 --> 00:51:24,400 certain problems wrong easy to figure 1407 00:51:24,400 --> 00:51:27,440 out uh not so easy when you have 1408 00:51:27,440 --> 00:51:29,359 different areas where like you have like 1409 00:51:29,359 --> 00:51:31,680 the um oh let's go back to economy a 1410 00:51:31,680 --> 00:51:33,280 little bit more difficult to calculate 1411 00:51:33,280 --> 00:51:34,880 if you have a large group that scores 1412 00:51:34,880 --> 00:51:36,720 that makes 30 000 1413 00:51:36,720 --> 00:51:38,480 and a slightly bigger group that makes 1414 00:51:38,480 --> 00:51:40,880 26 000 so what do you put down for the 1415 00:51:40,880 --> 00:51:42,800 mode uh certainly there's a number of 1416 00:51:42,800 --> 00:51:44,319 ways to calculate that and there's 1417 00:51:44,319 --> 00:51:45,599 actually a different variations 1418 00:51:45,599 --> 00:51:47,280 depending on what you're doing 1419 00:51:47,280 --> 00:51:49,040 so now we're looking at a measure of 1420 00:51:49,040 --> 00:51:51,359 spread uh range what's the difference 1421 00:51:51,359 --> 00:51:53,599 between the highest and the lowest value 1422 00:51:53,599 --> 00:51:55,200 first thing you want to look at you know 1423 00:51:55,200 --> 00:51:56,559 it's uh we had everybody in the test 1424 00:51:56,559 --> 00:51:59,440 scored between 60 and 100 so we got 100 1425 00:51:59,440 --> 00:52:02,800 or maybe 60 to 90 it was so hard a lot 1426 00:52:02,800 --> 00:52:04,000 of people could not get a hundred 1427 00:52:04,000 --> 00:52:05,839 percent 1428 00:52:05,839 --> 00:52:06,920 you have your 1429 00:52:06,920 --> 00:52:10,079 inter-quartile range quartiles divide a 1430 00:52:10,079 --> 00:52:12,720 rank ordered data set into four equal 1431 00:52:12,720 --> 00:52:14,319 parts 1432 00:52:14,319 --> 00:52:16,319 very common thing to do as part of all 1433 00:52:16,319 --> 00:52:18,160 the basic packages whether you're 1434 00:52:18,160 --> 00:52:20,160 working in 1435 00:52:20,160 --> 00:52:22,319 data frames with pandas whether you're 1436 00:52:22,319 --> 00:52:24,079 working in scala whether you're working 1437 00:52:24,079 --> 00:52:25,440 in r 1438 00:52:25,440 --> 00:52:26,800 you'll see this come up where they have 1439 00:52:26,800 --> 00:52:29,280 range your min your max and then it'll 1440 00:52:29,280 --> 00:52:31,440 have your interquartile range how does 1441 00:52:31,440 --> 00:52:33,599 it look like in each quarter of data 1442 00:52:33,599 --> 00:52:36,240 variance measures how far each number in 1443 00:52:36,240 --> 00:52:38,640 the set is from the mean and therefore 1444 00:52:38,640 --> 00:52:41,760 from every other number in the set 1445 00:52:41,760 --> 00:52:43,520 so you have like how much turbulence is 1446 00:52:43,520 --> 00:52:44,839 going on in this 1447 00:52:44,839 --> 00:52:48,079 data and then the standard deviation 1448 00:52:48,079 --> 00:52:49,839 it is to measure the variance or the 1449 00:52:49,839 --> 00:52:51,680 dispersion of a set of values from the 1450 00:52:51,680 --> 00:52:52,800 mean 1451 00:52:52,800 --> 00:52:55,200 and you'll usually see uh if i'm doing a 1452 00:52:55,200 --> 00:52:58,160 graph i might have the value graphed 1453 00:52:58,160 --> 00:53:00,880 and then based on the the error i might 1454 00:53:00,880 --> 00:53:03,040 graph graph the standard deviation in 1455 00:53:03,040 --> 00:53:04,800 the error on the graph as a background 1456 00:53:04,800 --> 00:53:07,200 so you can see how far off it is 1457 00:53:07,200 --> 00:53:10,160 uh so standard deviation is used a lot 1458 00:53:10,160 --> 00:53:12,480 so measurement of spread uh marks of a 1459 00:53:12,480 --> 00:53:15,200 student out of 100 we have here from 50 1460 00:53:15,200 --> 00:53:18,319 to 63 or 50 to 90. 1461 00:53:18,319 --> 00:53:20,800 so the range maximum marks minimum marks 1462 00:53:20,800 --> 00:53:23,200 we have 90 to 45 and the spread of that 1463 00:53:23,200 --> 00:53:26,559 is 45 90 minus 45. and then we have the 1464 00:53:26,559 --> 00:53:28,400 interquartile range 1465 00:53:28,400 --> 00:53:30,480 using the same marks over there you can 1466 00:53:30,480 --> 00:53:32,400 see here where the median is 1467 00:53:32,400 --> 00:53:34,960 and then there's the first quarter the 1468 00:53:34,960 --> 00:53:36,960 second quarter and the third quarter 1469 00:53:36,960 --> 00:53:38,640 based on splitting it apart by those 1470 00:53:38,640 --> 00:53:39,839 values 1471 00:53:39,839 --> 00:53:41,280 and to understand the variance and 1472 00:53:41,280 --> 00:53:43,440 standard deviation we first need to find 1473 00:53:43,440 --> 00:53:46,079 out the mean uh so here's our our you 1474 00:53:46,079 --> 00:53:48,319 know calculating the average there we 1475 00:53:48,319 --> 00:53:50,400 end up at approximately 66 for the 1476 00:53:50,400 --> 00:53:52,319 average and then we look at that the 1477 00:53:52,319 --> 00:53:54,160 variance once we know the means we can 1478 00:53:54,160 --> 00:53:56,240 do equals the marks minus the mean 1479 00:53:56,240 --> 00:53:57,520 squared 1480 00:53:57,520 --> 00:53:59,599 y is a squared 1481 00:53:59,599 --> 00:54:01,680 because one you want to make sure it's 1482 00:54:01,680 --> 00:54:04,079 you don't have like if you if you're 1483 00:54:04,079 --> 00:54:05,760 putting all this stuff together you end 1484 00:54:05,760 --> 00:54:07,839 up with an error as far as one's 1485 00:54:07,839 --> 00:54:09,359 negative one's positive one's a little 1486 00:54:09,359 --> 00:54:11,280 higher one's a little lower 1487 00:54:11,280 --> 00:54:12,720 so you always see 1488 00:54:12,720 --> 00:54:14,880 the squared value and over the total 1489 00:54:14,880 --> 00:54:16,319 observations 1490 00:54:16,319 --> 00:54:18,559 and so the standard deviation equals the 1491 00:54:18,559 --> 00:54:20,400 square root of the variance which is 1492 00:54:20,400 --> 00:54:22,640 approximately 16. 1493 00:54:22,640 --> 00:54:24,640 and if you were looking at 1494 00:54:24,640 --> 00:54:26,559 a predictable model you would be looking 1495 00:54:26,559 --> 00:54:29,680 at the deviation based on the error how 1496 00:54:29,680 --> 00:54:31,839 much error does it have 1497 00:54:31,839 --> 00:54:33,200 that's again 1498 00:54:33,200 --> 00:54:35,040 really important to know if your if your 1499 00:54:35,040 --> 00:54:37,119 prediction is predicting something 1500 00:54:37,119 --> 00:54:39,200 what's the chance of it being way off or 1501 00:54:39,200 --> 00:54:42,000 just a little bit off 1502 00:54:42,000 --> 00:54:44,000 now that we've looked at the 1503 00:54:44,000 --> 00:54:46,240 tools as far as some of the basics for 1504 00:54:46,240 --> 00:54:47,839 doing your statistics and we're talking 1505 00:54:47,839 --> 00:54:48,800 about 1506 00:54:48,800 --> 00:54:51,119 let's go ahead and pull up a little demo 1507 00:54:51,119 --> 00:54:52,319 and show you what that looks like in 1508 00:54:52,319 --> 00:54:53,760 python code 1509 00:54:53,760 --> 00:54:55,599 so you can get some little hands on here 1510 00:54:55,599 --> 00:54:57,520 for that let's go back into our jupiter 1511 00:54:57,520 --> 00:55:00,079 notebook and python now almost all of 1512 00:55:00,079 --> 00:55:02,880 this you can do in numpy last time we 1513 00:55:02,880 --> 00:55:05,040 worked in numpy this time we're going to 1514 00:55:05,040 --> 00:55:06,880 go ahead and use pandas 1515 00:55:06,880 --> 00:55:10,079 and if you remember from pandas on here 1516 00:55:10,079 --> 00:55:12,960 this is basically a data frame rows 1517 00:55:12,960 --> 00:55:15,119 columns let's just go ahead and do a 1518 00:55:15,119 --> 00:55:16,079 print 1519 00:55:16,079 --> 00:55:18,800 df.head 1520 00:55:18,800 --> 00:55:21,040 and run that 1521 00:55:21,040 --> 00:55:23,520 and you can see we have the name jane 1522 00:55:23,520 --> 00:55:25,119 michael william rosie hannah sat in 1523 00:55:25,119 --> 00:55:27,280 their salaries on here and of course 1524 00:55:27,280 --> 00:55:29,280 instead of having to do all those hand 1525 00:55:29,280 --> 00:55:31,280 calculations and add everything together 1526 00:55:31,280 --> 00:55:32,960 and divide by the total 1527 00:55:32,960 --> 00:55:35,440 we can do something very simple on this 1528 00:55:35,440 --> 00:55:36,160 like 1529 00:55:36,160 --> 00:55:39,280 use the command mean in pandas and so if 1530 00:55:39,280 --> 00:55:41,520 i go ahead and do this print df 1531 00:55:41,520 --> 00:55:43,440 pick our column salary because we want 1532 00:55:43,440 --> 00:55:46,480 to find the means of that calorie 1533 00:55:46,480 --> 00:55:49,359 we want to find the means of that column 1534 00:55:49,359 --> 00:55:50,960 and we go and print this out and you can 1535 00:55:50,960 --> 00:55:52,480 see that the 1536 00:55:52,480 --> 00:55:56,799 average income on here is 71 000. 1537 00:55:56,799 --> 00:55:58,000 and let's just go ahead and do this 1538 00:55:58,000 --> 00:55:59,680 we'll go ahead and put in 1539 00:55:59,680 --> 00:56:02,680 means 1540 00:56:03,280 --> 00:56:04,720 and if we're going to do that we also 1541 00:56:04,720 --> 00:56:08,559 might want to find the median 1542 00:56:09,040 --> 00:56:11,440 and the median is 1543 00:56:11,440 --> 00:56:13,359 very similar 1544 00:56:13,359 --> 00:56:15,599 except it actually is just median we're 1545 00:56:15,599 --> 00:56:17,440 used to means in average it's kind of 1546 00:56:17,440 --> 00:56:18,720 interesting that those are the use of 1547 00:56:18,720 --> 00:56:20,160 two different words 1548 00:56:20,160 --> 00:56:23,359 uh there can be in some computation 1549 00:56:23,359 --> 00:56:25,440 slight differences but for the most part 1550 00:56:25,440 --> 00:56:27,920 the means is the average uh and then the 1551 00:56:27,920 --> 00:56:29,119 median 1552 00:56:29,119 --> 00:56:31,839 oops let's put a 1553 00:56:32,880 --> 00:56:34,319 median here 1554 00:56:34,319 --> 00:56:35,920 do you have salary that way it displays 1555 00:56:35,920 --> 00:56:38,160 a little better we can see the median is 1556 00:56:38,160 --> 00:56:39,839 54 1557 00:56:39,839 --> 00:56:42,400 000 so the halfway mark is significantly 1558 00:56:42,400 --> 00:56:44,799 below the average why because we have 1559 00:56:44,799 --> 00:56:48,000 somebody in here makes 189 000. darn you 1560 00:56:48,000 --> 00:56:50,480 rosie for throwing off our numbers 1561 00:56:50,480 --> 00:56:51,359 but that's something you'd want to 1562 00:56:51,359 --> 00:56:53,520 notice this is this is the difference 1563 00:56:53,520 --> 00:56:56,079 between these is huge and so is what is 1564 00:56:56,079 --> 00:56:57,280 the meaning behind that when you're 1565 00:56:57,280 --> 00:56:59,359 studying a populace and looking at 1566 00:56:59,359 --> 00:57:01,520 the different data coming in and of 1567 00:57:01,520 --> 00:57:03,280 course we also want to find out hey 1568 00:57:03,280 --> 00:57:04,480 what's the most 1569 00:57:04,480 --> 00:57:05,920 common 1570 00:57:05,920 --> 00:57:08,000 income that people make 1571 00:57:08,000 --> 00:57:10,000 in this little tiny sample and so we'll 1572 00:57:10,000 --> 00:57:12,480 go ahead and do the mode and you can see 1573 00:57:12,480 --> 00:57:14,160 here with the mode 1574 00:57:14,160 --> 00:57:16,720 it's at 50 000. 1575 00:57:16,720 --> 00:57:18,640 so this is this is very telling that 1576 00:57:18,640 --> 00:57:21,040 most people are making 50 000 1577 00:57:21,040 --> 00:57:24,160 the middle point is at 54 000. so half 1578 00:57:24,160 --> 00:57:26,240 the people are making more than that 1579 00:57:26,240 --> 00:57:28,960 what that tells me is that if the most 1580 00:57:28,960 --> 00:57:31,359 common income is weight is below the 1581 00:57:31,359 --> 00:57:32,559 median 1582 00:57:32,559 --> 00:57:33,839 then 1583 00:57:33,839 --> 00:57:35,599 there's a few there's a skill there's a 1584 00:57:35,599 --> 00:57:37,599 lot of high salaries going up but 1585 00:57:37,599 --> 00:57:39,520 there's some really low salaries in 1586 00:57:39,520 --> 00:57:42,079 there and so this trend which is very 1587 00:57:42,079 --> 00:57:44,319 common in statistics when you're 1588 00:57:44,319 --> 00:57:45,599 analyzing 1589 00:57:45,599 --> 00:57:48,000 the economy in different people's income 1590 00:57:48,000 --> 00:57:49,680 is pretty common and the bigger 1591 00:57:49,680 --> 00:57:51,839 difference between these is also 1592 00:57:51,839 --> 00:57:53,119 very important when we're studying 1593 00:57:53,119 --> 00:57:55,040 statistics 1594 00:57:55,040 --> 00:57:56,480 and when you hear someone just say hey 1595 00:57:56,480 --> 00:57:58,960 the average income was you might start 1596 00:57:58,960 --> 00:58:00,559 asking questions at that point why 1597 00:58:00,559 --> 00:58:02,000 aren't you talking about the median 1598 00:58:02,000 --> 00:58:03,599 income why aren't you talking about the 1599 00:58:03,599 --> 00:58:05,920 mode the most common income what are you 1600 00:58:05,920 --> 00:58:07,359 hiding 1601 00:58:07,359 --> 00:58:08,799 and if you're doing these analysis you 1602 00:58:08,799 --> 00:58:10,240 should be looking at these saying hey 1603 00:58:10,240 --> 00:58:11,760 why are these discrepancies why are 1604 00:58:11,760 --> 00:58:13,520 these so different and of course with 1605 00:58:13,520 --> 00:58:15,920 any analysis it's important to find out 1606 00:58:15,920 --> 00:58:17,760 the minimum 1607 00:58:17,760 --> 00:58:20,240 and the maximum so we'll go ahead it's 1608 00:58:20,240 --> 00:58:22,880 just simply uh 1609 00:58:22,880 --> 00:58:24,240 dot min 1610 00:58:24,240 --> 00:58:26,559 it'll pull up your minimum and then dot 1611 00:58:26,559 --> 00:58:28,640 max pulls up the maximum 1612 00:58:28,640 --> 00:58:32,559 pretty straightforward on as far as um 1613 00:58:32,559 --> 00:58:34,559 translating it and knowing which you 1614 00:58:34,559 --> 00:58:36,480 know put the your lowest value which 1615 00:58:36,480 --> 00:58:38,480 your highest value is here 1616 00:58:38,480 --> 00:58:40,640 um which you'll use to generate like a 1617 00:58:40,640 --> 00:58:43,359 spread later on and real quick on no 1618 00:58:43,359 --> 00:58:46,160 mode uh note that it puts mode zero like 1619 00:58:46,160 --> 00:58:47,440 i said there's a couple different ways 1620 00:58:47,440 --> 00:58:50,079 you can compute the mode 1621 00:58:50,079 --> 00:58:51,920 although the standard one is pretty good 1622 00:58:51,920 --> 00:58:53,839 we can of course do the range 1623 00:58:53,839 --> 00:58:56,480 which is your max minus your min so now 1624 00:58:56,480 --> 00:58:59,760 we have a range of 149 000 between the 1625 00:58:59,760 --> 00:59:02,240 upper end and the lower end and you 1626 00:59:02,240 --> 00:59:03,280 might want to be looking up the 1627 00:59:03,280 --> 00:59:05,760 individual values on all of these but it 1628 00:59:05,760 --> 00:59:09,839 turns out there is a describe 1629 00:59:09,839 --> 00:59:12,400 feature in pandas 1630 00:59:12,400 --> 00:59:14,559 and so in pandas we can actually do df 1631 00:59:14,559 --> 00:59:16,960 salary describe and if we do this you 1632 00:59:16,960 --> 00:59:19,520 can see we have that there's seven uh 1633 00:59:19,520 --> 00:59:21,760 setups here's our mean 1634 00:59:21,760 --> 00:59:23,520 um our standard deviation which we 1635 00:59:23,520 --> 00:59:25,200 didn't compute yet which would just be a 1636 00:59:25,200 --> 00:59:26,799 dot std 1637 00:59:26,799 --> 00:59:27,839 and you gotta be a little careful 1638 00:59:27,839 --> 00:59:29,599 because when it computes it it looks for 1639 00:59:29,599 --> 00:59:31,520 axes and things like that 1640 00:59:31,520 --> 00:59:33,280 we have our minimum value and here's our 1641 00:59:33,280 --> 00:59:35,520 quartiles 1642 00:59:35,520 --> 00:59:37,200 our maximum value and then of course the 1643 00:59:37,200 --> 00:59:38,720 name salary 1644 00:59:38,720 --> 00:59:40,079 so these are the these are the basic 1645 00:59:40,079 --> 00:59:41,599 statistics you can pull them up and like 1646 00:59:41,599 --> 00:59:43,119 just describe 1647 00:59:43,119 --> 00:59:45,520 this is a dictionary so i could actually 1648 00:59:45,520 --> 00:59:48,079 do something like 1649 00:59:48,079 --> 00:59:51,040 in here i could actually go uh count 1650 00:59:51,040 --> 00:59:52,319 and run 1651 00:59:52,319 --> 00:59:54,799 and now it just prints the count 1652 00:59:54,799 --> 00:59:56,319 so because this is a dictionary you can 1653 00:59:56,319 --> 00:59:59,119 pull any one of these values out of here 1654 00:59:59,119 --> 01:00:00,720 it's kind of a quick and dirty way to 1655 01:00:00,720 --> 01:00:02,640 pull all the different information and 1656 01:00:02,640 --> 01:00:04,160 then split it up and depending on what 1657 01:00:04,160 --> 01:00:05,200 you need 1658 01:00:05,200 --> 01:00:06,960 now if i just walked in and gave you 1659 01:00:06,960 --> 01:00:08,640 this information 1660 01:00:08,640 --> 01:00:10,160 in a meeting 1661 01:00:10,160 --> 01:00:12,240 at some point you would just kind of 1662 01:00:12,240 --> 01:00:14,799 fall asleep that's what i would do 1663 01:00:14,799 --> 01:00:15,760 anyway 1664 01:00:15,760 --> 01:00:16,720 um 1665 01:00:16,720 --> 01:00:18,400 so we want to go ahead and see about 1666 01:00:18,400 --> 01:00:20,079 graphing it here and we'll go ahead and 1667 01:00:20,079 --> 01:00:22,799 put it into a histogram and plot that 1668 01:00:22,799 --> 01:00:24,400 graph on it 1669 01:00:24,400 --> 01:00:26,079 of the salaries and let's just go ahead 1670 01:00:26,079 --> 01:00:28,480 and put that in here so 1671 01:00:28,480 --> 01:00:30,400 we do our map plot inline remember 1672 01:00:30,400 --> 01:00:32,400 that's a jupiter's notebook thing 1673 01:00:32,400 --> 01:00:34,319 a lot of the new version of the matte 1674 01:00:34,319 --> 01:00:36,640 plot library does it automatically 1675 01:00:36,640 --> 01:00:38,240 but just in case i always put it in 1676 01:00:38,240 --> 01:00:40,720 there import matplot library pi plot is 1677 01:00:40,720 --> 01:00:43,920 plt that's my plotting 1678 01:00:43,920 --> 01:00:46,079 and then we have our data frame i don't 1679 01:00:46,079 --> 01:00:47,440 i guess i really don't need to respell 1680 01:00:47,440 --> 01:00:48,720 the data frame 1681 01:00:48,720 --> 01:00:50,160 maybe we could just remind ourselves 1682 01:00:50,160 --> 01:00:52,000 what's in it so we'll go ahead and just 1683 01:00:52,000 --> 01:00:53,680 print 1684 01:00:53,680 --> 01:00:54,720 df 1685 01:00:54,720 --> 01:00:56,400 that way we still have it 1686 01:00:56,400 --> 01:00:58,799 and then we have our salary df salary 1687 01:00:58,799 --> 01:01:01,040 salary.plot history title salary 1688 01:01:01,040 --> 01:01:03,599 distribution color gray 1689 01:01:03,599 --> 01:01:07,520 uh plot ax v line salary the mean value 1690 01:01:07,520 --> 01:01:10,240 so we're going to take the mean value 1691 01:01:10,240 --> 01:01:11,760 color violet 1692 01:01:11,760 --> 01:01:14,079 line style dash this is just all making 1693 01:01:14,079 --> 01:01:15,200 it pretty 1694 01:01:15,200 --> 01:01:18,079 uh what color dashed line line width of 1695 01:01:18,079 --> 01:01:19,839 2 that kind of thing 1696 01:01:19,839 --> 01:01:21,599 and the median and let's go ahead and 1697 01:01:21,599 --> 01:01:22,799 run this just so you can see what we're 1698 01:01:22,799 --> 01:01:25,359 talking about 1699 01:01:25,359 --> 01:01:29,119 and so up here we are taking on our plot 1700 01:01:29,119 --> 01:01:31,680 um so here's the data here's our our 1701 01:01:31,680 --> 01:01:33,280 data frame print it out so you can see 1702 01:01:33,280 --> 01:01:35,200 it with the salaries we'll look at the 1703 01:01:35,200 --> 01:01:37,520 salary distribution and just look at 1704 01:01:37,520 --> 01:01:41,200 this the way the salary is distributed 1705 01:01:41,200 --> 01:01:42,480 you have our 1706 01:01:42,480 --> 01:01:44,160 in this case we did 1707 01:01:44,160 --> 01:01:47,599 let's see we had red for the median 1708 01:01:47,599 --> 01:01:49,760 we have violet 1709 01:01:49,760 --> 01:01:52,640 for our average or mean 1710 01:01:52,640 --> 01:01:54,880 and you can just see how it really 1711 01:01:54,880 --> 01:01:56,480 i mean here's our outlier here's our 1712 01:01:56,480 --> 01:01:58,799 person who makes a lot of money here's 1713 01:01:58,799 --> 01:01:59,599 the 1714 01:01:59,599 --> 01:02:02,559 average and here's the median 1715 01:02:02,559 --> 01:02:04,000 and so as you look at this you can say 1716 01:02:04,000 --> 01:02:05,119 wow 1717 01:02:05,119 --> 01:02:06,720 based on the average it really doesn't 1718 01:02:06,720 --> 01:02:07,920 tell you much about what people are 1719 01:02:07,920 --> 01:02:09,839 really taking home all it does is tell 1720 01:02:09,839 --> 01:02:10,559 you 1721 01:02:10,559 --> 01:02:12,720 how much money is in this you know what 1722 01:02:12,720 --> 01:02:14,880 the average salary is 1723 01:02:14,880 --> 01:02:16,000 so 1724 01:02:16,000 --> 01:02:17,520 some of the things you want to take away 1725 01:02:17,520 --> 01:02:20,160 in addition to this is that it's very 1726 01:02:20,160 --> 01:02:22,720 easy to plot 1727 01:02:22,720 --> 01:02:24,640 an ax v line 1728 01:02:24,640 --> 01:02:26,240 these are these up and down lines for 1729 01:02:26,240 --> 01:02:28,720 your markers 1730 01:02:28,720 --> 01:02:30,480 and as you just display the data i mean 1731 01:02:30,480 --> 01:02:31,839 you can add all kinds of things to this 1732 01:02:31,839 --> 01:02:33,680 and get really complicated keeping it 1733 01:02:33,680 --> 01:02:35,119 simple is pretty straightforward i look 1734 01:02:35,119 --> 01:02:36,960 at this and i can see we have a major 1735 01:02:36,960 --> 01:02:38,880 outlier out here we can definitely do a 1736 01:02:38,880 --> 01:02:41,280 histogram and stuff like that 1737 01:02:41,280 --> 01:02:42,799 but you know pictures worth a thousand 1738 01:02:42,799 --> 01:02:43,680 words 1739 01:02:43,680 --> 01:02:44,799 what you really want to make sure you 1740 01:02:44,799 --> 01:02:47,520 take away is that we can do a basic 1741 01:02:47,520 --> 01:02:48,960 describe 1742 01:02:48,960 --> 01:02:51,200 which pulls all this information out and 1743 01:02:51,200 --> 01:02:52,880 we can print any of the individual 1744 01:02:52,880 --> 01:02:55,200 information from the describe 1745 01:02:55,200 --> 01:02:58,400 because this is a dictionary 1746 01:02:58,880 --> 01:03:00,720 and so if we want to go ahead and look 1747 01:03:00,720 --> 01:03:02,559 up 1748 01:03:02,559 --> 01:03:04,559 the mean value we can also do describe 1749 01:03:04,559 --> 01:03:05,760 mean so if you're doing a lot of 1750 01:03:05,760 --> 01:03:07,359 statistics 1751 01:03:07,359 --> 01:03:10,160 being able to 1752 01:03:10,240 --> 01:03:11,440 doesn't have the print on there so it's 1753 01:03:11,440 --> 01:03:13,280 only going to print the last one which 1754 01:03:13,280 --> 01:03:14,960 happens to be the mean 1755 01:03:14,960 --> 01:03:16,559 you can very easily reference any one of 1756 01:03:16,559 --> 01:03:18,720 these and then you can also if you're 1757 01:03:18,720 --> 01:03:19,760 doing something a little bit more 1758 01:03:19,760 --> 01:03:21,520 complicated and you don't need just the 1759 01:03:21,520 --> 01:03:24,559 basics you can come through and pull any 1760 01:03:24,559 --> 01:03:27,760 one of the individual 1761 01:03:28,240 --> 01:03:30,960 references from the from the pandas on 1762 01:03:30,960 --> 01:03:33,680 here so now we've had a chance to 1763 01:03:33,680 --> 01:03:35,839 describe our data 1764 01:03:35,839 --> 01:03:39,039 let's get into inferential statistics 1765 01:03:39,039 --> 01:03:40,799 inferential statistics allows you to 1766 01:03:40,799 --> 01:03:44,880 make predictions or inferences from data 1767 01:03:44,880 --> 01:03:46,319 and you can see here we have a nice 1768 01:03:46,319 --> 01:03:49,760 little picture movie ratings and 1769 01:03:49,760 --> 01:03:52,000 if we took this group of people and said 1770 01:03:52,000 --> 01:03:53,520 hey how many people like the movie 1771 01:03:53,520 --> 01:03:55,920 dislike it can't say and then you ask 1772 01:03:55,920 --> 01:03:57,520 just a random person who comes out of 1773 01:03:57,520 --> 01:04:00,079 the movie who hasn't been in this study 1774 01:04:00,079 --> 01:04:03,039 you can infer that 55 percent chance of 1775 01:04:03,039 --> 01:04:04,400 saying liked 1776 01:04:04,400 --> 01:04:07,440 35 chance of saying disliked or a 10 or 1777 01:04:07,440 --> 01:04:09,760 11 chance of can't say 1778 01:04:09,760 --> 01:04:11,520 so that's real basics of what we're 1779 01:04:11,520 --> 01:04:13,200 talking about is you're going to infer 1780 01:04:13,200 --> 01:04:14,799 that the next person is going to follow 1781 01:04:14,799 --> 01:04:17,760 these statistics 1782 01:04:18,319 --> 01:04:20,960 so let's look at point estimation 1783 01:04:20,960 --> 01:04:22,480 it is a process of finding an 1784 01:04:22,480 --> 01:04:24,559 approximate value for a population's 1785 01:04:24,559 --> 01:04:26,720 parameter like mean 1786 01:04:26,720 --> 01:04:28,640 or average from random samples of the 1787 01:04:28,640 --> 01:04:30,799 population let's take an example of 1788 01:04:30,799 --> 01:04:33,440 testing vaccines for covid19 1789 01:04:33,440 --> 01:04:35,680 vaccines and flu bugs all that it's a 1790 01:04:35,680 --> 01:04:37,200 pretty big thing of how do you test 1791 01:04:37,200 --> 01:04:38,559 these out and make sure they're going to 1792 01:04:38,559 --> 01:04:40,400 work on the populace 1793 01:04:40,400 --> 01:04:42,319 a group of people are chosen from the 1794 01:04:42,319 --> 01:04:45,200 population medical trials are performed 1795 01:04:45,200 --> 01:04:47,200 results are generalized for the whole 1796 01:04:47,200 --> 01:04:49,839 population so here's a protected there's 1797 01:04:49,839 --> 01:04:51,280 our small group up here where we've 1798 01:04:51,280 --> 01:04:53,680 selected them we run medical trials on 1799 01:04:53,680 --> 01:04:55,200 them and then the results work for the 1800 01:04:55,200 --> 01:04:56,559 population 1801 01:04:56,559 --> 01:04:58,160 nice diagram with the arrows going back 1802 01:04:58,160 --> 01:05:00,240 and forth and the very scary coveted 1803 01:05:00,240 --> 01:05:01,920 virus in the middle of one 1804 01:05:01,920 --> 01:05:03,280 and let's take a look at the 1805 01:05:03,280 --> 01:05:06,880 applications of inferential statistics 1806 01:05:06,880 --> 01:05:08,480 very central is what they call 1807 01:05:08,480 --> 01:05:11,200 hypotheses testing 1808 01:05:11,200 --> 01:05:13,280 and the confidence interval which go 1809 01:05:13,280 --> 01:05:17,440 with that and then as we get into 1810 01:05:17,440 --> 01:05:19,920 probability we get into our binomial 1811 01:05:19,920 --> 01:05:22,000 theorem our normal distribution in 1812 01:05:22,000 --> 01:05:24,000 central limit theorem 1813 01:05:24,000 --> 01:05:26,640 hypothesis testing hypothesis testing is 1814 01:05:26,640 --> 01:05:28,880 used to measure the plausibility of a 1815 01:05:28,880 --> 01:05:30,319 hypothesis 1816 01:05:30,319 --> 01:05:33,440 assumption by using sample data 1817 01:05:33,440 --> 01:05:34,319 now 1818 01:05:34,319 --> 01:05:36,880 when we talk about theorem's 1819 01:05:36,880 --> 01:05:38,119 theory 1820 01:05:38,119 --> 01:05:41,760 hypothesis uh keep in mind that if you 1821 01:05:41,760 --> 01:05:43,920 are in a philosophy class 1822 01:05:43,920 --> 01:05:44,880 theory 1823 01:05:44,880 --> 01:05:48,240 is the same as hypothesis where theorem 1824 01:05:48,240 --> 01:05:50,720 is a scientific uh statement that is 1825 01:05:50,720 --> 01:05:52,880 something that has been proven 1826 01:05:52,880 --> 01:05:54,640 although it is always up for debate 1827 01:05:54,640 --> 01:05:56,160 because in science we always want to 1828 01:05:56,160 --> 01:05:57,760 make sure things are up to debate so 1829 01:05:57,760 --> 01:06:00,240 hypothesis is the same as a 1830 01:06:00,240 --> 01:06:02,480 philosophical class calling a theory 1831 01:06:02,480 --> 01:06:04,880 where theory in science is not the same 1832 01:06:04,880 --> 01:06:06,559 theory in science says this has been 1833 01:06:06,559 --> 01:06:09,680 well proven gravity is a theory uh so if 1834 01:06:09,680 --> 01:06:11,520 you want to debate the theory of gravity 1835 01:06:11,520 --> 01:06:13,760 try jumping up and down if you want to 1836 01:06:13,760 --> 01:06:16,400 have a theory about why the economy is 1837 01:06:16,400 --> 01:06:18,400 collapsing in your area 1838 01:06:18,400 --> 01:06:20,720 that is a philosophical debate 1839 01:06:20,720 --> 01:06:22,559 very important i've heard people mix 1840 01:06:22,559 --> 01:06:25,280 those up and it is a pet peeve of mine 1841 01:06:25,280 --> 01:06:27,599 when we talk about hypotheses testing 1842 01:06:27,599 --> 01:06:29,599 the steps involved in hypotheses testing 1843 01:06:29,599 --> 01:06:32,240 is first we formulate a hypothesis 1844 01:06:32,240 --> 01:06:34,160 we figure out the right test to test our 1845 01:06:34,160 --> 01:06:36,960 hypothesis we execute the test and we 1846 01:06:36,960 --> 01:06:39,280 make a decision and so when you're 1847 01:06:39,280 --> 01:06:40,960 talking about hypothesis you're usually 1848 01:06:40,960 --> 01:06:43,359 trying to disprove it if you can't 1849 01:06:43,359 --> 01:06:44,799 disprove it 1850 01:06:44,799 --> 01:06:46,799 and it works for all the facts then you 1851 01:06:46,799 --> 01:06:49,680 might call that a theorem at some point 1852 01:06:49,680 --> 01:06:51,839 so in a use case uh let's consider an 1853 01:06:51,839 --> 01:06:53,920 example we have four students were given 1854 01:06:53,920 --> 01:06:56,240 a task to clean a room every day 1855 01:06:56,240 --> 01:06:58,079 sounds like working with my kids they 1856 01:06:58,079 --> 01:06:59,520 decided to distribute the job of 1857 01:06:59,520 --> 01:07:01,680 cleaning the room among themselves they 1858 01:07:01,680 --> 01:07:04,000 did so by making four chits which has 1859 01:07:04,000 --> 01:07:05,920 their names on it and the name that gets 1860 01:07:05,920 --> 01:07:07,760 picked up has to do the cleaning for 1861 01:07:07,760 --> 01:07:08,960 that day 1862 01:07:08,960 --> 01:07:10,960 rob took the opportunity to make chits 1863 01:07:10,960 --> 01:07:13,039 and wrote everyone's name on it so 1864 01:07:13,039 --> 01:07:15,839 here's our four people nick rob emlia 1865 01:07:15,839 --> 01:07:19,200 imlia and summer 1866 01:07:19,200 --> 01:07:21,520 now rick emilia and summer are asking us 1867 01:07:21,520 --> 01:07:23,680 to decide whether rob has done some 1868 01:07:23,680 --> 01:07:26,240 mischief in preparing the chits i.e 1869 01:07:26,240 --> 01:07:27,920 whether rob has written his name on one 1870 01:07:27,920 --> 01:07:29,039 of the chit 1871 01:07:29,039 --> 01:07:30,400 for that we will find out the 1872 01:07:30,400 --> 01:07:32,240 probability of rob getting the cleaning 1873 01:07:32,240 --> 01:07:34,480 job on first day second day third day 1874 01:07:34,480 --> 01:07:36,640 and so on till 12 days 1875 01:07:36,640 --> 01:07:38,720 the probability of rob getting the job 1876 01:07:38,720 --> 01:07:41,920 decreases every day i.e his turn never 1877 01:07:41,920 --> 01:07:43,920 comes up then definitely he has done 1878 01:07:43,920 --> 01:07:46,480 some mischief while making the chits 1879 01:07:46,480 --> 01:07:48,640 so the probability of rob not doing work 1880 01:07:48,640 --> 01:07:51,039 on day one is uh three out of four 1881 01:07:51,039 --> 01:07:53,359 there's a 0.75 chance that he didn't do 1882 01:07:53,359 --> 01:07:54,240 work 1883 01:07:54,240 --> 01:07:56,559 uh two days three fourths times three 1884 01:07:56,559 --> 01:07:59,280 fourths equals point five six 1885 01:07:59,280 --> 01:08:00,799 three days you have three fourths three 1886 01:08:00,799 --> 01:08:03,920 fourths three fourths which equals 0.42 1887 01:08:03,920 --> 01:08:07,359 when you get to day 12 it's 0.032 which 1888 01:08:07,359 --> 01:08:09,599 is less than 0.05 1889 01:08:09,599 --> 01:08:12,480 remember this .05 that comes up a lot 1890 01:08:12,480 --> 01:08:14,640 when we're talking about 1891 01:08:14,640 --> 01:08:16,640 certain values when we're looking at 1892 01:08:16,640 --> 01:08:17,920 statistics 1893 01:08:17,920 --> 01:08:19,920 rob is cheating as he wasn't chosen for 1894 01:08:19,920 --> 01:08:22,319 12 consecutive days that's a very high 1895 01:08:22,319 --> 01:08:25,679 probability when on day 12 he still 1896 01:08:25,679 --> 01:08:29,040 hasn't gotten the job cleaning the room 1897 01:08:29,040 --> 01:08:31,520 so we come up to our important important 1898 01:08:31,520 --> 01:08:33,120 terminologies 1899 01:08:33,120 --> 01:08:36,000 we have null hypothesis 1900 01:08:36,000 --> 01:08:37,520 a general statement that states that 1901 01:08:37,520 --> 01:08:39,520 there is no relationship between two 1902 01:08:39,520 --> 01:08:42,399 measured phenomena or no association 1903 01:08:42,399 --> 01:08:44,238 among the groups 1904 01:08:44,238 --> 01:08:46,399 alternative hypothesis 1905 01:08:46,399 --> 01:08:48,399 contrary to the null hypothesis it 1906 01:08:48,399 --> 01:08:50,880 states whenever something is happening a 1907 01:08:50,880 --> 01:08:52,960 new theory is preferred instead of an 1908 01:08:52,960 --> 01:08:55,839 old one and so the two hypotheses go 1909 01:08:55,839 --> 01:08:58,799 hand in hand uh so your null this is 1910 01:08:58,799 --> 01:09:00,399 always interesting in in talking about 1911 01:09:00,399 --> 01:09:02,799 data science and the math behind it 1912 01:09:02,799 --> 01:09:05,040 it's about proving that the things have 1913 01:09:05,040 --> 01:09:06,399 no correlation 1914 01:09:06,399 --> 01:09:08,719 null hypothesis says these two have zero 1915 01:09:08,719 --> 01:09:10,479 relation to each other where the 1916 01:09:10,479 --> 01:09:12,719 alternative hypothesis says hey we found 1917 01:09:12,719 --> 01:09:14,880 a relation this is what it is 1918 01:09:14,880 --> 01:09:17,439 we have p-value the p-value is a 1919 01:09:17,439 --> 01:09:19,839 probability of finding the observed or 1920 01:09:19,839 --> 01:09:21,439 more extreme results when the null 1921 01:09:21,439 --> 01:09:24,799 hypothesis of a study question is true 1922 01:09:24,799 --> 01:09:26,799 and the t value it is simply the 1923 01:09:26,799 --> 01:09:28,640 calculated difference represented in 1924 01:09:28,640 --> 01:09:30,880 units of standard error the greater the 1925 01:09:30,880 --> 01:09:32,880 magnitude of t the greater the evidence 1926 01:09:32,880 --> 01:09:35,359 against the null hypothesis and you can 1927 01:09:35,359 --> 01:09:38,000 look at the t values being specific to 1928 01:09:38,000 --> 01:09:39,679 the test you're doing 1929 01:09:39,679 --> 01:09:42,238 where the p value is derived from your t 1930 01:09:42,238 --> 01:09:44,319 value and you're looking for what they 1931 01:09:44,319 --> 01:09:47,439 call the five percent or the 0.05 1932 01:09:47,439 --> 01:09:49,520 showing that it has a high correlation 1933 01:09:49,520 --> 01:09:50,238 so 1934 01:09:50,238 --> 01:09:52,158 digging in deeper let's assume that a 1935 01:09:52,158 --> 01:09:53,920 new drug is developed with the goal of 1936 01:09:53,920 --> 01:09:55,679 lowering the blood pressure more than 1937 01:09:55,679 --> 01:09:57,520 the existing drug 1938 01:09:57,520 --> 01:09:59,840 and this is a good one because the null 1939 01:09:59,840 --> 01:10:01,679 value here isn't that you don't have any 1940 01:10:01,679 --> 01:10:03,199 drug the null value here is that it's 1941 01:10:03,199 --> 01:10:04,880 better than the existing drug 1942 01:10:04,880 --> 01:10:06,480 the new drug doesn't lower the blood 1943 01:10:06,480 --> 01:10:09,360 pressure more than the existing drug 1944 01:10:09,360 --> 01:10:11,360 now if we get that 1945 01:10:11,360 --> 01:10:13,679 that says our null hypothesis is correct 1946 01:10:13,679 --> 01:10:16,159 there is no correlation and the new drug 1947 01:10:16,159 --> 01:10:18,880 is not doing its job the alternative 1948 01:10:18,880 --> 01:10:21,120 hypothesis the new drug does 1949 01:10:21,120 --> 01:10:22,800 significantly lower the blood pressure 1950 01:10:22,800 --> 01:10:25,600 more than the existing drug uh yay we 1951 01:10:25,600 --> 01:10:27,520 got a new drug out there and that's our 1952 01:10:27,520 --> 01:10:31,520 alternative hypothesis or the h1 or ha 1953 01:10:31,520 --> 01:10:33,920 and we look at the p-value results from 1954 01:10:33,920 --> 01:10:35,840 the evidence like medical trial showing 1955 01:10:35,840 --> 01:10:37,840 positive results which will reject the 1956 01:10:37,840 --> 01:10:39,600 null hypothesis 1957 01:10:39,600 --> 01:10:41,360 and again they're looking for 1958 01:10:41,360 --> 01:10:44,800 a 0.05 or 5 percent and the t value 1959 01:10:44,800 --> 01:10:46,880 comparing all the positive test results 1960 01:10:46,880 --> 01:10:48,640 and finding means of different samples 1961 01:10:48,640 --> 01:10:50,800 in order to test hypothesis 1962 01:10:50,800 --> 01:10:53,520 so this is specific to the test how what 1963 01:10:53,520 --> 01:10:56,000 percentage of increase did they have 1964 01:10:56,000 --> 01:10:57,920 and this leads us to the confidence 1965 01:10:57,920 --> 01:10:59,120 intervals 1966 01:10:59,120 --> 01:11:01,280 a confidence interval is a range of 1967 01:11:01,280 --> 01:11:03,679 values we are sure our true values of 1968 01:11:03,679 --> 01:11:06,159 observations lie in 1969 01:11:06,159 --> 01:11:08,080 let's say you asked a dog owner around 1970 01:11:08,080 --> 01:11:10,719 you and asked them how many cans of food 1971 01:11:10,719 --> 01:11:13,199 do you buy for your per year for your 1972 01:11:13,199 --> 01:11:14,239 dog 1973 01:11:14,239 --> 01:11:16,159 through calculations you got to know 1974 01:11:16,159 --> 01:11:18,880 that the on an average around 95 percent 1975 01:11:18,880 --> 01:11:21,199 of the people bought around 200 to 300 1976 01:11:21,199 --> 01:11:23,360 cans of food hence we can say that we 1977 01:11:23,360 --> 01:11:26,320 have a confidence interval of two 300 1978 01:11:26,320 --> 01:11:28,880 where 95 percent of our values lie in 1979 01:11:28,880 --> 01:11:31,120 that sprint data spread 1980 01:11:31,120 --> 01:11:33,520 and this the graph really helps a lot so 1981 01:11:33,520 --> 01:11:35,199 you can start seeing what you're looking 1982 01:11:35,199 --> 01:11:37,760 at here we have the 95 percent you have 1983 01:11:37,760 --> 01:11:39,600 your peak in this case it's a normal 1984 01:11:39,600 --> 01:11:41,040 distribution so you have a nice bell 1985 01:11:41,040 --> 01:11:42,640 curve equal on both sides it's not 1986 01:11:42,640 --> 01:11:45,840 asymmetrical and 95 of all the values 1987 01:11:45,840 --> 01:11:48,080 lie within a very small range and then 1988 01:11:48,080 --> 01:11:50,320 you have your outliers the 2.5 percent 1989 01:11:50,320 --> 01:11:52,159 going each way 1990 01:11:52,159 --> 01:11:54,960 so we touched upon hypothesis uh we're 1991 01:11:54,960 --> 01:11:57,760 going to move into probability so you 1992 01:11:57,760 --> 01:11:58,960 have your hypothesis once you've 1993 01:11:58,960 --> 01:12:00,400 generated your hypothesis we want to 1994 01:12:00,400 --> 01:12:01,760 know the probability of something 1995 01:12:01,760 --> 01:12:04,000 occurring probability is a measure of 1996 01:12:04,000 --> 01:12:06,640 the likelihood of an event to occur any 1997 01:12:06,640 --> 01:12:08,480 event can be predicted with total 1998 01:12:08,480 --> 01:12:09,600 certainty 1999 01:12:09,600 --> 01:12:11,199 and can only be predicted as a 2000 01:12:11,199 --> 01:12:13,679 likelihood of its occurrence so any 2001 01:12:13,679 --> 01:12:15,520 event cannot be predicted with total 2002 01:12:15,520 --> 01:12:17,280 certainty can only be predicted as a 2003 01:12:17,280 --> 01:12:19,679 likelihood of its occurrence 2004 01:12:19,679 --> 01:12:21,360 score prediction how good you're going 2005 01:12:21,360 --> 01:12:23,600 to do in whatever 2006 01:12:23,600 --> 01:12:26,080 sport you're in weather prediction stock 2007 01:12:26,080 --> 01:12:27,360 prediction 2008 01:12:27,360 --> 01:12:29,600 if you've studied physics and chaos 2009 01:12:29,600 --> 01:12:31,760 theory even the location of the chair 2010 01:12:31,760 --> 01:12:33,520 you're sitting on has a probability that 2011 01:12:33,520 --> 01:12:35,600 it might move three feet over 2012 01:12:35,600 --> 01:12:37,840 granted that probability is one in like 2013 01:12:37,840 --> 01:12:40,320 uh i think we calculated as under one in 2014 01:12:40,320 --> 01:12:43,440 trillions upon trillions so it's 2015 01:12:43,440 --> 01:12:44,719 the better the probability the more 2016 01:12:44,719 --> 01:12:46,000 likely it's going to happen there are 2017 01:12:46,000 --> 01:12:47,120 some things that have such a low 2018 01:12:47,120 --> 01:12:48,400 probability 2019 01:12:48,400 --> 01:12:50,880 that we don't see them so we talk about 2020 01:12:50,880 --> 01:12:53,360 a random variable a random variable is a 2021 01:12:53,360 --> 01:12:54,880 variable whose possible values are 2022 01:12:54,880 --> 01:12:57,920 numerical outcomes of a random phenomena 2023 01:12:57,920 --> 01:13:00,960 so uh we have the coin tossed how many 2024 01:13:00,960 --> 01:13:02,880 heads will occur in the series of 20 2025 01:13:02,880 --> 01:13:05,360 coin flips probably you know the on 2026 01:13:05,360 --> 01:13:07,440 average they're 10 but you really can't 2027 01:13:07,440 --> 01:13:09,679 know because it's very random how many 2028 01:13:09,679 --> 01:13:11,600 times a red ball is picked from a bag of 2029 01:13:11,600 --> 01:13:14,480 balls if there's equal number of red 2030 01:13:14,480 --> 01:13:16,080 balls and blue balls and green balls in 2031 01:13:16,080 --> 01:13:18,159 there how many times the sum of digits 2032 01:13:18,159 --> 01:13:19,760 on two dice 2033 01:13:19,760 --> 01:13:22,960 results are five each 2034 01:13:22,960 --> 01:13:24,320 so you know there's how often you're 2035 01:13:24,320 --> 01:13:27,440 gonna roll two fives on your paradigms 2036 01:13:27,440 --> 01:13:29,600 so in a use case uh let's consider the 2037 01:13:29,600 --> 01:13:31,280 example of rolling two dice we have a 2038 01:13:31,280 --> 01:13:33,360 random variable outcome equals y you can 2039 01:13:33,360 --> 01:13:35,520 take values two three four five six 2040 01:13:35,520 --> 01:13:38,000 seven eight nine ten eleven twelve 2041 01:13:38,000 --> 01:13:39,920 so we have a random variable and a 2042 01:13:39,920 --> 01:13:41,760 combination of dice 2043 01:13:41,760 --> 01:13:44,640 and instead of looking at how many times 2044 01:13:44,640 --> 01:13:46,320 both dice for roll five let's go ahead 2045 01:13:46,320 --> 01:13:48,880 and look at uh total sum of five and you 2046 01:13:48,880 --> 01:13:50,880 have in as far as your random variables 2047 01:13:50,880 --> 01:13:53,040 you can have a one four equals five four 2048 01:13:53,040 --> 01:13:55,360 one two three three two 2049 01:13:55,360 --> 01:13:58,320 so four of those rolls can be four if 2050 01:13:58,320 --> 01:13:59,920 you look at all the different options 2051 01:13:59,920 --> 01:14:02,000 you have four of those random rolls can 2052 01:14:02,000 --> 01:14:03,920 be a five 2053 01:14:03,920 --> 01:14:07,199 and if we look at the total number 2054 01:14:07,199 --> 01:14:10,320 which happens to be 36 different options 2055 01:14:10,320 --> 01:14:12,880 you can see that we have four out of 36 2056 01:14:12,880 --> 01:14:14,719 chance every time you roll the dice that 2057 01:14:14,719 --> 01:14:16,400 you're gonna roll a total of five you 2058 01:14:16,400 --> 01:14:18,400 can have an outcome of five 2059 01:14:18,400 --> 01:14:21,040 and uh we'll look a little deeper as to 2060 01:14:21,040 --> 01:14:23,120 what that means but you could think of 2061 01:14:23,120 --> 01:14:25,199 that at what point if someone never 2062 01:14:25,199 --> 01:14:27,679 rolls a five or they always roll a five 2063 01:14:27,679 --> 01:14:29,440 can you say hey that person's probably 2064 01:14:29,440 --> 01:14:30,480 cheating 2065 01:14:30,480 --> 01:14:32,080 we'll look a little closer at the math 2066 01:14:32,080 --> 01:14:34,239 behind that but let's just consider this 2067 01:14:34,239 --> 01:14:36,400 is one of the cases is rolling two dice 2068 01:14:36,400 --> 01:14:37,920 and gambling 2069 01:14:37,920 --> 01:14:40,239 there's also binomial distribution is 2070 01:14:40,239 --> 01:14:42,239 the probability of getting success or 2071 01:14:42,239 --> 01:14:44,239 failure as an outcome in an experiment 2072 01:14:44,239 --> 01:14:47,199 or trial that is repeated multiple times 2073 01:14:47,199 --> 01:14:49,840 and the key is is by meaning two 2074 01:14:49,840 --> 01:14:53,040 binomial so passing or failing an exam 2075 01:14:53,040 --> 01:14:55,280 winning or losing a game and getting 2076 01:14:55,280 --> 01:14:57,600 either head or tails so if you ever see 2077 01:14:57,600 --> 01:15:01,120 binomial distribution it's based on a 2078 01:15:01,120 --> 01:15:04,080 true false kind of setup you win or lose 2079 01:15:04,080 --> 01:15:05,760 let's consider a 2080 01:15:05,760 --> 01:15:08,400 use case and let's consider the game of 2081 01:15:08,400 --> 01:15:11,199 football between two clubs 2082 01:15:11,199 --> 01:15:13,920 barcelona and dortmund the teams will 2083 01:15:13,920 --> 01:15:16,000 have to play a total of four matches and 2084 01:15:16,000 --> 01:15:18,080 we have to find out the chances of 2085 01:15:18,080 --> 01:15:20,880 barcelona winning the series so we look 2086 01:15:20,880 --> 01:15:22,960 at the total games we're looking at five 2087 01:15:22,960 --> 01:15:24,960 different games or matches 2088 01:15:24,960 --> 01:15:26,320 let's say that the winning chance for 2089 01:15:26,320 --> 01:15:30,000 barcelona is 75 percent or 0.75 2090 01:15:30,000 --> 01:15:32,400 that means that each game they have a 75 2091 01:15:32,400 --> 01:15:33,600 chance that they're going to win that 2092 01:15:33,600 --> 01:15:36,239 game and losing chances are 25 percent 2093 01:15:36,239 --> 01:15:40,960 or 0.25 clearly 0.75 plus 0.25 equals 1. 2094 01:15:40,960 --> 01:15:43,040 so that accounts for 100 of the game 2095 01:15:43,040 --> 01:15:46,239 probability for getting k wins in in 2096 01:15:46,239 --> 01:15:48,560 matches is calculated 2097 01:15:48,560 --> 01:15:50,159 and we we're talking like so if you have 2098 01:15:50,159 --> 01:15:52,400 five games and you want to know if i 2099 01:15:52,400 --> 01:15:53,600 play 2100 01:15:53,600 --> 01:15:55,840 how many wins in those five games should 2101 01:15:55,840 --> 01:15:58,159 i get what's a percentage on those and 2102 01:15:58,159 --> 01:16:01,199 the probability for getting k wins and 2103 01:16:01,199 --> 01:16:04,640 in matches is calculated by p x equals k 2104 01:16:04,640 --> 01:16:08,960 equals nc k p to the k q to the n minus 2105 01:16:08,960 --> 01:16:09,760 k 2106 01:16:09,760 --> 01:16:12,719 here p is the probability of success and 2107 01:16:12,719 --> 01:16:15,120 q is the probability of failure and so 2108 01:16:15,120 --> 01:16:17,679 we can do total games of n equals 5 2109 01:16:17,679 --> 01:16:21,360 where k equals 0 one two three four five 2110 01:16:21,360 --> 01:16:23,520 p which is the chance of winning is 2111 01:16:23,520 --> 01:16:26,159 point seven five q the chance of losing 2112 01:16:26,159 --> 01:16:28,320 equals one minus p 2113 01:16:28,320 --> 01:16:30,239 which equals one minus point o seven 2114 01:16:30,239 --> 01:16:32,400 five which equals point two five the 2115 01:16:32,400 --> 01:16:34,320 probability that barcelona will lose all 2116 01:16:34,320 --> 01:16:36,719 of the matches can then just plug in the 2117 01:16:36,719 --> 01:16:37,840 numbers 2118 01:16:37,840 --> 01:16:42,040 and we end up with a .0009765625 2119 01:16:44,239 --> 01:16:45,679 so very small chance they're going to 2120 01:16:45,679 --> 01:16:48,080 lose all their matches 2121 01:16:48,080 --> 01:16:50,480 and we can plug in the value for two 2122 01:16:50,480 --> 01:16:53,440 matches probability that barcelona will 2123 01:16:53,440 --> 01:16:57,360 win at least two matches is 0.0878 and 2124 01:16:57,360 --> 01:16:58,480 of course we can go on to the 2125 01:16:58,480 --> 01:16:59,920 probability that barcelona will win 2126 01:16:59,920 --> 01:17:03,520 three matches the 0.26 and course four 2127 01:17:03,520 --> 01:17:06,239 matches and so on and it's always nice 2128 01:17:06,239 --> 01:17:08,320 to take this information 2129 01:17:08,320 --> 01:17:10,239 and let's find the accumulated discrete 2130 01:17:10,239 --> 01:17:12,480 probabilities for each of the outcomes 2131 01:17:12,480 --> 01:17:15,120 where barcelona has won three or more 2132 01:17:15,120 --> 01:17:17,440 matches x equals three x equals four x 2133 01:17:17,440 --> 01:17:18,960 equals five 2134 01:17:18,960 --> 01:17:20,960 and we end up with the p equals point 2135 01:17:20,960 --> 01:17:23,120 two six four plus point three nine five 2136 01:17:23,120 --> 01:17:25,199 plus two three seven which equals point 2137 01:17:25,199 --> 01:17:26,560 eight nine 2138 01:17:26,560 --> 01:17:28,159 in reality 2139 01:17:28,159 --> 01:17:29,679 the probability of barcelona winning the 2140 01:17:29,679 --> 01:17:32,640 series is much higher than 0.75 2141 01:17:32,640 --> 01:17:35,440 and it's always nice to 2142 01:17:35,440 --> 01:17:37,280 put out a nice graph so you can actually 2143 01:17:37,280 --> 01:17:38,880 see the number of wins to the 2144 01:17:38,880 --> 01:17:41,760 probability and how that pans out with 2145 01:17:41,760 --> 01:17:43,840 our binomial case 2146 01:17:43,840 --> 01:17:46,480 continuing in our important terminology 2147 01:17:46,480 --> 01:17:48,880 location the location of the center of 2148 01:17:48,880 --> 01:17:51,280 the graph depends on the mean 2149 01:17:51,280 --> 01:17:53,840 value and this is some very important 2150 01:17:53,840 --> 01:17:56,480 things so much of the data we look at 2151 01:17:56,480 --> 01:17:57,600 and when you start looking at 2152 01:17:57,600 --> 01:17:59,440 probabilities almost always has a 2153 01:17:59,440 --> 01:18:01,280 normalized look like the graph in the 2154 01:18:01,280 --> 01:18:02,800 middle 2155 01:18:02,800 --> 01:18:04,880 but you do have left skewed where the 2156 01:18:04,880 --> 01:18:06,480 data is skewed off to the left and you 2157 01:18:06,480 --> 01:18:07,679 have more stuff happening off to the 2158 01:18:07,679 --> 01:18:10,080 left and you have right skewed data 2159 01:18:10,080 --> 01:18:11,600 and so when this comes up and these 2160 01:18:11,600 --> 01:18:13,040 probabilities come up where they're 2161 01:18:13,040 --> 01:18:14,960 skewed it's really important to take a 2162 01:18:14,960 --> 01:18:16,560 closer look at that 2163 01:18:16,560 --> 01:18:18,640 mostly you end up with a normalized set 2164 01:18:18,640 --> 01:18:20,080 of data but you got to also be aware 2165 01:18:20,080 --> 01:18:22,560 that sometimes it's a skewed data 2166 01:18:22,560 --> 01:18:24,560 and then the height height of the slope 2167 01:18:24,560 --> 01:18:26,560 inversely depends upon the standard 2168 01:18:26,560 --> 01:18:28,560 deviation 2169 01:18:28,560 --> 01:18:29,760 so you can see down here the standard 2170 01:18:29,760 --> 01:18:31,360 deviation is really large it kind of 2171 01:18:31,360 --> 01:18:33,440 squishes it out and if the standard 2172 01:18:33,440 --> 01:18:35,679 deviation is small then most of your 2173 01:18:35,679 --> 01:18:36,960 data is going to hit right there in the 2174 01:18:36,960 --> 01:18:39,120 middle you can have a nice peak 2175 01:18:39,120 --> 01:18:40,960 and so being aware of this that you 2176 01:18:40,960 --> 01:18:43,040 might have a probability that fits 2177 01:18:43,040 --> 01:18:44,719 certain data but it has a lot of 2178 01:18:44,719 --> 01:18:46,719 outliers so you're if you have a really 2179 01:18:46,719 --> 01:18:49,199 high standard deviation 2180 01:18:49,199 --> 01:18:52,239 if you're doing stock market analysis 2181 01:18:52,239 --> 01:18:53,840 this means your predictions are probably 2182 01:18:53,840 --> 01:18:55,760 not going to make you much money 2183 01:18:55,760 --> 01:18:57,600 where if you have a very small deviation 2184 01:18:57,600 --> 01:18:59,600 you might be right on target and set to 2185 01:18:59,600 --> 01:19:01,040 become a millionaire 2186 01:19:01,040 --> 01:19:03,840 which leads us to the z-score z-score 2187 01:19:03,840 --> 01:19:06,320 tells you how far from the mean a data 2188 01:19:06,320 --> 01:19:09,040 point is it is measured in terms of 2189 01:19:09,040 --> 01:19:11,199 standard deviations from the mean around 2190 01:19:11,199 --> 01:19:12,880 68 percent of the results are found 2191 01:19:12,880 --> 01:19:15,360 between one standard deviation 2192 01:19:15,360 --> 01:19:17,280 around 95 percent of the results are 2193 01:19:17,280 --> 01:19:20,239 found between two standard deviations 2194 01:19:20,239 --> 01:19:22,159 and you read the symbols of course they 2195 01:19:22,159 --> 01:19:23,600 love to throw some greek letters in 2196 01:19:23,600 --> 01:19:27,120 there we have mu minus two sigma 2197 01:19:27,120 --> 01:19:29,760 mu is just a quick way it's a kind of 2198 01:19:29,760 --> 01:19:32,719 funky u it just means the mean 2199 01:19:32,719 --> 01:19:34,719 uh and then the sigma is the standard 2200 01:19:34,719 --> 01:19:37,040 deviation and that's the o with the 2201 01:19:37,040 --> 01:19:38,640 little arrow off to the right or the 2202 01:19:38,640 --> 01:19:39,520 little 2203 01:19:39,520 --> 01:19:41,440 wagy tail going up the o with it with 2204 01:19:41,440 --> 01:19:45,440 the line on it uh so mu minus two sigma 2205 01:19:45,440 --> 01:19:46,640 is your 2206 01:19:46,640 --> 01:19:48,640 uh 95 percent of the results are found 2207 01:19:48,640 --> 01:19:51,679 between two standard deviations 2208 01:19:51,679 --> 01:19:53,679 central limit theorem 2209 01:19:53,679 --> 01:19:55,679 this goes back to the skew if you 2210 01:19:55,679 --> 01:19:57,120 remember we were looking at the skew 2211 01:19:57,120 --> 01:20:00,560 values on this previous slide have left 2212 01:20:00,560 --> 01:20:03,280 skewed normalized and right skewed when 2213 01:20:03,280 --> 01:20:04,880 we're talking about it being skewed or 2214 01:20:04,880 --> 01:20:07,199 not skewed the distribution of the 2215 01:20:07,199 --> 01:20:09,600 sample means will be approximately 2216 01:20:09,600 --> 01:20:11,920 normally distributed evenly distributed 2217 01:20:11,920 --> 01:20:13,120 not skewed 2218 01:20:13,120 --> 01:20:15,679 if you take large random samples from 2219 01:20:15,679 --> 01:20:18,400 the population with the mean mu 2220 01:20:18,400 --> 01:20:21,360 and the standard deviation sigma with 2221 01:20:21,360 --> 01:20:23,040 replacement 2222 01:20:23,040 --> 01:20:24,800 and you can see here 2223 01:20:24,800 --> 01:20:26,719 of course we have our 2224 01:20:26,719 --> 01:20:28,880 mu minus two sigma and the spread down 2225 01:20:28,880 --> 01:20:31,360 here the mean the median and the mode 2226 01:20:31,360 --> 01:20:32,719 and so you're talking about very large 2227 01:20:32,719 --> 01:20:35,040 populations 2228 01:20:35,040 --> 01:20:36,480 these numbers should come together and 2229 01:20:36,480 --> 01:20:38,800 you shouldn't have a skewed value if you 2230 01:20:38,800 --> 01:20:41,520 do that's a flag that something's wrong 2231 01:20:41,520 --> 01:20:43,040 that's why this is so important to be 2232 01:20:43,040 --> 01:20:45,360 aware of what's going on with your data 2233 01:20:45,360 --> 01:20:47,679 where your samples are coming from 2234 01:20:47,679 --> 01:20:49,840 and the math behind it 2235 01:20:49,840 --> 01:20:51,040 and if we're going to do all this we got 2236 01:20:51,040 --> 01:20:54,640 to jump into conditional probability 2237 01:20:54,640 --> 01:20:56,560 the conditional probability of an event 2238 01:20:56,560 --> 01:20:58,960 a is a probability that the event will 2239 01:20:58,960 --> 01:21:01,440 occur given the knowledge that an event 2240 01:21:01,440 --> 01:21:04,080 to be has already occurred 2241 01:21:04,080 --> 01:21:06,320 and you'll see this as bayes theorem 2242 01:21:06,320 --> 01:21:08,960 b-a-y-e-s base 2243 01:21:08,960 --> 01:21:10,560 and this is red 2244 01:21:10,560 --> 01:21:11,760 i mean you have these funky looking 2245 01:21:11,760 --> 01:21:12,800 little 2246 01:21:12,800 --> 01:21:13,440 p 2247 01:21:13,440 --> 01:21:15,120 brackets a b 2248 01:21:15,120 --> 01:21:17,840 this is the probability of a being true 2249 01:21:17,840 --> 01:21:21,040 while b is already true 2250 01:21:21,040 --> 01:21:22,640 and you have the probability of b being 2251 01:21:22,640 --> 01:21:25,280 true when a is already true so p 2252 01:21:25,280 --> 01:21:26,640 b of a 2253 01:21:26,640 --> 01:21:29,280 probability of a being true divided by 2254 01:21:29,280 --> 01:21:31,920 the probability of b being true 2255 01:21:31,920 --> 01:21:33,920 and we talk about bayes theorem which 2256 01:21:33,920 --> 01:21:35,840 occurred back in the 1800s when he 2257 01:21:35,840 --> 01:21:37,600 discovered this this is such an 2258 01:21:37,600 --> 01:21:39,760 important formula and it's really it's 2259 01:21:39,760 --> 01:21:41,840 not if you actually do the math you 2260 01:21:41,840 --> 01:21:44,560 could just kind of do 2261 01:21:44,560 --> 01:21:48,080 x y equals j k and then you divide them 2262 01:21:48,080 --> 01:21:49,360 out and you're going to see the same 2263 01:21:49,360 --> 01:21:51,520 math but it works with probabilities 2264 01:21:51,520 --> 01:21:53,360 which makes it really nice 2265 01:21:53,360 --> 01:21:55,600 and so if you have a set you might have 2266 01:21:55,600 --> 01:21:57,920 uh eight or nine different studies going 2267 01:21:57,920 --> 01:22:00,400 on in different areas different people 2268 01:22:00,400 --> 01:22:01,920 have done the studies they brought them 2269 01:22:01,920 --> 01:22:03,679 together 2270 01:22:03,679 --> 01:22:05,679 if we look at today's covet virus the 2271 01:22:05,679 --> 01:22:07,520 virus spread 2272 01:22:07,520 --> 01:22:09,920 certainly the studies done in china 2273 01:22:09,920 --> 01:22:11,199 versus the studies the way they're done 2274 01:22:11,199 --> 01:22:12,719 in the us 2275 01:22:12,719 --> 01:22:14,800 that data is different in each of those 2276 01:22:14,800 --> 01:22:16,800 studies but if you can find a place 2277 01:22:16,800 --> 01:22:19,199 where it overlaps where they're studying 2278 01:22:19,199 --> 01:22:21,360 the same thing together you can then 2279 01:22:21,360 --> 01:22:23,040 compute the changes that you need to 2280 01:22:23,040 --> 01:22:25,679 make in one study to make them equal 2281 01:22:25,679 --> 01:22:27,920 and this is also true if you have a 2282 01:22:27,920 --> 01:22:29,679 study of 2283 01:22:29,679 --> 01:22:31,280 one group and you want to find out more 2284 01:22:31,280 --> 01:22:33,679 about it so this formula is very 2285 01:22:33,679 --> 01:22:35,840 powerful and it really has to do with 2286 01:22:35,840 --> 01:22:37,679 the data collection part of the math and 2287 01:22:37,679 --> 01:22:40,000 data science and understanding where 2288 01:22:40,000 --> 01:22:41,920 your data is coming from and how you're 2289 01:22:41,920 --> 01:22:44,000 going to combine different studies in 2290 01:22:44,000 --> 01:22:45,840 different groups 2291 01:22:45,840 --> 01:22:47,760 and we're going to go into a use case 2292 01:22:47,760 --> 01:22:49,520 let's find out the chance of a person 2293 01:22:49,520 --> 01:22:52,719 getting lung disease due to smoking 2294 01:22:52,719 --> 01:22:54,080 this is kind of interesting the way they 2295 01:22:54,080 --> 01:22:55,120 word this 2296 01:22:55,120 --> 01:22:56,719 let's say that according to medical 2297 01:22:56,719 --> 01:22:59,199 report provided by the hospital states 2298 01:22:59,199 --> 01:23:01,920 that around 10 percent of all patients 2299 01:23:01,920 --> 01:23:05,280 they treated suffered lung disease 2300 01:23:05,280 --> 01:23:07,440 so we have kind of a generic medical 2301 01:23:07,440 --> 01:23:08,560 report 2302 01:23:08,560 --> 01:23:10,320 they further found out 2303 01:23:10,320 --> 01:23:12,320 by a survey that 15 percent of the 2304 01:23:12,320 --> 01:23:15,360 patients that visit them smoke 2305 01:23:15,360 --> 01:23:16,960 so we have 10 percent that are lung 2306 01:23:16,960 --> 01:23:18,639 disease and 2307 01:23:18,639 --> 01:23:21,040 15 of the patients smoke 2308 01:23:21,040 --> 01:23:23,040 and finally five percent of the people 2309 01:23:23,040 --> 01:23:25,840 continued smoke even when they had lung 2310 01:23:25,840 --> 01:23:29,280 disease uh not the brightest choice um 2311 01:23:29,280 --> 01:23:30,719 but you know it is an addiction so it 2312 01:23:30,719 --> 01:23:32,480 can be really difficult to kick and so 2313 01:23:32,480 --> 01:23:35,520 we can look at the probability of a uh 2314 01:23:35,520 --> 01:23:37,679 prior probability of 10 people having 2315 01:23:37,679 --> 01:23:39,199 lung disease 2316 01:23:39,199 --> 01:23:41,440 and then probability b probability that 2317 01:23:41,440 --> 01:23:45,440 a patient smokes is 15 percent 2318 01:23:45,440 --> 01:23:48,880 uh and the probability of b 2319 01:23:48,880 --> 01:23:51,040 if b then a the probability of a patient 2320 01:23:51,040 --> 01:23:53,040 smokes even though they have lung 2321 01:23:53,040 --> 01:23:55,520 disease is five percent 2322 01:23:55,520 --> 01:23:57,920 and probability of a is b probability 2323 01:23:57,920 --> 01:23:59,760 that the patient will have lung disease 2324 01:23:59,760 --> 01:24:01,600 if they smoke and then when you put the 2325 01:24:01,600 --> 01:24:03,280 formulas together you get a nice 2326 01:24:03,280 --> 01:24:05,199 solution here you get the probability of 2327 01:24:05,199 --> 01:24:07,120 a of b probability that the patient will 2328 01:24:07,120 --> 01:24:09,440 have lung disease if they smoke 2329 01:24:09,440 --> 01:24:11,120 and you can just plug the numbers right 2330 01:24:11,120 --> 01:24:14,639 in and we get a 3.33 percent chance 2331 01:24:14,639 --> 01:24:16,880 hence there is a 3.33 chance that a 2332 01:24:16,880 --> 01:24:18,719 person who smokes will get a lung 2333 01:24:18,719 --> 01:24:20,239 disease 2334 01:24:20,239 --> 01:24:22,000 so we're going to pull up a little 2335 01:24:22,000 --> 01:24:24,639 python code i'm always my favorite roll 2336 01:24:24,639 --> 01:24:26,159 up the sleeves 2337 01:24:26,159 --> 01:24:28,080 keep in mind we're going to be doing 2338 01:24:28,080 --> 01:24:31,360 this um kind of like the back end way 2339 01:24:31,360 --> 01:24:33,600 so that you can see what's going on and 2340 01:24:33,600 --> 01:24:37,040 then later on we're going to create 2341 01:24:37,040 --> 01:24:39,679 we'll get into another demo which shows 2342 01:24:39,679 --> 01:24:40,880 you some of the tools are already 2343 01:24:40,880 --> 01:24:42,480 pre-built for this 2344 01:24:42,480 --> 01:24:45,840 let's start by creating a set so we're 2345 01:24:45,840 --> 01:24:48,320 going to create a set with curly braces 2346 01:24:48,320 --> 01:24:51,440 this means that our set has 2347 01:24:51,440 --> 01:24:55,280 only unique values so you have a list 2348 01:24:55,280 --> 01:24:57,199 you have your tuples which can never 2349 01:24:57,199 --> 01:24:59,840 change and then you have 2350 01:24:59,840 --> 01:25:03,360 in this case the the set so four seven 2351 01:25:03,360 --> 01:25:05,600 you can't create a four seven comma four 2352 01:25:05,600 --> 01:25:07,520 it'll delete the four out so it's only 2353 01:25:07,520 --> 01:25:09,040 unique values 2354 01:25:09,040 --> 01:25:12,320 and if you use dictionaries 2355 01:25:12,320 --> 01:25:14,800 quick reminder this should look familiar 2356 01:25:14,800 --> 01:25:17,280 because it is a dictionary we have a 2357 01:25:17,280 --> 01:25:20,159 value and that value is assigned to or 2358 01:25:20,159 --> 01:25:23,040 that key is assigned to a value 2359 01:25:23,040 --> 01:25:24,719 so you could have a key value set up as 2360 01:25:24,719 --> 01:25:26,800 a dictionary so it's like a dictionary 2361 01:25:26,800 --> 01:25:28,719 without the value it's just the keys and 2362 01:25:28,719 --> 01:25:31,920 they all have to be unique 2363 01:25:31,920 --> 01:25:34,080 and if we run this we have a 2364 01:25:34,080 --> 01:25:37,440 set of four seven 2365 01:25:37,920 --> 01:25:40,960 we can also take a list a regular 2366 01:25:40,960 --> 01:25:42,320 setup and i'm going to go ahead and just 2367 01:25:42,320 --> 01:25:44,639 throw in another number in here four 2368 01:25:44,639 --> 01:25:47,040 and run it uh and you can see here if i 2369 01:25:47,040 --> 01:25:50,000 take my list one two three four four 2370 01:25:50,000 --> 01:25:53,199 and i convert it to a set and here it is 2371 01:25:53,199 --> 01:25:57,199 my set from list equals set my list 2372 01:25:57,199 --> 01:25:58,960 the result is one two three four so it 2373 01:25:58,960 --> 01:26:01,040 just deletes that last four right out of 2374 01:26:01,040 --> 01:26:02,960 there 2375 01:26:02,960 --> 01:26:05,440 and with the sets you can also go in 2376 01:26:05,440 --> 01:26:06,880 there and 2377 01:26:06,880 --> 01:26:09,760 print here is my set my set 2378 01:26:09,760 --> 01:26:12,080 uh three is in the set and then if you 2379 01:26:12,080 --> 01:26:15,440 do three in my set 2380 01:26:15,440 --> 01:26:17,679 that's going to be a logic function 2381 01:26:17,679 --> 01:26:20,719 uh and one in my set six is not in the 2382 01:26:20,719 --> 01:26:24,880 set and so forth if we run this 2383 01:26:24,880 --> 01:26:27,679 we get three is in the set true one is 2384 01:26:27,679 --> 01:26:29,199 in the set false because three five 2385 01:26:29,199 --> 01:26:32,159 seven is another one six is in the set 2386 01:26:32,159 --> 01:26:36,639 six is not in the set so not in my set 2387 01:26:36,639 --> 01:26:39,040 you can also use this with the list we 2388 01:26:39,040 --> 01:26:41,120 could have just used three five seven 2389 01:26:41,120 --> 01:26:42,639 and it would have 2390 01:26:42,639 --> 01:26:45,760 the same response on there is three and 2391 01:26:45,760 --> 01:26:48,080 usually do if three is in but three in 2392 01:26:48,080 --> 01:26:50,320 my set is still works on just a regular 2393 01:26:50,320 --> 01:26:51,440 list 2394 01:26:51,440 --> 01:26:52,639 then we'll go ahead and do a little 2395 01:26:52,639 --> 01:26:54,800 iteration we're going to do kind of the 2396 01:26:54,800 --> 01:26:56,880 dice one remember 2397 01:26:56,880 --> 01:26:59,280 one two three four five six and so we're 2398 01:26:59,280 --> 01:27:01,600 going to bring in the iteration tool and 2399 01:27:01,600 --> 01:27:04,960 import product as product 2400 01:27:04,960 --> 01:27:06,719 and i'll show you what that means in 2401 01:27:06,719 --> 01:27:09,199 just a second so we have our two dice we 2402 01:27:09,199 --> 01:27:11,040 have dice a 2403 01:27:11,040 --> 01:27:13,440 and it's going to be a set of values 2404 01:27:13,440 --> 01:27:15,040 they can only have one value for each 2405 01:27:15,040 --> 01:27:17,040 one that's why they put it in a set and 2406 01:27:17,040 --> 01:27:20,000 if you remember from range it is up to 2407 01:27:20,000 --> 01:27:21,600 seven so this is going to be one two 2408 01:27:21,600 --> 01:27:24,239 three four five six it will not include 2409 01:27:24,239 --> 01:27:26,639 the seven and the same thing for our 2410 01:27:26,639 --> 01:27:29,199 dice b 2411 01:27:29,199 --> 01:27:30,480 and then we're gonna do is we're gonna 2412 01:27:30,480 --> 01:27:34,480 create a list which is the product 2413 01:27:34,480 --> 01:27:38,480 of a and b so what's a plus b 2414 01:27:38,480 --> 01:27:40,320 and if we go ahead and run this it'll 2415 01:27:40,320 --> 01:27:42,639 print that out and you'll see 2416 01:27:42,639 --> 01:27:44,239 in this case when they say product 2417 01:27:44,239 --> 01:27:47,760 because it's an iteration tool 2418 01:27:47,760 --> 01:27:49,679 we're talking about creating a tuple of 2419 01:27:49,679 --> 01:27:50,639 the two 2420 01:27:50,639 --> 01:27:52,719 so we've now created a tuple of all 2421 01:27:52,719 --> 01:27:55,600 possible outcomes of the dice where dice 2422 01:27:55,600 --> 01:27:58,480 a is one two three one to six and dice b 2423 01:27:58,480 --> 01:28:00,159 is one to six and you can see one to one 2424 01:28:00,159 --> 01:28:02,480 one to two one to three and so forth 2425 01:28:02,480 --> 01:28:03,840 you remember we had a slide on this 2426 01:28:03,840 --> 01:28:06,480 earlier where we talked about 2427 01:28:06,480 --> 01:28:07,920 the different all the different outcomes 2428 01:28:07,920 --> 01:28:09,280 of a dice 2429 01:28:09,280 --> 01:28:11,040 we can play around with this a little 2430 01:28:11,040 --> 01:28:14,719 bit we can do in dice equals two 2431 01:28:14,719 --> 01:28:18,080 dice faces one two three four five six 2432 01:28:18,080 --> 01:28:19,760 uh another way of doing what we did 2433 01:28:19,760 --> 01:28:21,520 before and then we can create an event 2434 01:28:21,520 --> 01:28:23,840 space where we have a set which is the 2435 01:28:23,840 --> 01:28:25,920 product of the dice faces 2436 01:28:25,920 --> 01:28:27,920 repeat equals indice and we'll go ahead 2437 01:28:27,920 --> 01:28:29,679 and just run this 2438 01:28:29,679 --> 01:28:32,000 and you can see here it just again puts 2439 01:28:32,000 --> 01:28:33,520 it through all the different possible 2440 01:28:33,520 --> 01:28:35,520 variables we can have 2441 01:28:35,520 --> 01:28:37,760 and then if we wanted to take the same 2442 01:28:37,760 --> 01:28:40,639 set on here and print them all out like 2443 01:28:40,639 --> 01:28:42,400 we had before 2444 01:28:42,400 --> 01:28:44,080 we can just go through for outcome and 2445 01:28:44,080 --> 01:28:47,280 event space outcome and equals 2446 01:28:47,280 --> 01:28:50,719 so the event space is creating 2447 01:28:50,719 --> 01:28:52,719 a sequence and as you can see here when 2448 01:28:52,719 --> 01:28:55,360 we print it out it stacks them versus 2449 01:28:55,360 --> 01:28:56,719 going through and putting them in a nice 2450 01:28:56,719 --> 01:28:58,880 line 2451 01:28:58,880 --> 01:29:01,120 and we'll go ahead and do something 2452 01:29:01,120 --> 01:29:03,040 let's go print 2453 01:29:03,040 --> 01:29:04,880 since we have the end printing with a 2454 01:29:04,880 --> 01:29:07,520 comma that just means it's just gonna 2455 01:29:07,520 --> 01:29:09,199 it's not gonna hit the return going down 2456 01:29:09,199 --> 01:29:10,960 to the next line 2457 01:29:10,960 --> 01:29:14,400 and we'll go ahead and do the length 2458 01:29:15,120 --> 01:29:17,840 of our event space that'll be an 2459 01:29:17,840 --> 01:29:19,040 important variable we're going to want 2460 01:29:19,040 --> 01:29:21,840 to know in a minute 2461 01:29:22,239 --> 01:29:23,679 and of course if i get carried away with 2462 01:29:23,679 --> 01:29:25,840 my typing of length uh we'll print it 2463 01:29:25,840 --> 01:29:27,840 twice and it'll give me an error 2464 01:29:27,840 --> 01:29:30,480 so we have 36 different possible 2465 01:29:30,480 --> 01:29:33,120 variations here 2466 01:29:33,120 --> 01:29:34,960 and we might want to calculate something 2467 01:29:34,960 --> 01:29:36,239 like 2468 01:29:36,239 --> 01:29:38,159 what about the multiple of three what if 2469 01:29:38,159 --> 01:29:40,000 we want to have 2470 01:29:40,000 --> 01:29:42,639 uh the probability of the multiple three 2471 01:29:42,639 --> 01:29:45,440 in our setup 2472 01:29:46,000 --> 01:29:48,159 and so we can put together the code for 2473 01:29:48,159 --> 01:29:50,880 the outcome and event space of x y 2474 01:29:50,880 --> 01:29:52,400 equals outcome 2475 01:29:52,400 --> 01:29:55,360 if x plus y 2476 01:29:55,360 --> 01:29:57,199 remainder 3 so we're going to divide by 2477 01:29:57,199 --> 01:29:58,480 3 and look at the remainder and it 2478 01:29:58,480 --> 01:30:01,120 equals 0 2479 01:30:01,120 --> 01:30:02,800 then it's a favorable outcome we're 2480 01:30:02,800 --> 01:30:04,320 going to pop that outcome on the end 2481 01:30:04,320 --> 01:30:06,480 there 2482 01:30:06,480 --> 01:30:08,400 and we'll turn it into a set so the 2483 01:30:08,400 --> 01:30:10,480 favor outcome equals a set 2484 01:30:10,480 --> 01:30:12,000 not necessary 2485 01:30:12,000 --> 01:30:13,600 because we know it's not going to be 2486 01:30:13,600 --> 01:30:15,360 repeating itself but just in case we'll 2487 01:30:15,360 --> 01:30:18,239 go ahead and do that 2488 01:30:19,120 --> 01:30:22,320 and if we want to print out the outcome 2489 01:30:22,320 --> 01:30:23,760 we can go ahead and see what that looks 2490 01:30:23,760 --> 01:30:26,560 like and you can see here these are all 2491 01:30:26,560 --> 01:30:28,960 multiples of three uh one plus two is 2492 01:30:28,960 --> 01:30:30,880 three five plus four is nine which 2493 01:30:30,880 --> 01:30:35,719 divided by three is three and so forth 2494 01:30:35,760 --> 01:30:37,920 and just like we looked up the length of 2495 01:30:37,920 --> 01:30:40,960 the one before let's go ahead and print 2496 01:30:40,960 --> 01:30:42,639 the length 2497 01:30:42,639 --> 01:30:44,000 of our 2498 01:30:44,000 --> 01:30:45,280 f outcome 2499 01:30:45,280 --> 01:30:49,560 so we can see what that looks like 2500 01:30:51,120 --> 01:30:53,360 there we go 2501 01:30:53,360 --> 01:30:55,280 and of course i did forget to add the 2502 01:30:55,280 --> 01:30:56,320 print in the middle because we're 2503 01:30:56,320 --> 01:30:58,080 looping through and putting an end on 2504 01:30:58,080 --> 01:30:59,520 the on the setup on there so we're going 2505 01:30:59,520 --> 01:31:01,600 to put the print in there and if i run 2506 01:31:01,600 --> 01:31:04,400 this you can see 2507 01:31:06,880 --> 01:31:10,239 we end up with 12. so we have 36 total 2508 01:31:10,239 --> 01:31:11,920 options 2509 01:31:11,920 --> 01:31:14,880 we have 12 that are multiple that add up 2510 01:31:14,880 --> 01:31:17,840 to a multiple of three 2511 01:31:17,840 --> 01:31:20,159 and we can easily conver compute the 2512 01:31:20,159 --> 01:31:22,239 probability of this 2513 01:31:22,239 --> 01:31:24,480 by simply taking the length 2514 01:31:24,480 --> 01:31:26,159 of our favorable outcome of the length 2515 01:31:26,159 --> 01:31:29,719 of the event space 2516 01:31:30,000 --> 01:31:31,600 if we print it out let me put that in 2517 01:31:31,600 --> 01:31:32,480 there 2518 01:31:32,480 --> 01:31:34,400 probability 2519 01:31:34,400 --> 01:31:36,400 last line so we just type it in we end 2520 01:31:36,400 --> 01:31:39,360 up with the 0.3333 chance 2521 01:31:39,360 --> 01:31:42,400 that's roughly a third 2522 01:31:42,400 --> 01:31:44,239 and we want to make this look nice so 2523 01:31:44,239 --> 01:31:45,679 let's go ahead and put in another line 2524 01:31:45,679 --> 01:31:47,520 there the probability of getting the sum 2525 01:31:47,520 --> 01:31:51,199 which is a multiple of 3 is 2526 01:31:51,199 --> 01:31:54,199 0.3333 2527 01:31:54,800 --> 01:31:56,960 we can compute the same thing for five 2528 01:31:56,960 --> 01:31:59,040 dice 2529 01:31:59,040 --> 01:32:01,440 and if we do this for five dice and go 2530 01:32:01,440 --> 01:32:03,840 and run it uh you can see we just have a 2531 01:32:03,840 --> 01:32:05,920 huge amount of choices 2532 01:32:05,920 --> 01:32:08,400 so just goes on and on down here and we 2533 01:32:08,400 --> 01:32:09,920 can look at 2534 01:32:09,920 --> 01:32:10,719 the 2535 01:32:10,719 --> 01:32:14,159 length of the event space 2536 01:32:19,760 --> 01:32:23,040 and we have over 7776 2537 01:32:23,040 --> 01:32:26,080 choices that's a lot of choices 2538 01:32:26,080 --> 01:32:27,840 and if we want to ask the question like 2539 01:32:27,840 --> 01:32:29,679 we did above uh 2540 01:32:29,679 --> 01:32:31,440 what is the sum where the sum is a 2541 01:32:31,440 --> 01:32:33,920 multiple of five but not a multiple of 2542 01:32:33,920 --> 01:32:35,120 three 2543 01:32:35,120 --> 01:32:37,040 we can go through all of these different 2544 01:32:37,040 --> 01:32:38,880 options and then 2545 01:32:38,880 --> 01:32:40,480 you can see here 2546 01:32:40,480 --> 01:32:44,239 d1 d2 d3 d4 d5 equals the outcome 2547 01:32:44,239 --> 01:32:46,880 and if you add these all together and 2548 01:32:46,880 --> 01:32:48,560 the 2549 01:32:48,560 --> 01:32:50,239 division by five does not have a 2550 01:32:50,239 --> 01:32:52,320 remainder of zero 2551 01:32:52,320 --> 01:32:54,719 but the remainder is also of a division 2552 01:32:54,719 --> 01:32:56,960 by three is not equal to zero 2553 01:32:56,960 --> 01:32:59,679 so the multiple of five is equal to zero 2554 01:32:59,679 --> 01:33:01,840 but the multiple three is not we can 2555 01:33:01,840 --> 01:33:03,840 just append that on here and then we can 2556 01:33:03,840 --> 01:33:06,960 look at that uh favorable outcome 2557 01:33:06,960 --> 01:33:08,400 we'll go ahead and set that and we'll 2558 01:33:08,400 --> 01:33:10,159 just take a look at this what's our 2559 01:33:10,159 --> 01:33:11,440 length 2560 01:33:11,440 --> 01:33:15,360 of our favorable outcome 2561 01:33:19,280 --> 01:33:20,480 it's always good to see what we're 2562 01:33:20,480 --> 01:33:23,199 working with and so we have 904 out of 2563 01:33:23,199 --> 01:33:26,199 70 2564 01:33:26,480 --> 01:33:28,800 6 and then of course we can just do a 2565 01:33:28,800 --> 01:33:30,880 simple division to get the probability 2566 01:33:30,880 --> 01:33:32,320 on here what's the probability that 2567 01:33:32,320 --> 01:33:33,600 we're going to roll 2568 01:33:33,600 --> 01:33:35,600 a multiple of 5 when you add them 2569 01:33:35,600 --> 01:33:37,199 together 2570 01:33:37,199 --> 01:33:40,080 but not a multiple of three 2571 01:33:40,080 --> 01:33:41,360 and so we're just going to divide those 2572 01:33:41,360 --> 01:33:43,440 two numbers and you can see here we get 2573 01:33:43,440 --> 01:33:45,679 point one one six two five five or 2574 01:33:45,679 --> 01:33:49,960 eleven point six two percent 2575 01:33:50,880 --> 01:33:53,120 and so you can really have a nice visual 2576 01:33:53,120 --> 01:33:55,920 that this is not really complicated math 2577 01:33:55,920 --> 01:33:58,000 right here on probabilities 2578 01:33:58,000 --> 01:34:00,080 it's just how many options do you have 2579 01:34:00,080 --> 01:34:02,400 and how many of those are you possibly 2580 01:34:02,400 --> 01:34:05,040 going to be able to come up with with 2581 01:34:05,040 --> 01:34:07,120 the solution you're looking for 2582 01:34:07,120 --> 01:34:10,400 and this leads us to a confusion matrix 2583 01:34:10,400 --> 01:34:12,480 a confusion matrix is a table which is 2584 01:34:12,480 --> 01:34:14,239 used to describe the performance of a 2585 01:34:14,239 --> 01:34:16,239 classification model on a set of test 2586 01:34:16,239 --> 01:34:19,280 data for which the true values are known 2587 01:34:19,280 --> 01:34:20,960 and so you'll see in the left we have 2588 01:34:20,960 --> 01:34:23,840 the predicted and the actual 2589 01:34:23,840 --> 01:34:26,639 and we have a negative uh false negative 2590 01:34:26,639 --> 01:34:29,520 positive true positive 2591 01:34:29,520 --> 01:34:32,159 and then we have false positive and true 2592 01:34:32,159 --> 01:34:35,199 negative and you can think of this as 2593 01:34:35,199 --> 01:34:38,560 your predicted model what does that mean 2594 01:34:38,560 --> 01:34:40,800 that means if you divided your data and 2595 01:34:40,800 --> 01:34:42,560 you use two-third of this to create the 2596 01:34:42,560 --> 01:34:43,600 model 2597 01:34:43,600 --> 01:34:45,520 you might then test it against an actual 2598 01:34:45,520 --> 01:34:47,440 case for the last third to see how well 2599 01:34:47,440 --> 01:34:50,000 it comes out how many times was it 2600 01:34:50,000 --> 01:34:52,719 true positive versus a 2601 01:34:52,719 --> 01:34:54,560 false positive it gave a false positive 2602 01:34:54,560 --> 01:34:55,679 response 2603 01:34:55,679 --> 01:34:58,239 and you can imagine in medical 2604 01:34:58,239 --> 01:35:00,480 situations this is a pretty big deal you 2605 01:35:00,480 --> 01:35:02,480 don't want to give a false positive so 2606 01:35:02,480 --> 01:35:04,480 you might adjust your model accordingly 2607 01:35:04,480 --> 01:35:06,800 so you don't have a false positive say 2608 01:35:06,800 --> 01:35:09,199 with a covavirus test it'd be better to 2609 01:35:09,199 --> 01:35:10,800 have a false negative and they go back 2610 01:35:10,800 --> 01:35:13,760 and get re-tested than to have 30 false 2611 01:35:13,760 --> 01:35:16,239 positives where then the test is pretty 2612 01:35:16,239 --> 01:35:17,760 much invalid 2613 01:35:17,760 --> 01:35:20,480 so in a use case like cancer prediction 2614 01:35:20,480 --> 01:35:22,560 let's consider an example where a cancer 2615 01:35:22,560 --> 01:35:24,320 prediction model is put to the test for 2616 01:35:24,320 --> 01:35:26,400 its accuracy and precision 2617 01:35:26,400 --> 01:35:28,239 actual result of a person's medical 2618 01:35:28,239 --> 01:35:30,480 report is compared with the prediction 2619 01:35:30,480 --> 01:35:33,280 made by the machine learning model 2620 01:35:33,280 --> 01:35:34,800 and so you can see here here's our 2621 01:35:34,800 --> 01:35:36,639 actual predicted whether they have 2622 01:35:36,639 --> 01:35:38,400 cancer or not you know cancer a big one 2623 01:35:38,400 --> 01:35:40,239 you don't want to have a 2624 01:35:40,239 --> 01:35:42,880 false positive i mean a false negative 2625 01:35:42,880 --> 01:35:44,239 in other words you don't want to have it 2626 01:35:44,239 --> 01:35:46,239 tell you that you don't have cancer when 2627 01:35:46,239 --> 01:35:48,400 you do so that would be something you'd 2628 01:35:48,400 --> 01:35:50,719 really be looking for in this particular 2629 01:35:50,719 --> 01:35:53,760 domain you don't want a false negative 2630 01:35:53,760 --> 01:35:55,120 and this is again you know you've 2631 01:35:55,120 --> 01:35:57,199 created a model you have hundreds of 2632 01:35:57,199 --> 01:35:59,600 people or thousands of pieces of data 2633 01:35:59,600 --> 01:36:00,800 that come in 2634 01:36:00,800 --> 01:36:02,639 there's a real famous case study where 2635 01:36:02,639 --> 01:36:04,159 they have the imagery and all the 2636 01:36:04,159 --> 01:36:05,440 measurements they take and there's about 2637 01:36:05,440 --> 01:36:08,239 36 different measurements they take 2638 01:36:08,239 --> 01:36:11,040 and then if you run the a basic model 2639 01:36:11,040 --> 01:36:12,719 you want to know just how accurate it is 2640 01:36:12,719 --> 01:36:15,120 how many negative results do you have 2641 01:36:15,120 --> 01:36:16,560 that are either telling people they have 2642 01:36:16,560 --> 01:36:18,159 cancer that don't or telling people that 2643 01:36:18,159 --> 01:36:20,159 don't have cancer that they do and then 2644 01:36:20,159 --> 01:36:22,639 we can take these numbers and we can 2645 01:36:22,639 --> 01:36:24,560 feed them into our accuracy our 2646 01:36:24,560 --> 01:36:27,120 precision and our recall 2647 01:36:27,120 --> 01:36:28,800 so accuracy precision and recall 2648 01:36:28,800 --> 01:36:30,800 accuracy metric to measure how 2649 01:36:30,800 --> 01:36:33,520 accurately the results are predicted 2650 01:36:33,520 --> 01:36:35,040 and this is your 2651 01:36:35,040 --> 01:36:36,239 total 2652 01:36:36,239 --> 01:36:38,159 true where you got the right results you 2653 01:36:38,159 --> 01:36:39,760 add them together the true positive the 2654 01:36:39,760 --> 01:36:40,960 true negative 2655 01:36:40,960 --> 01:36:43,440 over all the results so what percentage 2656 01:36:43,440 --> 01:36:45,920 of them were accurate versus what were 2657 01:36:45,920 --> 01:36:47,199 wrong 2658 01:36:47,199 --> 01:36:49,040 we talked about precision is a metric to 2659 01:36:49,040 --> 01:36:50,480 measure how many of the correctly 2660 01:36:50,480 --> 01:36:52,320 predicted cases are actually turned out 2661 01:36:52,320 --> 01:36:54,080 to be positive 2662 01:36:54,080 --> 01:36:57,199 uh so we have a precision on 2663 01:36:57,199 --> 01:36:58,639 true positive 2664 01:36:58,639 --> 01:37:00,639 again if you're talking about like uh 2665 01:37:00,639 --> 01:37:04,480 covid testing with the viruses uh you 2666 01:37:04,480 --> 01:37:07,040 really want this to be a high number you 2667 01:37:07,040 --> 01:37:08,400 want this true 2668 01:37:08,400 --> 01:37:10,480 that to be the center point where you 2669 01:37:10,480 --> 01:37:11,840 might have the opposite if you're 2670 01:37:11,840 --> 01:37:14,320 dealing with a cancer where you want no 2671 01:37:14,320 --> 01:37:16,480 false negatives 2672 01:37:16,480 --> 01:37:18,560 so this is your metric on here precision 2673 01:37:18,560 --> 01:37:20,880 is your test positive 2674 01:37:20,880 --> 01:37:22,560 true positive plus 2675 01:37:22,560 --> 01:37:24,320 false positive 2676 01:37:24,320 --> 01:37:26,239 and then your recall how many of the 2677 01:37:26,239 --> 01:37:28,480 actual positive cases we were able to 2678 01:37:28,480 --> 01:37:30,800 predict quickly with our model 2679 01:37:30,800 --> 01:37:33,040 so test positive is the test positive 2680 01:37:33,040 --> 01:37:36,239 plus the false negative on there and 2681 01:37:36,239 --> 01:37:38,320 we'll want to go ahead and do a demo on 2682 01:37:38,320 --> 01:37:42,000 the naive bayes classifier before i get 2683 01:37:42,000 --> 01:37:45,280 too far into a naive bayes classifier 2684 01:37:45,280 --> 01:37:46,320 because we're going to pull it from the 2685 01:37:46,320 --> 01:37:49,920 sk learn or the site kit 2686 01:37:49,920 --> 01:37:51,920 let's go ahead kind of an interesting 2687 01:37:51,920 --> 01:37:53,679 page here for classifiers when you go 2688 01:37:53,679 --> 01:37:55,840 into the sk learn kit there's a lot of 2689 01:37:55,840 --> 01:37:57,840 ways to do classification and i'll just 2690 01:37:57,840 --> 01:37:59,840 zoom up in here so you can see some of 2691 01:37:59,840 --> 01:38:01,440 the titles 2692 01:38:01,440 --> 01:38:03,199 there's everything from the nearest 2693 01:38:03,199 --> 01:38:05,760 neighbor linear 2694 01:38:05,760 --> 01:38:07,119 but we're going to be focusing on the 2695 01:38:07,119 --> 01:38:09,520 naive bayes over here 2696 01:38:09,520 --> 01:38:12,480 and this is just a sample data set that 2697 01:38:12,480 --> 01:38:14,880 they put together and you can see how 2698 01:38:14,880 --> 01:38:16,560 some of these have a very different 2699 01:38:16,560 --> 01:38:17,600 output 2700 01:38:17,600 --> 01:38:20,400 the naive bayes remember is set up as 2701 01:38:20,400 --> 01:38:22,480 probably the most simplified uh 2702 01:38:22,480 --> 01:38:24,880 calculator or set of predictions out 2703 01:38:24,880 --> 01:38:25,679 there 2704 01:38:25,679 --> 01:38:27,600 and so what we've been talking about 2705 01:38:27,600 --> 01:38:29,360 with the true false and stuff like that 2706 01:38:29,360 --> 01:38:31,760 where there's a 2707 01:38:31,760 --> 01:38:33,760 and then a belief that there is a 2708 01:38:33,760 --> 01:38:35,040 independent assumption between the 2709 01:38:35,040 --> 01:38:36,719 features where the features are very 2710 01:38:36,719 --> 01:38:39,600 assumed to have some kind of connection 2711 01:38:39,600 --> 01:38:42,239 uh then we can go ahead and use that for 2712 01:38:42,239 --> 01:38:43,760 the prediction and so that's what we're 2713 01:38:43,760 --> 01:38:46,800 using is a naive bayes classifier versus 2714 01:38:46,800 --> 01:38:48,239 many of the other classifiers that are 2715 01:38:48,239 --> 01:38:50,639 out there 2716 01:38:51,199 --> 01:38:54,080 for this we're going to use the social 2717 01:38:54,080 --> 01:38:56,480 network ads it's a little data set on 2718 01:38:56,480 --> 01:38:57,520 here 2719 01:38:57,520 --> 01:39:00,000 and let me go and just open that up the 2720 01:39:00,000 --> 01:39:01,520 file 2721 01:39:01,520 --> 01:39:04,560 here we go it has user id gender age 2722 01:39:04,560 --> 01:39:07,440 estimated salary uh purchased 2723 01:39:07,440 --> 01:39:10,000 and so we have you can see the user id 2724 01:39:10,000 --> 01:39:12,159 male 19 2725 01:39:12,159 --> 01:39:14,800 estimated salary 19 000 2726 01:39:14,800 --> 01:39:17,360 and purchased zero so it's either gonna 2727 01:39:17,360 --> 01:39:19,280 make a purchase or not 2728 01:39:19,280 --> 01:39:21,760 so look at that last one zero one we 2729 01:39:21,760 --> 01:39:23,600 should be thinking of binomials we 2730 01:39:23,600 --> 01:39:26,000 should be thinking of a simple naive 2731 01:39:26,000 --> 01:39:29,679 bayes classifier kind of setup 2732 01:39:29,760 --> 01:39:31,280 so if we close this out we're going to 2733 01:39:31,280 --> 01:39:34,880 go ahead and import our numpy as np 2734 01:39:34,880 --> 01:39:36,480 we're going to nice to have a good 2735 01:39:36,480 --> 01:39:38,639 visual of our data so we'll put in our 2736 01:39:38,639 --> 01:39:41,520 matplot library here's our pandas our 2737 01:39:41,520 --> 01:39:44,239 data frame 2738 01:39:44,239 --> 01:39:45,360 and then we're going to go ahead and 2739 01:39:45,360 --> 01:39:47,600 import the data set and the data set's 2740 01:39:47,600 --> 01:39:49,040 going to be we're going to read it from 2741 01:39:49,040 --> 01:39:51,920 the social network ads.csv then we're 2742 01:39:51,920 --> 01:39:53,199 going to print the head just so you can 2743 01:39:53,199 --> 01:39:54,480 see it again 2744 01:39:54,480 --> 01:39:56,400 even though i showed you it in the file 2745 01:39:56,400 --> 01:39:59,679 and x equals the data set i location 2746 01:39:59,679 --> 01:40:02,080 two three values and y is going to be 2747 01:40:02,080 --> 01:40:03,199 the four 2748 01:40:03,199 --> 01:40:05,440 column four let me just run this it's a 2749 01:40:05,440 --> 01:40:07,360 little easier to go over that 2750 01:40:07,360 --> 01:40:08,639 you can see right here we're going to be 2751 01:40:08,639 --> 01:40:09,760 looking at 2752 01:40:09,760 --> 01:40:12,159 0 1 2 as age 2753 01:40:12,159 --> 01:40:15,600 and estimated salary so 2 3 2754 01:40:15,600 --> 01:40:18,080 and that's what i location just means 2755 01:40:18,080 --> 01:40:19,440 that we're 2756 01:40:19,440 --> 01:40:21,679 looking at the number versus a regular 2757 01:40:21,679 --> 01:40:23,600 location a regular location you'd 2758 01:40:23,600 --> 01:40:27,119 actually say age and estimated salary 2759 01:40:27,119 --> 01:40:28,880 and then column four is did they make a 2760 01:40:28,880 --> 01:40:31,600 purchase they purchased something 2761 01:40:31,600 --> 01:40:32,880 so those are the three columns we're 2762 01:40:32,880 --> 01:40:34,480 going to be looking at when we do this 2763 01:40:34,480 --> 01:40:36,080 and we've gone ahead and imported these 2764 01:40:36,080 --> 01:40:36,800 and 2765 01:40:36,800 --> 01:40:38,960 imported the data so now our data set is 2766 01:40:38,960 --> 01:40:42,560 all set with this information in it 2767 01:40:43,760 --> 01:40:45,199 and we'll need to go ahead and split the 2768 01:40:45,199 --> 01:40:48,320 data up so we need our from the sk learn 2769 01:40:48,320 --> 01:40:51,119 model selection we can import train test 2770 01:40:51,119 --> 01:40:52,560 split 2771 01:40:52,560 --> 01:40:54,400 this does a nice job we can set the 2772 01:40:54,400 --> 01:40:56,880 random state so randomly picks the data 2773 01:40:56,880 --> 01:40:59,360 and we're just going to take uh 25 of 2774 01:40:59,360 --> 01:41:01,440 it's going to go into the test our x 2775 01:41:01,440 --> 01:41:03,119 test and our y test 2776 01:41:03,119 --> 01:41:05,760 and the 75 will go to x train and y 2777 01:41:05,760 --> 01:41:06,800 train 2778 01:41:06,800 --> 01:41:08,400 that way once we 2779 01:41:08,400 --> 01:41:10,639 create our model we can then have data 2780 01:41:10,639 --> 01:41:12,719 to see just how accurate or how well it 2781 01:41:12,719 --> 01:41:16,960 has performed with our prediction 2782 01:41:17,280 --> 01:41:20,400 the next step in pre-processing our data 2783 01:41:20,400 --> 01:41:23,600 is to go ahead and do feature scaling 2784 01:41:23,600 --> 01:41:25,119 now a lot of this is start to look 2785 01:41:25,119 --> 01:41:26,800 familiar if you've done a number of the 2786 01:41:26,800 --> 01:41:29,119 other modules and setup you should start 2787 01:41:29,119 --> 01:41:30,800 noticing that we 2788 01:41:30,800 --> 01:41:32,719 bring in our data we take a look at what 2789 01:41:32,719 --> 01:41:34,480 we're working with 2790 01:41:34,480 --> 01:41:35,920 we go ahead and split it up into 2791 01:41:35,920 --> 01:41:37,920 training and testing 2792 01:41:37,920 --> 01:41:39,440 in this case we're going to go ahead and 2793 01:41:39,440 --> 01:41:41,360 scale it scale it means we're putting it 2794 01:41:41,360 --> 01:41:44,880 between a value of minus 1 and 1 2795 01:41:44,880 --> 01:41:46,719 or or someplace in the middle ground 2796 01:41:46,719 --> 01:41:47,520 there 2797 01:41:47,520 --> 01:41:49,520 this way if you have any huge set you 2798 01:41:49,520 --> 01:41:51,600 don't have this huge um 2799 01:41:51,600 --> 01:41:53,360 setup if we go back up to here where 2800 01:41:53,360 --> 01:41:55,920 salary the salary is 2801 01:41:55,920 --> 01:41:59,280 20 000 versus age 35. 2802 01:41:59,280 --> 01:42:01,119 well there's a good chance with a lot of 2803 01:42:01,119 --> 01:42:03,920 the back end math that 20 000 will skew 2804 01:42:03,920 --> 01:42:05,920 the results and the estimated salary 2805 01:42:05,920 --> 01:42:08,239 will have a higher impact than the age 2806 01:42:08,239 --> 01:42:09,600 instead of balancing them out and 2807 01:42:09,600 --> 01:42:11,360 letting the calculations weigh them 2808 01:42:11,360 --> 01:42:13,360 properly 2809 01:42:13,360 --> 01:42:16,159 and finally we get to actually create 2810 01:42:16,159 --> 01:42:19,520 our naive bayes model 2811 01:42:19,520 --> 01:42:21,360 um and then we're going to go ahead and 2812 01:42:21,360 --> 01:42:25,440 import the gaussian naive bayes 2813 01:42:25,440 --> 01:42:28,480 and the gaussian is is the most basic 2814 01:42:28,480 --> 01:42:30,800 one that's what we're looking at now it 2815 01:42:30,800 --> 01:42:33,600 turns out though if you go to the sk 2816 01:42:33,600 --> 01:42:34,400 learn 2817 01:42:34,400 --> 01:42:35,440 kit 2818 01:42:35,440 --> 01:42:36,960 they have a number of different ones you 2819 01:42:36,960 --> 01:42:39,360 can pull in there there's a 2820 01:42:39,360 --> 01:42:41,119 bernoulli i know i've never used that 2821 01:42:41,119 --> 01:42:44,080 one categorical 2822 01:42:44,080 --> 01:42:46,639 complement and here's our gaussian 2823 01:42:46,639 --> 01:42:48,159 so there's a number of different options 2824 01:42:48,159 --> 01:42:49,440 you can look at 2825 01:42:49,440 --> 01:42:51,679 gaussian when you come to the naive 2826 01:42:51,679 --> 01:42:54,239 bayes is the most commonly used 2827 01:42:54,239 --> 01:42:56,080 so we're talking about the naive bayes 2828 01:42:56,080 --> 01:42:57,280 that's usually what people are talking 2829 01:42:57,280 --> 01:42:58,639 about when they when they're pulling 2830 01:42:58,639 --> 01:42:59,600 this in 2831 01:42:59,600 --> 01:43:01,040 and one of the nice things about the 2832 01:43:01,040 --> 01:43:03,119 gaussian if you go to their website um 2833 01:43:03,119 --> 01:43:06,000 to sk learn the naive bayes gaussian 2834 01:43:06,000 --> 01:43:07,520 there's a lot of cool features one of 2835 01:43:07,520 --> 01:43:10,400 them is you can do partial fit on here 2836 01:43:10,400 --> 01:43:11,920 that means if you have a huge amount of 2837 01:43:11,920 --> 01:43:13,440 data you don't have to process it all to 2838 01:43:13,440 --> 01:43:17,440 want you once you can batch it into the 2839 01:43:17,440 --> 01:43:20,239 gaussian nb model and there's many other 2840 01:43:20,239 --> 01:43:21,920 different things you can do with it as 2841 01:43:21,920 --> 01:43:24,480 far as fitting the data and how you 2842 01:43:24,480 --> 01:43:25,920 manipulate it 2843 01:43:25,920 --> 01:43:27,440 we're just doing the basics so we're 2844 01:43:27,440 --> 01:43:28,400 going to go ahead and create our 2845 01:43:28,400 --> 01:43:30,239 classifier we're going to equal the 2846 01:43:30,239 --> 01:43:32,639 gaussian in b 2847 01:43:32,639 --> 01:43:33,920 and then we're going to do a fit we're 2848 01:43:33,920 --> 01:43:36,639 going to fit our training data and our 2849 01:43:36,639 --> 01:43:40,960 training solution so x train y train 2850 01:43:41,280 --> 01:43:43,280 and we'll go ahead and run this uh it's 2851 01:43:43,280 --> 01:43:45,040 going to tell us that it ran the code 2852 01:43:45,040 --> 01:43:47,280 right there 2853 01:43:47,280 --> 01:43:49,840 and now we have our trained classifier 2854 01:43:49,840 --> 01:43:51,280 model 2855 01:43:51,280 --> 01:43:53,199 so the next step is we need to go ahead 2856 01:43:53,199 --> 01:43:54,880 and run a prediction we're going to do 2857 01:43:54,880 --> 01:43:56,480 our y predict equals the 2858 01:43:56,480 --> 01:43:58,400 classifier.predict 2859 01:43:58,400 --> 01:43:59,600 x test 2860 01:43:59,600 --> 01:44:01,600 so here we fit the data and now we're 2861 01:44:01,600 --> 01:44:05,239 going to go ahead and predict 2862 01:44:06,400 --> 01:44:10,239 and now we get to our confusion matrix 2863 01:44:10,239 --> 01:44:11,280 so from 2864 01:44:11,280 --> 01:44:13,520 the sk learn matrix metrics you can 2865 01:44:13,520 --> 01:44:15,679 import your confusion matrix 2866 01:44:15,679 --> 01:44:17,440 just as saves you from doing all the 2867 01:44:17,440 --> 01:44:19,760 simple math that does it all for you 2868 01:44:19,760 --> 01:44:20,880 and then we'll go ahead and create our 2869 01:44:20,880 --> 01:44:23,199 confusion metrics with the y test and 2870 01:44:23,199 --> 01:44:26,320 the y predict so we have our actual 2871 01:44:26,320 --> 01:44:29,119 and we have our predicted value 2872 01:44:29,119 --> 01:44:30,880 and you can see from here this is the 2873 01:44:30,880 --> 01:44:32,800 chart we looked at here's predicted so 2874 01:44:32,800 --> 01:44:35,040 true positive false positive 2875 01:44:35,040 --> 01:44:38,960 false negative true negative 2876 01:44:39,280 --> 01:44:41,119 and if we go ahead and run this there we 2877 01:44:41,119 --> 01:44:45,600 have it 65 3 7 25 2878 01:44:45,600 --> 01:44:48,320 and in this particular prediction we had 2879 01:44:48,320 --> 01:44:51,760 65 or predicted the truth as far as a 2880 01:44:51,760 --> 01:44:52,719 purchase they're going to make a 2881 01:44:52,719 --> 01:44:53,679 purchase 2882 01:44:53,679 --> 01:44:55,920 and we guessed three wrong 2883 01:44:55,920 --> 01:44:58,000 and then we had 25 we predicted would 2884 01:44:58,000 --> 01:45:00,960 not purchase and seven of them did so 2885 01:45:00,960 --> 01:45:05,280 there's our our confusion matrix 2886 01:45:05,280 --> 01:45:07,360 at this point if you were with your 2887 01:45:07,360 --> 01:45:09,760 shareholders or a board meeting 2888 01:45:09,760 --> 01:45:11,520 you would start to hear some snoozing if 2889 01:45:11,520 --> 01:45:12,960 they were looking at the numbers and you 2890 01:45:12,960 --> 01:45:14,840 say hey here's my confusion 2891 01:45:14,840 --> 01:45:17,520 matrix so let's go ahead and visualize 2892 01:45:17,520 --> 01:45:19,280 the results 2893 01:45:19,280 --> 01:45:20,639 we're going to pull from the matplot 2894 01:45:20,639 --> 01:45:24,960 library colors import listed color map 2895 01:45:25,520 --> 01:45:27,440 and this is actually my machine is going 2896 01:45:27,440 --> 01:45:31,760 to throw an error because this is being 2897 01:45:31,760 --> 01:45:33,840 because of the way the setup is i have a 2898 01:45:33,840 --> 01:45:35,920 newer version on here than when they put 2899 01:45:35,920 --> 01:45:37,440 together the demo 2900 01:45:37,440 --> 01:45:39,679 and we need our x set and our y set 2901 01:45:39,679 --> 01:45:42,560 which is our x train and y train 2902 01:45:42,560 --> 01:45:45,440 and then we'll create our x1 x2 2903 01:45:45,440 --> 01:45:47,520 and we'll put that into a grid 2904 01:45:47,520 --> 01:45:50,560 uh and we set our x set minimum stop and 2905 01:45:50,560 --> 01:45:52,560 our x at max stop 2906 01:45:52,560 --> 01:45:53,840 and if you come all the way over here 2907 01:45:53,840 --> 01:45:56,239 we're going to step .01 this is going to 2908 01:45:56,239 --> 01:45:59,040 give us a nice line is what that's doing 2909 01:45:59,040 --> 01:46:01,600 and we're going to plot the contour 2910 01:46:01,600 --> 01:46:04,400 plot the x limit plot the y limit 2911 01:46:04,400 --> 01:46:06,719 and put the scatter plot in there let's 2912 01:46:06,719 --> 01:46:08,239 go ahead and run this 2913 01:46:08,239 --> 01:46:11,119 to be honest when i'm doing these graphs 2914 01:46:11,119 --> 01:46:12,560 there's so many different ways to do 2915 01:46:12,560 --> 01:46:13,920 that there's so many different ways to 2916 01:46:13,920 --> 01:46:15,760 put this code together 2917 01:46:15,760 --> 01:46:18,159 to show you what we're doing it's a lot 2918 01:46:18,159 --> 01:46:20,480 easier to pull up the graph and then go 2919 01:46:20,480 --> 01:46:22,880 back up and explain it 2920 01:46:22,880 --> 01:46:25,280 so the first thing we want to note here 2921 01:46:25,280 --> 01:46:28,000 when we're looking at the data 2922 01:46:28,000 --> 01:46:30,639 is this is the training set 2923 01:46:30,639 --> 01:46:32,719 and so we have those who didn't make a 2924 01:46:32,719 --> 01:46:34,880 purchase we've drawn a nice area for 2925 01:46:34,880 --> 01:46:36,000 that 2926 01:46:36,000 --> 01:46:38,639 that's defined by the naive bayes setup 2927 01:46:38,639 --> 01:46:40,400 and then we have those who did make a 2928 01:46:40,400 --> 01:46:42,480 purchase the green and you can see that 2929 01:46:42,480 --> 01:46:44,480 some of the green drops fall into the 2930 01:46:44,480 --> 01:46:46,239 red area and some of the red dots fall 2931 01:46:46,239 --> 01:46:47,440 into the green 2932 01:46:47,440 --> 01:46:49,280 so even our training set isn't going to 2933 01:46:49,280 --> 01:46:51,600 be a hundred percent uh we couldn't do 2934 01:46:51,600 --> 01:46:52,560 that 2935 01:46:52,560 --> 01:46:54,239 and so we're looking at our different 2936 01:46:54,239 --> 01:46:56,159 data coming down 2937 01:46:56,159 --> 01:46:58,880 uh we can kind of arrange our x1 x2 so 2938 01:46:58,880 --> 01:47:00,800 we have a nice plot going on and we're 2939 01:47:00,800 --> 01:47:02,159 going to create the 2940 01:47:02,159 --> 01:47:04,400 contour 2941 01:47:04,400 --> 01:47:05,920 that's that nice line that's drawn down 2942 01:47:05,920 --> 01:47:09,040 the middle on here with the red green 2943 01:47:09,040 --> 01:47:10,480 that's where that's what this is doing 2944 01:47:10,480 --> 01:47:12,400 right here with the reshape and notice 2945 01:47:12,400 --> 01:47:13,679 that we had to 2946 01:47:13,679 --> 01:47:14,800 uh 2947 01:47:14,800 --> 01:47:17,040 do the dot t if you remember from numpy 2948 01:47:17,040 --> 01:47:19,360 um if you did the numpy module 2949 01:47:19,360 --> 01:47:23,760 you end up with pairs you know x x1 x2 2950 01:47:23,760 --> 01:47:27,280 x1 x2 next row and so forth you have to 2951 01:47:27,280 --> 01:47:29,199 flip it so it's all one row you have all 2952 01:47:29,199 --> 01:47:31,440 your x1s and all your x2s 2953 01:47:31,440 --> 01:47:32,480 so this is what we're kind of looking 2954 01:47:32,480 --> 01:47:36,159 for right here on this setup 2955 01:47:36,159 --> 01:47:38,960 and then the scatter plot is of course 2956 01:47:38,960 --> 01:47:40,560 your scattered data across there we're 2957 01:47:40,560 --> 01:47:42,159 just going through all the points that 2958 01:47:42,159 --> 01:47:44,480 puts these nice little dots on to our 2959 01:47:44,480 --> 01:47:46,560 setup on here and we have our estimated 2960 01:47:46,560 --> 01:47:48,800 salary and our h and then of course the 2961 01:47:48,800 --> 01:47:51,360 dots are did they make a purchase or not 2962 01:47:51,360 --> 01:47:52,960 and just a quick note this is kind of 2963 01:47:52,960 --> 01:47:54,880 funny you can see up here where it says 2964 01:47:54,880 --> 01:47:59,040 x set y set equals x train y train which 2965 01:47:59,040 --> 01:48:02,239 seems kind of a little weird to do 2966 01:48:02,239 --> 01:48:03,679 this is because this is probably 2967 01:48:03,679 --> 01:48:05,679 originally a definition 2968 01:48:05,679 --> 01:48:07,280 so it was its own module that could be 2969 01:48:07,280 --> 01:48:09,199 called over and over again 2970 01:48:09,199 --> 01:48:11,040 and which is really a good way to do it 2971 01:48:11,040 --> 01:48:12,159 because the next thing we're going to do 2972 01:48:12,159 --> 01:48:14,480 is do the exact same thing 2973 01:48:14,480 --> 01:48:16,159 but we're going to visualize the test 2974 01:48:16,159 --> 01:48:17,679 set results 2975 01:48:17,679 --> 01:48:19,119 that way we can see what happened with 2976 01:48:19,119 --> 01:48:22,719 our test group our 25 percent 2977 01:48:22,719 --> 01:48:25,040 and you can see down here we have the 2978 01:48:25,040 --> 01:48:26,239 test set 2979 01:48:26,239 --> 01:48:27,119 and it 2980 01:48:27,119 --> 01:48:28,719 if you look at the two 2981 01:48:28,719 --> 01:48:30,080 graphs next to each other this one 2982 01:48:30,080 --> 01:48:31,600 obviously has 2983 01:48:31,600 --> 01:48:33,280 75 percent of the data so it's going to 2984 01:48:33,280 --> 01:48:34,960 show a lot more 2985 01:48:34,960 --> 01:48:37,920 this is only 25 of the data you can see 2986 01:48:37,920 --> 01:48:39,920 that there's a number that are kind of 2987 01:48:39,920 --> 01:48:41,360 on the edge as to whether they could 2988 01:48:41,360 --> 01:48:43,360 guess by age and income they're going to 2989 01:48:43,360 --> 01:48:45,360 make a purchase or not 2990 01:48:45,360 --> 01:48:47,280 but that said it still is pretty clear 2991 01:48:47,280 --> 01:48:48,880 it's pretty good as far as how much the 2992 01:48:48,880 --> 01:48:52,080 estimate is and how good it does 2993 01:48:52,080 --> 01:48:52,880 now 2994 01:48:52,880 --> 01:48:55,360 graphs are really effective 2995 01:48:55,360 --> 01:48:57,760 for showing people what's going on but 2996 01:48:57,760 --> 01:49:00,000 you also need to have the numbers and so 2997 01:49:00,000 --> 01:49:01,679 we're going to do from sklearn we're 2998 01:49:01,679 --> 01:49:03,679 going to import metrics 2999 01:49:03,679 --> 01:49:04,719 and then we're going to print our 3000 01:49:04,719 --> 01:49:06,880 metrics classification port from the y 3001 01:49:06,880 --> 01:49:10,159 test and the y predict 3002 01:49:10,639 --> 01:49:13,119 and you can see here we have precision 3003 01:49:13,119 --> 01:49:15,280 precision of zeros is 90 there's our 3004 01:49:15,280 --> 01:49:16,560 recall 3005 01:49:16,560 --> 01:49:20,960 0.96 we have an f1 score and a support 3006 01:49:20,960 --> 01:49:23,920 and we have our precision the recall on 3007 01:49:23,920 --> 01:49:25,440 getting it right 3008 01:49:25,440 --> 01:49:27,440 and then we can do our accuracy the 3009 01:49:27,440 --> 01:49:30,000 macro average and the weighted average 3010 01:49:30,000 --> 01:49:32,239 so you can see it pulls in 3011 01:49:32,239 --> 01:49:34,000 pretty good as far as 3012 01:49:34,000 --> 01:49:35,600 how accurate it is 3013 01:49:35,600 --> 01:49:37,600 you could say it's going to be about 90 3014 01:49:37,600 --> 01:49:40,800 percent is going to guess correctly 3015 01:49:40,800 --> 01:49:42,000 that it that they're not going to 3016 01:49:42,000 --> 01:49:44,800 purchase and we had an 89 chance that 3017 01:49:44,800 --> 01:49:46,480 they are going to purchase 3018 01:49:46,480 --> 01:49:48,320 um and then the other numbers as you get 3019 01:49:48,320 --> 01:49:49,520 down 3020 01:49:49,520 --> 01:49:50,960 have a little bit different meaning but 3021 01:49:50,960 --> 01:49:52,080 it's pretty straightforward on here 3022 01:49:52,080 --> 01:49:55,199 here's our accuracy and here's our micro 3023 01:49:55,199 --> 01:49:56,800 average and the weighted average and 3024 01:49:56,800 --> 01:49:58,320 everything else you might need and if 3025 01:49:58,320 --> 01:50:00,400 you forgot the exact definition of 3026 01:50:00,400 --> 01:50:02,560 accuracy it is the 3027 01:50:02,560 --> 01:50:05,280 true positive true negative over all of 3028 01:50:05,280 --> 01:50:06,880 the different setups 3029 01:50:06,880 --> 01:50:09,840 precision is your true positive overall 3030 01:50:09,840 --> 01:50:12,080 positives true and false 3031 01:50:12,080 --> 01:50:14,560 and recall is a true positive over true 3032 01:50:14,560 --> 01:50:17,360 positive plus false negative 3033 01:50:17,360 --> 01:50:18,719 and we can just real quick flip back 3034 01:50:18,719 --> 01:50:20,000 there 3035 01:50:20,000 --> 01:50:22,639 so you can see those numbers on here 3036 01:50:22,639 --> 01:50:25,600 here's our precision here's our recall 3037 01:50:25,600 --> 01:50:28,320 and here's our accuracy on this 3038 01:50:28,320 --> 01:50:30,080 thank you for joining us for mathematics 3039 01:50:30,080 --> 01:50:32,320 for machine learning my name is richard 3040 01:50:32,320 --> 01:50:36,760 kirschner with the simply learn team 3041 01:50:37,150 --> 01:50:38,880 [Music] 3042 01:50:38,880 --> 01:50:40,639 hi there if you like this video 3043 01:50:40,639 --> 01:50:42,400 subscribe to the simply learn youtube 3044 01:50:42,400 --> 01:50:44,960 channel and click here to watch similar 3045 01:50:44,960 --> 01:50:47,199 videos to nerd up and get certified 3046 01:50:47,199 --> 01:50:50,520 click here 206344

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.