All language subtitles for [English (auto-generated)] Mathematics For Machine Learning Essential Mathematics - Machine Learning Tutorial Simplilearn [DownSub.com](1)
Afrikaans
Albanian
Amharic
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Cebuano
Chichewa
Chinese (Simplified)
Chinese (Traditional)
Corsican
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Filipino
Finnish
French
Frisian
Galician
Georgian
German
Greek
Gujarati
Haitian Creole
Hausa
Hawaiian
Hebrew
Hindi
Hmong
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kyrgyz
Lao
Latin
Latvian
Lithuanian
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Maltese
Maori
Marathi
Mongolian
Myanmar (Burmese)
Nepali
Norwegian
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Samoan
Scots Gaelic
Serbian
Sesotho
Shona
Sindhi
Sinhala
Slovak
Slovenian
Somali
Spanish
Sundanese
Swahili
Swedish
Tajik
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Xhosa
Yiddish
Yoruba
Zulu
Odia (Oriya)
Kinyarwanda
Turkmen
Tatar
Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:07,359 --> 00:00:09,840
mathematics for machine learning
2
00:00:09,840 --> 00:00:11,759
my name is richard kirschner with the
3
00:00:11,759 --> 00:00:13,759
simply learn team that's get certified
4
00:00:13,759 --> 00:00:15,040
get ahead
5
00:00:15,040 --> 00:00:16,400
we're going to cover mathematics for
6
00:00:16,400 --> 00:00:18,800
machine learning so today's agenda is
7
00:00:18,800 --> 00:00:21,039
going to cover data and its types then
8
00:00:21,039 --> 00:00:23,760
we're going to dive into linear algebra
9
00:00:23,760 --> 00:00:25,439
and its concepts
10
00:00:25,439 --> 00:00:28,240
calculus statistics for machine learning
11
00:00:28,240 --> 00:00:30,320
probability for machine learning
12
00:00:30,320 --> 00:00:32,159
hands-on demos
13
00:00:32,159 --> 00:00:33,360
and of course thrown in there in the
14
00:00:33,360 --> 00:00:35,120
middle is going to be your matrixes and
15
00:00:35,120 --> 00:00:36,960
a few other things to go along with all
16
00:00:36,960 --> 00:00:38,800
this
17
00:00:38,800 --> 00:00:40,719
data in its types
18
00:00:40,719 --> 00:00:42,559
data denotes the individual pieces of
19
00:00:42,559 --> 00:00:44,239
factual information collected from
20
00:00:44,239 --> 00:00:46,559
various sources it is stored processed
21
00:00:46,559 --> 00:00:49,360
and later used for analysis
22
00:00:49,360 --> 00:00:51,680
and so we see here just a huge grouping
23
00:00:51,680 --> 00:00:54,800
of information a lot of tech stuff money
24
00:00:54,800 --> 00:00:57,760
dollar signs numbers
25
00:00:57,760 --> 00:00:58,879
and then you have your performing
26
00:00:58,879 --> 00:01:00,719
analytics to drive insights and
27
00:01:00,719 --> 00:01:02,480
hopefully you have a nice share your
28
00:01:02,480 --> 00:01:04,000
shareholders gather it at the meeting
29
00:01:04,000 --> 00:01:05,280
and you're able to explain it in
30
00:01:05,280 --> 00:01:07,280
something they can understand
31
00:01:07,280 --> 00:01:10,000
so we talk about data types of data
32
00:01:10,000 --> 00:01:12,320
we have in our types of data we have a
33
00:01:12,320 --> 00:01:14,799
qualitative categorical
34
00:01:14,799 --> 00:01:17,280
you think nominal or ordinal
35
00:01:17,280 --> 00:01:18,880
and then you have your quantitative or
36
00:01:18,880 --> 00:01:20,880
numerical which is discrete or
37
00:01:20,880 --> 00:01:22,640
continuous
38
00:01:22,640 --> 00:01:24,320
and let's look a little closer at those
39
00:01:24,320 --> 00:01:27,040
data type vocabulary always people's
40
00:01:27,040 --> 00:01:29,680
favorite is the vocabulary words okay
41
00:01:29,680 --> 00:01:31,200
not mine
42
00:01:31,200 --> 00:01:33,200
but let's dive into this what we mean by
43
00:01:33,200 --> 00:01:34,159
nominal
44
00:01:34,159 --> 00:01:36,799
nominal they are used to label various
45
00:01:36,799 --> 00:01:39,600
uh label our variables without providing
46
00:01:39,600 --> 00:01:42,159
any measurable value
47
00:01:42,159 --> 00:01:46,320
country gender race hair color etc
48
00:01:46,320 --> 00:01:48,079
it's something that you either mark true
49
00:01:48,079 --> 00:01:50,640
or false this is a label it's on or off
50
00:01:50,640 --> 00:01:52,640
either they have a red hat on or they do
51
00:01:52,640 --> 00:01:53,680
not
52
00:01:53,680 --> 00:01:54,880
so a lot of times when you're thinking
53
00:01:54,880 --> 00:01:56,399
nominal data
54
00:01:56,399 --> 00:01:57,840
labels
55
00:01:57,840 --> 00:01:59,759
think of it as a true false kind of
56
00:01:59,759 --> 00:02:02,079
setup and we look at ordinal this is
57
00:02:02,079 --> 00:02:04,399
categorical data with a set order or a
58
00:02:04,399 --> 00:02:06,079
scale to it
59
00:02:06,079 --> 00:02:07,600
and you can think of salary range as a
60
00:02:07,600 --> 00:02:08,720
great one
61
00:02:08,720 --> 00:02:11,120
movie ratings etc you can see here the
62
00:02:11,120 --> 00:02:13,520
salary rates if you have 10 000 to 20
63
00:02:13,520 --> 00:02:16,000
000 number of employees earning that
64
00:02:16,000 --> 00:02:18,879
rate is 150 20 000 to 30
65
00:02:18,879 --> 00:02:21,760
100 and so forth some of the terms
66
00:02:21,760 --> 00:02:23,840
you'll hear is bucket
67
00:02:23,840 --> 00:02:25,200
this is where you have 10 different
68
00:02:25,200 --> 00:02:27,040
buckets and you want to separate it into
69
00:02:27,040 --> 00:02:28,640
something that makes sense into those 10
70
00:02:28,640 --> 00:02:29,840
buckets
71
00:02:29,840 --> 00:02:32,239
and so when we start talking about
72
00:02:32,239 --> 00:02:34,640
ordinal a lot of times when you get down
73
00:02:34,640 --> 00:02:36,800
to the brass bones again we're talking
74
00:02:36,800 --> 00:02:38,239
true false
75
00:02:38,239 --> 00:02:40,000
so if you're a member of the 10 to 20k
76
00:02:40,000 --> 00:02:41,519
reigns
77
00:02:41,519 --> 00:02:43,920
so forth those would each be either part
78
00:02:43,920 --> 00:02:45,920
of that group or you're not but now
79
00:02:45,920 --> 00:02:47,280
we're talking about buckets so we want
80
00:02:47,280 --> 00:02:48,840
to count how many people are in that
81
00:02:48,840 --> 00:02:52,800
bucket quantitative numerical data
82
00:02:52,800 --> 00:02:55,360
falls into two classes discrete or
83
00:02:55,360 --> 00:02:56,640
continuous
84
00:02:56,640 --> 00:02:58,879
and so data with a final set of values
85
00:02:58,879 --> 00:03:01,360
which can be categorized class strength
86
00:03:01,360 --> 00:03:04,319
questions answered correctly and runs
87
00:03:04,319 --> 00:03:06,640
hit and cricket a lot of times when you
88
00:03:06,640 --> 00:03:09,040
see this you can think integer
89
00:03:09,040 --> 00:03:11,840
and a very restricted integer i.e
90
00:03:11,840 --> 00:03:13,840
you can only have 100 questions
91
00:03:13,840 --> 00:03:16,239
on a test so you can it's very discreet
92
00:03:16,239 --> 00:03:18,159
i only have 100 different values that it
93
00:03:18,159 --> 00:03:20,560
can attain so think usually you're
94
00:03:20,560 --> 00:03:22,800
talking about integers but within a very
95
00:03:22,800 --> 00:03:24,879
small range they don't have an open in
96
00:03:24,879 --> 00:03:26,879
or anything like that
97
00:03:26,879 --> 00:03:28,959
so discrete is very solid
98
00:03:28,959 --> 00:03:30,480
simple to count
99
00:03:30,480 --> 00:03:31,760
set number
100
00:03:31,760 --> 00:03:33,680
continuous on the other hand uh
101
00:03:33,680 --> 00:03:36,000
continuous data can take any numerical
102
00:03:36,000 --> 00:03:38,720
value within a range so water pressure
103
00:03:38,720 --> 00:03:41,440
weight of a person etc usually we start
104
00:03:41,440 --> 00:03:43,360
thinking about float values where they
105
00:03:43,360 --> 00:03:45,519
can get phenomenally small and they're
106
00:03:45,519 --> 00:03:47,120
in what they're worth
107
00:03:47,120 --> 00:03:48,640
and there's a whole series of values
108
00:03:48,640 --> 00:03:50,640
that falls right between discrete and
109
00:03:50,640 --> 00:03:52,480
continuous
110
00:03:52,480 --> 00:03:54,400
you can think of the stock market you
111
00:03:54,400 --> 00:03:57,280
have dollar amounts it's still discrete
112
00:03:57,280 --> 00:03:59,439
but it starts to get complicated enough
113
00:03:59,439 --> 00:04:01,040
when you have like you know jump in the
114
00:04:01,040 --> 00:04:03,480
stock market from
115
00:04:03,480 --> 00:04:05,840
525.33 cents to
116
00:04:05,840 --> 00:04:08,640
580.67
117
00:04:08,640 --> 00:04:10,560
there's a lot of point values in there
118
00:04:10,560 --> 00:04:12,799
it'd still be called discrete but you
119
00:04:12,799 --> 00:04:14,720
start looking at it as almost continuous
120
00:04:14,720 --> 00:04:17,120
because it does have such a variance in
121
00:04:17,120 --> 00:04:19,600
it now uh we talked about no we did we
122
00:04:19,600 --> 00:04:22,240
went over nominal and ordinal
123
00:04:22,240 --> 00:04:24,800
almost true false charts and we looked
124
00:04:24,800 --> 00:04:27,120
at quantitative and numerical data which
125
00:04:27,120 --> 00:04:29,520
will start to get into numbers discrete
126
00:04:29,520 --> 00:04:31,759
you can usually a lot of times discrete
127
00:04:31,759 --> 00:04:33,440
will be put into it could be put into
128
00:04:33,440 --> 00:04:35,680
true false but usually it's not uh so we
129
00:04:35,680 --> 00:04:36,880
want to address this stuff and the first
130
00:04:36,880 --> 00:04:38,960
thing you want to look at is the very
131
00:04:38,960 --> 00:04:40,800
basic which is your algebra so we're
132
00:04:40,800 --> 00:04:43,280
going to take a look at linear algebra
133
00:04:43,280 --> 00:04:44,800
you can remember back when your
134
00:04:44,800 --> 00:04:47,840
euclidean geometry we have a line well
135
00:04:47,840 --> 00:04:49,440
let's go through this we have a linear
136
00:04:49,440 --> 00:04:51,360
algebra is the domain of mathematics
137
00:04:51,360 --> 00:04:54,000
concerning linear equations
138
00:04:54,000 --> 00:04:56,560
and the representations in vector spaces
139
00:04:56,560 --> 00:04:58,479
and through matrixes i told you we're
140
00:04:58,479 --> 00:05:00,320
going to talk about matrix is
141
00:05:00,320 --> 00:05:03,199
uh so a linear equation
142
00:05:03,199 --> 00:05:05,120
is simply
143
00:05:05,120 --> 00:05:09,039
2x plus 4y minus 3z equals 10. very
144
00:05:09,039 --> 00:05:13,520
linear 10x plus 12.4 y equals z and now
145
00:05:13,520 --> 00:05:14,880
you can actually solve these two
146
00:05:14,880 --> 00:05:17,360
equations by combining them
147
00:05:17,360 --> 00:05:18,880
and that's we're talking about a linear
148
00:05:18,880 --> 00:05:20,560
equation
149
00:05:20,560 --> 00:05:24,000
in the vectors we have a plus b equals c
150
00:05:24,000 --> 00:05:25,120
now we're starting to look at a
151
00:05:25,120 --> 00:05:26,479
direction
152
00:05:26,479 --> 00:05:29,039
and these values usually think of an x y
153
00:05:29,039 --> 00:05:30,479
z plot
154
00:05:30,479 --> 00:05:32,320
so each one is a direction
155
00:05:32,320 --> 00:05:33,680
and the actual
156
00:05:33,680 --> 00:05:37,039
distance of like a triangle a b is c
157
00:05:37,039 --> 00:05:38,960
and then your matrix can describe all
158
00:05:38,960 --> 00:05:40,400
kinds of things
159
00:05:40,400 --> 00:05:42,080
i find matrixes
160
00:05:42,080 --> 00:05:44,240
confuse a lot of people
161
00:05:44,240 --> 00:05:46,000
not because they're particularly
162
00:05:46,000 --> 00:05:49,360
difficult but because of the magnitude
163
00:05:49,360 --> 00:05:50,840
and the different things they're used
164
00:05:50,840 --> 00:05:54,800
for and a matrix is a chart
165
00:05:54,800 --> 00:05:57,440
or a you know think of a spreadsheet but
166
00:05:57,440 --> 00:05:59,680
you have your rows and your columns
167
00:05:59,680 --> 00:06:02,240
and you'll see here we have a times b
168
00:06:02,240 --> 00:06:03,840
equals c
169
00:06:03,840 --> 00:06:07,120
very important to know your counts
170
00:06:07,120 --> 00:06:09,199
so depending on how the math is being
171
00:06:09,199 --> 00:06:11,199
done what you're using it for making
172
00:06:11,199 --> 00:06:12,800
sure you have the same rows and number
173
00:06:12,800 --> 00:06:15,120
of columns or a single number there's
174
00:06:15,120 --> 00:06:16,479
all kinds of things that play in that
175
00:06:16,479 --> 00:06:19,039
that can make matrixes confusing
176
00:06:19,039 --> 00:06:20,639
but really has a lot more to do with
177
00:06:20,639 --> 00:06:23,039
what domain you're working in are you
178
00:06:23,039 --> 00:06:24,240
adding in
179
00:06:24,240 --> 00:06:28,400
multiple polynomials where you have like
180
00:06:28,400 --> 00:06:31,600
a x squared plus b y plus you know you
181
00:06:31,600 --> 00:06:32,800
start to see that it can be very
182
00:06:32,800 --> 00:06:34,720
confusing versus a very straightforward
183
00:06:34,720 --> 00:06:35,840
matrix
184
00:06:35,840 --> 00:06:37,520
and let's just go a little deeper into
185
00:06:37,520 --> 00:06:39,600
these because these are such primary
186
00:06:39,600 --> 00:06:41,520
this is what we're here to talk about is
187
00:06:41,520 --> 00:06:43,840
these different math uh mathematical
188
00:06:43,840 --> 00:06:45,919
computations that come up
189
00:06:45,919 --> 00:06:47,440
so we're looking at linear equations
190
00:06:47,440 --> 00:06:49,039
let's dig deeper into that one an
191
00:06:49,039 --> 00:06:51,120
equation having a maximum order of one
192
00:06:51,120 --> 00:06:54,080
is called a linear equation
193
00:06:54,080 --> 00:06:55,919
so it's linear because when you look at
194
00:06:55,919 --> 00:06:58,800
this we have ax plus b equals c which is
195
00:06:58,800 --> 00:07:00,479
a one variable
196
00:07:00,479 --> 00:07:03,440
we have two variable a x plus b y equals
197
00:07:03,440 --> 00:07:07,360
c a x plus b y plus z c z equals d
198
00:07:07,360 --> 00:07:09,919
and so forth but all of these
199
00:07:09,919 --> 00:07:12,240
are to the power of one you don't see x
200
00:07:12,240 --> 00:07:14,479
squared you don't see x cubed so we're
201
00:07:14,479 --> 00:07:16,080
talking about linear equations that's
202
00:07:16,080 --> 00:07:17,680
what we're talking about in their
203
00:07:17,680 --> 00:07:18,639
addition
204
00:07:18,639 --> 00:07:20,720
if you have already dived into say
205
00:07:20,720 --> 00:07:22,720
neural networks you should recognize
206
00:07:22,720 --> 00:07:25,919
this ax plus by plus cz
207
00:07:25,919 --> 00:07:28,160
setup plus the intercept
208
00:07:28,160 --> 00:07:30,560
uh which is basically your your neural
209
00:07:30,560 --> 00:07:32,400
network each node adding up all the
210
00:07:32,400 --> 00:07:35,039
different inputs and we can drill down
211
00:07:35,039 --> 00:07:38,000
into that most common formula is your y
212
00:07:38,000 --> 00:07:41,199
equals mx plus c
213
00:07:41,199 --> 00:07:43,599
so you have your y
214
00:07:43,599 --> 00:07:46,240
equals the m which is your slope
215
00:07:46,240 --> 00:07:48,639
your x value plus c
216
00:07:48,639 --> 00:07:50,400
which is your
217
00:07:50,400 --> 00:07:52,560
y-intercept you kind of labeled it wrong
218
00:07:52,560 --> 00:07:54,400
here
219
00:07:54,400 --> 00:07:56,240
threw me for a loop but the the c would
220
00:07:56,240 --> 00:07:58,479
be your y-intercept so when you set x
221
00:07:58,479 --> 00:08:01,680
equal to zero y equals c and that's
222
00:08:01,680 --> 00:08:04,000
that's your y-intercept right there
223
00:08:04,000 --> 00:08:06,080
uh and that's they just had a reverse
224
00:08:06,080 --> 00:08:08,879
value of y when x equals zero equals the
225
00:08:08,879 --> 00:08:11,280
y-intercept which is c and your slow
226
00:08:11,280 --> 00:08:13,759
gradient line which is your m so you get
227
00:08:13,759 --> 00:08:16,319
your y equals two x plus three
228
00:08:16,319 --> 00:08:18,400
and there's lots of easy ways to compute
229
00:08:18,400 --> 00:08:20,160
this this way this is why we always
230
00:08:20,160 --> 00:08:21,680
start with the most basic one when we're
231
00:08:21,680 --> 00:08:23,520
solving one of these problems and then
232
00:08:23,520 --> 00:08:24,720
of course the
233
00:08:24,720 --> 00:08:26,400
one of the most important takeaways is
234
00:08:26,400 --> 00:08:28,879
the slope gradient of the line
235
00:08:28,879 --> 00:08:30,800
so the slope is very important that m
236
00:08:30,800 --> 00:08:31,759
value
237
00:08:31,759 --> 00:08:33,200
in this case we went ahead and solved
238
00:08:33,200 --> 00:08:34,479
this
239
00:08:34,479 --> 00:08:37,120
if you have y equals 2x plus 3 you can
240
00:08:37,120 --> 00:08:39,360
see how it has a nice line graph here on
241
00:08:39,360 --> 00:08:41,279
the right
242
00:08:41,279 --> 00:08:44,240
so matrixes a matrix refers to a
243
00:08:44,240 --> 00:08:46,480
rectangular representation of an array
244
00:08:46,480 --> 00:08:50,000
of numbers arranged in columns and rows
245
00:08:50,000 --> 00:08:52,640
so we're talking m rows by n columns
246
00:08:52,640 --> 00:08:55,200
here a11 is denotes the element of the
247
00:08:55,200 --> 00:08:58,399
first row in the first column similarly
248
00:08:58,399 --> 00:09:01,279
a12 and it's really pronounced a11 in
249
00:09:01,279 --> 00:09:04,320
this particular setup so it's a row one
250
00:09:04,320 --> 00:09:07,440
column one a 12 is a
251
00:09:07,440 --> 00:09:09,680
row one column two
252
00:09:09,680 --> 00:09:13,040
first row and second column and so on
253
00:09:13,040 --> 00:09:14,880
and there's a lot of ways to denote this
254
00:09:14,880 --> 00:09:17,200
i've seen these as like a capital letter
255
00:09:17,200 --> 00:09:20,160
a smaller case a for the top row or i
256
00:09:20,160 --> 00:09:22,160
mean you can see where they can go all
257
00:09:22,160 --> 00:09:23,760
kinds of different directions as far as
258
00:09:23,760 --> 00:09:25,519
the value
259
00:09:25,519 --> 00:09:26,800
you just take a moment to realize
260
00:09:26,800 --> 00:09:28,399
there's need to be some designation as
261
00:09:28,399 --> 00:09:30,640
far as what row it's in and what column
262
00:09:30,640 --> 00:09:32,320
it's in
263
00:09:32,320 --> 00:09:34,880
and we have our basic operations we have
264
00:09:34,880 --> 00:09:36,240
addition so when you think about
265
00:09:36,240 --> 00:09:38,399
addition you have uh
266
00:09:38,399 --> 00:09:41,680
two matrixes of two by two and you just
267
00:09:41,680 --> 00:09:44,320
add each individual number in that
268
00:09:44,320 --> 00:09:46,080
matrix and then when you get to the
269
00:09:46,080 --> 00:09:48,000
bottom you have uh in this case the
270
00:09:48,000 --> 00:09:50,800
solution is twelve 10 plus 2 is 12 5
271
00:09:50,800 --> 00:09:53,360
plus 3 is 8 and so on and the same thing
272
00:09:53,360 --> 00:09:55,279
with subtraction
273
00:09:55,279 --> 00:09:57,680
now again your counting matrix is you
274
00:09:57,680 --> 00:09:59,519
want to check your
275
00:09:59,519 --> 00:10:01,519
dimensions of the matrix
276
00:10:01,519 --> 00:10:03,760
the shape you'll see shape come up a lot
277
00:10:03,760 --> 00:10:05,279
in programming so we're talking about
278
00:10:05,279 --> 00:10:07,920
dimensions we're talking about the shape
279
00:10:07,920 --> 00:10:10,399
if the two shapes are equal
280
00:10:10,399 --> 00:10:12,080
this is what happens when you add them
281
00:10:12,080 --> 00:10:14,480
together or subtract them
282
00:10:14,480 --> 00:10:16,560
and we have multiplication when you look
283
00:10:16,560 --> 00:10:18,320
at the multiplication you end up with a
284
00:10:18,320 --> 00:10:21,279
very uh a slightly different setup going
285
00:10:21,279 --> 00:10:22,720
now
286
00:10:22,720 --> 00:10:25,040
if we look at our last one we're uh uh
287
00:10:25,040 --> 00:10:26,640
we're like why
288
00:10:26,640 --> 00:10:28,000
this always gets to me when we get to
289
00:10:28,000 --> 00:10:30,240
matrixes they don't really say why you
290
00:10:30,240 --> 00:10:32,800
multiply matrixes
291
00:10:32,800 --> 00:10:34,399
you know my first thought is one times
292
00:10:34,399 --> 00:10:36,480
two four times three but if you look at
293
00:10:36,480 --> 00:10:38,560
this we get one times two plus four
294
00:10:38,560 --> 00:10:39,920
times three
295
00:10:39,920 --> 00:10:43,040
one times three plus four times five
296
00:10:43,040 --> 00:10:45,200
uh six times two plus three times three
297
00:10:45,200 --> 00:10:47,760
six times three plus three times five if
298
00:10:47,760 --> 00:10:50,000
you're looking at these matrixes uh
299
00:10:50,000 --> 00:10:52,560
think of this more as an equation
300
00:10:52,560 --> 00:10:54,560
and so we have if you remember we went
301
00:10:54,560 --> 00:10:56,240
back up here for our multiple line
302
00:10:56,240 --> 00:10:58,240
equations let's just go back up a couple
303
00:10:58,240 --> 00:11:00,399
slides where we were looking at
304
00:11:00,399 --> 00:11:02,399
two variables so this is a two variable
305
00:11:02,399 --> 00:11:06,959
equation a x plus b y equals c
306
00:11:06,959 --> 00:11:09,440
and this is a way to make it very quick
307
00:11:09,440 --> 00:11:11,200
to solve these variables and that's why
308
00:11:11,200 --> 00:11:12,720
you have the matrix and that's why you
309
00:11:12,720 --> 00:11:14,480
do
310
00:11:14,480 --> 00:11:16,720
the multiplication the way they do
311
00:11:16,720 --> 00:11:19,680
and this is the dot product of uh one
312
00:11:19,680 --> 00:11:20,959
times two
313
00:11:20,959 --> 00:11:24,240
plus four times three
314
00:11:24,240 --> 00:11:28,800
one times three plus four times five
315
00:11:28,800 --> 00:11:32,320
six times two plus three times three
316
00:11:32,320 --> 00:11:34,320
six times three plus three times five
317
00:11:34,320 --> 00:11:36,079
and it gives us a nice little
318
00:11:36,079 --> 00:11:39,519
14 23 21 and 33 over here which then can
319
00:11:39,519 --> 00:11:42,959
be used and reduced down to a sample
320
00:11:42,959 --> 00:11:45,279
formula as far as solving the variables
321
00:11:45,279 --> 00:11:47,600
as you have enough inputs
322
00:11:47,600 --> 00:11:49,279
and then in matrix operations when
323
00:11:49,279 --> 00:11:51,600
you're dealing with a lot of matrixes
324
00:11:51,600 --> 00:11:54,480
now keep in mind multiplying matrixes is
325
00:11:54,480 --> 00:11:56,000
different than finding the product of
326
00:11:56,000 --> 00:11:58,560
two matrixes okay so we're talking about
327
00:11:58,560 --> 00:12:00,240
multiplication we're talking about
328
00:12:00,240 --> 00:12:02,959
solving uh for equations when you're
329
00:12:02,959 --> 00:12:04,560
finding the product you are just finding
330
00:12:04,560 --> 00:12:06,639
one times two keep that in mind because
331
00:12:06,639 --> 00:12:08,160
that does come up i've had that come up
332
00:12:08,160 --> 00:12:10,160
a number of times where i am altering
333
00:12:10,160 --> 00:12:12,160
data and i get confused as to what i'm
334
00:12:12,160 --> 00:12:13,360
doing with it
335
00:12:13,360 --> 00:12:16,240
uh transpose flipping the matrix over is
336
00:12:16,240 --> 00:12:19,120
diagonal comes up all the time where you
337
00:12:19,120 --> 00:12:20,959
have you still have 12 but instead of it
338
00:12:20,959 --> 00:12:24,399
being 12 8 it's now 12 14
339
00:12:24,399 --> 00:12:26,560
8 21 you're just flipping the columns
340
00:12:26,560 --> 00:12:28,079
and the rows
341
00:12:28,079 --> 00:12:30,720
and then of course you can do an inverse
342
00:12:30,720 --> 00:12:32,399
changing the signs of the values across
343
00:12:32,399 --> 00:12:34,000
this main diagonal
344
00:12:34,000 --> 00:12:35,600
and you can see here we have the inverse
345
00:12:35,600 --> 00:12:38,560
a to the minus 1 and ends up with
346
00:12:38,560 --> 00:12:41,760
instead of 12 8 14 12 is now minus 22
347
00:12:41,760 --> 00:12:43,279
minus 12.
348
00:12:43,279 --> 00:12:44,560
vectors
349
00:12:44,560 --> 00:12:47,440
vector just means we have
350
00:12:47,440 --> 00:12:50,320
a value and a direction
351
00:12:50,320 --> 00:12:52,639
and we have down four numbers here on
352
00:12:52,639 --> 00:12:54,240
our vector
353
00:12:54,240 --> 00:12:56,240
uh in mathematics a one dimensional
354
00:12:56,240 --> 00:12:59,600
matrix is called a vector uh so
355
00:12:59,600 --> 00:13:01,360
if you have your x plot and you have a
356
00:13:01,360 --> 00:13:04,000
single value that values along the x
357
00:13:04,000 --> 00:13:06,720
axis and it's a single dimension
358
00:13:06,720 --> 00:13:08,480
if you have two dimensions you can think
359
00:13:08,480 --> 00:13:10,480
about putting them on a graph you might
360
00:13:10,480 --> 00:13:13,519
have x and you might have y and each
361
00:13:13,519 --> 00:13:15,440
value denotes a direction and then of
362
00:13:15,440 --> 00:13:17,120
course the actual distance is going to
363
00:13:17,120 --> 00:13:20,480
be the hypothesis of that triangle and
364
00:13:20,480 --> 00:13:22,000
you can do that with three dimensionals
365
00:13:22,000 --> 00:13:23,600
x y and z
366
00:13:23,600 --> 00:13:25,040
and you can do it all the way to nth
367
00:13:25,040 --> 00:13:27,920
dimensions so when they talk about the k
368
00:13:27,920 --> 00:13:29,519
means
369
00:13:29,519 --> 00:13:31,760
for categorizing and how close data is
370
00:13:31,760 --> 00:13:34,560
together they will compute that based on
371
00:13:34,560 --> 00:13:36,320
the pythagorean theorem so you would
372
00:13:36,320 --> 00:13:37,440
take
373
00:13:37,440 --> 00:13:39,360
the square of each value add them all
374
00:13:39,360 --> 00:13:41,040
together and find the square root and
375
00:13:41,040 --> 00:13:42,880
that gives you a distance
376
00:13:42,880 --> 00:13:44,880
as far as where that point is where that
377
00:13:44,880 --> 00:13:47,440
vector exists or an actual point value
378
00:13:47,440 --> 00:13:48,880
and then you can compare that point
379
00:13:48,880 --> 00:13:51,440
value to another one it makes a very
380
00:13:51,440 --> 00:13:54,880
easy comparison versus comparing 50 or
381
00:13:54,880 --> 00:13:56,560
60 different numbers
382
00:13:56,560 --> 00:13:58,959
and that brings us up to i gene vectors
383
00:13:58,959 --> 00:14:01,519
and i gene values
384
00:14:01,519 --> 00:14:04,160
hygiene vectors the vectors that don't
385
00:14:04,160 --> 00:14:07,519
change their span while transformation
386
00:14:07,519 --> 00:14:10,160
and i gene values the scalar values that
387
00:14:10,160 --> 00:14:12,639
are associated to the vectors
388
00:14:12,639 --> 00:14:14,000
conceptually
389
00:14:14,000 --> 00:14:16,320
you can think of the vector as your
390
00:14:16,320 --> 00:14:18,720
picture you have a picture it's uh
391
00:14:18,720 --> 00:14:21,279
two dimensions x and y
392
00:14:21,279 --> 00:14:22,959
and so when you do those two dimensions
393
00:14:22,959 --> 00:14:25,120
and those two values or whatever that
394
00:14:25,120 --> 00:14:27,760
value is
395
00:14:27,920 --> 00:14:30,720
that is that point but the values change
396
00:14:30,720 --> 00:14:32,959
when you skew it and so
397
00:14:32,959 --> 00:14:36,320
if we take and we have a vector a
398
00:14:36,320 --> 00:14:38,639
and that's a set value uh
399
00:14:38,639 --> 00:14:41,519
b is um your is your you have a and b
400
00:14:41,519 --> 00:14:44,399
which is your i gene vector two is the i
401
00:14:44,399 --> 00:14:47,199
gene value so we're altering
402
00:14:47,199 --> 00:14:50,240
all the values by two that means we're
403
00:14:50,240 --> 00:14:51,839
maybe we're stretching it out one
404
00:14:51,839 --> 00:14:53,760
direction making it tall uh if you're
405
00:14:53,760 --> 00:14:56,160
doing picture editing
406
00:14:56,160 --> 00:14:57,920
that's one of the places this comes in
407
00:14:57,920 --> 00:15:00,079
but you can see when you're transforming
408
00:15:00,079 --> 00:15:02,399
uh your different information how you
409
00:15:02,399 --> 00:15:05,600
transform it is then your hygiene value
410
00:15:05,600 --> 00:15:07,839
and you can see here vector after line
411
00:15:07,839 --> 00:15:11,519
transit transition uh we have 3a a is
412
00:15:11,519 --> 00:15:14,959
the hygiene vector 3 is the aging value
413
00:15:14,959 --> 00:15:17,199
so a doesn't change that's whatever we
414
00:15:17,199 --> 00:15:18,399
started with that's your original
415
00:15:18,399 --> 00:15:21,040
picture and 3
416
00:15:21,040 --> 00:15:23,680
is skewing it one direction and maybe
417
00:15:23,680 --> 00:15:26,079
a b is being skewed another direction
418
00:15:26,079 --> 00:15:27,600
and so you have a nice tilted picture
419
00:15:27,600 --> 00:15:29,600
because you altered it by those by the
420
00:15:29,600 --> 00:15:31,760
hygiene values
421
00:15:31,760 --> 00:15:34,720
so let's go ahead and pull up a demo on
422
00:15:34,720 --> 00:15:37,519
linear algebra and to do this i'm going
423
00:15:37,519 --> 00:15:40,320
to go through my trusted anaconda into
424
00:15:40,320 --> 00:15:42,399
my jupiter notebook
425
00:15:42,399 --> 00:15:44,320
and we'll create a new
426
00:15:44,320 --> 00:15:47,199
notebook called linear algebra
427
00:15:47,199 --> 00:15:49,600
since we are working in python we're
428
00:15:49,600 --> 00:15:51,600
going to use our numpy i always import
429
00:15:51,600 --> 00:15:54,320
that as np or numpy array probably the
430
00:15:54,320 --> 00:15:56,160
most popular
431
00:15:56,160 --> 00:16:00,480
module for doing matrixes and things in
432
00:16:00,480 --> 00:16:02,240
given that this is part of a series i'm
433
00:16:02,240 --> 00:16:04,480
not going to go too much into numpy we
434
00:16:04,480 --> 00:16:06,000
are going to go ahead and create two
435
00:16:06,000 --> 00:16:08,240
different variables a for a numpy array
436
00:16:08,240 --> 00:16:12,000
10 15 and b 29
437
00:16:12,000 --> 00:16:13,199
we'll go ahead and run this and you can
438
00:16:13,199 --> 00:16:16,079
see there's our two arrays 10 15 29 and
439
00:16:16,079 --> 00:16:17,519
i went and added a space there in
440
00:16:17,519 --> 00:16:18,959
between
441
00:16:18,959 --> 00:16:20,880
so it's easier to read
442
00:16:20,880 --> 00:16:23,440
and since it's the last line we don't
443
00:16:23,440 --> 00:16:24,720
have to put the print statement on it
444
00:16:24,720 --> 00:16:27,279
unless you want we can simply but we can
445
00:16:27,279 --> 00:16:30,639
simply do a plus b so when i run this uh
446
00:16:30,639 --> 00:16:35,680
we have 10 15 29 and we get 30 24 which
447
00:16:35,680 --> 00:16:39,759
is what you expect 10 plus 20 15 plus 9
448
00:16:39,759 --> 00:16:41,600
you could almost look at this addition
449
00:16:41,600 --> 00:16:44,399
as being
450
00:16:45,360 --> 00:16:46,959
just adding up the columns on here
451
00:16:46,959 --> 00:16:49,279
coming down and if we wanted to do it a
452
00:16:49,279 --> 00:16:52,560
different way we could also do a dot t
453
00:16:52,560 --> 00:16:54,399
plus b dot t
454
00:16:54,399 --> 00:16:56,880
remember that t flips them and so if we
455
00:16:56,880 --> 00:17:00,160
do that we now get them uh we now have
456
00:17:00,160 --> 00:17:02,800
30 24 going the other way
457
00:17:02,800 --> 00:17:05,119
we could also do something kind of fun
458
00:17:05,119 --> 00:17:06,559
there's a lot of different ways to do
459
00:17:06,559 --> 00:17:07,919
this
460
00:17:07,919 --> 00:17:10,799
as far as a plus b i can also do a plus
461
00:17:10,799 --> 00:17:11,679
b
462
00:17:11,679 --> 00:17:13,280
dot t
463
00:17:13,280 --> 00:17:14,480
and you're going to see that that will
464
00:17:14,480 --> 00:17:17,119
come out the same the 30 24 whether i
465
00:17:17,119 --> 00:17:19,359
transpose a and b or transpose them both
466
00:17:19,359 --> 00:17:21,918
at the end
467
00:17:22,400 --> 00:17:24,559
and likewise we can very easily subtract
468
00:17:24,559 --> 00:17:28,000
two vectors i can go a minus b
469
00:17:28,000 --> 00:17:31,679
and we run that and we get minus 10 6
470
00:17:31,679 --> 00:17:33,440
now remember this is the last line in
471
00:17:33,440 --> 00:17:35,120
this particular section that's right not
472
00:17:35,120 --> 00:17:37,600
to put the print around it
473
00:17:37,600 --> 00:17:40,080
and just like we did before
474
00:17:40,080 --> 00:17:42,799
we can transpose either the individual
475
00:17:42,799 --> 00:17:45,440
or we can transpose the main setup and
476
00:17:45,440 --> 00:17:48,000
then we get a minus 10 6 going the other
477
00:17:48,000 --> 00:17:50,240
way
478
00:17:51,520 --> 00:17:53,440
now we didn't mention this in our notes
479
00:17:53,440 --> 00:17:56,160
but you can also do a scalar
480
00:17:56,160 --> 00:17:57,760
multiplication
481
00:17:57,760 --> 00:17:59,440
and just put down the scalar so you can
482
00:17:59,440 --> 00:18:00,960
remember that
483
00:18:00,960 --> 00:18:02,880
uh what we're talking about here is i
484
00:18:02,880 --> 00:18:04,000
have
485
00:18:04,000 --> 00:18:04,880
this
486
00:18:04,880 --> 00:18:06,799
array here u
487
00:18:06,799 --> 00:18:08,880
and if i go a
488
00:18:08,880 --> 00:18:11,039
times u
489
00:18:11,039 --> 00:18:12,960
we'll take the value 2 we'll multiply it
490
00:18:12,960 --> 00:18:15,840
by every value in here so 2 times 30 is
491
00:18:15,840 --> 00:18:19,360
60 2 times 15
492
00:18:19,360 --> 00:18:22,400
and just like we did before
493
00:18:22,400 --> 00:18:24,240
this happens a lot because when you're
494
00:18:24,240 --> 00:18:26,640
doing matrixes you do need to flip them
495
00:18:26,640 --> 00:18:29,600
you get 60 30 coming this way
496
00:18:29,600 --> 00:18:32,320
so in numpy uh we have what they call
497
00:18:32,320 --> 00:18:35,440
dot product
498
00:18:35,600 --> 00:18:37,760
and uh with this this is in a two
499
00:18:37,760 --> 00:18:40,160
dimensional vectors it is equivalent of
500
00:18:40,160 --> 00:18:42,799
two matrix multiplication remember we
501
00:18:42,799 --> 00:18:45,120
were talking about matrix multiplication
502
00:18:45,120 --> 00:18:47,919
uh where it is the
503
00:18:47,919 --> 00:18:50,960
well let's walk through it
504
00:18:50,960 --> 00:18:54,240
we'll go ahead and start by defining two
505
00:18:54,240 --> 00:18:58,160
numpy arrays we'll have uh 10 20 25 6 or
506
00:18:58,160 --> 00:19:00,400
our u and our v uh and then we're going
507
00:19:00,400 --> 00:19:02,000
to go ahead and do
508
00:19:02,000 --> 00:19:04,080
if we take
509
00:19:04,080 --> 00:19:05,679
the values
510
00:19:05,679 --> 00:19:08,640
and if you remember correctly
511
00:19:08,640 --> 00:19:11,679
an array like this would be 10 times 25
512
00:19:11,679 --> 00:19:14,640
plus 20 times 6.
513
00:19:14,640 --> 00:19:16,240
we'll go ahead and
514
00:19:16,240 --> 00:19:18,880
print that
515
00:19:20,960 --> 00:19:23,200
there we go
516
00:19:23,200 --> 00:19:25,919
and then we'll go ahead and do the np
517
00:19:25,919 --> 00:19:27,600
dot dot
518
00:19:27,600 --> 00:19:30,559
of u comma
519
00:19:31,200 --> 00:19:32,880
v
520
00:19:32,880 --> 00:19:35,120
and we'll find when we do this we go and
521
00:19:35,120 --> 00:19:36,640
run this
522
00:19:36,640 --> 00:19:39,720
we're going to get 370
523
00:19:39,720 --> 00:19:42,799
370. so this is a strain multiplication
524
00:19:42,799 --> 00:19:45,679
where they use it to solve
525
00:19:45,679 --> 00:19:48,880
linear algebra when you have multiple
526
00:19:48,880 --> 00:19:50,880
numbers going across and so this could
527
00:19:50,880 --> 00:19:52,320
be very complicated we could have a
528
00:19:52,320 --> 00:19:53,679
whole string of different variables
529
00:19:53,679 --> 00:19:56,480
going in here but for this we get a nice
530
00:19:56,480 --> 00:20:00,160
value for our dot multiplication
531
00:20:00,480 --> 00:20:01,840
and we did
532
00:20:01,840 --> 00:20:03,600
addition earlier which is just your
533
00:20:03,600 --> 00:20:05,200
basic addition
534
00:20:05,200 --> 00:20:06,799
and of course the matrix you can get
535
00:20:06,799 --> 00:20:09,120
very complicated on these or
536
00:20:09,120 --> 00:20:11,840
in this case we'll go ahead and do
537
00:20:11,840 --> 00:20:13,440
let's create two
538
00:20:13,440 --> 00:20:15,840
complex matrixes
539
00:20:15,840 --> 00:20:19,280
this one is a matrix of
540
00:20:19,280 --> 00:20:22,640
you know 12 10 4 6 4 31. we'll just
541
00:20:22,640 --> 00:20:24,080
print out a so you can see what that
542
00:20:24,080 --> 00:20:25,919
looks like here's print
543
00:20:25,919 --> 00:20:27,440
a
544
00:20:27,440 --> 00:20:30,000
we print a out you can see that we have
545
00:20:30,000 --> 00:20:31,840
a
546
00:20:31,840 --> 00:20:35,840
2 by 3 layer matrix for a
547
00:20:35,840 --> 00:20:37,919
and we can also put together
548
00:20:37,919 --> 00:20:39,280
always kind of fun when you're playing
549
00:20:39,280 --> 00:20:41,039
with print values
550
00:20:41,039 --> 00:20:42,720
we could do something like this we could
551
00:20:42,720 --> 00:20:44,400
go in here
552
00:20:44,400 --> 00:20:45,520
there we go
553
00:20:45,520 --> 00:20:48,080
we could print a we have it end with
554
00:20:48,080 --> 00:20:49,280
equals a
555
00:20:49,280 --> 00:20:50,159
run
556
00:20:50,159 --> 00:20:52,000
and this kind of gives it a nice look
557
00:20:52,000 --> 00:20:54,400
here's your matrix that's all this is
558
00:20:54,400 --> 00:20:56,559
comma n means it just tags it on the end
559
00:20:56,559 --> 00:20:59,120
that's all that is doing on there
560
00:20:59,120 --> 00:21:01,200
and then we can simply add in what is a
561
00:21:01,200 --> 00:21:02,960
plus b and you should already guess
562
00:21:02,960 --> 00:21:04,240
because this is the same as what we did
563
00:21:04,240 --> 00:21:06,240
before there's no difference
564
00:21:06,240 --> 00:21:08,400
we do a simple vector addition we have
565
00:21:08,400 --> 00:21:12,080
12 plus 2 is 14 10 plus 8 is 18 and so
566
00:21:12,080 --> 00:21:13,120
on
567
00:21:13,120 --> 00:21:15,760
and just like we did the matrix addition
568
00:21:15,760 --> 00:21:19,280
we can also do a minus b
569
00:21:19,280 --> 00:21:21,760
and do our matrix subtraction
570
00:21:21,760 --> 00:21:24,320
and we look at this we have what 12
571
00:21:24,320 --> 00:21:31,400
minus 2 is 10 10 minus 8 um where are we
572
00:21:32,640 --> 00:21:35,760
oh there we go eight minus uh
573
00:21:35,760 --> 00:21:37,200
confusing what i'm looking at i should
574
00:21:37,200 --> 00:21:39,120
have reprinted out the original numbers
575
00:21:39,120 --> 00:21:41,840
uh but we can see here 12 minus 2 is of
576
00:21:41,840 --> 00:21:45,039
course 10 10 minus 8 is 2
577
00:21:45,039 --> 00:21:48,880
4 minus 46 is minus 42 and so forth so
578
00:21:48,880 --> 00:21:50,799
same as a subtraction as before we just
579
00:21:50,799 --> 00:21:52,240
call it matrix subtraction it's
580
00:21:52,240 --> 00:21:54,559
identical
581
00:21:54,559 --> 00:21:56,400
now if you remember up here we had
582
00:21:56,400 --> 00:21:59,120
scalar addition we're adding just one
583
00:21:59,120 --> 00:22:00,159
number
584
00:22:00,159 --> 00:22:02,480
to a matrix you can also do scalar
585
00:22:02,480 --> 00:22:04,159
multiplication
586
00:22:04,159 --> 00:22:06,559
and so simply if you have a single value
587
00:22:06,559 --> 00:22:09,039
a and you have b which is your array we
588
00:22:09,039 --> 00:22:13,440
can also do a times b when we run that
589
00:22:13,440 --> 00:22:16,880
you can see here we have 2 times 4 is 8
590
00:22:16,880 --> 00:22:19,840
5 times 4 is 20 and so forth you're just
591
00:22:19,840 --> 00:22:21,679
multiplying the 4 across each one of
592
00:22:21,679 --> 00:22:23,280
these values
593
00:22:23,280 --> 00:22:24,799
and this is an interesting one that
594
00:22:24,799 --> 00:22:27,840
comes up a little bit of a brain teaser
595
00:22:27,840 --> 00:22:32,400
is matrix and vector multiplication
596
00:22:32,400 --> 00:22:35,520
and so when we're looking at this
597
00:22:35,520 --> 00:22:36,640
we are
598
00:22:36,640 --> 00:22:37,919
just doing regular arrays it doesn't
599
00:22:37,919 --> 00:22:40,080
necessarily have to be a numpy array we
600
00:22:40,080 --> 00:22:41,679
have a
601
00:22:41,679 --> 00:22:44,080
which has our
602
00:22:44,080 --> 00:22:46,880
array of arrays and b which is a single
603
00:22:46,880 --> 00:22:51,280
array and so we can from here
604
00:22:51,520 --> 00:22:53,600
do the dot
605
00:22:53,600 --> 00:22:55,360
a b
606
00:22:55,360 --> 00:22:57,600
and this is going to return two values
607
00:22:57,600 --> 00:23:00,000
and the first value is that is you could
608
00:23:00,000 --> 00:23:01,600
say it's like
609
00:23:01,600 --> 00:23:03,039
we're doing the
610
00:23:03,039 --> 00:23:05,120
this array b array
611
00:23:05,120 --> 00:23:07,760
first with a and then with a second one
612
00:23:07,760 --> 00:23:09,200
and so it splits it up so you have a
613
00:23:09,200 --> 00:23:10,880
matrix of vector multiplication and you
614
00:23:10,880 --> 00:23:12,240
can mix and match
615
00:23:12,240 --> 00:23:14,320
when you get into really complicated uh
616
00:23:14,320 --> 00:23:16,640
back end stuff this becomes more common
617
00:23:16,640 --> 00:23:18,320
because you're now you've got layers
618
00:23:18,320 --> 00:23:21,600
upon layers of data and so you'll end up
619
00:23:21,600 --> 00:23:23,760
with a matrix and a set of bolt uh
620
00:23:23,760 --> 00:23:26,880
vector matrices do you want to multiply
621
00:23:26,880 --> 00:23:28,799
now keep in mind
622
00:23:28,799 --> 00:23:31,039
that if you're doing data science a lot
623
00:23:31,039 --> 00:23:32,640
of times you're not looking at this this
624
00:23:32,640 --> 00:23:34,960
is what's going on behind the scenes so
625
00:23:34,960 --> 00:23:38,000
if you're in the scikit looking at sk
626
00:23:38,000 --> 00:23:39,280
learn where you're doing linear
627
00:23:39,280 --> 00:23:40,720
regression models
628
00:23:40,720 --> 00:23:42,320
this is some of the math that's hidden
629
00:23:42,320 --> 00:23:44,640
behind the scenes that's going on
630
00:23:44,640 --> 00:23:46,320
other times you might find yourself
631
00:23:46,320 --> 00:23:48,559
having to do part of this and manipulate
632
00:23:48,559 --> 00:23:50,240
the data around so it fits right and
633
00:23:50,240 --> 00:23:52,159
then you go back in and you run it
634
00:23:52,159 --> 00:23:56,480
through the psi kit and if we can do
635
00:23:56,480 --> 00:23:58,799
up here where we did a
636
00:23:58,799 --> 00:24:00,720
matrix and vector multiplication we can
637
00:24:00,720 --> 00:24:03,440
also do matrix two matrix multiplication
638
00:24:03,440 --> 00:24:05,039
and if we run this where we have the two
639
00:24:05,039 --> 00:24:06,480
matrixes
640
00:24:06,480 --> 00:24:07,840
you can see we have a very complicated
641
00:24:07,840 --> 00:24:09,440
array that of course comes out on there
642
00:24:09,440 --> 00:24:10,720
for our dot
643
00:24:10,720 --> 00:24:13,120
and just to reiterate it we have our
644
00:24:13,120 --> 00:24:16,559
transpose a matrix which is your dot t
645
00:24:16,559 --> 00:24:19,120
and so if we create a matrix a and we do
646
00:24:19,120 --> 00:24:21,039
uh transpose it you can see how it flips
647
00:24:21,039 --> 00:24:22,000
it from
648
00:24:22,000 --> 00:24:26,880
5 10 15 20 25 30 to 5 15 25
649
00:24:26,880 --> 00:24:28,480
10 20 30
650
00:24:28,480 --> 00:24:31,440
rows and columns
651
00:24:31,440 --> 00:24:33,279
and certainly with the math uh this
652
00:24:33,279 --> 00:24:36,240
comes up a lot um it also comes up a lot
653
00:24:36,240 --> 00:24:38,720
with xy plotting when you put in the pi
654
00:24:38,720 --> 00:24:40,799
plot you have one format where they're
655
00:24:40,799 --> 00:24:42,400
looking at pairs and numbers and then
656
00:24:42,400 --> 00:24:45,039
they want all of x's and all y's so you
657
00:24:45,039 --> 00:24:47,200
know the transpose is an important tool
658
00:24:47,200 --> 00:24:49,360
both for your math and for plotting and
659
00:24:49,360 --> 00:24:51,039
all kinds of things
660
00:24:51,039 --> 00:24:53,679
another tool that we didn't discuss uh
661
00:24:53,679 --> 00:24:56,960
is your identity matrix
662
00:24:56,960 --> 00:25:01,200
uh and this one is more definition
663
00:25:01,200 --> 00:25:03,679
the identity matrix
664
00:25:03,679 --> 00:25:06,559
we have here one where we just did two
665
00:25:06,559 --> 00:25:09,600
so it comes down as one zero zero one
666
00:25:09,600 --> 00:25:12,159
one zero zero zero one zero it creates a
667
00:25:12,159 --> 00:25:14,799
diagonal of one and what that is is when
668
00:25:14,799 --> 00:25:16,880
you're doing your identities you could
669
00:25:16,880 --> 00:25:18,320
be comparing
670
00:25:18,320 --> 00:25:20,320
all your different features to the
671
00:25:20,320 --> 00:25:21,600
different features and how they
672
00:25:21,600 --> 00:25:22,720
correlate
673
00:25:22,720 --> 00:25:25,200
and of course when you have feature one
674
00:25:25,200 --> 00:25:27,520
compared to feature one to itself it is
675
00:25:27,520 --> 00:25:28,480
always
676
00:25:28,480 --> 00:25:29,760
one
677
00:25:29,760 --> 00:25:30,880
where
678
00:25:30,880 --> 00:25:32,480
usually it's between zero one depending
679
00:25:32,480 --> 00:25:34,640
on how well correlates so when we're
680
00:25:34,640 --> 00:25:36,720
talking about identity matrix that's
681
00:25:36,720 --> 00:25:38,720
what we're talking about right here is
682
00:25:38,720 --> 00:25:41,039
that you create this preset matrix and
683
00:25:41,039 --> 00:25:42,480
then you might adjust these numbers
684
00:25:42,480 --> 00:25:44,000
depending on what you're working with
685
00:25:44,000 --> 00:25:45,600
and what the domain is
686
00:25:45,600 --> 00:25:47,520
and then another thing we can do
687
00:25:47,520 --> 00:25:49,039
to kind of wrap this up we'll hit you
688
00:25:49,039 --> 00:25:51,200
with the most complicated
689
00:25:51,200 --> 00:25:52,320
piece of this
690
00:25:52,320 --> 00:25:55,120
puzzle here is an inverse
691
00:25:55,120 --> 00:25:57,120
a matrix
692
00:25:57,120 --> 00:25:59,840
and let's just go ahead and put the um
693
00:25:59,840 --> 00:26:02,640
it's a lengthy description
694
00:26:02,640 --> 00:26:04,159
let's go and put the description this is
695
00:26:04,159 --> 00:26:06,480
straight out of the uh
696
00:26:06,480 --> 00:26:08,559
the website for
697
00:26:08,559 --> 00:26:12,880
numpy so given a square matrix a here's
698
00:26:12,880 --> 00:26:16,000
our square matrix a which is 2 1 0 0 1 0
699
00:26:16,000 --> 00:26:19,279
1 2 1. keep in mind 3 by 3 is square
700
00:26:19,279 --> 00:26:20,960
it's got to be equal it's going to
701
00:26:20,960 --> 00:26:24,559
return the matrix a inverse satisfying
702
00:26:24,559 --> 00:26:26,720
dot a
703
00:26:26,720 --> 00:26:29,360
a inverse so here's our matrix
704
00:26:29,360 --> 00:26:32,080
multiplication
705
00:26:32,080 --> 00:26:34,559
and then of course it equals the dot
706
00:26:34,559 --> 00:26:37,440
yeah a inverse of a
707
00:26:37,440 --> 00:26:39,600
with an identity shape of
708
00:26:39,600 --> 00:26:41,919
a dot shaped zero this is just reshaping
709
00:26:41,919 --> 00:26:43,840
the identity
710
00:26:43,840 --> 00:26:45,919
that's a little complicated there uh so
711
00:26:45,919 --> 00:26:48,480
we're going to have our here's our array
712
00:26:48,480 --> 00:26:50,159
we'll go ahead and run this
713
00:26:50,159 --> 00:26:52,640
and you can see what we end up with
714
00:26:52,640 --> 00:26:55,919
is we end up with uh an array 0.5 minus
715
00:26:55,919 --> 00:26:59,440
0.5 and so forth with our 2 1 1 going
716
00:26:59,440 --> 00:27:04,000
down 2 1 0 0 1 0 1 2 1.
717
00:27:04,000 --> 00:27:06,080
getting into a little deep on the math
718
00:27:06,080 --> 00:27:07,200
understanding
719
00:27:07,200 --> 00:27:09,360
when you need this is probably really is
720
00:27:09,360 --> 00:27:10,880
is what's really important when you're
721
00:27:10,880 --> 00:27:12,480
doing data science
722
00:27:12,480 --> 00:27:13,760
versus
723
00:27:13,760 --> 00:27:15,520
handwriting this out and looking up the
724
00:27:15,520 --> 00:27:17,520
math and handwriting all the pieces out
725
00:27:17,520 --> 00:27:19,200
you do need to know about the linear
726
00:27:19,200 --> 00:27:21,440
algorithm inverse of a
727
00:27:21,440 --> 00:27:23,360
so if it comes up you can easily pull it
728
00:27:23,360 --> 00:27:24,640
up or at least remember where to look it
729
00:27:24,640 --> 00:27:27,600
up we took a look at the algebra side of
730
00:27:27,600 --> 00:27:29,200
it let's go ahead and take a look at the
731
00:27:29,200 --> 00:27:31,679
calculus side of what's going on here
732
00:27:31,679 --> 00:27:35,039
with the machine learning so calculus oh
733
00:27:35,039 --> 00:27:36,880
my goodness and differential equations
734
00:27:36,880 --> 00:27:38,159
you got to throw that in there because
735
00:27:38,159 --> 00:27:40,159
that's all part of the
736
00:27:40,159 --> 00:27:42,400
bag of tricks especially when you're
737
00:27:42,400 --> 00:27:44,400
doing large neural networks but also
738
00:27:44,400 --> 00:27:46,640
comes up in many other areas the good
739
00:27:46,640 --> 00:27:48,320
news is most of it's already done for
740
00:27:48,320 --> 00:27:50,480
you in the back end uh so when it comes
741
00:27:50,480 --> 00:27:52,240
up you really do need to understand from
742
00:27:52,240 --> 00:27:54,480
the data science not data analytics data
743
00:27:54,480 --> 00:27:56,960
analytics means you're digging deep into
744
00:27:56,960 --> 00:27:59,600
actually solving these math equations
745
00:27:59,600 --> 00:28:01,679
and a neural network is just a giant
746
00:28:01,679 --> 00:28:04,080
differential equation
747
00:28:04,080 --> 00:28:06,000
so we talk about calculus
748
00:28:06,000 --> 00:28:07,840
we're going to go ahead
749
00:28:07,840 --> 00:28:10,880
and understand it by talking about cars
750
00:28:10,880 --> 00:28:13,520
versus time and speed
751
00:28:13,520 --> 00:28:16,320
so helps to calculate the spontaneous
752
00:28:16,320 --> 00:28:19,120
rate of change
753
00:28:19,120 --> 00:28:21,039
so suppose we plot a graph of the speed
754
00:28:21,039 --> 00:28:23,120
of a car with respect to time
755
00:28:23,120 --> 00:28:25,279
so as you can see here going down the
756
00:28:25,279 --> 00:28:27,760
highway probably merged into the highway
757
00:28:27,760 --> 00:28:30,080
from an on-ramp so i had to accelerate
758
00:28:30,080 --> 00:28:31,919
so my speed went way up
759
00:28:31,919 --> 00:28:34,799
uh stuck in traffic merged into the
760
00:28:34,799 --> 00:28:36,559
traffic traffic opens up and i
761
00:28:36,559 --> 00:28:38,880
accelerate again up to the speed limit
762
00:28:38,880 --> 00:28:41,840
and uh maybe peter's off up there so you
763
00:28:41,840 --> 00:28:44,559
can look at this as as
764
00:28:44,559 --> 00:28:46,720
the speed versus time i'm getting faster
765
00:28:46,720 --> 00:28:48,080
and faster because i'm continually
766
00:28:48,080 --> 00:28:50,480
accelerating and if i hit the brakes you
767
00:28:50,480 --> 00:28:52,080
go the other way
768
00:28:52,080 --> 00:28:53,679
so the rate of change of speed with
769
00:28:53,679 --> 00:28:55,760
respect of time is nothing but
770
00:28:55,760 --> 00:28:57,760
acceleration how fast are we
771
00:28:57,760 --> 00:28:59,039
accelerating
772
00:28:59,039 --> 00:29:00,799
the acceleration is the area between the
773
00:29:00,799 --> 00:29:02,799
star point of x and the end point of
774
00:29:02,799 --> 00:29:04,960
delta x
775
00:29:04,960 --> 00:29:07,279
so we can calculate a simple
776
00:29:07,279 --> 00:29:09,360
if you had x and delta x we could put a
777
00:29:09,360 --> 00:29:12,480
line there and that slope of the line is
778
00:29:12,480 --> 00:29:14,559
our acceleration
779
00:29:14,559 --> 00:29:16,640
now that's pretty easy when you're doing
780
00:29:16,640 --> 00:29:18,960
linear algebra but i don't want to know
781
00:29:18,960 --> 00:29:20,720
it just for that line in those two
782
00:29:20,720 --> 00:29:22,559
points i want to know what across
783
00:29:22,559 --> 00:29:24,960
the whole of what i'm working with
784
00:29:24,960 --> 00:29:26,960
that's where we get into calculus
785
00:29:26,960 --> 00:29:28,799
so we talk about the distance between x
786
00:29:28,799 --> 00:29:31,120
and delta x it has to be the smallest
787
00:29:31,120 --> 00:29:33,279
possible near to zero in order to
788
00:29:33,279 --> 00:29:36,399
approximate the acceleration
789
00:29:36,399 --> 00:29:38,799
uh so the idea is instead of i mean if
790
00:29:38,799 --> 00:29:41,520
you ever did took a basic calculus class
791
00:29:41,520 --> 00:29:43,760
they would draw bars down here and you
792
00:29:43,760 --> 00:29:45,840
would divide this area up
793
00:29:45,840 --> 00:29:47,840
let's go back up the screen you divide
794
00:29:47,840 --> 00:29:50,559
this area of this time period up into
795
00:29:50,559 --> 00:29:53,039
maybe 10 sections and you'd use that and
796
00:29:53,039 --> 00:29:54,559
you could calculate the acceleration
797
00:29:54,559 --> 00:29:56,320
between each one of those 10 sections
798
00:29:56,320 --> 00:29:57,760
kind of thing
799
00:29:57,760 --> 00:29:59,600
and then we just keep making that space
800
00:29:59,600 --> 00:30:02,640
smaller and smaller until delta x is
801
00:30:02,640 --> 00:30:03,919
almost
802
00:30:03,919 --> 00:30:06,559
infinitesimally small
803
00:30:06,559 --> 00:30:08,960
and so we get a function of a
804
00:30:08,960 --> 00:30:12,559
equals a limit as h goes to 0 of a
805
00:30:12,559 --> 00:30:14,720
function of a plus h minus a function of
806
00:30:14,720 --> 00:30:17,440
a over h and that is you're
807
00:30:17,440 --> 00:30:20,799
computing the slope of the line
808
00:30:20,799 --> 00:30:22,399
we're just computing that slope and
809
00:30:22,399 --> 00:30:23,919
they're smaller and smaller and smaller
810
00:30:23,919 --> 00:30:25,679
samples
811
00:30:25,679 --> 00:30:27,679
and that's what calculus is calculus is
812
00:30:27,679 --> 00:30:29,760
the integral you can see down here we
813
00:30:29,760 --> 00:30:32,000
have our nice uh integral sign looks
814
00:30:32,000 --> 00:30:33,679
like a giant s
815
00:30:33,679 --> 00:30:35,679
and that's what that means is that we've
816
00:30:35,679 --> 00:30:39,120
taken this down to as small as we can
817
00:30:39,120 --> 00:30:41,120
for that sampling
818
00:30:41,120 --> 00:30:42,399
so we're talking about calculus we're
819
00:30:42,399 --> 00:30:45,279
finding the area under the slope is the
820
00:30:45,279 --> 00:30:48,000
main process in the integration
821
00:30:48,000 --> 00:30:50,000
similar small intervals are made of the
822
00:30:50,000 --> 00:30:52,320
smallest possible length of x plus delta
823
00:30:52,320 --> 00:30:55,279
x where delta x approaches almost an
824
00:30:55,279 --> 00:30:57,440
infinitesimally small space
825
00:30:57,440 --> 00:30:59,039
and then it helps to find the overall
826
00:30:59,039 --> 00:31:01,279
acceleration by summing up all the links
827
00:31:01,279 --> 00:31:02,559
together
828
00:31:02,559 --> 00:31:03,840
so we're summing up all the
829
00:31:03,840 --> 00:31:05,600
accelerations from the beginning to the
830
00:31:05,600 --> 00:31:06,480
end
831
00:31:06,480 --> 00:31:08,799
and so here's our integral we sum of a
832
00:31:08,799 --> 00:31:11,200
of x times d of x
833
00:31:11,200 --> 00:31:13,279
equals a plus c
834
00:31:13,279 --> 00:31:16,799
uh that is our basic calculus here
835
00:31:16,799 --> 00:31:19,440
so when we talk about multivariate
836
00:31:19,440 --> 00:31:20,799
calculus
837
00:31:20,799 --> 00:31:22,640
multivariate calculus deals with
838
00:31:22,640 --> 00:31:25,440
functions that have multiple variables
839
00:31:25,440 --> 00:31:26,799
and you can see here we start getting
840
00:31:26,799 --> 00:31:30,399
into some very complicated equations um
841
00:31:30,399 --> 00:31:33,039
changing w over change of time
842
00:31:33,039 --> 00:31:35,520
equals change of w over change of z
843
00:31:35,520 --> 00:31:38,159
the differential of z to dx differential
844
00:31:38,159 --> 00:31:41,200
of x to dt it gets pretty complicated uh
845
00:31:41,200 --> 00:31:43,120
and it really translates into the
846
00:31:43,120 --> 00:31:45,039
multivariate integration using double
847
00:31:45,039 --> 00:31:47,840
integrals and so you have the the sum of
848
00:31:47,840 --> 00:31:51,200
the sum of f of x of y of d of a equals
849
00:31:51,200 --> 00:31:54,399
the sum from c to d and a to b of f of x
850
00:31:54,399 --> 00:31:56,720
y d x d y equals
851
00:31:56,720 --> 00:31:59,679
uh the sum of a to b sum of c to d of f
852
00:31:59,679 --> 00:32:03,039
x y d y d x
853
00:32:03,039 --> 00:32:05,600
understanding the very specifics of
854
00:32:05,600 --> 00:32:07,519
everything going on in here and actually
855
00:32:07,519 --> 00:32:10,880
doing the math is usually calculus 1
856
00:32:10,880 --> 00:32:14,000
calculus 2 and differential equations so
857
00:32:14,000 --> 00:32:15,440
you're talking about three full length
858
00:32:15,440 --> 00:32:17,519
courses to dig into
859
00:32:17,519 --> 00:32:20,000
and solve these math equations
860
00:32:20,000 --> 00:32:21,600
what we want to take from here is we're
861
00:32:21,600 --> 00:32:23,440
talking about calculus
862
00:32:23,440 --> 00:32:26,080
we're talking about summing of all these
863
00:32:26,080 --> 00:32:28,320
different slopes and so we're still
864
00:32:28,320 --> 00:32:30,159
solving a linear
865
00:32:30,159 --> 00:32:32,159
expression we're still solving
866
00:32:32,159 --> 00:32:34,960
y equals m x plus b
867
00:32:34,960 --> 00:32:36,960
but we're doing this for infinitesimally
868
00:32:36,960 --> 00:32:38,880
small x's and then we want to sum them
869
00:32:38,880 --> 00:32:40,880
up that's what this integral sign means
870
00:32:40,880 --> 00:32:45,760
the sum of a of x d of x equals a plus c
871
00:32:45,760 --> 00:32:48,000
and when you see these very complicated
872
00:32:48,000 --> 00:32:50,320
multivariate differentiation using the
873
00:32:50,320 --> 00:32:51,760
chain rule
874
00:32:51,760 --> 00:32:53,360
when we come in here and we have the
875
00:32:53,360 --> 00:32:55,840
change of w to the change of t equals
876
00:32:55,840 --> 00:32:57,760
the change of w dz
877
00:32:57,760 --> 00:32:59,919
uh and so forth
878
00:32:59,919 --> 00:33:01,360
that's what's going on here that's what
879
00:33:01,360 --> 00:33:03,279
these means we're basically looking for
880
00:33:03,279 --> 00:33:05,120
the area under the curve which really
881
00:33:05,120 --> 00:33:06,480
comes to
882
00:33:06,480 --> 00:33:09,519
how is the change changing speeds going
883
00:33:09,519 --> 00:33:12,080
up how is that changing and then you end
884
00:33:12,080 --> 00:33:14,240
up with a multiple layer so if i have
885
00:33:14,240 --> 00:33:16,320
three layers of neural networks how is
886
00:33:16,320 --> 00:33:18,399
the third layer changing based on the
887
00:33:18,399 --> 00:33:20,080
second layer changing which is based on
888
00:33:20,080 --> 00:33:22,399
the first layer changing and you get the
889
00:33:22,399 --> 00:33:23,760
picture here that now we have a very
890
00:33:23,760 --> 00:33:25,039
complicated
891
00:33:25,039 --> 00:33:27,200
multivariate integration
892
00:33:27,200 --> 00:33:29,679
with integrals
893
00:33:29,679 --> 00:33:32,480
the good news is we can solve this
894
00:33:32,480 --> 00:33:34,000
mathematically and that's what we do
895
00:33:34,000 --> 00:33:35,519
when you do neural networks and reverse
896
00:33:35,519 --> 00:33:37,039
propagation
897
00:33:37,039 --> 00:33:38,720
so the nice thing is that you don't have
898
00:33:38,720 --> 00:33:40,480
to solve this on paper unless you're a
899
00:33:40,480 --> 00:33:42,000
data analysis and you're working on the
900
00:33:42,000 --> 00:33:44,399
back end of integrating these formulas
901
00:33:44,399 --> 00:33:45,600
and building the script to actually
902
00:33:45,600 --> 00:33:47,760
build them so we talk about applications
903
00:33:47,760 --> 00:33:50,320
of calculus it provides us the tools to
904
00:33:50,320 --> 00:33:53,039
build an accurate predictive model
905
00:33:53,039 --> 00:33:55,360
so it's really behind the scenes we want
906
00:33:55,360 --> 00:33:56,640
to guess at what the change of the
907
00:33:56,640 --> 00:33:59,120
change of the change is
908
00:33:59,120 --> 00:34:01,360
that's a little goofy i know i just
909
00:34:01,360 --> 00:34:02,960
threw that out there it's kind of a meta
910
00:34:02,960 --> 00:34:05,360
term but if you can guess how things are
911
00:34:05,360 --> 00:34:07,519
going to change then you can guess what
912
00:34:07,519 --> 00:34:09,199
the new numbers are
913
00:34:09,199 --> 00:34:11,679
multivariate calculus explains the
914
00:34:11,679 --> 00:34:13,520
change in our target variable in
915
00:34:13,520 --> 00:34:15,440
relation to the rate of change in the
916
00:34:15,440 --> 00:34:17,440
input variables
917
00:34:17,440 --> 00:34:19,599
so there's our multiple variables going
918
00:34:19,599 --> 00:34:20,399
in there
919
00:34:20,399 --> 00:34:22,239
if one variable is changing how does it
920
00:34:22,239 --> 00:34:24,320
affect the other variable
921
00:34:24,320 --> 00:34:27,119
and then in gradient descent calculus is
922
00:34:27,119 --> 00:34:30,800
used to find the local and global maxima
923
00:34:30,800 --> 00:34:32,719
and this is really big
924
00:34:32,719 --> 00:34:33,839
we're actually going to have a whole
925
00:34:33,839 --> 00:34:36,239
section here on gradient descent because
926
00:34:36,239 --> 00:34:37,520
it is
927
00:34:37,520 --> 00:34:39,359
really i mean i talked about neural
928
00:34:39,359 --> 00:34:41,199
networks and how you can see how the
929
00:34:41,199 --> 00:34:42,719
different layers go in there but
930
00:34:42,719 --> 00:34:46,079
gradient descent is one of the most key
931
00:34:46,079 --> 00:34:48,079
things for trying to guess the best
932
00:34:48,079 --> 00:34:49,839
answer to something
933
00:34:49,839 --> 00:34:53,760
so let's take a look at the code behind
934
00:34:53,760 --> 00:34:55,679
gradient descent
935
00:34:55,679 --> 00:34:57,839
and uh before we open up the code let's
936
00:34:57,839 --> 00:34:59,599
just do real quick
937
00:34:59,599 --> 00:35:03,280
uh gradient descent
938
00:35:03,280 --> 00:35:05,599
let's say we have a curve like this and
939
00:35:05,599 --> 00:35:07,280
most common
940
00:35:07,280 --> 00:35:09,200
is that this is going to represent your
941
00:35:09,200 --> 00:35:12,400
error oops
942
00:35:12,400 --> 00:35:15,200
error there we go error
943
00:35:15,200 --> 00:35:17,119
hard to read there and i want to make
944
00:35:17,119 --> 00:35:20,160
the error as low as possible
945
00:35:20,160 --> 00:35:22,160
and so what i'm looking at it is i want
946
00:35:22,160 --> 00:35:25,040
to find this line here
947
00:35:25,040 --> 00:35:26,480
which is the
948
00:35:26,480 --> 00:35:28,960
minimum value so we're looking for the
949
00:35:28,960 --> 00:35:33,599
minimum and it does that by uh sampling
950
00:35:33,599 --> 00:35:35,040
there
951
00:35:35,040 --> 00:35:37,200
and then based on this it guesses it
952
00:35:37,200 --> 00:35:39,280
might be someplace here and it goes hey
953
00:35:39,280 --> 00:35:41,119
this is still going down
954
00:35:41,119 --> 00:35:43,280
it goes here and then goes back over
955
00:35:43,280 --> 00:35:45,680
here and then goes a little bit closer
956
00:35:45,680 --> 00:35:47,359
and it's just playing a high low until
957
00:35:47,359 --> 00:35:50,720
it gets to that spot that bottom spot
958
00:35:50,720 --> 00:35:54,960
and so we want to minimize the error in
959
00:35:54,960 --> 00:35:57,280
on the flip note you could also want to
960
00:35:57,280 --> 00:35:59,359
be maximizing something you want to get
961
00:35:59,359 --> 00:36:02,320
the best output of it uh that's simply
962
00:36:02,320 --> 00:36:04,320
uh minus the value
963
00:36:04,320 --> 00:36:05,839
so if you're looking for where the peak
964
00:36:05,839 --> 00:36:06,880
is
965
00:36:06,880 --> 00:36:09,760
this is the same as a negative
966
00:36:09,760 --> 00:36:11,760
for where the valley is looking for that
967
00:36:11,760 --> 00:36:12,960
valley
968
00:36:12,960 --> 00:36:14,480
that's all that is and this is a way of
969
00:36:14,480 --> 00:36:16,560
finding it so
970
00:36:16,560 --> 00:36:18,800
the cool thing is all the heavy lifting
971
00:36:18,800 --> 00:36:20,880
is done i actually
972
00:36:20,880 --> 00:36:22,720
ended up putting together one of these a
973
00:36:22,720 --> 00:36:25,119
while back is when i didn't know about
974
00:36:25,119 --> 00:36:27,760
sidekick and it was just starting
975
00:36:27,760 --> 00:36:29,839
boy it's a long while back
976
00:36:29,839 --> 00:36:31,040
and uh
977
00:36:31,040 --> 00:36:32,880
is playing high low how do you play high
978
00:36:32,880 --> 00:36:35,359
low not get stuck in the valleys uh
979
00:36:35,359 --> 00:36:37,119
figure out these curves and things like
980
00:36:37,119 --> 00:36:39,440
that well you do that and the back end
981
00:36:39,440 --> 00:36:40,960
is all the calculus and differential
982
00:36:40,960 --> 00:36:43,520
equations to calculate this out
983
00:36:43,520 --> 00:36:45,280
the good news is you don't have to do
984
00:36:45,280 --> 00:36:47,040
those
985
00:36:47,040 --> 00:36:48,400
so instead we're going to put together
986
00:36:48,400 --> 00:36:52,480
the code and let's go ahead
987
00:36:52,480 --> 00:36:55,839
and see what we can do with that
988
00:36:56,800 --> 00:36:59,040
so uh guys in the back put together a
989
00:36:59,040 --> 00:37:00,800
nice little piece of code here which is
990
00:37:00,800 --> 00:37:02,079
kind of fun
991
00:37:02,079 --> 00:37:04,320
uh some things we're gonna note and this
992
00:37:04,320 --> 00:37:06,240
is this is really important stuff
993
00:37:06,240 --> 00:37:07,680
because when you start doing your data
994
00:37:07,680 --> 00:37:09,680
science and digging into your machine
995
00:37:09,680 --> 00:37:11,280
learning models
996
00:37:11,280 --> 00:37:12,720
you're going to find
997
00:37:12,720 --> 00:37:14,800
these things are stumbling blocks
998
00:37:14,800 --> 00:37:17,119
the first one is current x where do we
999
00:37:17,119 --> 00:37:18,480
start at
1000
00:37:18,480 --> 00:37:20,079
keep in mind
1001
00:37:20,079 --> 00:37:20,880
your
1002
00:37:20,880 --> 00:37:23,119
model that you're working with is very
1003
00:37:23,119 --> 00:37:25,119
generic so whatever you use to minimize
1004
00:37:25,119 --> 00:37:27,119
it the first question is where do we
1005
00:37:27,119 --> 00:37:28,640
start
1006
00:37:28,640 --> 00:37:30,320
and we started at this because the
1007
00:37:30,320 --> 00:37:32,720
algorithm starts at x equals three
1008
00:37:32,720 --> 00:37:35,280
so we arbitrarily picked five
1009
00:37:35,280 --> 00:37:37,200
learning rate is uh how many bars to
1010
00:37:37,200 --> 00:37:39,599
skip going one way or the other i'm in
1011
00:37:39,599 --> 00:37:40,880
fact i'm going to separate that a little
1012
00:37:40,880 --> 00:37:42,079
bit because these two are really
1013
00:37:42,079 --> 00:37:43,119
important
1014
00:37:43,119 --> 00:37:45,440
um if we're dealing with something like
1015
00:37:45,440 --> 00:37:47,440
this where we're talking about
1016
00:37:47,440 --> 00:37:49,119
well here's our here's the function
1017
00:37:49,119 --> 00:37:51,280
we're going to use our
1018
00:37:51,280 --> 00:37:53,280
gradient of our function
1019
00:37:53,280 --> 00:37:56,160
2 times x plus 5 keep it simple so
1020
00:37:56,160 --> 00:37:57,440
that's a function we're going to work
1021
00:37:57,440 --> 00:37:59,839
with so if i'm dealing with increments
1022
00:37:59,839 --> 00:38:02,880
of 1000.1 is going to be a very long
1023
00:38:02,880 --> 00:38:05,839
time and if i'm dealing with increments
1024
00:38:05,839 --> 00:38:08,480
of 0.001
1025
00:38:08,480 --> 00:38:11,520
0.1 is going to skip over my answer so i
1026
00:38:11,520 --> 00:38:13,520
won't get a very good answer
1027
00:38:13,520 --> 00:38:15,280
and then we look at precision this tells
1028
00:38:15,280 --> 00:38:18,400
us when to stop the algorithm so again
1029
00:38:18,400 --> 00:38:21,280
very specific to what you're working on
1030
00:38:21,280 --> 00:38:23,040
if you're working with money
1031
00:38:23,040 --> 00:38:24,240
and
1032
00:38:24,240 --> 00:38:28,240
you don't convert it into a float value
1033
00:38:28,240 --> 00:38:31,040
you might be dealing with 0.01 which is
1034
00:38:31,040 --> 00:38:32,640
a penny that might be your precision
1035
00:38:32,640 --> 00:38:34,560
you're working with
1036
00:38:34,560 --> 00:38:36,160
and then of course the previous step
1037
00:38:36,160 --> 00:38:39,200
size max iterations we want something to
1038
00:38:39,200 --> 00:38:41,040
cut out at a certain point usually
1039
00:38:41,040 --> 00:38:43,680
that's built into a lot of minimization
1040
00:38:43,680 --> 00:38:44,880
functions
1041
00:38:44,880 --> 00:38:47,040
and then here's our actual uh formula
1042
00:38:47,040 --> 00:38:48,720
we're going to be working with
1043
00:38:48,720 --> 00:38:50,720
and then we come in we go while previous
1044
00:38:50,720 --> 00:38:52,800
step size is greater than precision and
1045
00:38:52,800 --> 00:38:56,960
it is less than max and max iters
1046
00:38:56,960 --> 00:38:59,280
say that 10 times fast
1047
00:38:59,280 --> 00:39:00,960
um
1048
00:39:00,960 --> 00:39:02,960
we're just saying if it's uh if we're if
1049
00:39:02,960 --> 00:39:04,480
we're still greater than our precision
1050
00:39:04,480 --> 00:39:05,920
level we still got to keep digging
1051
00:39:05,920 --> 00:39:07,440
deeper
1052
00:39:07,440 --> 00:39:09,359
and then we also don't want to go past a
1053
00:39:09,359 --> 00:39:11,680
thou or whatever this is a million or 10
1054
00:39:11,680 --> 00:39:12,720
000
1055
00:39:12,720 --> 00:39:14,880
running that's actually pretty high we
1056
00:39:14,880 --> 00:39:16,880
almost never do max iterations more than
1057
00:39:16,880 --> 00:39:19,200
like 100 or 200
1058
00:39:19,200 --> 00:39:20,640
rare occasions you might go up to four
1059
00:39:20,640 --> 00:39:22,720
or 500 if it's depending on the problem
1060
00:39:22,720 --> 00:39:24,000
you're working with
1061
00:39:24,000 --> 00:39:26,079
uh so we have our previous equals our
1062
00:39:26,079 --> 00:39:29,760
current that way we can track time wise
1063
00:39:29,760 --> 00:39:31,760
the current now equals the current minus
1064
00:39:31,760 --> 00:39:34,480
the rate times the formula of our
1065
00:39:34,480 --> 00:39:36,000
previous x
1066
00:39:36,000 --> 00:39:38,640
so now we've generated our new version
1067
00:39:38,640 --> 00:39:41,280
previous step size equals the absolute
1068
00:39:41,280 --> 00:39:43,359
current previous
1069
00:39:43,359 --> 00:39:46,320
so we're looking for the change in x
1070
00:39:46,320 --> 00:39:48,560
errors equals iterations plus one that's
1071
00:39:48,560 --> 00:39:50,800
so we know to stop if we get too far
1072
00:39:50,800 --> 00:39:51,920
and then we're just going to print the
1073
00:39:51,920 --> 00:39:54,480
local minimum occurs at
1074
00:39:54,480 --> 00:39:56,800
x on here and if we go ahead and run
1075
00:39:56,800 --> 00:39:58,720
this
1076
00:39:58,720 --> 00:40:00,400
you can see right here it gets down to
1077
00:40:00,400 --> 00:40:03,839
this point and it says hey
1078
00:40:03,839 --> 00:40:06,280
local minimum is minus
1079
00:40:06,280 --> 00:40:08,960
3.3222 for this particular series we
1080
00:40:08,960 --> 00:40:11,440
created this is created off of our
1081
00:40:11,440 --> 00:40:15,200
formula here lambda x2 times x plus five
1082
00:40:15,200 --> 00:40:16,400
now
1083
00:40:16,400 --> 00:40:18,480
when i'm running this stuff you'll see
1084
00:40:18,480 --> 00:40:21,040
this come up a lot
1085
00:40:21,040 --> 00:40:21,920
in
1086
00:40:21,920 --> 00:40:24,880
with the sk learn kit and one of the
1087
00:40:24,880 --> 00:40:26,640
nice reasons of breaking this down the
1088
00:40:26,640 --> 00:40:27,920
way we did
1089
00:40:27,920 --> 00:40:30,560
is i could go over those top pieces
1090
00:40:30,560 --> 00:40:32,560
those top pieces are everything when you
1091
00:40:32,560 --> 00:40:34,880
start looking at these minimization tool
1092
00:40:34,880 --> 00:40:39,359
kits in built-in code and so from
1093
00:40:39,359 --> 00:40:41,760
we'll just do it's actually
1094
00:40:41,760 --> 00:40:43,359
docs
1095
00:40:43,359 --> 00:40:44,720
dot
1096
00:40:44,720 --> 00:40:46,880
scipy.org
1097
00:40:46,880 --> 00:40:49,200
and we're looking at
1098
00:40:49,200 --> 00:40:50,960
the scikit
1099
00:40:50,960 --> 00:40:54,720
there we go optimize minimize
1100
00:40:54,720 --> 00:40:57,119
you can only minimize one value
1101
00:40:57,119 --> 00:40:58,560
you have the function that's going in
1102
00:40:58,560 --> 00:41:01,359
this function can be very complicated
1103
00:41:01,359 --> 00:41:03,040
so we used a very simple function up
1104
00:41:03,040 --> 00:41:03,920
here
1105
00:41:03,920 --> 00:41:05,760
it could be
1106
00:41:05,760 --> 00:41:06,960
there's all kinds of things that could
1107
00:41:06,960 --> 00:41:08,319
be on there and there's a number of
1108
00:41:08,319 --> 00:41:10,319
methods to solve this as far as how they
1109
00:41:10,319 --> 00:41:11,599
shrink down
1110
00:41:11,599 --> 00:41:13,440
uh and your x naught there's your
1111
00:41:13,440 --> 00:41:15,040
there's your start value so your
1112
00:41:15,040 --> 00:41:18,160
function your start value
1113
00:41:18,160 --> 00:41:19,520
there's all kinds of things that come in
1114
00:41:19,520 --> 00:41:20,720
here that you can look at which we're
1115
00:41:20,720 --> 00:41:22,319
not going to
1116
00:41:22,319 --> 00:41:24,400
optimization automatically creates
1117
00:41:24,400 --> 00:41:26,720
constraints bounds
1118
00:41:26,720 --> 00:41:28,560
some of this it does automatically but
1119
00:41:28,560 --> 00:41:30,640
you really the big thing i want to point
1120
00:41:30,640 --> 00:41:32,160
out here is you need to have a starting
1121
00:41:32,160 --> 00:41:34,079
point you want to start with something
1122
00:41:34,079 --> 00:41:35,599
that you already know is mostly the
1123
00:41:35,599 --> 00:41:36,880
answer
1124
00:41:36,880 --> 00:41:38,079
if you don't then it's going to have a
1125
00:41:38,079 --> 00:41:39,680
heck of a time trying to calculate it
1126
00:41:39,680 --> 00:41:41,359
out
1127
00:41:41,359 --> 00:41:42,640
or you can write your own little script
1128
00:41:42,640 --> 00:41:44,480
that does this and does a high low
1129
00:41:44,480 --> 00:41:47,280
guessing and tries to find the max value
1130
00:41:47,280 --> 00:41:50,240
that brings us to statistics what this
1131
00:41:50,240 --> 00:41:52,319
is kind of all about is figuring things
1132
00:41:52,319 --> 00:41:55,359
out a lot of vocabulary and statistics
1133
00:41:55,359 --> 00:41:58,319
uh so statistics well i guess it's all
1134
00:41:58,319 --> 00:42:00,079
relative it's definitely not an edel
1135
00:42:00,079 --> 00:42:01,359
class
1136
00:42:01,359 --> 00:42:03,359
so a bunch of stuff going on statistics
1137
00:42:03,359 --> 00:42:05,680
statistics concerns with the collection
1138
00:42:05,680 --> 00:42:08,640
organization analysis interpretation
1139
00:42:08,640 --> 00:42:12,160
and presentation of data
1140
00:42:12,160 --> 00:42:14,640
that is a mouthful
1141
00:42:14,640 --> 00:42:17,280
so we have from end to end
1142
00:42:17,280 --> 00:42:19,280
where does it come from is it valid what
1143
00:42:19,280 --> 00:42:22,079
does it mean how do we organize it um
1144
00:42:22,079 --> 00:42:24,240
how do we analyze it and then you gotta
1145
00:42:24,240 --> 00:42:26,079
take those analysis and interpret it
1146
00:42:26,079 --> 00:42:28,240
into something that people can use kind
1147
00:42:28,240 --> 00:42:29,680
of reduce it to
1148
00:42:29,680 --> 00:42:31,359
understandable
1149
00:42:31,359 --> 00:42:32,880
and nowadays you have to be able to
1150
00:42:32,880 --> 00:42:34,800
present it if you can't present it then
1151
00:42:34,800 --> 00:42:36,000
no one else is going to understand what
1152
00:42:36,000 --> 00:42:38,640
the heck you did
1153
00:42:38,640 --> 00:42:41,440
so we look at the terminologies
1154
00:42:41,440 --> 00:42:43,040
there is a lot of terminologies
1155
00:42:43,040 --> 00:42:45,040
depending on what domain you're working
1156
00:42:45,040 --> 00:42:45,839
in
1157
00:42:45,839 --> 00:42:49,119
so clearly if you're working in
1158
00:42:49,119 --> 00:42:52,000
a domain that deals with
1159
00:42:52,000 --> 00:42:56,160
viruses and t cells and and
1160
00:42:56,160 --> 00:42:57,680
how does you know where does that come
1161
00:42:57,680 --> 00:42:58,800
from you're studying the different
1162
00:42:58,800 --> 00:43:00,880
people that you can have a population
1163
00:43:00,880 --> 00:43:04,640
if you are working with um
1164
00:43:04,640 --> 00:43:06,720
mechanical gear
1165
00:43:06,720 --> 00:43:07,760
you know a little bit different if
1166
00:43:07,760 --> 00:43:09,200
you're looking for the wobbling
1167
00:43:09,200 --> 00:43:12,000
statistics uh to know when to replace a
1168
00:43:12,000 --> 00:43:13,520
rotor on a machine or something like
1169
00:43:13,520 --> 00:43:14,319
that
1170
00:43:14,319 --> 00:43:15,920
that can be a big deal you know we have
1171
00:43:15,920 --> 00:43:18,560
these huge fans that turn
1172
00:43:18,560 --> 00:43:21,760
in our sewage processing systems and so
1173
00:43:21,760 --> 00:43:24,160
those fans they start to wobble and hum
1174
00:43:24,160 --> 00:43:25,680
and do different things that the sensors
1175
00:43:25,680 --> 00:43:28,240
pick up at one point do you replace them
1176
00:43:28,240 --> 00:43:29,599
instead of waiting for it to break in
1177
00:43:29,599 --> 00:43:31,119
which case it costs a lot of money
1178
00:43:31,119 --> 00:43:32,640
instead of replacing a bushing you're
1179
00:43:32,640 --> 00:43:35,200
replacing the whole fan unit
1180
00:43:35,200 --> 00:43:37,200
an interesting project that came up for
1181
00:43:37,200 --> 00:43:39,200
our city a while back
1182
00:43:39,200 --> 00:43:41,440
so population all objects are
1183
00:43:41,440 --> 00:43:43,760
measurements whose properties are being
1184
00:43:43,760 --> 00:43:45,040
observed
1185
00:43:45,040 --> 00:43:46,960
so that's your population all the
1186
00:43:46,960 --> 00:43:49,359
objects it's easy to see it with people
1187
00:43:49,359 --> 00:43:52,880
because we have our population in large
1188
00:43:52,880 --> 00:43:55,200
but in the case of the sewer fans we're
1189
00:43:55,200 --> 00:43:56,880
talking about having the fan units
1190
00:43:56,880 --> 00:43:58,400
that's the population of fans that we're
1191
00:43:58,400 --> 00:44:00,560
working with
1192
00:44:00,560 --> 00:44:03,040
you have a parameter a matrix that is
1193
00:44:03,040 --> 00:44:04,960
used to represent a population or
1194
00:44:04,960 --> 00:44:06,560
characteristic
1195
00:44:06,560 --> 00:44:08,560
you have your sample a subset of the
1196
00:44:08,560 --> 00:44:10,800
population studied you don't want to do
1197
00:44:10,800 --> 00:44:12,560
them all because then you don't have a
1198
00:44:12,560 --> 00:44:14,319
if you come up with a conclusion for
1199
00:44:14,319 --> 00:44:16,000
everyone you don't have a way of testing
1200
00:44:16,000 --> 00:44:17,760
it so you take a sample
1201
00:44:17,760 --> 00:44:18,880
sometimes you don't have a choice you
1202
00:44:18,880 --> 00:44:20,480
can only take a sample of what's going
1203
00:44:20,480 --> 00:44:24,000
on you can't study the whole population
1204
00:44:24,000 --> 00:44:26,319
and a variable a metric of interest for
1205
00:44:26,319 --> 00:44:30,240
each person or object in a population
1206
00:44:30,240 --> 00:44:31,839
types of sampling
1207
00:44:31,839 --> 00:44:34,160
we have a probabilistic approach
1208
00:44:34,160 --> 00:44:35,920
selecting samples from a larger
1209
00:44:35,920 --> 00:44:38,319
population using a method based on the
1210
00:44:38,319 --> 00:44:40,880
theory of probability
1211
00:44:40,880 --> 00:44:42,079
and we'll go into a little bit more
1212
00:44:42,079 --> 00:44:44,000
deeper on these we have random
1213
00:44:44,000 --> 00:44:46,640
systematic stratified and then you have
1214
00:44:46,640 --> 00:44:49,040
a non-probabilistic approach selecting
1215
00:44:49,040 --> 00:44:51,520
samples based on the subjective judgment
1216
00:44:51,520 --> 00:44:53,760
of the researcher rather than random
1217
00:44:53,760 --> 00:44:55,119
selection
1218
00:44:55,119 --> 00:44:56,960
it has to do with convenience trying to
1219
00:44:56,960 --> 00:44:58,480
reach a quota
1220
00:44:58,480 --> 00:45:00,800
or snowball
1221
00:45:00,800 --> 00:45:02,560
and they're very biased that's one of
1222
00:45:02,560 --> 00:45:04,160
the reasons you'll see this big stamp on
1223
00:45:04,160 --> 00:45:06,160
that says biased so you gotta be very
1224
00:45:06,160 --> 00:45:08,079
careful on that
1225
00:45:08,079 --> 00:45:10,720
so probabilistic sampling uh when we
1226
00:45:10,720 --> 00:45:12,800
talk about a random sampling we select
1227
00:45:12,800 --> 00:45:15,040
random size samples from each group or
1228
00:45:15,040 --> 00:45:17,520
category so we it's as random as you can
1229
00:45:17,520 --> 00:45:21,440
get we talk about systematic sampling
1230
00:45:21,440 --> 00:45:23,760
we're selecting random size samples from
1231
00:45:23,760 --> 00:45:25,760
each group or category with a fixed
1232
00:45:25,760 --> 00:45:28,079
periodic interval
1233
00:45:28,079 --> 00:45:29,920
uh so we kind of split it up this would
1234
00:45:29,920 --> 00:45:31,599
be like a time set up or different
1235
00:45:31,599 --> 00:45:32,960
categories
1236
00:45:32,960 --> 00:45:34,400
and you might ask your question what is
1237
00:45:34,400 --> 00:45:37,040
a category or a group
1238
00:45:37,040 --> 00:45:38,640
if you look at i'm going to go back a
1239
00:45:38,640 --> 00:45:41,359
window let's say we're studying
1240
00:45:41,359 --> 00:45:44,800
economics of different of an area
1241
00:45:44,800 --> 00:45:47,680
we know pretty much that based on their
1242
00:45:47,680 --> 00:45:49,839
culture where they came from
1243
00:45:49,839 --> 00:45:52,560
they might need to be separated and so
1244
00:45:52,560 --> 00:45:54,960
uh and when i say separated i don't mean
1245
00:45:54,960 --> 00:45:56,640
separated from their
1246
00:45:56,640 --> 00:45:58,800
place where they live i mean as far as
1247
00:45:58,800 --> 00:46:00,400
the analysis we want to look at the
1248
00:46:00,400 --> 00:46:01,920
different groups and make sure they're
1249
00:46:01,920 --> 00:46:03,520
all represented
1250
00:46:03,520 --> 00:46:06,240
so if we had like an eighty percent uh
1251
00:46:06,240 --> 00:46:09,760
of a group that is uh say hispanic and
1252
00:46:09,760 --> 00:46:13,040
or indian and also in that same area we
1253
00:46:13,040 --> 00:46:15,760
have 20 20 percent who are
1254
00:46:15,760 --> 00:46:17,359
let's call our expatriates they left
1255
00:46:17,359 --> 00:46:19,520
america and they're nice and
1256
00:46:19,520 --> 00:46:22,000
your caucasian group we might want to
1257
00:46:22,000 --> 00:46:24,400
sample a group that is representative of
1258
00:46:24,400 --> 00:46:26,720
both uh so we're talking about
1259
00:46:26,720 --> 00:46:29,280
stratified sampling and we're talking
1260
00:46:29,280 --> 00:46:30,560
about groups those are the groups we're
1261
00:46:30,560 --> 00:46:32,079
talking about and it brings us to
1262
00:46:32,079 --> 00:46:33,599
stratified sampling selecting
1263
00:46:33,599 --> 00:46:35,760
approximately equal size samples from
1264
00:46:35,760 --> 00:46:38,160
each group or category
1265
00:46:38,160 --> 00:46:40,160
this way we can actually separate the
1266
00:46:40,160 --> 00:46:43,359
categories and give us an insight into
1267
00:46:43,359 --> 00:46:44,800
the different cultures and how that
1268
00:46:44,800 --> 00:46:47,119
might affect them in that area
1269
00:46:47,119 --> 00:46:49,040
so you can see these are very very
1270
00:46:49,040 --> 00:46:50,480
different kind of
1271
00:46:50,480 --> 00:46:52,720
depends on what you're working with
1272
00:46:52,720 --> 00:46:54,640
as far as your data and what you're
1273
00:46:54,640 --> 00:46:55,599
studying
1274
00:46:55,599 --> 00:46:57,520
and so we can see here just a little bit
1275
00:46:57,520 --> 00:46:59,440
more we'd have selecting 25 employees
1276
00:46:59,440 --> 00:47:02,240
from a company of 250 employees randomly
1277
00:47:02,240 --> 00:47:03,440
don't care anything about them what
1278
00:47:03,440 --> 00:47:05,200
groups are in which office they're in
1279
00:47:05,200 --> 00:47:06,800
nothing
1280
00:47:06,800 --> 00:47:08,560
and we might be selecting one employee
1281
00:47:08,560 --> 00:47:10,720
from every 50 unique employees and a
1282
00:47:10,720 --> 00:47:13,280
company of 250 employees
1283
00:47:13,280 --> 00:47:15,040
and then we have selecting one employee
1284
00:47:15,040 --> 00:47:17,359
from every branch in the company office
1285
00:47:17,359 --> 00:47:18,880
so we have all the different branches
1286
00:47:18,880 --> 00:47:20,960
there's our group or categories by the
1287
00:47:20,960 --> 00:47:23,040
branch and the category could depend on
1288
00:47:23,040 --> 00:47:25,040
what you're studying so it has a lot of
1289
00:47:25,040 --> 00:47:26,640
variation on there
1290
00:47:26,640 --> 00:47:28,000
you see this kind of grouping and
1291
00:47:28,000 --> 00:47:30,400
categorizing is also used to generate a
1292
00:47:30,400 --> 00:47:33,359
lot of misinformation
1293
00:47:33,359 --> 00:47:35,520
so if you only study one group and you
1294
00:47:35,520 --> 00:47:37,359
say this is what it is
1295
00:47:37,359 --> 00:47:38,960
then everybody assumes that's what it is
1296
00:47:38,960 --> 00:47:40,480
for everybody and so you've got to be
1297
00:47:40,480 --> 00:47:41,680
very careful of that and it's very
1298
00:47:41,680 --> 00:47:44,400
unethical thing to kind of do
1299
00:47:44,400 --> 00:47:46,880
so types of statistics uh we talk about
1300
00:47:46,880 --> 00:47:48,160
statistics
1301
00:47:48,160 --> 00:47:50,240
we're going to talk about descriptive
1302
00:47:50,240 --> 00:47:52,880
and inferential statistics
1303
00:47:52,880 --> 00:47:54,720
there are so many different terms and
1304
00:47:54,720 --> 00:47:58,000
statistics to break it up uh so we so
1305
00:47:58,000 --> 00:48:00,240
we're talking about a particular
1306
00:48:00,240 --> 00:48:01,440
setup
1307
00:48:01,440 --> 00:48:03,200
so we're talking about descriptive and
1308
00:48:03,200 --> 00:48:05,920
inferential uh statistics
1309
00:48:05,920 --> 00:48:08,560
the base of the word describe
1310
00:48:08,560 --> 00:48:10,560
is pretty solid you're describing the
1311
00:48:10,560 --> 00:48:13,040
data what does it look like with
1312
00:48:13,040 --> 00:48:15,040
inferential statistics we're going to
1313
00:48:15,040 --> 00:48:17,119
take that from the small population to a
1314
00:48:17,119 --> 00:48:19,200
large population so if you're working
1315
00:48:19,200 --> 00:48:21,200
with a drug company you might look at
1316
00:48:21,200 --> 00:48:23,040
the data and say these people were
1317
00:48:23,040 --> 00:48:24,720
helped by this drug
1318
00:48:24,720 --> 00:48:25,599
they did
1319
00:48:25,599 --> 00:48:27,200
80 percent better
1320
00:48:27,200 --> 00:48:29,040
as far as their health or 80 percent
1321
00:48:29,040 --> 00:48:32,400
better survival rate than the people
1322
00:48:32,400 --> 00:48:34,160
who did not have the drug so we can
1323
00:48:34,160 --> 00:48:36,160
infer that that drug will work in the
1324
00:48:36,160 --> 00:48:38,400
greater populace and will help people so
1325
00:48:38,400 --> 00:48:40,400
that's where you get your inferential so
1326
00:48:40,400 --> 00:48:41,280
we are
1327
00:48:41,280 --> 00:48:42,880
predicting how it's going to affect the
1328
00:48:42,880 --> 00:48:44,880
greater population
1329
00:48:44,880 --> 00:48:46,880
so descriptive statistics it is used to
1330
00:48:46,880 --> 00:48:49,280
describe the basic features of data and
1331
00:48:49,280 --> 00:48:51,839
form the basis of quantitative analysis
1332
00:48:51,839 --> 00:48:53,280
of data
1333
00:48:53,280 --> 00:48:54,960
so we have a measure of central
1334
00:48:54,960 --> 00:48:57,119
tendencies we have your mean median and
1335
00:48:57,119 --> 00:48:58,240
mode
1336
00:48:58,240 --> 00:49:00,400
and then we have a measure of spread
1337
00:49:00,400 --> 00:49:02,880
like your range your interquartile range
1338
00:49:02,880 --> 00:49:04,319
your variance and your standard
1339
00:49:04,319 --> 00:49:05,839
deviation
1340
00:49:05,839 --> 00:49:06,960
and we're going to look at all these a
1341
00:49:06,960 --> 00:49:08,880
little deeper here in a second
1342
00:49:08,880 --> 00:49:12,640
but one of them you can think of is
1343
00:49:12,640 --> 00:49:14,960
how it the data difference
1344
00:49:14,960 --> 00:49:17,359
differences you know what's the max min
1345
00:49:17,359 --> 00:49:19,839
range all that stuff is your spread and
1346
00:49:19,839 --> 00:49:21,680
anything that's just a single number is
1347
00:49:21,680 --> 00:49:24,000
usually your central uh tendencies
1348
00:49:24,000 --> 00:49:26,160
measure of central tendencies
1349
00:49:26,160 --> 00:49:28,000
so we talk about the mean it is the
1350
00:49:28,000 --> 00:49:30,480
average of the set of values considered
1351
00:49:30,480 --> 00:49:32,480
what is the average outcome of whatever
1352
00:49:32,480 --> 00:49:33,920
is going on
1353
00:49:33,920 --> 00:49:35,520
and then your median
1354
00:49:35,520 --> 00:49:37,680
separates the higher half and the lower
1355
00:49:37,680 --> 00:49:40,400
half of data
1356
00:49:40,400 --> 00:49:42,400
so where's the center point of all your
1357
00:49:42,400 --> 00:49:45,760
different data points so your mean might
1358
00:49:45,760 --> 00:49:47,839
have some a couple really big numbers
1359
00:49:47,839 --> 00:49:49,440
that skew it
1360
00:49:49,440 --> 00:49:51,760
so that the average is much higher than
1361
00:49:51,760 --> 00:49:54,640
if you took those outliers out where the
1362
00:49:54,640 --> 00:49:57,359
median would by separating the high from
1363
00:49:57,359 --> 00:49:59,599
the low might give you a much lower
1364
00:49:59,599 --> 00:50:01,040
number you might look at and say oh
1365
00:50:01,040 --> 00:50:03,200
that's that's odd why is the average so
1366
00:50:03,200 --> 00:50:04,880
much higher than the median well it's
1367
00:50:04,880 --> 00:50:06,400
because you have some outliers or why is
1368
00:50:06,400 --> 00:50:07,839
it so much lower
1369
00:50:07,839 --> 00:50:09,599
and then the mode is the most frequent
1370
00:50:09,599 --> 00:50:11,200
appearing value
1371
00:50:11,200 --> 00:50:12,400
this is really interesting if you're
1372
00:50:12,400 --> 00:50:14,480
studying economics and how people are
1373
00:50:14,480 --> 00:50:16,319
doing you might find that the most
1374
00:50:16,319 --> 00:50:17,839
common
1375
00:50:17,839 --> 00:50:19,880
income like in the us was
1376
00:50:19,880 --> 00:50:22,880
1.24 000 a year
1377
00:50:22,880 --> 00:50:26,240
where the average was closer to 80 000
1378
00:50:26,240 --> 00:50:28,240
and it's like wow what a difference well
1379
00:50:28,240 --> 00:50:29,839
there's some people have a lot of money
1380
00:50:29,839 --> 00:50:32,160
and so that skews that way up so the
1381
00:50:32,160 --> 00:50:34,079
average person is not making that kind
1382
00:50:34,079 --> 00:50:36,000
of money and then you look at the median
1383
00:50:36,000 --> 00:50:37,359
income and you're like well the median
1384
00:50:37,359 --> 00:50:39,280
income is a little bit closer to the
1385
00:50:39,280 --> 00:50:41,200
average so it does create a very
1386
00:50:41,200 --> 00:50:43,520
interesting way of looking at the data
1387
00:50:43,520 --> 00:50:45,680
again these are all uh central
1388
00:50:45,680 --> 00:50:47,520
tendencies single numbers you can look
1389
00:50:47,520 --> 00:50:50,480
at for the whole spread of the data
1390
00:50:50,480 --> 00:50:52,640
and we look at the measure of central
1391
00:50:52,640 --> 00:50:54,880
tendencies the mean is the average marks
1392
00:50:54,880 --> 00:50:56,960
of a students in a classroom so here we
1393
00:50:56,960 --> 00:50:58,880
have the mean sum of the marks of the
1394
00:50:58,880 --> 00:51:01,280
students total number of students and as
1395
00:51:01,280 --> 00:51:03,359
we talked about the median
1396
00:51:03,359 --> 00:51:04,480
we have
1397
00:51:04,480 --> 00:51:07,040
0 through 10 and we take half the
1398
00:51:07,040 --> 00:51:08,400
numbers and put them on one side of the
1399
00:51:08,400 --> 00:51:09,920
line half the numbers on the other side
1400
00:51:09,920 --> 00:51:12,079
of the line uh we end up with five in
1401
00:51:12,079 --> 00:51:14,000
the middle and then the mode what mark
1402
00:51:14,000 --> 00:51:16,400
was scored by most of the students in a
1403
00:51:16,400 --> 00:51:17,680
test
1404
00:51:17,680 --> 00:51:19,680
in a simple case where most people
1405
00:51:19,680 --> 00:51:21,839
scored like an 82 percent and got
1406
00:51:21,839 --> 00:51:24,400
certain problems wrong easy to figure
1407
00:51:24,400 --> 00:51:27,440
out uh not so easy when you have
1408
00:51:27,440 --> 00:51:29,359
different areas where like you have like
1409
00:51:29,359 --> 00:51:31,680
the um oh let's go back to economy a
1410
00:51:31,680 --> 00:51:33,280
little bit more difficult to calculate
1411
00:51:33,280 --> 00:51:34,880
if you have a large group that scores
1412
00:51:34,880 --> 00:51:36,720
that makes 30 000
1413
00:51:36,720 --> 00:51:38,480
and a slightly bigger group that makes
1414
00:51:38,480 --> 00:51:40,880
26 000 so what do you put down for the
1415
00:51:40,880 --> 00:51:42,800
mode uh certainly there's a number of
1416
00:51:42,800 --> 00:51:44,319
ways to calculate that and there's
1417
00:51:44,319 --> 00:51:45,599
actually a different variations
1418
00:51:45,599 --> 00:51:47,280
depending on what you're doing
1419
00:51:47,280 --> 00:51:49,040
so now we're looking at a measure of
1420
00:51:49,040 --> 00:51:51,359
spread uh range what's the difference
1421
00:51:51,359 --> 00:51:53,599
between the highest and the lowest value
1422
00:51:53,599 --> 00:51:55,200
first thing you want to look at you know
1423
00:51:55,200 --> 00:51:56,559
it's uh we had everybody in the test
1424
00:51:56,559 --> 00:51:59,440
scored between 60 and 100 so we got 100
1425
00:51:59,440 --> 00:52:02,800
or maybe 60 to 90 it was so hard a lot
1426
00:52:02,800 --> 00:52:04,000
of people could not get a hundred
1427
00:52:04,000 --> 00:52:05,839
percent
1428
00:52:05,839 --> 00:52:06,920
you have your
1429
00:52:06,920 --> 00:52:10,079
inter-quartile range quartiles divide a
1430
00:52:10,079 --> 00:52:12,720
rank ordered data set into four equal
1431
00:52:12,720 --> 00:52:14,319
parts
1432
00:52:14,319 --> 00:52:16,319
very common thing to do as part of all
1433
00:52:16,319 --> 00:52:18,160
the basic packages whether you're
1434
00:52:18,160 --> 00:52:20,160
working in
1435
00:52:20,160 --> 00:52:22,319
data frames with pandas whether you're
1436
00:52:22,319 --> 00:52:24,079
working in scala whether you're working
1437
00:52:24,079 --> 00:52:25,440
in r
1438
00:52:25,440 --> 00:52:26,800
you'll see this come up where they have
1439
00:52:26,800 --> 00:52:29,280
range your min your max and then it'll
1440
00:52:29,280 --> 00:52:31,440
have your interquartile range how does
1441
00:52:31,440 --> 00:52:33,599
it look like in each quarter of data
1442
00:52:33,599 --> 00:52:36,240
variance measures how far each number in
1443
00:52:36,240 --> 00:52:38,640
the set is from the mean and therefore
1444
00:52:38,640 --> 00:52:41,760
from every other number in the set
1445
00:52:41,760 --> 00:52:43,520
so you have like how much turbulence is
1446
00:52:43,520 --> 00:52:44,839
going on in this
1447
00:52:44,839 --> 00:52:48,079
data and then the standard deviation
1448
00:52:48,079 --> 00:52:49,839
it is to measure the variance or the
1449
00:52:49,839 --> 00:52:51,680
dispersion of a set of values from the
1450
00:52:51,680 --> 00:52:52,800
mean
1451
00:52:52,800 --> 00:52:55,200
and you'll usually see uh if i'm doing a
1452
00:52:55,200 --> 00:52:58,160
graph i might have the value graphed
1453
00:52:58,160 --> 00:53:00,880
and then based on the the error i might
1454
00:53:00,880 --> 00:53:03,040
graph graph the standard deviation in
1455
00:53:03,040 --> 00:53:04,800
the error on the graph as a background
1456
00:53:04,800 --> 00:53:07,200
so you can see how far off it is
1457
00:53:07,200 --> 00:53:10,160
uh so standard deviation is used a lot
1458
00:53:10,160 --> 00:53:12,480
so measurement of spread uh marks of a
1459
00:53:12,480 --> 00:53:15,200
student out of 100 we have here from 50
1460
00:53:15,200 --> 00:53:18,319
to 63 or 50 to 90.
1461
00:53:18,319 --> 00:53:20,800
so the range maximum marks minimum marks
1462
00:53:20,800 --> 00:53:23,200
we have 90 to 45 and the spread of that
1463
00:53:23,200 --> 00:53:26,559
is 45 90 minus 45. and then we have the
1464
00:53:26,559 --> 00:53:28,400
interquartile range
1465
00:53:28,400 --> 00:53:30,480
using the same marks over there you can
1466
00:53:30,480 --> 00:53:32,400
see here where the median is
1467
00:53:32,400 --> 00:53:34,960
and then there's the first quarter the
1468
00:53:34,960 --> 00:53:36,960
second quarter and the third quarter
1469
00:53:36,960 --> 00:53:38,640
based on splitting it apart by those
1470
00:53:38,640 --> 00:53:39,839
values
1471
00:53:39,839 --> 00:53:41,280
and to understand the variance and
1472
00:53:41,280 --> 00:53:43,440
standard deviation we first need to find
1473
00:53:43,440 --> 00:53:46,079
out the mean uh so here's our our you
1474
00:53:46,079 --> 00:53:48,319
know calculating the average there we
1475
00:53:48,319 --> 00:53:50,400
end up at approximately 66 for the
1476
00:53:50,400 --> 00:53:52,319
average and then we look at that the
1477
00:53:52,319 --> 00:53:54,160
variance once we know the means we can
1478
00:53:54,160 --> 00:53:56,240
do equals the marks minus the mean
1479
00:53:56,240 --> 00:53:57,520
squared
1480
00:53:57,520 --> 00:53:59,599
y is a squared
1481
00:53:59,599 --> 00:54:01,680
because one you want to make sure it's
1482
00:54:01,680 --> 00:54:04,079
you don't have like if you if you're
1483
00:54:04,079 --> 00:54:05,760
putting all this stuff together you end
1484
00:54:05,760 --> 00:54:07,839
up with an error as far as one's
1485
00:54:07,839 --> 00:54:09,359
negative one's positive one's a little
1486
00:54:09,359 --> 00:54:11,280
higher one's a little lower
1487
00:54:11,280 --> 00:54:12,720
so you always see
1488
00:54:12,720 --> 00:54:14,880
the squared value and over the total
1489
00:54:14,880 --> 00:54:16,319
observations
1490
00:54:16,319 --> 00:54:18,559
and so the standard deviation equals the
1491
00:54:18,559 --> 00:54:20,400
square root of the variance which is
1492
00:54:20,400 --> 00:54:22,640
approximately 16.
1493
00:54:22,640 --> 00:54:24,640
and if you were looking at
1494
00:54:24,640 --> 00:54:26,559
a predictable model you would be looking
1495
00:54:26,559 --> 00:54:29,680
at the deviation based on the error how
1496
00:54:29,680 --> 00:54:31,839
much error does it have
1497
00:54:31,839 --> 00:54:33,200
that's again
1498
00:54:33,200 --> 00:54:35,040
really important to know if your if your
1499
00:54:35,040 --> 00:54:37,119
prediction is predicting something
1500
00:54:37,119 --> 00:54:39,200
what's the chance of it being way off or
1501
00:54:39,200 --> 00:54:42,000
just a little bit off
1502
00:54:42,000 --> 00:54:44,000
now that we've looked at the
1503
00:54:44,000 --> 00:54:46,240
tools as far as some of the basics for
1504
00:54:46,240 --> 00:54:47,839
doing your statistics and we're talking
1505
00:54:47,839 --> 00:54:48,800
about
1506
00:54:48,800 --> 00:54:51,119
let's go ahead and pull up a little demo
1507
00:54:51,119 --> 00:54:52,319
and show you what that looks like in
1508
00:54:52,319 --> 00:54:53,760
python code
1509
00:54:53,760 --> 00:54:55,599
so you can get some little hands on here
1510
00:54:55,599 --> 00:54:57,520
for that let's go back into our jupiter
1511
00:54:57,520 --> 00:55:00,079
notebook and python now almost all of
1512
00:55:00,079 --> 00:55:02,880
this you can do in numpy last time we
1513
00:55:02,880 --> 00:55:05,040
worked in numpy this time we're going to
1514
00:55:05,040 --> 00:55:06,880
go ahead and use pandas
1515
00:55:06,880 --> 00:55:10,079
and if you remember from pandas on here
1516
00:55:10,079 --> 00:55:12,960
this is basically a data frame rows
1517
00:55:12,960 --> 00:55:15,119
columns let's just go ahead and do a
1518
00:55:15,119 --> 00:55:16,079
print
1519
00:55:16,079 --> 00:55:18,800
df.head
1520
00:55:18,800 --> 00:55:21,040
and run that
1521
00:55:21,040 --> 00:55:23,520
and you can see we have the name jane
1522
00:55:23,520 --> 00:55:25,119
michael william rosie hannah sat in
1523
00:55:25,119 --> 00:55:27,280
their salaries on here and of course
1524
00:55:27,280 --> 00:55:29,280
instead of having to do all those hand
1525
00:55:29,280 --> 00:55:31,280
calculations and add everything together
1526
00:55:31,280 --> 00:55:32,960
and divide by the total
1527
00:55:32,960 --> 00:55:35,440
we can do something very simple on this
1528
00:55:35,440 --> 00:55:36,160
like
1529
00:55:36,160 --> 00:55:39,280
use the command mean in pandas and so if
1530
00:55:39,280 --> 00:55:41,520
i go ahead and do this print df
1531
00:55:41,520 --> 00:55:43,440
pick our column salary because we want
1532
00:55:43,440 --> 00:55:46,480
to find the means of that calorie
1533
00:55:46,480 --> 00:55:49,359
we want to find the means of that column
1534
00:55:49,359 --> 00:55:50,960
and we go and print this out and you can
1535
00:55:50,960 --> 00:55:52,480
see that the
1536
00:55:52,480 --> 00:55:56,799
average income on here is 71 000.
1537
00:55:56,799 --> 00:55:58,000
and let's just go ahead and do this
1538
00:55:58,000 --> 00:55:59,680
we'll go ahead and put in
1539
00:55:59,680 --> 00:56:02,680
means
1540
00:56:03,280 --> 00:56:04,720
and if we're going to do that we also
1541
00:56:04,720 --> 00:56:08,559
might want to find the median
1542
00:56:09,040 --> 00:56:11,440
and the median is
1543
00:56:11,440 --> 00:56:13,359
very similar
1544
00:56:13,359 --> 00:56:15,599
except it actually is just median we're
1545
00:56:15,599 --> 00:56:17,440
used to means in average it's kind of
1546
00:56:17,440 --> 00:56:18,720
interesting that those are the use of
1547
00:56:18,720 --> 00:56:20,160
two different words
1548
00:56:20,160 --> 00:56:23,359
uh there can be in some computation
1549
00:56:23,359 --> 00:56:25,440
slight differences but for the most part
1550
00:56:25,440 --> 00:56:27,920
the means is the average uh and then the
1551
00:56:27,920 --> 00:56:29,119
median
1552
00:56:29,119 --> 00:56:31,839
oops let's put a
1553
00:56:32,880 --> 00:56:34,319
median here
1554
00:56:34,319 --> 00:56:35,920
do you have salary that way it displays
1555
00:56:35,920 --> 00:56:38,160
a little better we can see the median is
1556
00:56:38,160 --> 00:56:39,839
54
1557
00:56:39,839 --> 00:56:42,400
000 so the halfway mark is significantly
1558
00:56:42,400 --> 00:56:44,799
below the average why because we have
1559
00:56:44,799 --> 00:56:48,000
somebody in here makes 189 000. darn you
1560
00:56:48,000 --> 00:56:50,480
rosie for throwing off our numbers
1561
00:56:50,480 --> 00:56:51,359
but that's something you'd want to
1562
00:56:51,359 --> 00:56:53,520
notice this is this is the difference
1563
00:56:53,520 --> 00:56:56,079
between these is huge and so is what is
1564
00:56:56,079 --> 00:56:57,280
the meaning behind that when you're
1565
00:56:57,280 --> 00:56:59,359
studying a populace and looking at
1566
00:56:59,359 --> 00:57:01,520
the different data coming in and of
1567
00:57:01,520 --> 00:57:03,280
course we also want to find out hey
1568
00:57:03,280 --> 00:57:04,480
what's the most
1569
00:57:04,480 --> 00:57:05,920
common
1570
00:57:05,920 --> 00:57:08,000
income that people make
1571
00:57:08,000 --> 00:57:10,000
in this little tiny sample and so we'll
1572
00:57:10,000 --> 00:57:12,480
go ahead and do the mode and you can see
1573
00:57:12,480 --> 00:57:14,160
here with the mode
1574
00:57:14,160 --> 00:57:16,720
it's at 50 000.
1575
00:57:16,720 --> 00:57:18,640
so this is this is very telling that
1576
00:57:18,640 --> 00:57:21,040
most people are making 50 000
1577
00:57:21,040 --> 00:57:24,160
the middle point is at 54 000. so half
1578
00:57:24,160 --> 00:57:26,240
the people are making more than that
1579
00:57:26,240 --> 00:57:28,960
what that tells me is that if the most
1580
00:57:28,960 --> 00:57:31,359
common income is weight is below the
1581
00:57:31,359 --> 00:57:32,559
median
1582
00:57:32,559 --> 00:57:33,839
then
1583
00:57:33,839 --> 00:57:35,599
there's a few there's a skill there's a
1584
00:57:35,599 --> 00:57:37,599
lot of high salaries going up but
1585
00:57:37,599 --> 00:57:39,520
there's some really low salaries in
1586
00:57:39,520 --> 00:57:42,079
there and so this trend which is very
1587
00:57:42,079 --> 00:57:44,319
common in statistics when you're
1588
00:57:44,319 --> 00:57:45,599
analyzing
1589
00:57:45,599 --> 00:57:48,000
the economy in different people's income
1590
00:57:48,000 --> 00:57:49,680
is pretty common and the bigger
1591
00:57:49,680 --> 00:57:51,839
difference between these is also
1592
00:57:51,839 --> 00:57:53,119
very important when we're studying
1593
00:57:53,119 --> 00:57:55,040
statistics
1594
00:57:55,040 --> 00:57:56,480
and when you hear someone just say hey
1595
00:57:56,480 --> 00:57:58,960
the average income was you might start
1596
00:57:58,960 --> 00:58:00,559
asking questions at that point why
1597
00:58:00,559 --> 00:58:02,000
aren't you talking about the median
1598
00:58:02,000 --> 00:58:03,599
income why aren't you talking about the
1599
00:58:03,599 --> 00:58:05,920
mode the most common income what are you
1600
00:58:05,920 --> 00:58:07,359
hiding
1601
00:58:07,359 --> 00:58:08,799
and if you're doing these analysis you
1602
00:58:08,799 --> 00:58:10,240
should be looking at these saying hey
1603
00:58:10,240 --> 00:58:11,760
why are these discrepancies why are
1604
00:58:11,760 --> 00:58:13,520
these so different and of course with
1605
00:58:13,520 --> 00:58:15,920
any analysis it's important to find out
1606
00:58:15,920 --> 00:58:17,760
the minimum
1607
00:58:17,760 --> 00:58:20,240
and the maximum so we'll go ahead it's
1608
00:58:20,240 --> 00:58:22,880
just simply uh
1609
00:58:22,880 --> 00:58:24,240
dot min
1610
00:58:24,240 --> 00:58:26,559
it'll pull up your minimum and then dot
1611
00:58:26,559 --> 00:58:28,640
max pulls up the maximum
1612
00:58:28,640 --> 00:58:32,559
pretty straightforward on as far as um
1613
00:58:32,559 --> 00:58:34,559
translating it and knowing which you
1614
00:58:34,559 --> 00:58:36,480
know put the your lowest value which
1615
00:58:36,480 --> 00:58:38,480
your highest value is here
1616
00:58:38,480 --> 00:58:40,640
um which you'll use to generate like a
1617
00:58:40,640 --> 00:58:43,359
spread later on and real quick on no
1618
00:58:43,359 --> 00:58:46,160
mode uh note that it puts mode zero like
1619
00:58:46,160 --> 00:58:47,440
i said there's a couple different ways
1620
00:58:47,440 --> 00:58:50,079
you can compute the mode
1621
00:58:50,079 --> 00:58:51,920
although the standard one is pretty good
1622
00:58:51,920 --> 00:58:53,839
we can of course do the range
1623
00:58:53,839 --> 00:58:56,480
which is your max minus your min so now
1624
00:58:56,480 --> 00:58:59,760
we have a range of 149 000 between the
1625
00:58:59,760 --> 00:59:02,240
upper end and the lower end and you
1626
00:59:02,240 --> 00:59:03,280
might want to be looking up the
1627
00:59:03,280 --> 00:59:05,760
individual values on all of these but it
1628
00:59:05,760 --> 00:59:09,839
turns out there is a describe
1629
00:59:09,839 --> 00:59:12,400
feature in pandas
1630
00:59:12,400 --> 00:59:14,559
and so in pandas we can actually do df
1631
00:59:14,559 --> 00:59:16,960
salary describe and if we do this you
1632
00:59:16,960 --> 00:59:19,520
can see we have that there's seven uh
1633
00:59:19,520 --> 00:59:21,760
setups here's our mean
1634
00:59:21,760 --> 00:59:23,520
um our standard deviation which we
1635
00:59:23,520 --> 00:59:25,200
didn't compute yet which would just be a
1636
00:59:25,200 --> 00:59:26,799
dot std
1637
00:59:26,799 --> 00:59:27,839
and you gotta be a little careful
1638
00:59:27,839 --> 00:59:29,599
because when it computes it it looks for
1639
00:59:29,599 --> 00:59:31,520
axes and things like that
1640
00:59:31,520 --> 00:59:33,280
we have our minimum value and here's our
1641
00:59:33,280 --> 00:59:35,520
quartiles
1642
00:59:35,520 --> 00:59:37,200
our maximum value and then of course the
1643
00:59:37,200 --> 00:59:38,720
name salary
1644
00:59:38,720 --> 00:59:40,079
so these are the these are the basic
1645
00:59:40,079 --> 00:59:41,599
statistics you can pull them up and like
1646
00:59:41,599 --> 00:59:43,119
just describe
1647
00:59:43,119 --> 00:59:45,520
this is a dictionary so i could actually
1648
00:59:45,520 --> 00:59:48,079
do something like
1649
00:59:48,079 --> 00:59:51,040
in here i could actually go uh count
1650
00:59:51,040 --> 00:59:52,319
and run
1651
00:59:52,319 --> 00:59:54,799
and now it just prints the count
1652
00:59:54,799 --> 00:59:56,319
so because this is a dictionary you can
1653
00:59:56,319 --> 00:59:59,119
pull any one of these values out of here
1654
00:59:59,119 --> 01:00:00,720
it's kind of a quick and dirty way to
1655
01:00:00,720 --> 01:00:02,640
pull all the different information and
1656
01:00:02,640 --> 01:00:04,160
then split it up and depending on what
1657
01:00:04,160 --> 01:00:05,200
you need
1658
01:00:05,200 --> 01:00:06,960
now if i just walked in and gave you
1659
01:00:06,960 --> 01:00:08,640
this information
1660
01:00:08,640 --> 01:00:10,160
in a meeting
1661
01:00:10,160 --> 01:00:12,240
at some point you would just kind of
1662
01:00:12,240 --> 01:00:14,799
fall asleep that's what i would do
1663
01:00:14,799 --> 01:00:15,760
anyway
1664
01:00:15,760 --> 01:00:16,720
um
1665
01:00:16,720 --> 01:00:18,400
so we want to go ahead and see about
1666
01:00:18,400 --> 01:00:20,079
graphing it here and we'll go ahead and
1667
01:00:20,079 --> 01:00:22,799
put it into a histogram and plot that
1668
01:00:22,799 --> 01:00:24,400
graph on it
1669
01:00:24,400 --> 01:00:26,079
of the salaries and let's just go ahead
1670
01:00:26,079 --> 01:00:28,480
and put that in here so
1671
01:00:28,480 --> 01:00:30,400
we do our map plot inline remember
1672
01:00:30,400 --> 01:00:32,400
that's a jupiter's notebook thing
1673
01:00:32,400 --> 01:00:34,319
a lot of the new version of the matte
1674
01:00:34,319 --> 01:00:36,640
plot library does it automatically
1675
01:00:36,640 --> 01:00:38,240
but just in case i always put it in
1676
01:00:38,240 --> 01:00:40,720
there import matplot library pi plot is
1677
01:00:40,720 --> 01:00:43,920
plt that's my plotting
1678
01:00:43,920 --> 01:00:46,079
and then we have our data frame i don't
1679
01:00:46,079 --> 01:00:47,440
i guess i really don't need to respell
1680
01:00:47,440 --> 01:00:48,720
the data frame
1681
01:00:48,720 --> 01:00:50,160
maybe we could just remind ourselves
1682
01:00:50,160 --> 01:00:52,000
what's in it so we'll go ahead and just
1683
01:00:52,000 --> 01:00:53,680
print
1684
01:00:53,680 --> 01:00:54,720
df
1685
01:00:54,720 --> 01:00:56,400
that way we still have it
1686
01:00:56,400 --> 01:00:58,799
and then we have our salary df salary
1687
01:00:58,799 --> 01:01:01,040
salary.plot history title salary
1688
01:01:01,040 --> 01:01:03,599
distribution color gray
1689
01:01:03,599 --> 01:01:07,520
uh plot ax v line salary the mean value
1690
01:01:07,520 --> 01:01:10,240
so we're going to take the mean value
1691
01:01:10,240 --> 01:01:11,760
color violet
1692
01:01:11,760 --> 01:01:14,079
line style dash this is just all making
1693
01:01:14,079 --> 01:01:15,200
it pretty
1694
01:01:15,200 --> 01:01:18,079
uh what color dashed line line width of
1695
01:01:18,079 --> 01:01:19,839
2 that kind of thing
1696
01:01:19,839 --> 01:01:21,599
and the median and let's go ahead and
1697
01:01:21,599 --> 01:01:22,799
run this just so you can see what we're
1698
01:01:22,799 --> 01:01:25,359
talking about
1699
01:01:25,359 --> 01:01:29,119
and so up here we are taking on our plot
1700
01:01:29,119 --> 01:01:31,680
um so here's the data here's our our
1701
01:01:31,680 --> 01:01:33,280
data frame print it out so you can see
1702
01:01:33,280 --> 01:01:35,200
it with the salaries we'll look at the
1703
01:01:35,200 --> 01:01:37,520
salary distribution and just look at
1704
01:01:37,520 --> 01:01:41,200
this the way the salary is distributed
1705
01:01:41,200 --> 01:01:42,480
you have our
1706
01:01:42,480 --> 01:01:44,160
in this case we did
1707
01:01:44,160 --> 01:01:47,599
let's see we had red for the median
1708
01:01:47,599 --> 01:01:49,760
we have violet
1709
01:01:49,760 --> 01:01:52,640
for our average or mean
1710
01:01:52,640 --> 01:01:54,880
and you can just see how it really
1711
01:01:54,880 --> 01:01:56,480
i mean here's our outlier here's our
1712
01:01:56,480 --> 01:01:58,799
person who makes a lot of money here's
1713
01:01:58,799 --> 01:01:59,599
the
1714
01:01:59,599 --> 01:02:02,559
average and here's the median
1715
01:02:02,559 --> 01:02:04,000
and so as you look at this you can say
1716
01:02:04,000 --> 01:02:05,119
wow
1717
01:02:05,119 --> 01:02:06,720
based on the average it really doesn't
1718
01:02:06,720 --> 01:02:07,920
tell you much about what people are
1719
01:02:07,920 --> 01:02:09,839
really taking home all it does is tell
1720
01:02:09,839 --> 01:02:10,559
you
1721
01:02:10,559 --> 01:02:12,720
how much money is in this you know what
1722
01:02:12,720 --> 01:02:14,880
the average salary is
1723
01:02:14,880 --> 01:02:16,000
so
1724
01:02:16,000 --> 01:02:17,520
some of the things you want to take away
1725
01:02:17,520 --> 01:02:20,160
in addition to this is that it's very
1726
01:02:20,160 --> 01:02:22,720
easy to plot
1727
01:02:22,720 --> 01:02:24,640
an ax v line
1728
01:02:24,640 --> 01:02:26,240
these are these up and down lines for
1729
01:02:26,240 --> 01:02:28,720
your markers
1730
01:02:28,720 --> 01:02:30,480
and as you just display the data i mean
1731
01:02:30,480 --> 01:02:31,839
you can add all kinds of things to this
1732
01:02:31,839 --> 01:02:33,680
and get really complicated keeping it
1733
01:02:33,680 --> 01:02:35,119
simple is pretty straightforward i look
1734
01:02:35,119 --> 01:02:36,960
at this and i can see we have a major
1735
01:02:36,960 --> 01:02:38,880
outlier out here we can definitely do a
1736
01:02:38,880 --> 01:02:41,280
histogram and stuff like that
1737
01:02:41,280 --> 01:02:42,799
but you know pictures worth a thousand
1738
01:02:42,799 --> 01:02:43,680
words
1739
01:02:43,680 --> 01:02:44,799
what you really want to make sure you
1740
01:02:44,799 --> 01:02:47,520
take away is that we can do a basic
1741
01:02:47,520 --> 01:02:48,960
describe
1742
01:02:48,960 --> 01:02:51,200
which pulls all this information out and
1743
01:02:51,200 --> 01:02:52,880
we can print any of the individual
1744
01:02:52,880 --> 01:02:55,200
information from the describe
1745
01:02:55,200 --> 01:02:58,400
because this is a dictionary
1746
01:02:58,880 --> 01:03:00,720
and so if we want to go ahead and look
1747
01:03:00,720 --> 01:03:02,559
up
1748
01:03:02,559 --> 01:03:04,559
the mean value we can also do describe
1749
01:03:04,559 --> 01:03:05,760
mean so if you're doing a lot of
1750
01:03:05,760 --> 01:03:07,359
statistics
1751
01:03:07,359 --> 01:03:10,160
being able to
1752
01:03:10,240 --> 01:03:11,440
doesn't have the print on there so it's
1753
01:03:11,440 --> 01:03:13,280
only going to print the last one which
1754
01:03:13,280 --> 01:03:14,960
happens to be the mean
1755
01:03:14,960 --> 01:03:16,559
you can very easily reference any one of
1756
01:03:16,559 --> 01:03:18,720
these and then you can also if you're
1757
01:03:18,720 --> 01:03:19,760
doing something a little bit more
1758
01:03:19,760 --> 01:03:21,520
complicated and you don't need just the
1759
01:03:21,520 --> 01:03:24,559
basics you can come through and pull any
1760
01:03:24,559 --> 01:03:27,760
one of the individual
1761
01:03:28,240 --> 01:03:30,960
references from the from the pandas on
1762
01:03:30,960 --> 01:03:33,680
here so now we've had a chance to
1763
01:03:33,680 --> 01:03:35,839
describe our data
1764
01:03:35,839 --> 01:03:39,039
let's get into inferential statistics
1765
01:03:39,039 --> 01:03:40,799
inferential statistics allows you to
1766
01:03:40,799 --> 01:03:44,880
make predictions or inferences from data
1767
01:03:44,880 --> 01:03:46,319
and you can see here we have a nice
1768
01:03:46,319 --> 01:03:49,760
little picture movie ratings and
1769
01:03:49,760 --> 01:03:52,000
if we took this group of people and said
1770
01:03:52,000 --> 01:03:53,520
hey how many people like the movie
1771
01:03:53,520 --> 01:03:55,920
dislike it can't say and then you ask
1772
01:03:55,920 --> 01:03:57,520
just a random person who comes out of
1773
01:03:57,520 --> 01:04:00,079
the movie who hasn't been in this study
1774
01:04:00,079 --> 01:04:03,039
you can infer that 55 percent chance of
1775
01:04:03,039 --> 01:04:04,400
saying liked
1776
01:04:04,400 --> 01:04:07,440
35 chance of saying disliked or a 10 or
1777
01:04:07,440 --> 01:04:09,760
11 chance of can't say
1778
01:04:09,760 --> 01:04:11,520
so that's real basics of what we're
1779
01:04:11,520 --> 01:04:13,200
talking about is you're going to infer
1780
01:04:13,200 --> 01:04:14,799
that the next person is going to follow
1781
01:04:14,799 --> 01:04:17,760
these statistics
1782
01:04:18,319 --> 01:04:20,960
so let's look at point estimation
1783
01:04:20,960 --> 01:04:22,480
it is a process of finding an
1784
01:04:22,480 --> 01:04:24,559
approximate value for a population's
1785
01:04:24,559 --> 01:04:26,720
parameter like mean
1786
01:04:26,720 --> 01:04:28,640
or average from random samples of the
1787
01:04:28,640 --> 01:04:30,799
population let's take an example of
1788
01:04:30,799 --> 01:04:33,440
testing vaccines for covid19
1789
01:04:33,440 --> 01:04:35,680
vaccines and flu bugs all that it's a
1790
01:04:35,680 --> 01:04:37,200
pretty big thing of how do you test
1791
01:04:37,200 --> 01:04:38,559
these out and make sure they're going to
1792
01:04:38,559 --> 01:04:40,400
work on the populace
1793
01:04:40,400 --> 01:04:42,319
a group of people are chosen from the
1794
01:04:42,319 --> 01:04:45,200
population medical trials are performed
1795
01:04:45,200 --> 01:04:47,200
results are generalized for the whole
1796
01:04:47,200 --> 01:04:49,839
population so here's a protected there's
1797
01:04:49,839 --> 01:04:51,280
our small group up here where we've
1798
01:04:51,280 --> 01:04:53,680
selected them we run medical trials on
1799
01:04:53,680 --> 01:04:55,200
them and then the results work for the
1800
01:04:55,200 --> 01:04:56,559
population
1801
01:04:56,559 --> 01:04:58,160
nice diagram with the arrows going back
1802
01:04:58,160 --> 01:05:00,240
and forth and the very scary coveted
1803
01:05:00,240 --> 01:05:01,920
virus in the middle of one
1804
01:05:01,920 --> 01:05:03,280
and let's take a look at the
1805
01:05:03,280 --> 01:05:06,880
applications of inferential statistics
1806
01:05:06,880 --> 01:05:08,480
very central is what they call
1807
01:05:08,480 --> 01:05:11,200
hypotheses testing
1808
01:05:11,200 --> 01:05:13,280
and the confidence interval which go
1809
01:05:13,280 --> 01:05:17,440
with that and then as we get into
1810
01:05:17,440 --> 01:05:19,920
probability we get into our binomial
1811
01:05:19,920 --> 01:05:22,000
theorem our normal distribution in
1812
01:05:22,000 --> 01:05:24,000
central limit theorem
1813
01:05:24,000 --> 01:05:26,640
hypothesis testing hypothesis testing is
1814
01:05:26,640 --> 01:05:28,880
used to measure the plausibility of a
1815
01:05:28,880 --> 01:05:30,319
hypothesis
1816
01:05:30,319 --> 01:05:33,440
assumption by using sample data
1817
01:05:33,440 --> 01:05:34,319
now
1818
01:05:34,319 --> 01:05:36,880
when we talk about theorem's
1819
01:05:36,880 --> 01:05:38,119
theory
1820
01:05:38,119 --> 01:05:41,760
hypothesis uh keep in mind that if you
1821
01:05:41,760 --> 01:05:43,920
are in a philosophy class
1822
01:05:43,920 --> 01:05:44,880
theory
1823
01:05:44,880 --> 01:05:48,240
is the same as hypothesis where theorem
1824
01:05:48,240 --> 01:05:50,720
is a scientific uh statement that is
1825
01:05:50,720 --> 01:05:52,880
something that has been proven
1826
01:05:52,880 --> 01:05:54,640
although it is always up for debate
1827
01:05:54,640 --> 01:05:56,160
because in science we always want to
1828
01:05:56,160 --> 01:05:57,760
make sure things are up to debate so
1829
01:05:57,760 --> 01:06:00,240
hypothesis is the same as a
1830
01:06:00,240 --> 01:06:02,480
philosophical class calling a theory
1831
01:06:02,480 --> 01:06:04,880
where theory in science is not the same
1832
01:06:04,880 --> 01:06:06,559
theory in science says this has been
1833
01:06:06,559 --> 01:06:09,680
well proven gravity is a theory uh so if
1834
01:06:09,680 --> 01:06:11,520
you want to debate the theory of gravity
1835
01:06:11,520 --> 01:06:13,760
try jumping up and down if you want to
1836
01:06:13,760 --> 01:06:16,400
have a theory about why the economy is
1837
01:06:16,400 --> 01:06:18,400
collapsing in your area
1838
01:06:18,400 --> 01:06:20,720
that is a philosophical debate
1839
01:06:20,720 --> 01:06:22,559
very important i've heard people mix
1840
01:06:22,559 --> 01:06:25,280
those up and it is a pet peeve of mine
1841
01:06:25,280 --> 01:06:27,599
when we talk about hypotheses testing
1842
01:06:27,599 --> 01:06:29,599
the steps involved in hypotheses testing
1843
01:06:29,599 --> 01:06:32,240
is first we formulate a hypothesis
1844
01:06:32,240 --> 01:06:34,160
we figure out the right test to test our
1845
01:06:34,160 --> 01:06:36,960
hypothesis we execute the test and we
1846
01:06:36,960 --> 01:06:39,280
make a decision and so when you're
1847
01:06:39,280 --> 01:06:40,960
talking about hypothesis you're usually
1848
01:06:40,960 --> 01:06:43,359
trying to disprove it if you can't
1849
01:06:43,359 --> 01:06:44,799
disprove it
1850
01:06:44,799 --> 01:06:46,799
and it works for all the facts then you
1851
01:06:46,799 --> 01:06:49,680
might call that a theorem at some point
1852
01:06:49,680 --> 01:06:51,839
so in a use case uh let's consider an
1853
01:06:51,839 --> 01:06:53,920
example we have four students were given
1854
01:06:53,920 --> 01:06:56,240
a task to clean a room every day
1855
01:06:56,240 --> 01:06:58,079
sounds like working with my kids they
1856
01:06:58,079 --> 01:06:59,520
decided to distribute the job of
1857
01:06:59,520 --> 01:07:01,680
cleaning the room among themselves they
1858
01:07:01,680 --> 01:07:04,000
did so by making four chits which has
1859
01:07:04,000 --> 01:07:05,920
their names on it and the name that gets
1860
01:07:05,920 --> 01:07:07,760
picked up has to do the cleaning for
1861
01:07:07,760 --> 01:07:08,960
that day
1862
01:07:08,960 --> 01:07:10,960
rob took the opportunity to make chits
1863
01:07:10,960 --> 01:07:13,039
and wrote everyone's name on it so
1864
01:07:13,039 --> 01:07:15,839
here's our four people nick rob emlia
1865
01:07:15,839 --> 01:07:19,200
imlia and summer
1866
01:07:19,200 --> 01:07:21,520
now rick emilia and summer are asking us
1867
01:07:21,520 --> 01:07:23,680
to decide whether rob has done some
1868
01:07:23,680 --> 01:07:26,240
mischief in preparing the chits i.e
1869
01:07:26,240 --> 01:07:27,920
whether rob has written his name on one
1870
01:07:27,920 --> 01:07:29,039
of the chit
1871
01:07:29,039 --> 01:07:30,400
for that we will find out the
1872
01:07:30,400 --> 01:07:32,240
probability of rob getting the cleaning
1873
01:07:32,240 --> 01:07:34,480
job on first day second day third day
1874
01:07:34,480 --> 01:07:36,640
and so on till 12 days
1875
01:07:36,640 --> 01:07:38,720
the probability of rob getting the job
1876
01:07:38,720 --> 01:07:41,920
decreases every day i.e his turn never
1877
01:07:41,920 --> 01:07:43,920
comes up then definitely he has done
1878
01:07:43,920 --> 01:07:46,480
some mischief while making the chits
1879
01:07:46,480 --> 01:07:48,640
so the probability of rob not doing work
1880
01:07:48,640 --> 01:07:51,039
on day one is uh three out of four
1881
01:07:51,039 --> 01:07:53,359
there's a 0.75 chance that he didn't do
1882
01:07:53,359 --> 01:07:54,240
work
1883
01:07:54,240 --> 01:07:56,559
uh two days three fourths times three
1884
01:07:56,559 --> 01:07:59,280
fourths equals point five six
1885
01:07:59,280 --> 01:08:00,799
three days you have three fourths three
1886
01:08:00,799 --> 01:08:03,920
fourths three fourths which equals 0.42
1887
01:08:03,920 --> 01:08:07,359
when you get to day 12 it's 0.032 which
1888
01:08:07,359 --> 01:08:09,599
is less than 0.05
1889
01:08:09,599 --> 01:08:12,480
remember this .05 that comes up a lot
1890
01:08:12,480 --> 01:08:14,640
when we're talking about
1891
01:08:14,640 --> 01:08:16,640
certain values when we're looking at
1892
01:08:16,640 --> 01:08:17,920
statistics
1893
01:08:17,920 --> 01:08:19,920
rob is cheating as he wasn't chosen for
1894
01:08:19,920 --> 01:08:22,319
12 consecutive days that's a very high
1895
01:08:22,319 --> 01:08:25,679
probability when on day 12 he still
1896
01:08:25,679 --> 01:08:29,040
hasn't gotten the job cleaning the room
1897
01:08:29,040 --> 01:08:31,520
so we come up to our important important
1898
01:08:31,520 --> 01:08:33,120
terminologies
1899
01:08:33,120 --> 01:08:36,000
we have null hypothesis
1900
01:08:36,000 --> 01:08:37,520
a general statement that states that
1901
01:08:37,520 --> 01:08:39,520
there is no relationship between two
1902
01:08:39,520 --> 01:08:42,399
measured phenomena or no association
1903
01:08:42,399 --> 01:08:44,238
among the groups
1904
01:08:44,238 --> 01:08:46,399
alternative hypothesis
1905
01:08:46,399 --> 01:08:48,399
contrary to the null hypothesis it
1906
01:08:48,399 --> 01:08:50,880
states whenever something is happening a
1907
01:08:50,880 --> 01:08:52,960
new theory is preferred instead of an
1908
01:08:52,960 --> 01:08:55,839
old one and so the two hypotheses go
1909
01:08:55,839 --> 01:08:58,799
hand in hand uh so your null this is
1910
01:08:58,799 --> 01:09:00,399
always interesting in in talking about
1911
01:09:00,399 --> 01:09:02,799
data science and the math behind it
1912
01:09:02,799 --> 01:09:05,040
it's about proving that the things have
1913
01:09:05,040 --> 01:09:06,399
no correlation
1914
01:09:06,399 --> 01:09:08,719
null hypothesis says these two have zero
1915
01:09:08,719 --> 01:09:10,479
relation to each other where the
1916
01:09:10,479 --> 01:09:12,719
alternative hypothesis says hey we found
1917
01:09:12,719 --> 01:09:14,880
a relation this is what it is
1918
01:09:14,880 --> 01:09:17,439
we have p-value the p-value is a
1919
01:09:17,439 --> 01:09:19,839
probability of finding the observed or
1920
01:09:19,839 --> 01:09:21,439
more extreme results when the null
1921
01:09:21,439 --> 01:09:24,799
hypothesis of a study question is true
1922
01:09:24,799 --> 01:09:26,799
and the t value it is simply the
1923
01:09:26,799 --> 01:09:28,640
calculated difference represented in
1924
01:09:28,640 --> 01:09:30,880
units of standard error the greater the
1925
01:09:30,880 --> 01:09:32,880
magnitude of t the greater the evidence
1926
01:09:32,880 --> 01:09:35,359
against the null hypothesis and you can
1927
01:09:35,359 --> 01:09:38,000
look at the t values being specific to
1928
01:09:38,000 --> 01:09:39,679
the test you're doing
1929
01:09:39,679 --> 01:09:42,238
where the p value is derived from your t
1930
01:09:42,238 --> 01:09:44,319
value and you're looking for what they
1931
01:09:44,319 --> 01:09:47,439
call the five percent or the 0.05
1932
01:09:47,439 --> 01:09:49,520
showing that it has a high correlation
1933
01:09:49,520 --> 01:09:50,238
so
1934
01:09:50,238 --> 01:09:52,158
digging in deeper let's assume that a
1935
01:09:52,158 --> 01:09:53,920
new drug is developed with the goal of
1936
01:09:53,920 --> 01:09:55,679
lowering the blood pressure more than
1937
01:09:55,679 --> 01:09:57,520
the existing drug
1938
01:09:57,520 --> 01:09:59,840
and this is a good one because the null
1939
01:09:59,840 --> 01:10:01,679
value here isn't that you don't have any
1940
01:10:01,679 --> 01:10:03,199
drug the null value here is that it's
1941
01:10:03,199 --> 01:10:04,880
better than the existing drug
1942
01:10:04,880 --> 01:10:06,480
the new drug doesn't lower the blood
1943
01:10:06,480 --> 01:10:09,360
pressure more than the existing drug
1944
01:10:09,360 --> 01:10:11,360
now if we get that
1945
01:10:11,360 --> 01:10:13,679
that says our null hypothesis is correct
1946
01:10:13,679 --> 01:10:16,159
there is no correlation and the new drug
1947
01:10:16,159 --> 01:10:18,880
is not doing its job the alternative
1948
01:10:18,880 --> 01:10:21,120
hypothesis the new drug does
1949
01:10:21,120 --> 01:10:22,800
significantly lower the blood pressure
1950
01:10:22,800 --> 01:10:25,600
more than the existing drug uh yay we
1951
01:10:25,600 --> 01:10:27,520
got a new drug out there and that's our
1952
01:10:27,520 --> 01:10:31,520
alternative hypothesis or the h1 or ha
1953
01:10:31,520 --> 01:10:33,920
and we look at the p-value results from
1954
01:10:33,920 --> 01:10:35,840
the evidence like medical trial showing
1955
01:10:35,840 --> 01:10:37,840
positive results which will reject the
1956
01:10:37,840 --> 01:10:39,600
null hypothesis
1957
01:10:39,600 --> 01:10:41,360
and again they're looking for
1958
01:10:41,360 --> 01:10:44,800
a 0.05 or 5 percent and the t value
1959
01:10:44,800 --> 01:10:46,880
comparing all the positive test results
1960
01:10:46,880 --> 01:10:48,640
and finding means of different samples
1961
01:10:48,640 --> 01:10:50,800
in order to test hypothesis
1962
01:10:50,800 --> 01:10:53,520
so this is specific to the test how what
1963
01:10:53,520 --> 01:10:56,000
percentage of increase did they have
1964
01:10:56,000 --> 01:10:57,920
and this leads us to the confidence
1965
01:10:57,920 --> 01:10:59,120
intervals
1966
01:10:59,120 --> 01:11:01,280
a confidence interval is a range of
1967
01:11:01,280 --> 01:11:03,679
values we are sure our true values of
1968
01:11:03,679 --> 01:11:06,159
observations lie in
1969
01:11:06,159 --> 01:11:08,080
let's say you asked a dog owner around
1970
01:11:08,080 --> 01:11:10,719
you and asked them how many cans of food
1971
01:11:10,719 --> 01:11:13,199
do you buy for your per year for your
1972
01:11:13,199 --> 01:11:14,239
dog
1973
01:11:14,239 --> 01:11:16,159
through calculations you got to know
1974
01:11:16,159 --> 01:11:18,880
that the on an average around 95 percent
1975
01:11:18,880 --> 01:11:21,199
of the people bought around 200 to 300
1976
01:11:21,199 --> 01:11:23,360
cans of food hence we can say that we
1977
01:11:23,360 --> 01:11:26,320
have a confidence interval of two 300
1978
01:11:26,320 --> 01:11:28,880
where 95 percent of our values lie in
1979
01:11:28,880 --> 01:11:31,120
that sprint data spread
1980
01:11:31,120 --> 01:11:33,520
and this the graph really helps a lot so
1981
01:11:33,520 --> 01:11:35,199
you can start seeing what you're looking
1982
01:11:35,199 --> 01:11:37,760
at here we have the 95 percent you have
1983
01:11:37,760 --> 01:11:39,600
your peak in this case it's a normal
1984
01:11:39,600 --> 01:11:41,040
distribution so you have a nice bell
1985
01:11:41,040 --> 01:11:42,640
curve equal on both sides it's not
1986
01:11:42,640 --> 01:11:45,840
asymmetrical and 95 of all the values
1987
01:11:45,840 --> 01:11:48,080
lie within a very small range and then
1988
01:11:48,080 --> 01:11:50,320
you have your outliers the 2.5 percent
1989
01:11:50,320 --> 01:11:52,159
going each way
1990
01:11:52,159 --> 01:11:54,960
so we touched upon hypothesis uh we're
1991
01:11:54,960 --> 01:11:57,760
going to move into probability so you
1992
01:11:57,760 --> 01:11:58,960
have your hypothesis once you've
1993
01:11:58,960 --> 01:12:00,400
generated your hypothesis we want to
1994
01:12:00,400 --> 01:12:01,760
know the probability of something
1995
01:12:01,760 --> 01:12:04,000
occurring probability is a measure of
1996
01:12:04,000 --> 01:12:06,640
the likelihood of an event to occur any
1997
01:12:06,640 --> 01:12:08,480
event can be predicted with total
1998
01:12:08,480 --> 01:12:09,600
certainty
1999
01:12:09,600 --> 01:12:11,199
and can only be predicted as a
2000
01:12:11,199 --> 01:12:13,679
likelihood of its occurrence so any
2001
01:12:13,679 --> 01:12:15,520
event cannot be predicted with total
2002
01:12:15,520 --> 01:12:17,280
certainty can only be predicted as a
2003
01:12:17,280 --> 01:12:19,679
likelihood of its occurrence
2004
01:12:19,679 --> 01:12:21,360
score prediction how good you're going
2005
01:12:21,360 --> 01:12:23,600
to do in whatever
2006
01:12:23,600 --> 01:12:26,080
sport you're in weather prediction stock
2007
01:12:26,080 --> 01:12:27,360
prediction
2008
01:12:27,360 --> 01:12:29,600
if you've studied physics and chaos
2009
01:12:29,600 --> 01:12:31,760
theory even the location of the chair
2010
01:12:31,760 --> 01:12:33,520
you're sitting on has a probability that
2011
01:12:33,520 --> 01:12:35,600
it might move three feet over
2012
01:12:35,600 --> 01:12:37,840
granted that probability is one in like
2013
01:12:37,840 --> 01:12:40,320
uh i think we calculated as under one in
2014
01:12:40,320 --> 01:12:43,440
trillions upon trillions so it's
2015
01:12:43,440 --> 01:12:44,719
the better the probability the more
2016
01:12:44,719 --> 01:12:46,000
likely it's going to happen there are
2017
01:12:46,000 --> 01:12:47,120
some things that have such a low
2018
01:12:47,120 --> 01:12:48,400
probability
2019
01:12:48,400 --> 01:12:50,880
that we don't see them so we talk about
2020
01:12:50,880 --> 01:12:53,360
a random variable a random variable is a
2021
01:12:53,360 --> 01:12:54,880
variable whose possible values are
2022
01:12:54,880 --> 01:12:57,920
numerical outcomes of a random phenomena
2023
01:12:57,920 --> 01:13:00,960
so uh we have the coin tossed how many
2024
01:13:00,960 --> 01:13:02,880
heads will occur in the series of 20
2025
01:13:02,880 --> 01:13:05,360
coin flips probably you know the on
2026
01:13:05,360 --> 01:13:07,440
average they're 10 but you really can't
2027
01:13:07,440 --> 01:13:09,679
know because it's very random how many
2028
01:13:09,679 --> 01:13:11,600
times a red ball is picked from a bag of
2029
01:13:11,600 --> 01:13:14,480
balls if there's equal number of red
2030
01:13:14,480 --> 01:13:16,080
balls and blue balls and green balls in
2031
01:13:16,080 --> 01:13:18,159
there how many times the sum of digits
2032
01:13:18,159 --> 01:13:19,760
on two dice
2033
01:13:19,760 --> 01:13:22,960
results are five each
2034
01:13:22,960 --> 01:13:24,320
so you know there's how often you're
2035
01:13:24,320 --> 01:13:27,440
gonna roll two fives on your paradigms
2036
01:13:27,440 --> 01:13:29,600
so in a use case uh let's consider the
2037
01:13:29,600 --> 01:13:31,280
example of rolling two dice we have a
2038
01:13:31,280 --> 01:13:33,360
random variable outcome equals y you can
2039
01:13:33,360 --> 01:13:35,520
take values two three four five six
2040
01:13:35,520 --> 01:13:38,000
seven eight nine ten eleven twelve
2041
01:13:38,000 --> 01:13:39,920
so we have a random variable and a
2042
01:13:39,920 --> 01:13:41,760
combination of dice
2043
01:13:41,760 --> 01:13:44,640
and instead of looking at how many times
2044
01:13:44,640 --> 01:13:46,320
both dice for roll five let's go ahead
2045
01:13:46,320 --> 01:13:48,880
and look at uh total sum of five and you
2046
01:13:48,880 --> 01:13:50,880
have in as far as your random variables
2047
01:13:50,880 --> 01:13:53,040
you can have a one four equals five four
2048
01:13:53,040 --> 01:13:55,360
one two three three two
2049
01:13:55,360 --> 01:13:58,320
so four of those rolls can be four if
2050
01:13:58,320 --> 01:13:59,920
you look at all the different options
2051
01:13:59,920 --> 01:14:02,000
you have four of those random rolls can
2052
01:14:02,000 --> 01:14:03,920
be a five
2053
01:14:03,920 --> 01:14:07,199
and if we look at the total number
2054
01:14:07,199 --> 01:14:10,320
which happens to be 36 different options
2055
01:14:10,320 --> 01:14:12,880
you can see that we have four out of 36
2056
01:14:12,880 --> 01:14:14,719
chance every time you roll the dice that
2057
01:14:14,719 --> 01:14:16,400
you're gonna roll a total of five you
2058
01:14:16,400 --> 01:14:18,400
can have an outcome of five
2059
01:14:18,400 --> 01:14:21,040
and uh we'll look a little deeper as to
2060
01:14:21,040 --> 01:14:23,120
what that means but you could think of
2061
01:14:23,120 --> 01:14:25,199
that at what point if someone never
2062
01:14:25,199 --> 01:14:27,679
rolls a five or they always roll a five
2063
01:14:27,679 --> 01:14:29,440
can you say hey that person's probably
2064
01:14:29,440 --> 01:14:30,480
cheating
2065
01:14:30,480 --> 01:14:32,080
we'll look a little closer at the math
2066
01:14:32,080 --> 01:14:34,239
behind that but let's just consider this
2067
01:14:34,239 --> 01:14:36,400
is one of the cases is rolling two dice
2068
01:14:36,400 --> 01:14:37,920
and gambling
2069
01:14:37,920 --> 01:14:40,239
there's also binomial distribution is
2070
01:14:40,239 --> 01:14:42,239
the probability of getting success or
2071
01:14:42,239 --> 01:14:44,239
failure as an outcome in an experiment
2072
01:14:44,239 --> 01:14:47,199
or trial that is repeated multiple times
2073
01:14:47,199 --> 01:14:49,840
and the key is is by meaning two
2074
01:14:49,840 --> 01:14:53,040
binomial so passing or failing an exam
2075
01:14:53,040 --> 01:14:55,280
winning or losing a game and getting
2076
01:14:55,280 --> 01:14:57,600
either head or tails so if you ever see
2077
01:14:57,600 --> 01:15:01,120
binomial distribution it's based on a
2078
01:15:01,120 --> 01:15:04,080
true false kind of setup you win or lose
2079
01:15:04,080 --> 01:15:05,760
let's consider a
2080
01:15:05,760 --> 01:15:08,400
use case and let's consider the game of
2081
01:15:08,400 --> 01:15:11,199
football between two clubs
2082
01:15:11,199 --> 01:15:13,920
barcelona and dortmund the teams will
2083
01:15:13,920 --> 01:15:16,000
have to play a total of four matches and
2084
01:15:16,000 --> 01:15:18,080
we have to find out the chances of
2085
01:15:18,080 --> 01:15:20,880
barcelona winning the series so we look
2086
01:15:20,880 --> 01:15:22,960
at the total games we're looking at five
2087
01:15:22,960 --> 01:15:24,960
different games or matches
2088
01:15:24,960 --> 01:15:26,320
let's say that the winning chance for
2089
01:15:26,320 --> 01:15:30,000
barcelona is 75 percent or 0.75
2090
01:15:30,000 --> 01:15:32,400
that means that each game they have a 75
2091
01:15:32,400 --> 01:15:33,600
chance that they're going to win that
2092
01:15:33,600 --> 01:15:36,239
game and losing chances are 25 percent
2093
01:15:36,239 --> 01:15:40,960
or 0.25 clearly 0.75 plus 0.25 equals 1.
2094
01:15:40,960 --> 01:15:43,040
so that accounts for 100 of the game
2095
01:15:43,040 --> 01:15:46,239
probability for getting k wins in in
2096
01:15:46,239 --> 01:15:48,560
matches is calculated
2097
01:15:48,560 --> 01:15:50,159
and we we're talking like so if you have
2098
01:15:50,159 --> 01:15:52,400
five games and you want to know if i
2099
01:15:52,400 --> 01:15:53,600
play
2100
01:15:53,600 --> 01:15:55,840
how many wins in those five games should
2101
01:15:55,840 --> 01:15:58,159
i get what's a percentage on those and
2102
01:15:58,159 --> 01:16:01,199
the probability for getting k wins and
2103
01:16:01,199 --> 01:16:04,640
in matches is calculated by p x equals k
2104
01:16:04,640 --> 01:16:08,960
equals nc k p to the k q to the n minus
2105
01:16:08,960 --> 01:16:09,760
k
2106
01:16:09,760 --> 01:16:12,719
here p is the probability of success and
2107
01:16:12,719 --> 01:16:15,120
q is the probability of failure and so
2108
01:16:15,120 --> 01:16:17,679
we can do total games of n equals 5
2109
01:16:17,679 --> 01:16:21,360
where k equals 0 one two three four five
2110
01:16:21,360 --> 01:16:23,520
p which is the chance of winning is
2111
01:16:23,520 --> 01:16:26,159
point seven five q the chance of losing
2112
01:16:26,159 --> 01:16:28,320
equals one minus p
2113
01:16:28,320 --> 01:16:30,239
which equals one minus point o seven
2114
01:16:30,239 --> 01:16:32,400
five which equals point two five the
2115
01:16:32,400 --> 01:16:34,320
probability that barcelona will lose all
2116
01:16:34,320 --> 01:16:36,719
of the matches can then just plug in the
2117
01:16:36,719 --> 01:16:37,840
numbers
2118
01:16:37,840 --> 01:16:42,040
and we end up with a .0009765625
2119
01:16:44,239 --> 01:16:45,679
so very small chance they're going to
2120
01:16:45,679 --> 01:16:48,080
lose all their matches
2121
01:16:48,080 --> 01:16:50,480
and we can plug in the value for two
2122
01:16:50,480 --> 01:16:53,440
matches probability that barcelona will
2123
01:16:53,440 --> 01:16:57,360
win at least two matches is 0.0878 and
2124
01:16:57,360 --> 01:16:58,480
of course we can go on to the
2125
01:16:58,480 --> 01:16:59,920
probability that barcelona will win
2126
01:16:59,920 --> 01:17:03,520
three matches the 0.26 and course four
2127
01:17:03,520 --> 01:17:06,239
matches and so on and it's always nice
2128
01:17:06,239 --> 01:17:08,320
to take this information
2129
01:17:08,320 --> 01:17:10,239
and let's find the accumulated discrete
2130
01:17:10,239 --> 01:17:12,480
probabilities for each of the outcomes
2131
01:17:12,480 --> 01:17:15,120
where barcelona has won three or more
2132
01:17:15,120 --> 01:17:17,440
matches x equals three x equals four x
2133
01:17:17,440 --> 01:17:18,960
equals five
2134
01:17:18,960 --> 01:17:20,960
and we end up with the p equals point
2135
01:17:20,960 --> 01:17:23,120
two six four plus point three nine five
2136
01:17:23,120 --> 01:17:25,199
plus two three seven which equals point
2137
01:17:25,199 --> 01:17:26,560
eight nine
2138
01:17:26,560 --> 01:17:28,159
in reality
2139
01:17:28,159 --> 01:17:29,679
the probability of barcelona winning the
2140
01:17:29,679 --> 01:17:32,640
series is much higher than 0.75
2141
01:17:32,640 --> 01:17:35,440
and it's always nice to
2142
01:17:35,440 --> 01:17:37,280
put out a nice graph so you can actually
2143
01:17:37,280 --> 01:17:38,880
see the number of wins to the
2144
01:17:38,880 --> 01:17:41,760
probability and how that pans out with
2145
01:17:41,760 --> 01:17:43,840
our binomial case
2146
01:17:43,840 --> 01:17:46,480
continuing in our important terminology
2147
01:17:46,480 --> 01:17:48,880
location the location of the center of
2148
01:17:48,880 --> 01:17:51,280
the graph depends on the mean
2149
01:17:51,280 --> 01:17:53,840
value and this is some very important
2150
01:17:53,840 --> 01:17:56,480
things so much of the data we look at
2151
01:17:56,480 --> 01:17:57,600
and when you start looking at
2152
01:17:57,600 --> 01:17:59,440
probabilities almost always has a
2153
01:17:59,440 --> 01:18:01,280
normalized look like the graph in the
2154
01:18:01,280 --> 01:18:02,800
middle
2155
01:18:02,800 --> 01:18:04,880
but you do have left skewed where the
2156
01:18:04,880 --> 01:18:06,480
data is skewed off to the left and you
2157
01:18:06,480 --> 01:18:07,679
have more stuff happening off to the
2158
01:18:07,679 --> 01:18:10,080
left and you have right skewed data
2159
01:18:10,080 --> 01:18:11,600
and so when this comes up and these
2160
01:18:11,600 --> 01:18:13,040
probabilities come up where they're
2161
01:18:13,040 --> 01:18:14,960
skewed it's really important to take a
2162
01:18:14,960 --> 01:18:16,560
closer look at that
2163
01:18:16,560 --> 01:18:18,640
mostly you end up with a normalized set
2164
01:18:18,640 --> 01:18:20,080
of data but you got to also be aware
2165
01:18:20,080 --> 01:18:22,560
that sometimes it's a skewed data
2166
01:18:22,560 --> 01:18:24,560
and then the height height of the slope
2167
01:18:24,560 --> 01:18:26,560
inversely depends upon the standard
2168
01:18:26,560 --> 01:18:28,560
deviation
2169
01:18:28,560 --> 01:18:29,760
so you can see down here the standard
2170
01:18:29,760 --> 01:18:31,360
deviation is really large it kind of
2171
01:18:31,360 --> 01:18:33,440
squishes it out and if the standard
2172
01:18:33,440 --> 01:18:35,679
deviation is small then most of your
2173
01:18:35,679 --> 01:18:36,960
data is going to hit right there in the
2174
01:18:36,960 --> 01:18:39,120
middle you can have a nice peak
2175
01:18:39,120 --> 01:18:40,960
and so being aware of this that you
2176
01:18:40,960 --> 01:18:43,040
might have a probability that fits
2177
01:18:43,040 --> 01:18:44,719
certain data but it has a lot of
2178
01:18:44,719 --> 01:18:46,719
outliers so you're if you have a really
2179
01:18:46,719 --> 01:18:49,199
high standard deviation
2180
01:18:49,199 --> 01:18:52,239
if you're doing stock market analysis
2181
01:18:52,239 --> 01:18:53,840
this means your predictions are probably
2182
01:18:53,840 --> 01:18:55,760
not going to make you much money
2183
01:18:55,760 --> 01:18:57,600
where if you have a very small deviation
2184
01:18:57,600 --> 01:18:59,600
you might be right on target and set to
2185
01:18:59,600 --> 01:19:01,040
become a millionaire
2186
01:19:01,040 --> 01:19:03,840
which leads us to the z-score z-score
2187
01:19:03,840 --> 01:19:06,320
tells you how far from the mean a data
2188
01:19:06,320 --> 01:19:09,040
point is it is measured in terms of
2189
01:19:09,040 --> 01:19:11,199
standard deviations from the mean around
2190
01:19:11,199 --> 01:19:12,880
68 percent of the results are found
2191
01:19:12,880 --> 01:19:15,360
between one standard deviation
2192
01:19:15,360 --> 01:19:17,280
around 95 percent of the results are
2193
01:19:17,280 --> 01:19:20,239
found between two standard deviations
2194
01:19:20,239 --> 01:19:22,159
and you read the symbols of course they
2195
01:19:22,159 --> 01:19:23,600
love to throw some greek letters in
2196
01:19:23,600 --> 01:19:27,120
there we have mu minus two sigma
2197
01:19:27,120 --> 01:19:29,760
mu is just a quick way it's a kind of
2198
01:19:29,760 --> 01:19:32,719
funky u it just means the mean
2199
01:19:32,719 --> 01:19:34,719
uh and then the sigma is the standard
2200
01:19:34,719 --> 01:19:37,040
deviation and that's the o with the
2201
01:19:37,040 --> 01:19:38,640
little arrow off to the right or the
2202
01:19:38,640 --> 01:19:39,520
little
2203
01:19:39,520 --> 01:19:41,440
wagy tail going up the o with it with
2204
01:19:41,440 --> 01:19:45,440
the line on it uh so mu minus two sigma
2205
01:19:45,440 --> 01:19:46,640
is your
2206
01:19:46,640 --> 01:19:48,640
uh 95 percent of the results are found
2207
01:19:48,640 --> 01:19:51,679
between two standard deviations
2208
01:19:51,679 --> 01:19:53,679
central limit theorem
2209
01:19:53,679 --> 01:19:55,679
this goes back to the skew if you
2210
01:19:55,679 --> 01:19:57,120
remember we were looking at the skew
2211
01:19:57,120 --> 01:20:00,560
values on this previous slide have left
2212
01:20:00,560 --> 01:20:03,280
skewed normalized and right skewed when
2213
01:20:03,280 --> 01:20:04,880
we're talking about it being skewed or
2214
01:20:04,880 --> 01:20:07,199
not skewed the distribution of the
2215
01:20:07,199 --> 01:20:09,600
sample means will be approximately
2216
01:20:09,600 --> 01:20:11,920
normally distributed evenly distributed
2217
01:20:11,920 --> 01:20:13,120
not skewed
2218
01:20:13,120 --> 01:20:15,679
if you take large random samples from
2219
01:20:15,679 --> 01:20:18,400
the population with the mean mu
2220
01:20:18,400 --> 01:20:21,360
and the standard deviation sigma with
2221
01:20:21,360 --> 01:20:23,040
replacement
2222
01:20:23,040 --> 01:20:24,800
and you can see here
2223
01:20:24,800 --> 01:20:26,719
of course we have our
2224
01:20:26,719 --> 01:20:28,880
mu minus two sigma and the spread down
2225
01:20:28,880 --> 01:20:31,360
here the mean the median and the mode
2226
01:20:31,360 --> 01:20:32,719
and so you're talking about very large
2227
01:20:32,719 --> 01:20:35,040
populations
2228
01:20:35,040 --> 01:20:36,480
these numbers should come together and
2229
01:20:36,480 --> 01:20:38,800
you shouldn't have a skewed value if you
2230
01:20:38,800 --> 01:20:41,520
do that's a flag that something's wrong
2231
01:20:41,520 --> 01:20:43,040
that's why this is so important to be
2232
01:20:43,040 --> 01:20:45,360
aware of what's going on with your data
2233
01:20:45,360 --> 01:20:47,679
where your samples are coming from
2234
01:20:47,679 --> 01:20:49,840
and the math behind it
2235
01:20:49,840 --> 01:20:51,040
and if we're going to do all this we got
2236
01:20:51,040 --> 01:20:54,640
to jump into conditional probability
2237
01:20:54,640 --> 01:20:56,560
the conditional probability of an event
2238
01:20:56,560 --> 01:20:58,960
a is a probability that the event will
2239
01:20:58,960 --> 01:21:01,440
occur given the knowledge that an event
2240
01:21:01,440 --> 01:21:04,080
to be has already occurred
2241
01:21:04,080 --> 01:21:06,320
and you'll see this as bayes theorem
2242
01:21:06,320 --> 01:21:08,960
b-a-y-e-s base
2243
01:21:08,960 --> 01:21:10,560
and this is red
2244
01:21:10,560 --> 01:21:11,760
i mean you have these funky looking
2245
01:21:11,760 --> 01:21:12,800
little
2246
01:21:12,800 --> 01:21:13,440
p
2247
01:21:13,440 --> 01:21:15,120
brackets a b
2248
01:21:15,120 --> 01:21:17,840
this is the probability of a being true
2249
01:21:17,840 --> 01:21:21,040
while b is already true
2250
01:21:21,040 --> 01:21:22,640
and you have the probability of b being
2251
01:21:22,640 --> 01:21:25,280
true when a is already true so p
2252
01:21:25,280 --> 01:21:26,640
b of a
2253
01:21:26,640 --> 01:21:29,280
probability of a being true divided by
2254
01:21:29,280 --> 01:21:31,920
the probability of b being true
2255
01:21:31,920 --> 01:21:33,920
and we talk about bayes theorem which
2256
01:21:33,920 --> 01:21:35,840
occurred back in the 1800s when he
2257
01:21:35,840 --> 01:21:37,600
discovered this this is such an
2258
01:21:37,600 --> 01:21:39,760
important formula and it's really it's
2259
01:21:39,760 --> 01:21:41,840
not if you actually do the math you
2260
01:21:41,840 --> 01:21:44,560
could just kind of do
2261
01:21:44,560 --> 01:21:48,080
x y equals j k and then you divide them
2262
01:21:48,080 --> 01:21:49,360
out and you're going to see the same
2263
01:21:49,360 --> 01:21:51,520
math but it works with probabilities
2264
01:21:51,520 --> 01:21:53,360
which makes it really nice
2265
01:21:53,360 --> 01:21:55,600
and so if you have a set you might have
2266
01:21:55,600 --> 01:21:57,920
uh eight or nine different studies going
2267
01:21:57,920 --> 01:22:00,400
on in different areas different people
2268
01:22:00,400 --> 01:22:01,920
have done the studies they brought them
2269
01:22:01,920 --> 01:22:03,679
together
2270
01:22:03,679 --> 01:22:05,679
if we look at today's covet virus the
2271
01:22:05,679 --> 01:22:07,520
virus spread
2272
01:22:07,520 --> 01:22:09,920
certainly the studies done in china
2273
01:22:09,920 --> 01:22:11,199
versus the studies the way they're done
2274
01:22:11,199 --> 01:22:12,719
in the us
2275
01:22:12,719 --> 01:22:14,800
that data is different in each of those
2276
01:22:14,800 --> 01:22:16,800
studies but if you can find a place
2277
01:22:16,800 --> 01:22:19,199
where it overlaps where they're studying
2278
01:22:19,199 --> 01:22:21,360
the same thing together you can then
2279
01:22:21,360 --> 01:22:23,040
compute the changes that you need to
2280
01:22:23,040 --> 01:22:25,679
make in one study to make them equal
2281
01:22:25,679 --> 01:22:27,920
and this is also true if you have a
2282
01:22:27,920 --> 01:22:29,679
study of
2283
01:22:29,679 --> 01:22:31,280
one group and you want to find out more
2284
01:22:31,280 --> 01:22:33,679
about it so this formula is very
2285
01:22:33,679 --> 01:22:35,840
powerful and it really has to do with
2286
01:22:35,840 --> 01:22:37,679
the data collection part of the math and
2287
01:22:37,679 --> 01:22:40,000
data science and understanding where
2288
01:22:40,000 --> 01:22:41,920
your data is coming from and how you're
2289
01:22:41,920 --> 01:22:44,000
going to combine different studies in
2290
01:22:44,000 --> 01:22:45,840
different groups
2291
01:22:45,840 --> 01:22:47,760
and we're going to go into a use case
2292
01:22:47,760 --> 01:22:49,520
let's find out the chance of a person
2293
01:22:49,520 --> 01:22:52,719
getting lung disease due to smoking
2294
01:22:52,719 --> 01:22:54,080
this is kind of interesting the way they
2295
01:22:54,080 --> 01:22:55,120
word this
2296
01:22:55,120 --> 01:22:56,719
let's say that according to medical
2297
01:22:56,719 --> 01:22:59,199
report provided by the hospital states
2298
01:22:59,199 --> 01:23:01,920
that around 10 percent of all patients
2299
01:23:01,920 --> 01:23:05,280
they treated suffered lung disease
2300
01:23:05,280 --> 01:23:07,440
so we have kind of a generic medical
2301
01:23:07,440 --> 01:23:08,560
report
2302
01:23:08,560 --> 01:23:10,320
they further found out
2303
01:23:10,320 --> 01:23:12,320
by a survey that 15 percent of the
2304
01:23:12,320 --> 01:23:15,360
patients that visit them smoke
2305
01:23:15,360 --> 01:23:16,960
so we have 10 percent that are lung
2306
01:23:16,960 --> 01:23:18,639
disease and
2307
01:23:18,639 --> 01:23:21,040
15 of the patients smoke
2308
01:23:21,040 --> 01:23:23,040
and finally five percent of the people
2309
01:23:23,040 --> 01:23:25,840
continued smoke even when they had lung
2310
01:23:25,840 --> 01:23:29,280
disease uh not the brightest choice um
2311
01:23:29,280 --> 01:23:30,719
but you know it is an addiction so it
2312
01:23:30,719 --> 01:23:32,480
can be really difficult to kick and so
2313
01:23:32,480 --> 01:23:35,520
we can look at the probability of a uh
2314
01:23:35,520 --> 01:23:37,679
prior probability of 10 people having
2315
01:23:37,679 --> 01:23:39,199
lung disease
2316
01:23:39,199 --> 01:23:41,440
and then probability b probability that
2317
01:23:41,440 --> 01:23:45,440
a patient smokes is 15 percent
2318
01:23:45,440 --> 01:23:48,880
uh and the probability of b
2319
01:23:48,880 --> 01:23:51,040
if b then a the probability of a patient
2320
01:23:51,040 --> 01:23:53,040
smokes even though they have lung
2321
01:23:53,040 --> 01:23:55,520
disease is five percent
2322
01:23:55,520 --> 01:23:57,920
and probability of a is b probability
2323
01:23:57,920 --> 01:23:59,760
that the patient will have lung disease
2324
01:23:59,760 --> 01:24:01,600
if they smoke and then when you put the
2325
01:24:01,600 --> 01:24:03,280
formulas together you get a nice
2326
01:24:03,280 --> 01:24:05,199
solution here you get the probability of
2327
01:24:05,199 --> 01:24:07,120
a of b probability that the patient will
2328
01:24:07,120 --> 01:24:09,440
have lung disease if they smoke
2329
01:24:09,440 --> 01:24:11,120
and you can just plug the numbers right
2330
01:24:11,120 --> 01:24:14,639
in and we get a 3.33 percent chance
2331
01:24:14,639 --> 01:24:16,880
hence there is a 3.33 chance that a
2332
01:24:16,880 --> 01:24:18,719
person who smokes will get a lung
2333
01:24:18,719 --> 01:24:20,239
disease
2334
01:24:20,239 --> 01:24:22,000
so we're going to pull up a little
2335
01:24:22,000 --> 01:24:24,639
python code i'm always my favorite roll
2336
01:24:24,639 --> 01:24:26,159
up the sleeves
2337
01:24:26,159 --> 01:24:28,080
keep in mind we're going to be doing
2338
01:24:28,080 --> 01:24:31,360
this um kind of like the back end way
2339
01:24:31,360 --> 01:24:33,600
so that you can see what's going on and
2340
01:24:33,600 --> 01:24:37,040
then later on we're going to create
2341
01:24:37,040 --> 01:24:39,679
we'll get into another demo which shows
2342
01:24:39,679 --> 01:24:40,880
you some of the tools are already
2343
01:24:40,880 --> 01:24:42,480
pre-built for this
2344
01:24:42,480 --> 01:24:45,840
let's start by creating a set so we're
2345
01:24:45,840 --> 01:24:48,320
going to create a set with curly braces
2346
01:24:48,320 --> 01:24:51,440
this means that our set has
2347
01:24:51,440 --> 01:24:55,280
only unique values so you have a list
2348
01:24:55,280 --> 01:24:57,199
you have your tuples which can never
2349
01:24:57,199 --> 01:24:59,840
change and then you have
2350
01:24:59,840 --> 01:25:03,360
in this case the the set so four seven
2351
01:25:03,360 --> 01:25:05,600
you can't create a four seven comma four
2352
01:25:05,600 --> 01:25:07,520
it'll delete the four out so it's only
2353
01:25:07,520 --> 01:25:09,040
unique values
2354
01:25:09,040 --> 01:25:12,320
and if you use dictionaries
2355
01:25:12,320 --> 01:25:14,800
quick reminder this should look familiar
2356
01:25:14,800 --> 01:25:17,280
because it is a dictionary we have a
2357
01:25:17,280 --> 01:25:20,159
value and that value is assigned to or
2358
01:25:20,159 --> 01:25:23,040
that key is assigned to a value
2359
01:25:23,040 --> 01:25:24,719
so you could have a key value set up as
2360
01:25:24,719 --> 01:25:26,800
a dictionary so it's like a dictionary
2361
01:25:26,800 --> 01:25:28,719
without the value it's just the keys and
2362
01:25:28,719 --> 01:25:31,920
they all have to be unique
2363
01:25:31,920 --> 01:25:34,080
and if we run this we have a
2364
01:25:34,080 --> 01:25:37,440
set of four seven
2365
01:25:37,920 --> 01:25:40,960
we can also take a list a regular
2366
01:25:40,960 --> 01:25:42,320
setup and i'm going to go ahead and just
2367
01:25:42,320 --> 01:25:44,639
throw in another number in here four
2368
01:25:44,639 --> 01:25:47,040
and run it uh and you can see here if i
2369
01:25:47,040 --> 01:25:50,000
take my list one two three four four
2370
01:25:50,000 --> 01:25:53,199
and i convert it to a set and here it is
2371
01:25:53,199 --> 01:25:57,199
my set from list equals set my list
2372
01:25:57,199 --> 01:25:58,960
the result is one two three four so it
2373
01:25:58,960 --> 01:26:01,040
just deletes that last four right out of
2374
01:26:01,040 --> 01:26:02,960
there
2375
01:26:02,960 --> 01:26:05,440
and with the sets you can also go in
2376
01:26:05,440 --> 01:26:06,880
there and
2377
01:26:06,880 --> 01:26:09,760
print here is my set my set
2378
01:26:09,760 --> 01:26:12,080
uh three is in the set and then if you
2379
01:26:12,080 --> 01:26:15,440
do three in my set
2380
01:26:15,440 --> 01:26:17,679
that's going to be a logic function
2381
01:26:17,679 --> 01:26:20,719
uh and one in my set six is not in the
2382
01:26:20,719 --> 01:26:24,880
set and so forth if we run this
2383
01:26:24,880 --> 01:26:27,679
we get three is in the set true one is
2384
01:26:27,679 --> 01:26:29,199
in the set false because three five
2385
01:26:29,199 --> 01:26:32,159
seven is another one six is in the set
2386
01:26:32,159 --> 01:26:36,639
six is not in the set so not in my set
2387
01:26:36,639 --> 01:26:39,040
you can also use this with the list we
2388
01:26:39,040 --> 01:26:41,120
could have just used three five seven
2389
01:26:41,120 --> 01:26:42,639
and it would have
2390
01:26:42,639 --> 01:26:45,760
the same response on there is three and
2391
01:26:45,760 --> 01:26:48,080
usually do if three is in but three in
2392
01:26:48,080 --> 01:26:50,320
my set is still works on just a regular
2393
01:26:50,320 --> 01:26:51,440
list
2394
01:26:51,440 --> 01:26:52,639
then we'll go ahead and do a little
2395
01:26:52,639 --> 01:26:54,800
iteration we're going to do kind of the
2396
01:26:54,800 --> 01:26:56,880
dice one remember
2397
01:26:56,880 --> 01:26:59,280
one two three four five six and so we're
2398
01:26:59,280 --> 01:27:01,600
going to bring in the iteration tool and
2399
01:27:01,600 --> 01:27:04,960
import product as product
2400
01:27:04,960 --> 01:27:06,719
and i'll show you what that means in
2401
01:27:06,719 --> 01:27:09,199
just a second so we have our two dice we
2402
01:27:09,199 --> 01:27:11,040
have dice a
2403
01:27:11,040 --> 01:27:13,440
and it's going to be a set of values
2404
01:27:13,440 --> 01:27:15,040
they can only have one value for each
2405
01:27:15,040 --> 01:27:17,040
one that's why they put it in a set and
2406
01:27:17,040 --> 01:27:20,000
if you remember from range it is up to
2407
01:27:20,000 --> 01:27:21,600
seven so this is going to be one two
2408
01:27:21,600 --> 01:27:24,239
three four five six it will not include
2409
01:27:24,239 --> 01:27:26,639
the seven and the same thing for our
2410
01:27:26,639 --> 01:27:29,199
dice b
2411
01:27:29,199 --> 01:27:30,480
and then we're gonna do is we're gonna
2412
01:27:30,480 --> 01:27:34,480
create a list which is the product
2413
01:27:34,480 --> 01:27:38,480
of a and b so what's a plus b
2414
01:27:38,480 --> 01:27:40,320
and if we go ahead and run this it'll
2415
01:27:40,320 --> 01:27:42,639
print that out and you'll see
2416
01:27:42,639 --> 01:27:44,239
in this case when they say product
2417
01:27:44,239 --> 01:27:47,760
because it's an iteration tool
2418
01:27:47,760 --> 01:27:49,679
we're talking about creating a tuple of
2419
01:27:49,679 --> 01:27:50,639
the two
2420
01:27:50,639 --> 01:27:52,719
so we've now created a tuple of all
2421
01:27:52,719 --> 01:27:55,600
possible outcomes of the dice where dice
2422
01:27:55,600 --> 01:27:58,480
a is one two three one to six and dice b
2423
01:27:58,480 --> 01:28:00,159
is one to six and you can see one to one
2424
01:28:00,159 --> 01:28:02,480
one to two one to three and so forth
2425
01:28:02,480 --> 01:28:03,840
you remember we had a slide on this
2426
01:28:03,840 --> 01:28:06,480
earlier where we talked about
2427
01:28:06,480 --> 01:28:07,920
the different all the different outcomes
2428
01:28:07,920 --> 01:28:09,280
of a dice
2429
01:28:09,280 --> 01:28:11,040
we can play around with this a little
2430
01:28:11,040 --> 01:28:14,719
bit we can do in dice equals two
2431
01:28:14,719 --> 01:28:18,080
dice faces one two three four five six
2432
01:28:18,080 --> 01:28:19,760
uh another way of doing what we did
2433
01:28:19,760 --> 01:28:21,520
before and then we can create an event
2434
01:28:21,520 --> 01:28:23,840
space where we have a set which is the
2435
01:28:23,840 --> 01:28:25,920
product of the dice faces
2436
01:28:25,920 --> 01:28:27,920
repeat equals indice and we'll go ahead
2437
01:28:27,920 --> 01:28:29,679
and just run this
2438
01:28:29,679 --> 01:28:32,000
and you can see here it just again puts
2439
01:28:32,000 --> 01:28:33,520
it through all the different possible
2440
01:28:33,520 --> 01:28:35,520
variables we can have
2441
01:28:35,520 --> 01:28:37,760
and then if we wanted to take the same
2442
01:28:37,760 --> 01:28:40,639
set on here and print them all out like
2443
01:28:40,639 --> 01:28:42,400
we had before
2444
01:28:42,400 --> 01:28:44,080
we can just go through for outcome and
2445
01:28:44,080 --> 01:28:47,280
event space outcome and equals
2446
01:28:47,280 --> 01:28:50,719
so the event space is creating
2447
01:28:50,719 --> 01:28:52,719
a sequence and as you can see here when
2448
01:28:52,719 --> 01:28:55,360
we print it out it stacks them versus
2449
01:28:55,360 --> 01:28:56,719
going through and putting them in a nice
2450
01:28:56,719 --> 01:28:58,880
line
2451
01:28:58,880 --> 01:29:01,120
and we'll go ahead and do something
2452
01:29:01,120 --> 01:29:03,040
let's go print
2453
01:29:03,040 --> 01:29:04,880
since we have the end printing with a
2454
01:29:04,880 --> 01:29:07,520
comma that just means it's just gonna
2455
01:29:07,520 --> 01:29:09,199
it's not gonna hit the return going down
2456
01:29:09,199 --> 01:29:10,960
to the next line
2457
01:29:10,960 --> 01:29:14,400
and we'll go ahead and do the length
2458
01:29:15,120 --> 01:29:17,840
of our event space that'll be an
2459
01:29:17,840 --> 01:29:19,040
important variable we're going to want
2460
01:29:19,040 --> 01:29:21,840
to know in a minute
2461
01:29:22,239 --> 01:29:23,679
and of course if i get carried away with
2462
01:29:23,679 --> 01:29:25,840
my typing of length uh we'll print it
2463
01:29:25,840 --> 01:29:27,840
twice and it'll give me an error
2464
01:29:27,840 --> 01:29:30,480
so we have 36 different possible
2465
01:29:30,480 --> 01:29:33,120
variations here
2466
01:29:33,120 --> 01:29:34,960
and we might want to calculate something
2467
01:29:34,960 --> 01:29:36,239
like
2468
01:29:36,239 --> 01:29:38,159
what about the multiple of three what if
2469
01:29:38,159 --> 01:29:40,000
we want to have
2470
01:29:40,000 --> 01:29:42,639
uh the probability of the multiple three
2471
01:29:42,639 --> 01:29:45,440
in our setup
2472
01:29:46,000 --> 01:29:48,159
and so we can put together the code for
2473
01:29:48,159 --> 01:29:50,880
the outcome and event space of x y
2474
01:29:50,880 --> 01:29:52,400
equals outcome
2475
01:29:52,400 --> 01:29:55,360
if x plus y
2476
01:29:55,360 --> 01:29:57,199
remainder 3 so we're going to divide by
2477
01:29:57,199 --> 01:29:58,480
3 and look at the remainder and it
2478
01:29:58,480 --> 01:30:01,120
equals 0
2479
01:30:01,120 --> 01:30:02,800
then it's a favorable outcome we're
2480
01:30:02,800 --> 01:30:04,320
going to pop that outcome on the end
2481
01:30:04,320 --> 01:30:06,480
there
2482
01:30:06,480 --> 01:30:08,400
and we'll turn it into a set so the
2483
01:30:08,400 --> 01:30:10,480
favor outcome equals a set
2484
01:30:10,480 --> 01:30:12,000
not necessary
2485
01:30:12,000 --> 01:30:13,600
because we know it's not going to be
2486
01:30:13,600 --> 01:30:15,360
repeating itself but just in case we'll
2487
01:30:15,360 --> 01:30:18,239
go ahead and do that
2488
01:30:19,120 --> 01:30:22,320
and if we want to print out the outcome
2489
01:30:22,320 --> 01:30:23,760
we can go ahead and see what that looks
2490
01:30:23,760 --> 01:30:26,560
like and you can see here these are all
2491
01:30:26,560 --> 01:30:28,960
multiples of three uh one plus two is
2492
01:30:28,960 --> 01:30:30,880
three five plus four is nine which
2493
01:30:30,880 --> 01:30:35,719
divided by three is three and so forth
2494
01:30:35,760 --> 01:30:37,920
and just like we looked up the length of
2495
01:30:37,920 --> 01:30:40,960
the one before let's go ahead and print
2496
01:30:40,960 --> 01:30:42,639
the length
2497
01:30:42,639 --> 01:30:44,000
of our
2498
01:30:44,000 --> 01:30:45,280
f outcome
2499
01:30:45,280 --> 01:30:49,560
so we can see what that looks like
2500
01:30:51,120 --> 01:30:53,360
there we go
2501
01:30:53,360 --> 01:30:55,280
and of course i did forget to add the
2502
01:30:55,280 --> 01:30:56,320
print in the middle because we're
2503
01:30:56,320 --> 01:30:58,080
looping through and putting an end on
2504
01:30:58,080 --> 01:30:59,520
the on the setup on there so we're going
2505
01:30:59,520 --> 01:31:01,600
to put the print in there and if i run
2506
01:31:01,600 --> 01:31:04,400
this you can see
2507
01:31:06,880 --> 01:31:10,239
we end up with 12. so we have 36 total
2508
01:31:10,239 --> 01:31:11,920
options
2509
01:31:11,920 --> 01:31:14,880
we have 12 that are multiple that add up
2510
01:31:14,880 --> 01:31:17,840
to a multiple of three
2511
01:31:17,840 --> 01:31:20,159
and we can easily conver compute the
2512
01:31:20,159 --> 01:31:22,239
probability of this
2513
01:31:22,239 --> 01:31:24,480
by simply taking the length
2514
01:31:24,480 --> 01:31:26,159
of our favorable outcome of the length
2515
01:31:26,159 --> 01:31:29,719
of the event space
2516
01:31:30,000 --> 01:31:31,600
if we print it out let me put that in
2517
01:31:31,600 --> 01:31:32,480
there
2518
01:31:32,480 --> 01:31:34,400
probability
2519
01:31:34,400 --> 01:31:36,400
last line so we just type it in we end
2520
01:31:36,400 --> 01:31:39,360
up with the 0.3333 chance
2521
01:31:39,360 --> 01:31:42,400
that's roughly a third
2522
01:31:42,400 --> 01:31:44,239
and we want to make this look nice so
2523
01:31:44,239 --> 01:31:45,679
let's go ahead and put in another line
2524
01:31:45,679 --> 01:31:47,520
there the probability of getting the sum
2525
01:31:47,520 --> 01:31:51,199
which is a multiple of 3 is
2526
01:31:51,199 --> 01:31:54,199
0.3333
2527
01:31:54,800 --> 01:31:56,960
we can compute the same thing for five
2528
01:31:56,960 --> 01:31:59,040
dice
2529
01:31:59,040 --> 01:32:01,440
and if we do this for five dice and go
2530
01:32:01,440 --> 01:32:03,840
and run it uh you can see we just have a
2531
01:32:03,840 --> 01:32:05,920
huge amount of choices
2532
01:32:05,920 --> 01:32:08,400
so just goes on and on down here and we
2533
01:32:08,400 --> 01:32:09,920
can look at
2534
01:32:09,920 --> 01:32:10,719
the
2535
01:32:10,719 --> 01:32:14,159
length of the event space
2536
01:32:19,760 --> 01:32:23,040
and we have over 7776
2537
01:32:23,040 --> 01:32:26,080
choices that's a lot of choices
2538
01:32:26,080 --> 01:32:27,840
and if we want to ask the question like
2539
01:32:27,840 --> 01:32:29,679
we did above uh
2540
01:32:29,679 --> 01:32:31,440
what is the sum where the sum is a
2541
01:32:31,440 --> 01:32:33,920
multiple of five but not a multiple of
2542
01:32:33,920 --> 01:32:35,120
three
2543
01:32:35,120 --> 01:32:37,040
we can go through all of these different
2544
01:32:37,040 --> 01:32:38,880
options and then
2545
01:32:38,880 --> 01:32:40,480
you can see here
2546
01:32:40,480 --> 01:32:44,239
d1 d2 d3 d4 d5 equals the outcome
2547
01:32:44,239 --> 01:32:46,880
and if you add these all together and
2548
01:32:46,880 --> 01:32:48,560
the
2549
01:32:48,560 --> 01:32:50,239
division by five does not have a
2550
01:32:50,239 --> 01:32:52,320
remainder of zero
2551
01:32:52,320 --> 01:32:54,719
but the remainder is also of a division
2552
01:32:54,719 --> 01:32:56,960
by three is not equal to zero
2553
01:32:56,960 --> 01:32:59,679
so the multiple of five is equal to zero
2554
01:32:59,679 --> 01:33:01,840
but the multiple three is not we can
2555
01:33:01,840 --> 01:33:03,840
just append that on here and then we can
2556
01:33:03,840 --> 01:33:06,960
look at that uh favorable outcome
2557
01:33:06,960 --> 01:33:08,400
we'll go ahead and set that and we'll
2558
01:33:08,400 --> 01:33:10,159
just take a look at this what's our
2559
01:33:10,159 --> 01:33:11,440
length
2560
01:33:11,440 --> 01:33:15,360
of our favorable outcome
2561
01:33:19,280 --> 01:33:20,480
it's always good to see what we're
2562
01:33:20,480 --> 01:33:23,199
working with and so we have 904 out of
2563
01:33:23,199 --> 01:33:26,199
70
2564
01:33:26,480 --> 01:33:28,800
6 and then of course we can just do a
2565
01:33:28,800 --> 01:33:30,880
simple division to get the probability
2566
01:33:30,880 --> 01:33:32,320
on here what's the probability that
2567
01:33:32,320 --> 01:33:33,600
we're going to roll
2568
01:33:33,600 --> 01:33:35,600
a multiple of 5 when you add them
2569
01:33:35,600 --> 01:33:37,199
together
2570
01:33:37,199 --> 01:33:40,080
but not a multiple of three
2571
01:33:40,080 --> 01:33:41,360
and so we're just going to divide those
2572
01:33:41,360 --> 01:33:43,440
two numbers and you can see here we get
2573
01:33:43,440 --> 01:33:45,679
point one one six two five five or
2574
01:33:45,679 --> 01:33:49,960
eleven point six two percent
2575
01:33:50,880 --> 01:33:53,120
and so you can really have a nice visual
2576
01:33:53,120 --> 01:33:55,920
that this is not really complicated math
2577
01:33:55,920 --> 01:33:58,000
right here on probabilities
2578
01:33:58,000 --> 01:34:00,080
it's just how many options do you have
2579
01:34:00,080 --> 01:34:02,400
and how many of those are you possibly
2580
01:34:02,400 --> 01:34:05,040
going to be able to come up with with
2581
01:34:05,040 --> 01:34:07,120
the solution you're looking for
2582
01:34:07,120 --> 01:34:10,400
and this leads us to a confusion matrix
2583
01:34:10,400 --> 01:34:12,480
a confusion matrix is a table which is
2584
01:34:12,480 --> 01:34:14,239
used to describe the performance of a
2585
01:34:14,239 --> 01:34:16,239
classification model on a set of test
2586
01:34:16,239 --> 01:34:19,280
data for which the true values are known
2587
01:34:19,280 --> 01:34:20,960
and so you'll see in the left we have
2588
01:34:20,960 --> 01:34:23,840
the predicted and the actual
2589
01:34:23,840 --> 01:34:26,639
and we have a negative uh false negative
2590
01:34:26,639 --> 01:34:29,520
positive true positive
2591
01:34:29,520 --> 01:34:32,159
and then we have false positive and true
2592
01:34:32,159 --> 01:34:35,199
negative and you can think of this as
2593
01:34:35,199 --> 01:34:38,560
your predicted model what does that mean
2594
01:34:38,560 --> 01:34:40,800
that means if you divided your data and
2595
01:34:40,800 --> 01:34:42,560
you use two-third of this to create the
2596
01:34:42,560 --> 01:34:43,600
model
2597
01:34:43,600 --> 01:34:45,520
you might then test it against an actual
2598
01:34:45,520 --> 01:34:47,440
case for the last third to see how well
2599
01:34:47,440 --> 01:34:50,000
it comes out how many times was it
2600
01:34:50,000 --> 01:34:52,719
true positive versus a
2601
01:34:52,719 --> 01:34:54,560
false positive it gave a false positive
2602
01:34:54,560 --> 01:34:55,679
response
2603
01:34:55,679 --> 01:34:58,239
and you can imagine in medical
2604
01:34:58,239 --> 01:35:00,480
situations this is a pretty big deal you
2605
01:35:00,480 --> 01:35:02,480
don't want to give a false positive so
2606
01:35:02,480 --> 01:35:04,480
you might adjust your model accordingly
2607
01:35:04,480 --> 01:35:06,800
so you don't have a false positive say
2608
01:35:06,800 --> 01:35:09,199
with a covavirus test it'd be better to
2609
01:35:09,199 --> 01:35:10,800
have a false negative and they go back
2610
01:35:10,800 --> 01:35:13,760
and get re-tested than to have 30 false
2611
01:35:13,760 --> 01:35:16,239
positives where then the test is pretty
2612
01:35:16,239 --> 01:35:17,760
much invalid
2613
01:35:17,760 --> 01:35:20,480
so in a use case like cancer prediction
2614
01:35:20,480 --> 01:35:22,560
let's consider an example where a cancer
2615
01:35:22,560 --> 01:35:24,320
prediction model is put to the test for
2616
01:35:24,320 --> 01:35:26,400
its accuracy and precision
2617
01:35:26,400 --> 01:35:28,239
actual result of a person's medical
2618
01:35:28,239 --> 01:35:30,480
report is compared with the prediction
2619
01:35:30,480 --> 01:35:33,280
made by the machine learning model
2620
01:35:33,280 --> 01:35:34,800
and so you can see here here's our
2621
01:35:34,800 --> 01:35:36,639
actual predicted whether they have
2622
01:35:36,639 --> 01:35:38,400
cancer or not you know cancer a big one
2623
01:35:38,400 --> 01:35:40,239
you don't want to have a
2624
01:35:40,239 --> 01:35:42,880
false positive i mean a false negative
2625
01:35:42,880 --> 01:35:44,239
in other words you don't want to have it
2626
01:35:44,239 --> 01:35:46,239
tell you that you don't have cancer when
2627
01:35:46,239 --> 01:35:48,400
you do so that would be something you'd
2628
01:35:48,400 --> 01:35:50,719
really be looking for in this particular
2629
01:35:50,719 --> 01:35:53,760
domain you don't want a false negative
2630
01:35:53,760 --> 01:35:55,120
and this is again you know you've
2631
01:35:55,120 --> 01:35:57,199
created a model you have hundreds of
2632
01:35:57,199 --> 01:35:59,600
people or thousands of pieces of data
2633
01:35:59,600 --> 01:36:00,800
that come in
2634
01:36:00,800 --> 01:36:02,639
there's a real famous case study where
2635
01:36:02,639 --> 01:36:04,159
they have the imagery and all the
2636
01:36:04,159 --> 01:36:05,440
measurements they take and there's about
2637
01:36:05,440 --> 01:36:08,239
36 different measurements they take
2638
01:36:08,239 --> 01:36:11,040
and then if you run the a basic model
2639
01:36:11,040 --> 01:36:12,719
you want to know just how accurate it is
2640
01:36:12,719 --> 01:36:15,120
how many negative results do you have
2641
01:36:15,120 --> 01:36:16,560
that are either telling people they have
2642
01:36:16,560 --> 01:36:18,159
cancer that don't or telling people that
2643
01:36:18,159 --> 01:36:20,159
don't have cancer that they do and then
2644
01:36:20,159 --> 01:36:22,639
we can take these numbers and we can
2645
01:36:22,639 --> 01:36:24,560
feed them into our accuracy our
2646
01:36:24,560 --> 01:36:27,120
precision and our recall
2647
01:36:27,120 --> 01:36:28,800
so accuracy precision and recall
2648
01:36:28,800 --> 01:36:30,800
accuracy metric to measure how
2649
01:36:30,800 --> 01:36:33,520
accurately the results are predicted
2650
01:36:33,520 --> 01:36:35,040
and this is your
2651
01:36:35,040 --> 01:36:36,239
total
2652
01:36:36,239 --> 01:36:38,159
true where you got the right results you
2653
01:36:38,159 --> 01:36:39,760
add them together the true positive the
2654
01:36:39,760 --> 01:36:40,960
true negative
2655
01:36:40,960 --> 01:36:43,440
over all the results so what percentage
2656
01:36:43,440 --> 01:36:45,920
of them were accurate versus what were
2657
01:36:45,920 --> 01:36:47,199
wrong
2658
01:36:47,199 --> 01:36:49,040
we talked about precision is a metric to
2659
01:36:49,040 --> 01:36:50,480
measure how many of the correctly
2660
01:36:50,480 --> 01:36:52,320
predicted cases are actually turned out
2661
01:36:52,320 --> 01:36:54,080
to be positive
2662
01:36:54,080 --> 01:36:57,199
uh so we have a precision on
2663
01:36:57,199 --> 01:36:58,639
true positive
2664
01:36:58,639 --> 01:37:00,639
again if you're talking about like uh
2665
01:37:00,639 --> 01:37:04,480
covid testing with the viruses uh you
2666
01:37:04,480 --> 01:37:07,040
really want this to be a high number you
2667
01:37:07,040 --> 01:37:08,400
want this true
2668
01:37:08,400 --> 01:37:10,480
that to be the center point where you
2669
01:37:10,480 --> 01:37:11,840
might have the opposite if you're
2670
01:37:11,840 --> 01:37:14,320
dealing with a cancer where you want no
2671
01:37:14,320 --> 01:37:16,480
false negatives
2672
01:37:16,480 --> 01:37:18,560
so this is your metric on here precision
2673
01:37:18,560 --> 01:37:20,880
is your test positive
2674
01:37:20,880 --> 01:37:22,560
true positive plus
2675
01:37:22,560 --> 01:37:24,320
false positive
2676
01:37:24,320 --> 01:37:26,239
and then your recall how many of the
2677
01:37:26,239 --> 01:37:28,480
actual positive cases we were able to
2678
01:37:28,480 --> 01:37:30,800
predict quickly with our model
2679
01:37:30,800 --> 01:37:33,040
so test positive is the test positive
2680
01:37:33,040 --> 01:37:36,239
plus the false negative on there and
2681
01:37:36,239 --> 01:37:38,320
we'll want to go ahead and do a demo on
2682
01:37:38,320 --> 01:37:42,000
the naive bayes classifier before i get
2683
01:37:42,000 --> 01:37:45,280
too far into a naive bayes classifier
2684
01:37:45,280 --> 01:37:46,320
because we're going to pull it from the
2685
01:37:46,320 --> 01:37:49,920
sk learn or the site kit
2686
01:37:49,920 --> 01:37:51,920
let's go ahead kind of an interesting
2687
01:37:51,920 --> 01:37:53,679
page here for classifiers when you go
2688
01:37:53,679 --> 01:37:55,840
into the sk learn kit there's a lot of
2689
01:37:55,840 --> 01:37:57,840
ways to do classification and i'll just
2690
01:37:57,840 --> 01:37:59,840
zoom up in here so you can see some of
2691
01:37:59,840 --> 01:38:01,440
the titles
2692
01:38:01,440 --> 01:38:03,199
there's everything from the nearest
2693
01:38:03,199 --> 01:38:05,760
neighbor linear
2694
01:38:05,760 --> 01:38:07,119
but we're going to be focusing on the
2695
01:38:07,119 --> 01:38:09,520
naive bayes over here
2696
01:38:09,520 --> 01:38:12,480
and this is just a sample data set that
2697
01:38:12,480 --> 01:38:14,880
they put together and you can see how
2698
01:38:14,880 --> 01:38:16,560
some of these have a very different
2699
01:38:16,560 --> 01:38:17,600
output
2700
01:38:17,600 --> 01:38:20,400
the naive bayes remember is set up as
2701
01:38:20,400 --> 01:38:22,480
probably the most simplified uh
2702
01:38:22,480 --> 01:38:24,880
calculator or set of predictions out
2703
01:38:24,880 --> 01:38:25,679
there
2704
01:38:25,679 --> 01:38:27,600
and so what we've been talking about
2705
01:38:27,600 --> 01:38:29,360
with the true false and stuff like that
2706
01:38:29,360 --> 01:38:31,760
where there's a
2707
01:38:31,760 --> 01:38:33,760
and then a belief that there is a
2708
01:38:33,760 --> 01:38:35,040
independent assumption between the
2709
01:38:35,040 --> 01:38:36,719
features where the features are very
2710
01:38:36,719 --> 01:38:39,600
assumed to have some kind of connection
2711
01:38:39,600 --> 01:38:42,239
uh then we can go ahead and use that for
2712
01:38:42,239 --> 01:38:43,760
the prediction and so that's what we're
2713
01:38:43,760 --> 01:38:46,800
using is a naive bayes classifier versus
2714
01:38:46,800 --> 01:38:48,239
many of the other classifiers that are
2715
01:38:48,239 --> 01:38:50,639
out there
2716
01:38:51,199 --> 01:38:54,080
for this we're going to use the social
2717
01:38:54,080 --> 01:38:56,480
network ads it's a little data set on
2718
01:38:56,480 --> 01:38:57,520
here
2719
01:38:57,520 --> 01:39:00,000
and let me go and just open that up the
2720
01:39:00,000 --> 01:39:01,520
file
2721
01:39:01,520 --> 01:39:04,560
here we go it has user id gender age
2722
01:39:04,560 --> 01:39:07,440
estimated salary uh purchased
2723
01:39:07,440 --> 01:39:10,000
and so we have you can see the user id
2724
01:39:10,000 --> 01:39:12,159
male 19
2725
01:39:12,159 --> 01:39:14,800
estimated salary 19 000
2726
01:39:14,800 --> 01:39:17,360
and purchased zero so it's either gonna
2727
01:39:17,360 --> 01:39:19,280
make a purchase or not
2728
01:39:19,280 --> 01:39:21,760
so look at that last one zero one we
2729
01:39:21,760 --> 01:39:23,600
should be thinking of binomials we
2730
01:39:23,600 --> 01:39:26,000
should be thinking of a simple naive
2731
01:39:26,000 --> 01:39:29,679
bayes classifier kind of setup
2732
01:39:29,760 --> 01:39:31,280
so if we close this out we're going to
2733
01:39:31,280 --> 01:39:34,880
go ahead and import our numpy as np
2734
01:39:34,880 --> 01:39:36,480
we're going to nice to have a good
2735
01:39:36,480 --> 01:39:38,639
visual of our data so we'll put in our
2736
01:39:38,639 --> 01:39:41,520
matplot library here's our pandas our
2737
01:39:41,520 --> 01:39:44,239
data frame
2738
01:39:44,239 --> 01:39:45,360
and then we're going to go ahead and
2739
01:39:45,360 --> 01:39:47,600
import the data set and the data set's
2740
01:39:47,600 --> 01:39:49,040
going to be we're going to read it from
2741
01:39:49,040 --> 01:39:51,920
the social network ads.csv then we're
2742
01:39:51,920 --> 01:39:53,199
going to print the head just so you can
2743
01:39:53,199 --> 01:39:54,480
see it again
2744
01:39:54,480 --> 01:39:56,400
even though i showed you it in the file
2745
01:39:56,400 --> 01:39:59,679
and x equals the data set i location
2746
01:39:59,679 --> 01:40:02,080
two three values and y is going to be
2747
01:40:02,080 --> 01:40:03,199
the four
2748
01:40:03,199 --> 01:40:05,440
column four let me just run this it's a
2749
01:40:05,440 --> 01:40:07,360
little easier to go over that
2750
01:40:07,360 --> 01:40:08,639
you can see right here we're going to be
2751
01:40:08,639 --> 01:40:09,760
looking at
2752
01:40:09,760 --> 01:40:12,159
0 1 2 as age
2753
01:40:12,159 --> 01:40:15,600
and estimated salary so 2 3
2754
01:40:15,600 --> 01:40:18,080
and that's what i location just means
2755
01:40:18,080 --> 01:40:19,440
that we're
2756
01:40:19,440 --> 01:40:21,679
looking at the number versus a regular
2757
01:40:21,679 --> 01:40:23,600
location a regular location you'd
2758
01:40:23,600 --> 01:40:27,119
actually say age and estimated salary
2759
01:40:27,119 --> 01:40:28,880
and then column four is did they make a
2760
01:40:28,880 --> 01:40:31,600
purchase they purchased something
2761
01:40:31,600 --> 01:40:32,880
so those are the three columns we're
2762
01:40:32,880 --> 01:40:34,480
going to be looking at when we do this
2763
01:40:34,480 --> 01:40:36,080
and we've gone ahead and imported these
2764
01:40:36,080 --> 01:40:36,800
and
2765
01:40:36,800 --> 01:40:38,960
imported the data so now our data set is
2766
01:40:38,960 --> 01:40:42,560
all set with this information in it
2767
01:40:43,760 --> 01:40:45,199
and we'll need to go ahead and split the
2768
01:40:45,199 --> 01:40:48,320
data up so we need our from the sk learn
2769
01:40:48,320 --> 01:40:51,119
model selection we can import train test
2770
01:40:51,119 --> 01:40:52,560
split
2771
01:40:52,560 --> 01:40:54,400
this does a nice job we can set the
2772
01:40:54,400 --> 01:40:56,880
random state so randomly picks the data
2773
01:40:56,880 --> 01:40:59,360
and we're just going to take uh 25 of
2774
01:40:59,360 --> 01:41:01,440
it's going to go into the test our x
2775
01:41:01,440 --> 01:41:03,119
test and our y test
2776
01:41:03,119 --> 01:41:05,760
and the 75 will go to x train and y
2777
01:41:05,760 --> 01:41:06,800
train
2778
01:41:06,800 --> 01:41:08,400
that way once we
2779
01:41:08,400 --> 01:41:10,639
create our model we can then have data
2780
01:41:10,639 --> 01:41:12,719
to see just how accurate or how well it
2781
01:41:12,719 --> 01:41:16,960
has performed with our prediction
2782
01:41:17,280 --> 01:41:20,400
the next step in pre-processing our data
2783
01:41:20,400 --> 01:41:23,600
is to go ahead and do feature scaling
2784
01:41:23,600 --> 01:41:25,119
now a lot of this is start to look
2785
01:41:25,119 --> 01:41:26,800
familiar if you've done a number of the
2786
01:41:26,800 --> 01:41:29,119
other modules and setup you should start
2787
01:41:29,119 --> 01:41:30,800
noticing that we
2788
01:41:30,800 --> 01:41:32,719
bring in our data we take a look at what
2789
01:41:32,719 --> 01:41:34,480
we're working with
2790
01:41:34,480 --> 01:41:35,920
we go ahead and split it up into
2791
01:41:35,920 --> 01:41:37,920
training and testing
2792
01:41:37,920 --> 01:41:39,440
in this case we're going to go ahead and
2793
01:41:39,440 --> 01:41:41,360
scale it scale it means we're putting it
2794
01:41:41,360 --> 01:41:44,880
between a value of minus 1 and 1
2795
01:41:44,880 --> 01:41:46,719
or or someplace in the middle ground
2796
01:41:46,719 --> 01:41:47,520
there
2797
01:41:47,520 --> 01:41:49,520
this way if you have any huge set you
2798
01:41:49,520 --> 01:41:51,600
don't have this huge um
2799
01:41:51,600 --> 01:41:53,360
setup if we go back up to here where
2800
01:41:53,360 --> 01:41:55,920
salary the salary is
2801
01:41:55,920 --> 01:41:59,280
20 000 versus age 35.
2802
01:41:59,280 --> 01:42:01,119
well there's a good chance with a lot of
2803
01:42:01,119 --> 01:42:03,920
the back end math that 20 000 will skew
2804
01:42:03,920 --> 01:42:05,920
the results and the estimated salary
2805
01:42:05,920 --> 01:42:08,239
will have a higher impact than the age
2806
01:42:08,239 --> 01:42:09,600
instead of balancing them out and
2807
01:42:09,600 --> 01:42:11,360
letting the calculations weigh them
2808
01:42:11,360 --> 01:42:13,360
properly
2809
01:42:13,360 --> 01:42:16,159
and finally we get to actually create
2810
01:42:16,159 --> 01:42:19,520
our naive bayes model
2811
01:42:19,520 --> 01:42:21,360
um and then we're going to go ahead and
2812
01:42:21,360 --> 01:42:25,440
import the gaussian naive bayes
2813
01:42:25,440 --> 01:42:28,480
and the gaussian is is the most basic
2814
01:42:28,480 --> 01:42:30,800
one that's what we're looking at now it
2815
01:42:30,800 --> 01:42:33,600
turns out though if you go to the sk
2816
01:42:33,600 --> 01:42:34,400
learn
2817
01:42:34,400 --> 01:42:35,440
kit
2818
01:42:35,440 --> 01:42:36,960
they have a number of different ones you
2819
01:42:36,960 --> 01:42:39,360
can pull in there there's a
2820
01:42:39,360 --> 01:42:41,119
bernoulli i know i've never used that
2821
01:42:41,119 --> 01:42:44,080
one categorical
2822
01:42:44,080 --> 01:42:46,639
complement and here's our gaussian
2823
01:42:46,639 --> 01:42:48,159
so there's a number of different options
2824
01:42:48,159 --> 01:42:49,440
you can look at
2825
01:42:49,440 --> 01:42:51,679
gaussian when you come to the naive
2826
01:42:51,679 --> 01:42:54,239
bayes is the most commonly used
2827
01:42:54,239 --> 01:42:56,080
so we're talking about the naive bayes
2828
01:42:56,080 --> 01:42:57,280
that's usually what people are talking
2829
01:42:57,280 --> 01:42:58,639
about when they when they're pulling
2830
01:42:58,639 --> 01:42:59,600
this in
2831
01:42:59,600 --> 01:43:01,040
and one of the nice things about the
2832
01:43:01,040 --> 01:43:03,119
gaussian if you go to their website um
2833
01:43:03,119 --> 01:43:06,000
to sk learn the naive bayes gaussian
2834
01:43:06,000 --> 01:43:07,520
there's a lot of cool features one of
2835
01:43:07,520 --> 01:43:10,400
them is you can do partial fit on here
2836
01:43:10,400 --> 01:43:11,920
that means if you have a huge amount of
2837
01:43:11,920 --> 01:43:13,440
data you don't have to process it all to
2838
01:43:13,440 --> 01:43:17,440
want you once you can batch it into the
2839
01:43:17,440 --> 01:43:20,239
gaussian nb model and there's many other
2840
01:43:20,239 --> 01:43:21,920
different things you can do with it as
2841
01:43:21,920 --> 01:43:24,480
far as fitting the data and how you
2842
01:43:24,480 --> 01:43:25,920
manipulate it
2843
01:43:25,920 --> 01:43:27,440
we're just doing the basics so we're
2844
01:43:27,440 --> 01:43:28,400
going to go ahead and create our
2845
01:43:28,400 --> 01:43:30,239
classifier we're going to equal the
2846
01:43:30,239 --> 01:43:32,639
gaussian in b
2847
01:43:32,639 --> 01:43:33,920
and then we're going to do a fit we're
2848
01:43:33,920 --> 01:43:36,639
going to fit our training data and our
2849
01:43:36,639 --> 01:43:40,960
training solution so x train y train
2850
01:43:41,280 --> 01:43:43,280
and we'll go ahead and run this uh it's
2851
01:43:43,280 --> 01:43:45,040
going to tell us that it ran the code
2852
01:43:45,040 --> 01:43:47,280
right there
2853
01:43:47,280 --> 01:43:49,840
and now we have our trained classifier
2854
01:43:49,840 --> 01:43:51,280
model
2855
01:43:51,280 --> 01:43:53,199
so the next step is we need to go ahead
2856
01:43:53,199 --> 01:43:54,880
and run a prediction we're going to do
2857
01:43:54,880 --> 01:43:56,480
our y predict equals the
2858
01:43:56,480 --> 01:43:58,400
classifier.predict
2859
01:43:58,400 --> 01:43:59,600
x test
2860
01:43:59,600 --> 01:44:01,600
so here we fit the data and now we're
2861
01:44:01,600 --> 01:44:05,239
going to go ahead and predict
2862
01:44:06,400 --> 01:44:10,239
and now we get to our confusion matrix
2863
01:44:10,239 --> 01:44:11,280
so from
2864
01:44:11,280 --> 01:44:13,520
the sk learn matrix metrics you can
2865
01:44:13,520 --> 01:44:15,679
import your confusion matrix
2866
01:44:15,679 --> 01:44:17,440
just as saves you from doing all the
2867
01:44:17,440 --> 01:44:19,760
simple math that does it all for you
2868
01:44:19,760 --> 01:44:20,880
and then we'll go ahead and create our
2869
01:44:20,880 --> 01:44:23,199
confusion metrics with the y test and
2870
01:44:23,199 --> 01:44:26,320
the y predict so we have our actual
2871
01:44:26,320 --> 01:44:29,119
and we have our predicted value
2872
01:44:29,119 --> 01:44:30,880
and you can see from here this is the
2873
01:44:30,880 --> 01:44:32,800
chart we looked at here's predicted so
2874
01:44:32,800 --> 01:44:35,040
true positive false positive
2875
01:44:35,040 --> 01:44:38,960
false negative true negative
2876
01:44:39,280 --> 01:44:41,119
and if we go ahead and run this there we
2877
01:44:41,119 --> 01:44:45,600
have it 65 3 7 25
2878
01:44:45,600 --> 01:44:48,320
and in this particular prediction we had
2879
01:44:48,320 --> 01:44:51,760
65 or predicted the truth as far as a
2880
01:44:51,760 --> 01:44:52,719
purchase they're going to make a
2881
01:44:52,719 --> 01:44:53,679
purchase
2882
01:44:53,679 --> 01:44:55,920
and we guessed three wrong
2883
01:44:55,920 --> 01:44:58,000
and then we had 25 we predicted would
2884
01:44:58,000 --> 01:45:00,960
not purchase and seven of them did so
2885
01:45:00,960 --> 01:45:05,280
there's our our confusion matrix
2886
01:45:05,280 --> 01:45:07,360
at this point if you were with your
2887
01:45:07,360 --> 01:45:09,760
shareholders or a board meeting
2888
01:45:09,760 --> 01:45:11,520
you would start to hear some snoozing if
2889
01:45:11,520 --> 01:45:12,960
they were looking at the numbers and you
2890
01:45:12,960 --> 01:45:14,840
say hey here's my confusion
2891
01:45:14,840 --> 01:45:17,520
matrix so let's go ahead and visualize
2892
01:45:17,520 --> 01:45:19,280
the results
2893
01:45:19,280 --> 01:45:20,639
we're going to pull from the matplot
2894
01:45:20,639 --> 01:45:24,960
library colors import listed color map
2895
01:45:25,520 --> 01:45:27,440
and this is actually my machine is going
2896
01:45:27,440 --> 01:45:31,760
to throw an error because this is being
2897
01:45:31,760 --> 01:45:33,840
because of the way the setup is i have a
2898
01:45:33,840 --> 01:45:35,920
newer version on here than when they put
2899
01:45:35,920 --> 01:45:37,440
together the demo
2900
01:45:37,440 --> 01:45:39,679
and we need our x set and our y set
2901
01:45:39,679 --> 01:45:42,560
which is our x train and y train
2902
01:45:42,560 --> 01:45:45,440
and then we'll create our x1 x2
2903
01:45:45,440 --> 01:45:47,520
and we'll put that into a grid
2904
01:45:47,520 --> 01:45:50,560
uh and we set our x set minimum stop and
2905
01:45:50,560 --> 01:45:52,560
our x at max stop
2906
01:45:52,560 --> 01:45:53,840
and if you come all the way over here
2907
01:45:53,840 --> 01:45:56,239
we're going to step .01 this is going to
2908
01:45:56,239 --> 01:45:59,040
give us a nice line is what that's doing
2909
01:45:59,040 --> 01:46:01,600
and we're going to plot the contour
2910
01:46:01,600 --> 01:46:04,400
plot the x limit plot the y limit
2911
01:46:04,400 --> 01:46:06,719
and put the scatter plot in there let's
2912
01:46:06,719 --> 01:46:08,239
go ahead and run this
2913
01:46:08,239 --> 01:46:11,119
to be honest when i'm doing these graphs
2914
01:46:11,119 --> 01:46:12,560
there's so many different ways to do
2915
01:46:12,560 --> 01:46:13,920
that there's so many different ways to
2916
01:46:13,920 --> 01:46:15,760
put this code together
2917
01:46:15,760 --> 01:46:18,159
to show you what we're doing it's a lot
2918
01:46:18,159 --> 01:46:20,480
easier to pull up the graph and then go
2919
01:46:20,480 --> 01:46:22,880
back up and explain it
2920
01:46:22,880 --> 01:46:25,280
so the first thing we want to note here
2921
01:46:25,280 --> 01:46:28,000
when we're looking at the data
2922
01:46:28,000 --> 01:46:30,639
is this is the training set
2923
01:46:30,639 --> 01:46:32,719
and so we have those who didn't make a
2924
01:46:32,719 --> 01:46:34,880
purchase we've drawn a nice area for
2925
01:46:34,880 --> 01:46:36,000
that
2926
01:46:36,000 --> 01:46:38,639
that's defined by the naive bayes setup
2927
01:46:38,639 --> 01:46:40,400
and then we have those who did make a
2928
01:46:40,400 --> 01:46:42,480
purchase the green and you can see that
2929
01:46:42,480 --> 01:46:44,480
some of the green drops fall into the
2930
01:46:44,480 --> 01:46:46,239
red area and some of the red dots fall
2931
01:46:46,239 --> 01:46:47,440
into the green
2932
01:46:47,440 --> 01:46:49,280
so even our training set isn't going to
2933
01:46:49,280 --> 01:46:51,600
be a hundred percent uh we couldn't do
2934
01:46:51,600 --> 01:46:52,560
that
2935
01:46:52,560 --> 01:46:54,239
and so we're looking at our different
2936
01:46:54,239 --> 01:46:56,159
data coming down
2937
01:46:56,159 --> 01:46:58,880
uh we can kind of arrange our x1 x2 so
2938
01:46:58,880 --> 01:47:00,800
we have a nice plot going on and we're
2939
01:47:00,800 --> 01:47:02,159
going to create the
2940
01:47:02,159 --> 01:47:04,400
contour
2941
01:47:04,400 --> 01:47:05,920
that's that nice line that's drawn down
2942
01:47:05,920 --> 01:47:09,040
the middle on here with the red green
2943
01:47:09,040 --> 01:47:10,480
that's where that's what this is doing
2944
01:47:10,480 --> 01:47:12,400
right here with the reshape and notice
2945
01:47:12,400 --> 01:47:13,679
that we had to
2946
01:47:13,679 --> 01:47:14,800
uh
2947
01:47:14,800 --> 01:47:17,040
do the dot t if you remember from numpy
2948
01:47:17,040 --> 01:47:19,360
um if you did the numpy module
2949
01:47:19,360 --> 01:47:23,760
you end up with pairs you know x x1 x2
2950
01:47:23,760 --> 01:47:27,280
x1 x2 next row and so forth you have to
2951
01:47:27,280 --> 01:47:29,199
flip it so it's all one row you have all
2952
01:47:29,199 --> 01:47:31,440
your x1s and all your x2s
2953
01:47:31,440 --> 01:47:32,480
so this is what we're kind of looking
2954
01:47:32,480 --> 01:47:36,159
for right here on this setup
2955
01:47:36,159 --> 01:47:38,960
and then the scatter plot is of course
2956
01:47:38,960 --> 01:47:40,560
your scattered data across there we're
2957
01:47:40,560 --> 01:47:42,159
just going through all the points that
2958
01:47:42,159 --> 01:47:44,480
puts these nice little dots on to our
2959
01:47:44,480 --> 01:47:46,560
setup on here and we have our estimated
2960
01:47:46,560 --> 01:47:48,800
salary and our h and then of course the
2961
01:47:48,800 --> 01:47:51,360
dots are did they make a purchase or not
2962
01:47:51,360 --> 01:47:52,960
and just a quick note this is kind of
2963
01:47:52,960 --> 01:47:54,880
funny you can see up here where it says
2964
01:47:54,880 --> 01:47:59,040
x set y set equals x train y train which
2965
01:47:59,040 --> 01:48:02,239
seems kind of a little weird to do
2966
01:48:02,239 --> 01:48:03,679
this is because this is probably
2967
01:48:03,679 --> 01:48:05,679
originally a definition
2968
01:48:05,679 --> 01:48:07,280
so it was its own module that could be
2969
01:48:07,280 --> 01:48:09,199
called over and over again
2970
01:48:09,199 --> 01:48:11,040
and which is really a good way to do it
2971
01:48:11,040 --> 01:48:12,159
because the next thing we're going to do
2972
01:48:12,159 --> 01:48:14,480
is do the exact same thing
2973
01:48:14,480 --> 01:48:16,159
but we're going to visualize the test
2974
01:48:16,159 --> 01:48:17,679
set results
2975
01:48:17,679 --> 01:48:19,119
that way we can see what happened with
2976
01:48:19,119 --> 01:48:22,719
our test group our 25 percent
2977
01:48:22,719 --> 01:48:25,040
and you can see down here we have the
2978
01:48:25,040 --> 01:48:26,239
test set
2979
01:48:26,239 --> 01:48:27,119
and it
2980
01:48:27,119 --> 01:48:28,719
if you look at the two
2981
01:48:28,719 --> 01:48:30,080
graphs next to each other this one
2982
01:48:30,080 --> 01:48:31,600
obviously has
2983
01:48:31,600 --> 01:48:33,280
75 percent of the data so it's going to
2984
01:48:33,280 --> 01:48:34,960
show a lot more
2985
01:48:34,960 --> 01:48:37,920
this is only 25 of the data you can see
2986
01:48:37,920 --> 01:48:39,920
that there's a number that are kind of
2987
01:48:39,920 --> 01:48:41,360
on the edge as to whether they could
2988
01:48:41,360 --> 01:48:43,360
guess by age and income they're going to
2989
01:48:43,360 --> 01:48:45,360
make a purchase or not
2990
01:48:45,360 --> 01:48:47,280
but that said it still is pretty clear
2991
01:48:47,280 --> 01:48:48,880
it's pretty good as far as how much the
2992
01:48:48,880 --> 01:48:52,080
estimate is and how good it does
2993
01:48:52,080 --> 01:48:52,880
now
2994
01:48:52,880 --> 01:48:55,360
graphs are really effective
2995
01:48:55,360 --> 01:48:57,760
for showing people what's going on but
2996
01:48:57,760 --> 01:49:00,000
you also need to have the numbers and so
2997
01:49:00,000 --> 01:49:01,679
we're going to do from sklearn we're
2998
01:49:01,679 --> 01:49:03,679
going to import metrics
2999
01:49:03,679 --> 01:49:04,719
and then we're going to print our
3000
01:49:04,719 --> 01:49:06,880
metrics classification port from the y
3001
01:49:06,880 --> 01:49:10,159
test and the y predict
3002
01:49:10,639 --> 01:49:13,119
and you can see here we have precision
3003
01:49:13,119 --> 01:49:15,280
precision of zeros is 90 there's our
3004
01:49:15,280 --> 01:49:16,560
recall
3005
01:49:16,560 --> 01:49:20,960
0.96 we have an f1 score and a support
3006
01:49:20,960 --> 01:49:23,920
and we have our precision the recall on
3007
01:49:23,920 --> 01:49:25,440
getting it right
3008
01:49:25,440 --> 01:49:27,440
and then we can do our accuracy the
3009
01:49:27,440 --> 01:49:30,000
macro average and the weighted average
3010
01:49:30,000 --> 01:49:32,239
so you can see it pulls in
3011
01:49:32,239 --> 01:49:34,000
pretty good as far as
3012
01:49:34,000 --> 01:49:35,600
how accurate it is
3013
01:49:35,600 --> 01:49:37,600
you could say it's going to be about 90
3014
01:49:37,600 --> 01:49:40,800
percent is going to guess correctly
3015
01:49:40,800 --> 01:49:42,000
that it that they're not going to
3016
01:49:42,000 --> 01:49:44,800
purchase and we had an 89 chance that
3017
01:49:44,800 --> 01:49:46,480
they are going to purchase
3018
01:49:46,480 --> 01:49:48,320
um and then the other numbers as you get
3019
01:49:48,320 --> 01:49:49,520
down
3020
01:49:49,520 --> 01:49:50,960
have a little bit different meaning but
3021
01:49:50,960 --> 01:49:52,080
it's pretty straightforward on here
3022
01:49:52,080 --> 01:49:55,199
here's our accuracy and here's our micro
3023
01:49:55,199 --> 01:49:56,800
average and the weighted average and
3024
01:49:56,800 --> 01:49:58,320
everything else you might need and if
3025
01:49:58,320 --> 01:50:00,400
you forgot the exact definition of
3026
01:50:00,400 --> 01:50:02,560
accuracy it is the
3027
01:50:02,560 --> 01:50:05,280
true positive true negative over all of
3028
01:50:05,280 --> 01:50:06,880
the different setups
3029
01:50:06,880 --> 01:50:09,840
precision is your true positive overall
3030
01:50:09,840 --> 01:50:12,080
positives true and false
3031
01:50:12,080 --> 01:50:14,560
and recall is a true positive over true
3032
01:50:14,560 --> 01:50:17,360
positive plus false negative
3033
01:50:17,360 --> 01:50:18,719
and we can just real quick flip back
3034
01:50:18,719 --> 01:50:20,000
there
3035
01:50:20,000 --> 01:50:22,639
so you can see those numbers on here
3036
01:50:22,639 --> 01:50:25,600
here's our precision here's our recall
3037
01:50:25,600 --> 01:50:28,320
and here's our accuracy on this
3038
01:50:28,320 --> 01:50:30,080
thank you for joining us for mathematics
3039
01:50:30,080 --> 01:50:32,320
for machine learning my name is richard
3040
01:50:32,320 --> 01:50:36,760
kirschner with the simply learn team
3041
01:50:37,150 --> 01:50:38,880
[Music]
3042
01:50:38,880 --> 01:50:40,639
hi there if you like this video
3043
01:50:40,639 --> 01:50:42,400
subscribe to the simply learn youtube
3044
01:50:42,400 --> 01:50:44,960
channel and click here to watch similar
3045
01:50:44,960 --> 01:50:47,199
videos to nerd up and get certified
3046
01:50:47,199 --> 01:50:50,520
click here
206344
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.