Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:02,399 --> 00:00:03,399
Good day,
2
00:00:03,399 --> 00:00:08,500
everyone. This is your lecturer, Monica wahi.
And we're going to start now with section
3
00:00:08,500 --> 00:00:16,870
1.1. What is statistics? So here's our learning
objectives for this lecture. At the end of
4
00:00:16,870 --> 00:00:21,679
this lecture, the students should be able
to state at least one definition of statistics.
5
00:00:21,679 --> 00:00:27,699
Yes, there's more than one, give one example
of a population parameter. And one example
6
00:00:27,699 --> 00:00:34,580
of a sample statistic. Also, the student should
be able to classify a variable into quantitative
7
00:00:34,580 --> 00:00:43,070
or qualitative and as nominal ordinal, interval,
or ratio. So what we're going to cover in
8
00:00:43,070 --> 00:00:48,220
this lecture is, first I'm going to go over
some definitions of statistics. Like I said,
9
00:00:48,220 --> 00:00:52,620
there's more than one. But they all sort of
relate to the basic concept of why you're
10
00:00:52,620 --> 00:00:57,900
doing statistics, and especially not math.
So what's the difference, right, then we're
11
00:00:57,900 --> 00:01:02,880
gonna go over a population parameter and sample
statistic. And you'll know what those mean,
12
00:01:02,880 --> 00:01:09,680
at the end of the lecture. And finally, we're
going to go over classifying levels of measurement.
13
00:01:09,680 --> 00:01:16,170
So let's start with the definition of statistics.
And so we're going to go over these concepts
14
00:01:16,170 --> 00:01:22,050
like what it is. And also I'm going to define
for you the concept of individuals versus
15
00:01:22,050 --> 00:01:26,940
variables. You may know definitions for those
words already, but I'm going to give you them
16
00:01:26,940 --> 00:01:32,150
in statistics ease. And then I'm going to
give you examples of statistics, individuals
17
00:01:32,150 --> 00:01:40,470
and variables in healthcare. So here are the
definitions. What is statistics? statistics
18
00:01:40,470 --> 00:01:46,460
is the study, how to collect, organize, analyze,
and interpret numerical information and data.
19
00:01:46,460 --> 00:01:53,250
Well, that sounds pretty esoteric, right?
But if you actually think about it, even if
20
00:01:53,250 --> 00:01:55,490
he did a simple survey, like you just did
21
00:01:55,490 --> 00:01:56,710
a wiki, you just
22
00:01:56,710 --> 00:02:01,070
look on Yelp, right? You look on Yelp, and
you see, you know, the restaurant, you want
23
00:02:01,070 --> 00:02:04,850
to go to some people say five stars or four
stars, but there's a few two stars one star
24
00:02:04,850 --> 00:02:10,810
will do you go? I mean, there's a whole bunch
of different answers. So how do you do that,
25
00:02:10,810 --> 00:02:16,069
you kind of have to analyze it, you kind of
have to interpret it. So it's not that easy.
26
00:02:16,069 --> 00:02:21,540
So statistics is both the science of uncertainty,
and the technology of extracting information
27
00:02:21,540 --> 00:02:27,950
from data. So in other words, if you've got
a bunch of data about like a restaurant, um,
28
00:02:27,950 --> 00:02:32,480
you don't know how it's gonna be if you actually
go there, right? You don't know for sure.
29
00:02:32,480 --> 00:02:38,349
But, uh, so it's the science of uncertainty.
If you look on Yelp, and you're seeing almost
30
00:02:38,349 --> 00:02:44,560
everybody's giving it a four or five star,
maybe it's gonna be good for you, right? But
31
00:02:44,560 --> 00:02:50,989
you don't know, maybe there's new management.
That's the uncertainty. So statistics is used
32
00:02:50,989 --> 00:02:56,109
to help us make decisions, not just whether
to go to the restaurant or not, but important
33
00:02:56,109 --> 00:03:01,439
statistics, such as in health care and public
health. Well, I guess if it's an expensive
34
00:03:01,439 --> 00:03:05,700
restaurant, maybe it's important. But anyway,
and health care and public health, you really
35
00:03:05,700 --> 00:03:09,609
need these statistics, because they really
guide you. Like, for example, let's think
36
00:03:09,609 --> 00:03:14,569
of the Center for Disease Control and Prevention
in the United States. So what do they do?
37
00:03:14,569 --> 00:03:19,279
They spend the whole year studying the different
flu viruses that go round, because there's
38
00:03:19,279 --> 00:03:20,279
more than one.
39
00:03:20,279 --> 00:03:21,279
They spend
40
00:03:21,279 --> 00:03:25,900
the whole year doing that they organize, analyze,
and interpret numerical information and data
41
00:03:25,900 --> 00:03:32,709
about these different viruses, the different
influenza viruses that are going around. They
42
00:03:32,709 --> 00:03:37,859
extract that information. And you know, what
decisions I make the make the decisions about
43
00:03:37,859 --> 00:03:46,500
what viruses to include, in the next year
sexy? Are they always right? Sure enough,
44
00:03:46,500 --> 00:03:49,189
they're not. I mean, have you ever had a year
where you're like, Oh, my gosh, everybody
45
00:03:49,189 --> 00:03:54,780
I know, got vaccinated, and they're still
getting sick? Well, you know, give him a break.
46
00:03:54,780 --> 00:03:59,859
It's this sign some uncertainty, they it just
didn't work out that time. However, this is
47
00:03:59,859 --> 00:04:06,879
probably better than just randomly guessing.
Right. So that's statistics for you. Know,
48
00:04:06,879 --> 00:04:13,309
I promised you I'd tell you the statistics
ease version of individuals and variables.
49
00:04:13,309 --> 00:04:18,608
Now, if you're outside statistics, you know
that individuals are people, right. And you
50
00:04:18,608 --> 00:04:24,210
know that a variable is a factor, like a factor
that can vary, you know, like, the only variable
51
00:04:24,210 --> 00:04:25,400
is I don't know what time
52
00:04:25,400 --> 00:04:27,590
something's going to happen.
53
00:04:27,590 --> 00:04:28,590
But when you
54
00:04:28,590 --> 00:04:34,910
enter the land of statistics, there are specific
meanings to these two words. Individuals are
55
00:04:34,910 --> 00:04:40,090
people or objects included in a study. So
if you're gonna do an animal study with some
56
00:04:40,090 --> 00:04:46,319
mice in it, those would be the individuals.
If you do a randomized clinical trial, and
57
00:04:46,319 --> 00:04:51,860
you include people who have Alzheimer's in
it, then patients are your individuals. But
58
00:04:51,860 --> 00:04:56,259
we do a lot of different things in healthcare.
We sometimes study hospitals, like the rate
59
00:04:56,259 --> 00:05:01,300
of nosocomial infections, in which case if
you're looking old bunch of stuff in hospitals,
60
00:05:01,300 --> 00:05:07,590
those would be the individuals. Sometimes
we look at states rates of infant mortality,
61
00:05:07,590 --> 00:05:12,889
for example, in different states, in that
case, states would be individuals. So as you
62
00:05:12,889 --> 00:05:16,621
can see at the bottom of the slide, a variable
then is a characteristic of the individual
63
00:05:16,621 --> 00:05:22,840
to be measured, or observed. I give some examples
on the slide. But like I was saying, you know,
64
00:05:22,840 --> 00:05:28,370
if you wanted to study a hospital, for example,
I gave you the example of a variable of a
65
00:05:28,370 --> 00:05:34,110
rate of nosocomial infections, you could also
have other variables about that individual
66
00:05:34,110 --> 00:05:35,879
or hospital,
67
00:05:35,879 --> 00:05:41,949
like the rate of in hospital mortality. And
so, as you can see, one of the things we do
68
00:05:41,949 --> 00:05:46,479
in statistics is we sit down and we decide,
well, who are going to be our individuals
69
00:05:46,479 --> 00:05:53,570
that we're going to measure? And what variables
are we going to measure. So I just threw up
70
00:05:53,570 --> 00:06:00,360
here a few examples of different kinds of
individuals we have, that we use a lot in
71
00:06:00,360 --> 00:06:07,389
health care and public health, and an example
of just one variable, about those example
72
00:06:07,389 --> 00:06:12,561
individuals. But there would theoretically
be many variables about them. And I just want
73
00:06:12,561 --> 00:06:19,240
you to notice, a lot of times, the individuals
are geographic locations. Other times they
74
00:06:19,240 --> 00:06:26,960
might be institutions, like I said, like hospitals,
or clinics, or programs. There's other things
75
00:06:26,960 --> 00:06:34,090
that they are, but these are just kind of
the big ones. So, um, as I was describing,
76
00:06:34,090 --> 00:06:39,949
and just to review, what I went over, statistics
is used in healthcare and other disciplines
77
00:06:39,949 --> 00:06:46,789
to, to aid in decision making, like I gave
the example the CDC and their vaccine for
78
00:06:46,789 --> 00:06:52,419
influenza. And so therefore, it's really important
to understand statistics, because you need
79
00:06:52,419 --> 00:06:56,020
to understand these processes in healthcare,
like how do we figure out
80
00:06:56,020 --> 00:06:57,289
what to do?
81
00:06:57,289 --> 00:07:03,110
Like not only what do we do, but how do we
figure out what to do. And that's really important
82
00:07:03,110 --> 00:07:09,409
because we use statistics a lot in healthcare.
Now, we're going to move on to talk about
83
00:07:09,409 --> 00:07:15,729
what a population parameter is, and what a
sample statistic is. So we're going to go
84
00:07:15,729 --> 00:07:21,370
over first definition of a population and
the definition of a sample. So you're sure
85
00:07:21,370 --> 00:07:26,280
about what those mean. And we're going to
talk about the data about a population and
86
00:07:26,280 --> 00:07:30,849
the data about a sample and how those are
different. And then we're going to get into
87
00:07:30,849 --> 00:07:36,930
what I was just describing parameters and
statistics. And I'll give you a few examples.
88
00:07:36,930 --> 00:07:41,759
So let's start with what is the population,
again, another case where you just have a
89
00:07:41,759 --> 00:07:47,849
normal word, but it has a special meaning
and statistics? Well, it's a group of people
90
00:07:47,849 --> 00:07:54,340
or objects with a common theme. And when every
member of that group is considered this population,
91
00:07:54,340 --> 00:08:00,060
right. So here, here's just one example. So
the theme would be like nurses who work at
92
00:08:00,060 --> 00:08:06,229
Massachusetts, Massachusetts General Hospital,
so the population then if that was your theme,
93
00:08:06,229 --> 00:08:14,550
will be the list from human resources of every
nurse out currently employed at mgh. Now,
94
00:08:14,550 --> 00:08:21,699
it really does depend on how you define that
thing. Like I could have said, nurses who
95
00:08:21,699 --> 00:08:28,169
belong to the American nursing Association,
right? And then we'd be looking at a different
96
00:08:28,169 --> 00:08:35,640
list. I could say nurses who live in New Orleans,
in the city limits of New Orleans who live
97
00:08:35,640 --> 00:08:41,460
there, right, then we'll be looking at a different
population. So really has to do with the details
98
00:08:41,460 --> 00:08:48,730
of how you describe the theme around that
population. But the point is, once you describe
99
00:08:48,730 --> 00:08:56,320
that theme, the population is every single
individual in there. So then, what is the
100
00:08:56,320 --> 00:09:03,980
sample? Well, it's a small portion of that
population. It can be a representative sample,
101
00:09:03,980 --> 00:09:10,460
but it can also be a biased sample, and we're
going to get into that. So let's just go back
102
00:09:10,460 --> 00:09:17,130
to mgh. And think let's say we were going
to survey a sample of the population of nurses
103
00:09:17,130 --> 00:09:24,130
at mgh, let's say we only surveyed nurses
in the intensive care unit. That would be
104
00:09:24,130 --> 00:09:29,250
a sample, but not a representative sample.
So it would be a small portion of that population,
105
00:09:29,250 --> 00:09:35,840
but not a representative one. Probably more
representative would be if we asked at least
106
00:09:35,840 --> 00:09:42,600
one nurse from each department. And so I just
want to get in your head that the whole concept
107
00:09:42,600 --> 00:09:49,200
of sample is, is that it's just a small portion
of the population. And it's not a portion
108
00:09:49,200 --> 00:09:55,570
of some other population. It's just that one.
But the problem is you can get a biased one
109
00:09:55,570 --> 00:10:02,400
or representative one. So you have to think
about So when you think about it, if you've
110
00:10:02,400 --> 00:10:09,620
got a whole population, then you would get
variables about each individual in that population.
111
00:10:09,620 --> 00:10:15,230
And those variables would be your data. But
if you chose samples, that you know, just
112
00:10:15,230 --> 00:10:20,950
a portion will be a lot less work, right?
You'd still have to get variables about those
113
00:10:20,950 --> 00:10:25,800
individuals, but there's way fewer individuals,
so it probably be easier. So in population
114
00:10:25,800 --> 00:10:31,480
data, data from every single individual in
the population is available. And that's called
115
00:10:31,480 --> 00:10:40,600
a census. So I'm, I knew a person who decided
to do a survey of every single professor at
116
00:10:40,600 --> 00:10:46,830
a college. She didn't take just some professors
from each department, she sent the survey
117
00:10:46,830 --> 00:10:55,280
to every single professor. So she did not
use a sample, she used a census. But in sample
118
00:10:55,280 --> 00:11:01,830
data, the data are only available from some
of the individuals in the population. So if
119
00:11:01,830 --> 00:11:08,180
we go back to the researcher I described,
if she had only taken some of the list, the
120
00:11:08,180 --> 00:11:17,180
email list of the professors at that college,
then she would have been serving a sample.
121
00:11:17,180 --> 00:11:23,490
And that's actually very commonly used in
research studies, especially if patients,
122
00:11:23,490 --> 00:11:29,320
why would you need to go get every, for example,
kidney dialysis patient and study every single
123
00:11:29,320 --> 00:11:37,510
one, you only need a sample. And why is that
because we have statistics. So I'm going to
124
00:11:37,510 --> 00:11:44,970
just give you a few examples of real population
data in healthcare. You're probably familiar
125
00:11:44,970 --> 00:11:52,220
with Medicare, Medicare is the public insurance
program in the United States, for elders.
126
00:11:52,220 --> 00:11:56,310
So even my grandma was on Medicare when she
was alive,
127
00:11:56,310 --> 00:12:02,980
and she was not a US citizen, she was from
India. So we really do a good job of covering
128
00:12:02,980 --> 00:12:08,430
our elders in the US with Medicare. In fact,
I even read a statistics that said, almost
129
00:12:08,430 --> 00:12:16,750
100% of people aged 65 and over are in Medicare.
And so therefore, if you download data from
130
00:12:16,750 --> 00:12:21,380
Medicare, they make it confidential, you only
just replace all the personal identifiers.
131
00:12:21,380 --> 00:12:25,910
But there's this thing called the Medicare
claims data set for every single transaction
132
00:12:25,910 --> 00:12:32,390
that happens, like if you're in Medicare,
and you go get some treatment that's in there.
133
00:12:32,390 --> 00:12:39,290
So it has all the insurance claims filed by
the Medicare population, because it has everybody,
134
00:12:39,290 --> 00:12:45,290
everything than that is population data. Also,
in the United States, every 10 years, the
135
00:12:45,290 --> 00:12:50,390
government hires a bunch of people to go out
and survey a bunch of people. And also, they
136
00:12:50,390 --> 00:12:54,370
send out a bunch of surveys. And the idea
is to try to get every single person in the
137
00:12:54,370 --> 00:13:00,500
United States to fill out that survey. And
that's called the United States Census. So
138
00:13:00,500 --> 00:13:07,610
now, I'm going to give you sort of a mirror
image of the sample data. Okay. Remember how
139
00:13:07,610 --> 00:13:13,910
I was just talking to about Medicare? People
who are enrolled in Medicare are called Medicare
140
00:13:13,910 --> 00:13:20,410
beneficiaries, and Medicare cares what they
think. So they do a survey of a sample of
141
00:13:20,410 --> 00:13:27,150
individuals on Medicare. And they do this
kind of often. I think they do it once a year.
142
00:13:27,150 --> 00:13:33,260
I'm not sure it's a phone survey. They only
do a sample because they're going to use statistics
143
00:13:33,260 --> 00:13:38,640
to try and extrapolate that knowledge back
to the population of Medicare beneficiaries.
144
00:13:38,640 --> 00:13:45,580
Also, in case you notice, the United States
Census only takes place every 10 years. Do
145
00:13:45,580 --> 00:13:51,130
you think changes happen in between? Yep,
lots of changes. Like you just think about
146
00:13:51,130 --> 00:13:58,200
Hurricane Katrina. That's very sad. It changed
the population distribution in Louisiana,
147
00:13:58,200 --> 00:14:03,300
vary vary dramatically, and also other states
around there. So how did they keep up? Well,
148
00:14:03,300 --> 00:14:08,450
they used the American Community Survey, the
government does this the United States Census
149
00:14:08,450 --> 00:14:15,130
Bureau, and that, again, is done by phone.
And that's conducted yearly. And it's a sample
150
00:14:15,130 --> 00:14:21,860
and so the US doesn't know exactly how many
people would be in Louisiana or anywhere else.
151
00:14:21,860 --> 00:14:27,790
But they can use statistics to extrapolate
that from the sample of the American Community
152
00:14:27,790 --> 00:14:38,730
Survey. I want to just do a shout out to statistical
notation. So from now on, when we see a capital
153
00:14:38,730 --> 00:14:47,190
N, like let's say you sack capital N equals
25, then you can assume that 25 means a population
154
00:14:47,190 --> 00:14:52,820
that's just kind of a secret code we use in
statistics. However, if you saw a lowercase
155
00:14:52,820 --> 00:15:00,160
n, n equals 25, and it was lowercase, then
you could assume that this was a sample of
156
00:15:00,160 --> 00:15:06,440
the population. And again, it's just kind
of like a secret code, you have to pay attention.
157
00:15:06,440 --> 00:15:12,040
When I'm talking and I say n, and you can
see uppercase and lowercase. You don't know
158
00:15:12,040 --> 00:15:21,660
if I'm talking about a population, or a sample.
Now I'm going to get into the concept of parameter
159
00:15:21,660 --> 00:15:28,980
versus statistic, I want you to notice that
the word parameter starts with P PA. So parameter
160
00:15:28,980 --> 00:15:35,930
is a measure that describes the entire population.
So for instance, anything that would come
161
00:15:35,930 --> 00:15:43,540
out of that whole Medicare claims data set,
or that whole United States Census would be
162
00:15:43,540 --> 00:15:52,420
a parameter. On the other hand, a statistic
statistic starts with S, and statistic is
163
00:15:52,420 --> 00:15:59,550
a measure that describes only a sample of
a population. Here we have an, again, a situation
164
00:15:59,550 --> 00:16:06,790
where the word statistic is used, like daily
on the news. In fact, sometimes I hear on
165
00:16:06,790 --> 00:16:14,290
the news, something like Oh, look at the rate
of HIV in Africa, it's going up. That's a
166
00:16:14,290 --> 00:16:21,230
terrible statistic. I agree. It's terrible.
But they mean parameter, because they're talking
167
00:16:21,230 --> 00:16:27,340
about all of Africa, every single person in
Africa, if the rate of HIV is going up in
168
00:16:27,340 --> 00:16:33,860
Africa, they mean a parameter, they don't
need a statistic.
169
00:16:33,860 --> 00:16:40,820
So here's an example of parameters and statistics
that are based on the same population. So
170
00:16:40,820 --> 00:16:46,260
for example, the mean age of every American
on Medicare is a parameter that's every single
171
00:16:46,260 --> 00:16:52,990
person. However, remember, the Medicare beneficiary
survey, that's just a sample. So if we took
172
00:16:52,990 --> 00:16:58,350
the mean age of those people, we would just
have a statistic. And again, you just have
173
00:16:58,350 --> 00:17:03,030
to pay attention, because if you listen to
the news, you'll hear them use the word statistic
174
00:17:03,030 --> 00:17:10,420
to mean both parameter and statistic. But
in this situation with, when you're practicing
175
00:17:10,420 --> 00:17:16,400
in the field of statistics, it's very important
to point out when the number you're talking
176
00:17:16,400 --> 00:17:22,450
about comes from a population versus comes
from a sample. So you should really use the
177
00:17:22,450 --> 00:17:31,800
term. This is a parameter if it's from a population,
or this is a statistic, if it's from a sample.
178
00:17:31,800 --> 00:17:38,700
And so again, don't get confused. If you're
listening to someone talk in a lecture or
179
00:17:38,700 --> 00:17:46,150
in a video, you might want to look for clues
that a number is a population parameter, or
180
00:17:46,150 --> 00:17:52,920
as a sample statistic, if you hear that the
data set that they use encompasses an entire
181
00:17:52,920 --> 00:17:58,630
population. And usually that's the kind of
stuff done by governments, like remember when
182
00:17:58,630 --> 00:18:04,430
I was talking about the rate of HIV in Africa,
lead probably be done by governments of the
183
00:18:04,430 --> 00:18:09,310
United Nations, or the World Health Organization.
So when you're talking about numbers that
184
00:18:09,310 --> 00:18:10,310
might have come
185
00:18:10,310 --> 00:18:11,310
out of an
186
00:18:11,310 --> 00:18:17,380
entire population, usually done by the government,
that's probably a population parameter. clues
187
00:18:17,380 --> 00:18:23,130
that someone's talking about a sample statistic
is if you hear them talking about a study
188
00:18:23,130 --> 00:18:24,890
that recruited volunteers,
189
00:18:24,890 --> 00:18:25,890
well,
190
00:18:25,890 --> 00:18:30,340
then, if it's volunteers, they didn't get
everybody in the population. So it's going
191
00:18:30,340 --> 00:18:37,440
to be a sample. Also, like surveys, for instance,
surveys about who people are going to vote
192
00:18:37,440 --> 00:18:44,040
for you public opinion surveys, they're never
going to ask some every single person in the
193
00:18:44,040 --> 00:18:50,100
state, who are you going to vote for build
us ask a sample. So if you hear about a survey,
194
00:18:50,100 --> 00:18:56,490
you might even have them tell you say, n equals
maybe a few 1000 people because that's all
195
00:18:56,490 --> 00:19:02,000
they surveyed. And so that's a clue that we're
talking about a sample statistic rather than
196
00:19:02,000 --> 00:19:09,300
a population parameter. Now, I'm going to
talk about the difference between descriptive
197
00:19:09,300 --> 00:19:14,510
statistics and inferential statistics. But
first I'm going to remind you what the word
198
00:19:14,510 --> 00:19:22,230
infer means. So infer means to kind of get
a hint from something indirectly. It's kind
199
00:19:22,230 --> 00:19:31,720
of the complement to imply. So if I said my
friend implied that I should not call after
200
00:19:31,720 --> 00:19:38,370
9pm and I figured that out. I would say I
inferred that I should not call my friend
201
00:19:38,370 --> 00:19:44,010
after 9pm. Okay. So in inferential is what
I'm going to talk about next. But first I'm
202
00:19:44,010 --> 00:19:49,740
going to talk about descriptive descriptives
is pretty easy, because you can do it to samples
203
00:19:49,740 --> 00:19:56,570
and you can do it to populations will variables
from samples and populations, right. And so,
204
00:19:56,570 --> 00:20:01,050
descriptive statistics involve methods of
organizing picturing in some Rising information
205
00:20:01,050 --> 00:20:05,540
from samples and populations. It's basically
just making pictures of it right? Like look
206
00:20:05,540 --> 00:20:09,809
at that bar chart. And that's just a simple
picture. And that can be made with just about
207
00:20:09,809 --> 00:20:17,140
any data. You get data from surveying people
at work, you get data from surveying your
208
00:20:17,140 --> 00:20:22,750
friends, what they're going to bring to the
potluck. If any of that can be used, you can
209
00:20:22,750 --> 00:20:28,950
go download the census data, you can make
descriptive statistics out of that. But there's
210
00:20:28,950 --> 00:20:36,059
something very special about inferential statistics.
And that involves methods of using information
211
00:20:36,059 --> 00:20:44,010
from a sample to draw conclusions regarding
the population. Therefore, inferential statistics
212
00:20:44,010 --> 00:20:52,370
can only be done on a sample. And therefore
and that's why that's called inferential.
213
00:20:52,370 --> 00:20:59,210
Right? Because infer, because the sample is
going to give a hint about what the population
214
00:20:59,210 --> 00:21:03,370
is right? It's not going to say it directly,
which is annoying, right? But that's that
215
00:21:03,370 --> 00:21:09,000
uncertainty thing I was telling you about.
So the sample is going to imply something?
216
00:21:09,000 --> 00:21:14,840
Well, we're gonna infer something from the
sample about the population, right? So that's
217
00:21:14,840 --> 00:21:19,240
what inferential statistics is, is where you
take a sample, and you infer something about
218
00:21:19,240 --> 00:21:23,530
the population. Whereas descriptive statistics
is more loosey goosey. You can just do that
219
00:21:23,530 --> 00:21:32,370
to samples and populations, kind of like make
pictures out of it, right. So in statistics,
220
00:21:32,370 --> 00:21:37,880
it's really important to properly identify
measures as either population parameters,
221
00:21:37,880 --> 00:21:44,130
or sample statistics. Because as you can see,
you can only do inferential statistics on
222
00:21:44,130 --> 00:21:49,260
samples. And so you have to really know what
you're doing when you're doing statistics,
223
00:21:49,260 --> 00:21:53,900
what you're talking about, because different
types of data are used for parameters versus
224
00:21:53,900 --> 00:22:00,750
statistics. Alrighty, now we're going to get
into classifying variables into different
225
00:22:00,750 --> 00:22:06,550
levels of measurement. So remember our variables,
right, like we have individuals, and then
226
00:22:06,550 --> 00:22:11,390
we have variables about them. And those variables
actually can only fall into two groups, quantitative
227
00:22:11,390 --> 00:22:15,730
versus qualitative. And then depending on
which group they fall into, you can further
228
00:22:15,730 --> 00:22:21,221
classify them as interval versus ratio, or
nominal versus ordinal. And I'm going to give
229
00:22:21,221 --> 00:22:28,610
you some examples of how to classify a few
healthcare data, types of variables already,
230
00:22:28,610 --> 00:22:34,020
so I like to draw this picture. It's a four
level data classification, I'll draw it solely
231
00:22:34,020 --> 00:22:39,800
here for you. So we start with human research
data, that's what I like to start with. Alright,
232
00:22:39,800 --> 00:22:44,500
so we're going to split that into two. Remember,
I said that, we're going to start by talking
233
00:22:44,500 --> 00:22:49,960
about quantitative. Another word that's often
used for that is continuous, but we're going
234
00:22:49,960 --> 00:22:55,620
to use the word quantitative. So what does
that mean? That is a numerical measurement
235
00:22:55,620 --> 00:22:59,100
of something. So like, this gives an example
of temperature. So something
236
00:22:59,100 --> 00:23:03,810
with a number in it, I always think if I can
make a mean out of it, it must be a quantitative
237
00:23:03,810 --> 00:23:11,050
variable, right? And so here's an example
of quantitative variables. So time of admin,
238
00:23:11,050 --> 00:23:21,520
right? So imagine that you work a shift in
the ER, right? And from maybe 8pm to 12. like
239
00:23:21,520 --> 00:23:27,309
midnight, right? So you have this for hours.
And you could say, what the average time of
240
00:23:27,309 --> 00:23:32,540
admin would be for those who got admitted
to the hospital, you know, somebody got admitted
241
00:23:32,540 --> 00:23:36,920
at like, eight o'clock, and then somebody
at 815, and whatever, you could put that together,
242
00:23:36,920 --> 00:23:43,650
and you'd say what the average time was, also,
like, if you were doing a study, and you as
243
00:23:43,650 --> 00:23:49,230
you were saying, patients with a particular
condition like Alzheimer's disease, you could
244
00:23:49,230 --> 00:23:54,360
ask them their year of diagnosis, and then
you could make an average of that. And so
245
00:23:54,360 --> 00:24:00,250
you know, that that is quantitative. systolic
blood pressure is also numerical, and platelet
246
00:24:00,250 --> 00:24:05,500
count. And these are variables we run into
all the time in healthcare. So we're, you
247
00:24:05,500 --> 00:24:11,380
said that this is quantitative. Now, we'll
get back to our picture. So that's one side.
248
00:24:11,380 --> 00:24:16,000
So what if it's not quantitative? What else
could it be? Well, the only other category,
249
00:24:16,000 --> 00:24:21,510
it could be is categorical or qualitative.
I use the term qualitative, but some people
250
00:24:21,510 --> 00:24:28,300
use the term categorical, but that's kind
of what it is, is that it's a quality of something
251
00:24:28,300 --> 00:24:35,170
or a characteristic of something like sex
or race. So here are some qualitative variables
252
00:24:35,170 --> 00:24:41,080
in healthcare, like you can have type of health
insurance, like whether you're on Medicare
253
00:24:41,080 --> 00:24:47,370
or Medicaid or different types of private
insurance. Those are all just categorical,
254
00:24:47,370 --> 00:24:53,210
right? You can't make a mean out of that.
Also country of origin. If you're in our group
255
00:24:53,210 --> 00:24:58,110
of students and their international students
in there. Well, what countries are they from?
256
00:24:58,110 --> 00:25:03,090
Right? Well, you can't make a mean out of
that. Also you have situations where you do
257
00:25:03,090 --> 00:25:08,370
have numbers involved, like the stage of cancer,
right? That's depressing. Stage One, cancer,
258
00:25:08,370 --> 00:25:13,630
stage two, cancer, stage three, well, you
never can make a mean, out of the stage of
259
00:25:13,630 --> 00:25:19,330
cancer, you wouldn't say, well, the mean stages
is 1.4, or something like that. It's just
260
00:25:19,330 --> 00:25:25,430
a category. And of course, stage four is a
lot worse than stage one. You know, they're
261
00:25:25,430 --> 00:25:32,430
not just equal categories, but their categories.
Same with trauma center level level four Trauma
262
00:25:32,430 --> 00:25:38,809
Center, where you wouldn't make a mean out
of the number of after the term Trauma Center,
263
00:25:38,809 --> 00:25:44,870
right, like what level it is. But you could
say, well, in the state, maybe. So many percent
264
00:25:44,870 --> 00:25:49,390
of our trauma centers are level four trauma
center. So it's really just a categorical
265
00:25:49,390 --> 00:25:55,590
variable, even though there's a number involved.
Alright, so let's get back to our diagram,
266
00:25:55,590 --> 00:26:01,510
we figured out how to take any variable, and
first split it into one of two categories
267
00:26:01,510 --> 00:26:08,020
is either quantitative, if it's numerical,
or qualitative, if it's a characteristic.
268
00:26:08,020 --> 00:26:14,730
Now, we're going to just concentrate on quantitative
because we're going to separate those variables
269
00:26:14,730 --> 00:26:19,410
into two categories. And the first one we're
going to look at is interval. And the second
270
00:26:19,410 --> 00:26:26,309
one we're going to look at is ratio. So if
a if you happen to decide a variable as quantitative,
271
00:26:26,309 --> 00:26:31,710
then it could be interval or ratio, but not
if it's qualitative. Okay, if it's qualitative,
272
00:26:31,710 --> 00:26:38,640
it doesn't get to do that. So let's look at
interval versus ratio. So on the left side
273
00:26:38,640 --> 00:26:43,810
of the side, we have interval, which is where
it's quantitative, and the differences between
274
00:26:43,810 --> 00:26:46,580
data values are meaningful.
275
00:26:46,580 --> 00:26:47,580
And
276
00:26:47,580 --> 00:26:51,210
ratio has the same thing, the differences
between the data values are meaningful. What
277
00:26:51,210 --> 00:26:56,630
does that mean by that? Well, remember how
I was talking before how level one trauma
278
00:26:56,630 --> 00:27:01,700
center and level two trauma center that that
those are really categories, and not quantitative
279
00:27:01,700 --> 00:27:08,570
variables, because the difference actually
between them is not equal. Especially if you
280
00:27:08,570 --> 00:27:16,010
think of job classifications that might go
in 1234, like nurse, one, nurse to nurse three,
281
00:27:16,010 --> 00:27:21,471
nurse four, or I worked at a job where we
had office specialist one, office specialist
282
00:27:21,471 --> 00:27:23,860
to Office specialist three.
283
00:27:23,860 --> 00:27:26,850
And you know what the deal
284
00:27:26,850 --> 00:27:32,950
for going from office specialists to to Office
specialist three was really hard, you really
285
00:27:32,950 --> 00:27:39,360
had to do a lot there. But to go from one
to two wasn't that hard? So that was a categorical
286
00:27:39,360 --> 00:27:46,580
variable, right? Because the differences between
the values were meaningless. Okay. Like the
287
00:27:46,580 --> 00:27:52,529
difference between s one and s two versus
Oh, s two, and s three, they weren't equal.
288
00:27:52,529 --> 00:27:56,440
Whereas when you're dealing with a quantitative
variable, regardless of whether it's interval
289
00:27:56,440 --> 00:28:03,010
or ratio, you're talking like years, or systolic
blood pressure, one year for you is one year
290
00:28:03,010 --> 00:28:10,100
for me. So that's fine, right? But here's
where the difference comes in between interval
291
00:28:10,100 --> 00:28:16,880
and ratio. So all quantitative variables have
meaningful differences between their data
292
00:28:16,880 --> 00:28:24,920
values, but this hairsplitting thing here
is that an interval, there is no true zero.
293
00:28:24,920 --> 00:28:32,010
And in ratio, there is a true zero. And this
is how I try to think about it. an interval
294
00:28:32,010 --> 00:28:38,740
means kind of like, a space between two things.
Like if you think of the word intermission
295
00:28:38,740 --> 00:28:43,540
is kind of like an interval. It's like an
interval of time during a show where you get
296
00:28:43,540 --> 00:28:48,610
to get up and go the bathroom and get some
coffee. So that's interval. And so if you
297
00:28:48,610 --> 00:28:53,230
have something that's a space in between,
that's not going to have a zero, it doesn't
298
00:28:53,230 --> 00:28:59,150
really start anywhere, or end anywhere. It's
in between. Whereas ratio, how are you number
299
00:28:59,150 --> 00:29:05,130
that is, I don't know if you remember from
like high school, but you can't have a zero
300
00:29:05,130 --> 00:29:11,290
on the bottom of a ratio or a fraction. So
that's the way I use a pneumonic. That ratio
301
00:29:11,290 --> 00:29:18,930
means that you cannot have a true zero. But
how does this work out literally? Well, I'll
302
00:29:18,930 --> 00:29:25,690
show you. So let's go back to those examples
I showed you of quantitative variables, right?
303
00:29:25,690 --> 00:29:30,120
Because those are the only ones we have to
make this decision about whether they are
304
00:29:30,120 --> 00:29:37,010
interval ratio. So these are these examples.
Now I'm going to remind you that ratio has
305
00:29:37,010 --> 00:29:42,059
a true zero. Remember that little pneumonic
I said, like don't divide by zero. And so
306
00:29:42,059 --> 00:29:47,110
you know, like in a ratio, so they have a
true zero. Well, let's think about it. It's
307
00:29:47,110 --> 00:29:53,031
not very pleasant to have a zero systolic
blood pressure because you'd be dead. Same
308
00:29:53,031 --> 00:29:58,980
with the platelet count, but it is possible,
right? But now when we go on to interval,
309
00:29:58,980 --> 00:30:05,799
we can't have Like zero time, like time of
admet, you know are your diagnosis, there's
310
00:30:05,799 --> 00:30:13,230
no like, year zero. So as you probably just
guessed, ratio is where it's at. In healthcare.
311
00:30:13,230 --> 00:30:18,429
There's not a whole lot of times when we have
interval data, but we do, you know, anytime
312
00:30:18,429 --> 00:30:23,590
you have a time, so you got to keep that in
mind that if you want to split your quantitative
313
00:30:23,590 --> 00:30:29,650
variables into either interval or ratio, you
got to keep this in mind the difference between
314
00:30:29,650 --> 00:30:38,210
the true zero and the no true zero. Okay,
here's our handy dandy diagram. We've just
315
00:30:38,210 --> 00:30:44,170
gone through the tree classifying quantitative
data into interval versus ratio. Now let's
316
00:30:44,170 --> 00:30:48,780
go pay attention to the other side of the
tree qualitative. So how do we split those?
317
00:30:48,780 --> 00:30:58,080
Um, well, we can split those into nominal
versus ordinal. All right. So nominal applies
318
00:30:58,080 --> 00:31:04,630
to categories, labels, or names that cannot
be ordered from smallest to largest. Okay,
319
00:31:04,630 --> 00:31:08,850
like I kind of think of when they have an
advertisement, they say, for a nominal fee,
320
00:31:08,850 --> 00:31:14,290
you can do this, it means it's small, they're
like, there's almost no difference. And so
321
00:31:14,290 --> 00:31:18,559
that's why I say, there's no difference, it's
not smallest to largest is means they must
322
00:31:18,559 --> 00:31:24,710
be equal. That's how I remember it in my mind.
But then ordinal applies to data that can
323
00:31:24,710 --> 00:31:29,000
be arranged in order in categories. But remember
that thing I was saying about quantitative,
324
00:31:29,000 --> 00:31:33,929
it's not quantitative, right? Because the
difference between the data values either
325
00:31:33,929 --> 00:31:39,440
cannot be determined or is meaningless, like
I was talking about with cancer, especially,
326
00:31:39,440 --> 00:31:43,750
you know, if you go from stage three to stage
four, that's materially different than stage
327
00:31:43,750 --> 00:31:48,690
one to stage two. So you really can't determine
those things. So this is where we're gonna
328
00:31:48,690 --> 00:31:54,320
get into that it's ordinal. It's arranged
in categories that can be ordered from smallest
329
00:31:54,320 --> 00:32:01,620
to largest. So remember, our old friends that
I threw up there before of these examples
330
00:32:01,620 --> 00:32:07,710
of qualitative variables and healthcare? Well,
let's just reflect on this nominal cannot
331
00:32:07,710 --> 00:32:12,950
be ordered, right. So that would be more like
type of health insurance and country of origin
332
00:32:12,950 --> 00:32:17,919
because they could all be equal. Whereas ordinal
is going to have a natural order, even though
333
00:32:17,919 --> 00:32:24,110
the differences between the levels is meaningless,
which is what makes it so different from a
334
00:32:24,110 --> 00:32:29,330
quantitative variables. So which is why it
stays on the qualitative side of the tree,
335
00:32:29,330 --> 00:32:34,450
it just gets labeled ordinal. So what you
want to do is if you think you have a qualitative
336
00:32:34,450 --> 00:32:39,830
variable on your hands, look for a natural
order. If there is one, it's ordinal. And
337
00:32:39,830 --> 00:32:48,490
if not, it's nominal. So all data can be classified
as quantitative or qualitative. So if you
338
00:32:48,490 --> 00:32:53,640
have a variable, that's the first split you
can make as the difference between quantitative
339
00:32:53,640 --> 00:32:58,750
and qualitative, but once you do that, you
can further classify it as interval ratio,
340
00:32:58,750 --> 00:33:03,890
nominal, or ordinal. And it's really important
to know how to classify data in healthcare,
341
00:33:03,890 --> 00:33:09,200
as you'll find out later. Because depending
on how you classify it, you might be able
342
00:33:09,200 --> 00:33:15,840
to do different things with it in statistics
already, so what we went over was the definition
343
00:33:15,840 --> 00:33:20,720
of statistics. And we talked a little about
why you use it and how you use it, especially
344
00:33:20,720 --> 00:33:25,800
in healthcare. We went over what it means
to talk about a population parameter and the
345
00:33:25,800 --> 00:33:31,240
sample statistic, and we went over some examples
about them. And then we talked about classifying
346
00:33:31,240 --> 00:33:38,190
variables into the different levels of measurement,
and even talked about a few examples there.
347
00:33:38,190 --> 00:33:46,840
So I hope you enjoyed my lecture. Greetings,
this is Monica wahi lecturer at library college,
348
00:33:46,840 --> 00:33:55,780
bringing you your lecture on section 1.2 on
the topic of sampling.
349
00:33:55,780 --> 00:33:56,780
So here
350
00:33:56,780 --> 00:34:01,871
are your learning objectives for this particular
lecture. At the end of this lecture, the students
351
00:34:01,871 --> 00:34:08,030
should be able to define sampling frame and
sampling error, the student should be also
352
00:34:08,030 --> 00:34:13,389
able to give one example of how to do simple
random sampling. And one example of how to
353
00:34:13,389 --> 00:34:19,599
do systematic sampling. The students should
be able to explain one reason to choose stratified
354
00:34:19,599 --> 00:34:26,270
sampling over other approaches, state to differences
between cluster sampling and convenience sampling,
355
00:34:26,270 --> 00:34:33,029
and give an example of a national survey that
uses multistage sampling. So let's jump right
356
00:34:33,029 --> 00:34:39,268
into it here. So we're going to go over in
this lecture, sampling definitions, and then
357
00:34:39,268 --> 00:34:43,588
those different types of sampling I mentioned
in the learning objectives, simple random
358
00:34:43,589 --> 00:34:49,969
sampling, stratified sampling, systematic
sampling, and then convenience and multi state
359
00:34:49,969 --> 00:34:59,710
sing. So let's start with some sampling definitions.
What is a sample Okay, so we're going to revisit
360
00:34:59,710 --> 00:35:05,900
that concept from the previous lecture, we're
also going to talk about sampling frames,
361
00:35:05,900 --> 00:35:11,210
and what errors mean and errors of sampling
frames. And then we're also going to just
362
00:35:11,210 --> 00:35:15,869
go right back over that and make sure you
understand before we go on, and talk about
363
00:35:15,869 --> 00:35:22,499
the different types of sampling. So we take
a sample of a population, because we want
364
00:35:22,499 --> 00:35:28,390
to do inferential statistics, remember that
we want to infer from the sample to the population.
365
00:35:28,390 --> 00:35:34,109
And it's just not necessary to measure the
whole population, it would be impractical.
366
00:35:34,109 --> 00:35:40,789
And it's cost a lot. And actually, what you'll
find is, if you ever do an experiment, when
367
00:35:40,789 --> 00:35:46,049
where you actually do measure the whole population,
you'll find that if you get, you know, a pretty
368
00:35:46,049 --> 00:35:51,509
good proportion of the population, and you
just take that, you, that's all you really
369
00:35:51,509 --> 00:35:58,470
needed to talk to. So ultimately, we save
resources, especially in health care, when
370
00:35:58,470 --> 00:36:05,249
we do a good job of sampling, and use that
to infer to the population rather than having
371
00:36:05,249 --> 00:36:12,019
to take a census of the whole population all
the top. So that brings us to the concept
372
00:36:12,019 --> 00:36:18,130
of sampling frame. So the sampling frame is
the list of individuals from which a sample
373
00:36:18,130 --> 00:36:23,170
is actually selected. And the list may be
this physical concrete list, like you could
374
00:36:23,170 --> 00:36:29,260
have a list of students enrolled at a nursing
college, or in my other lecture, I gave an
375
00:36:29,260 --> 00:36:35,420
example of a list of nurses who work at Massachusetts
General Hospital, that could be your list,
376
00:36:35,420 --> 00:36:40,780
you'd go to human resources and get that.
Or it could be a theoretical list. It could
377
00:36:40,780 --> 00:36:46,079
be like the list of patients who present to
the emergency department today, obviously,
378
00:36:46,079 --> 00:36:51,029
when you go into work, at the beginning of
the shift, you're not going to know who's
379
00:36:51,029 --> 00:36:56,999
on that list yet. But it could be a theoretical
list. But whatever that list is, that is your
380
00:36:56,999 --> 00:37:05,769
sampling frame. So that those are the people
who actually could be selected for your study.
381
00:37:05,769 --> 00:37:12,109
So the sampling frame is the part of the population
from which you want to draw the sample. And
382
00:37:12,109 --> 00:37:17,960
you want to work at such that everybody from
your sampling frame has a chance of being
383
00:37:17,960 --> 00:37:22,890
selected for your sample. In other words,
you don't want to leave anyone that should
384
00:37:22,890 --> 00:37:31,059
be in your sampling frame out in the cold.
That leads us to the concept of under coverage.
385
00:37:31,059 --> 00:37:35,670
So what is it? It's omitting population members
from the sampling frame? They're supposed
386
00:37:35,670 --> 00:37:41,130
to be on the list, but they're not there.
So how can this happen? Well, let's say you
387
00:37:41,130 --> 00:37:45,309
did what I was suggesting in the previous
slide, you got a list of nursing students,
388
00:37:45,309 --> 00:37:50,650
you know, from a college, let's say somebody
signed up that day, or somebody was just admitted
389
00:37:50,650 --> 00:37:54,830
that day, maybe they didn't make it into the
database in time and you're missing them.
390
00:37:54,830 --> 00:37:59,920
Or even like that HR list I talked about,
at mgh, well, you know, I know how nurses
391
00:37:59,920 --> 00:38:04,119
are, sometimes they'll temp in different places,
and maybe they're not on the payroll, maybe
392
00:38:04,119 --> 00:38:09,160
they're through a temp agency. And so then
we would miss those nurses from the sampling
393
00:38:09,160 --> 00:38:15,099
frame. And then, you know, people who present
at the emergency department at night might
394
00:38:15,099 --> 00:38:19,470
be different than those in the day. And so
if you're really trying to sample from people
395
00:38:19,470 --> 00:38:24,330
who present to the emergency department, you
can't just look at like some small period
396
00:38:24,330 --> 00:38:31,970
of time, you'd have to look at, you know,
the whole 24 hour cycle. So if you omit population
397
00:38:31,970 --> 00:38:36,170
members from your sampling frame, they don't
even get a chance to be in it. And that's
398
00:38:36,170 --> 00:38:43,800
called under coverage. Now, I'm going to shift
around, we're jumping around with a few different
399
00:38:43,800 --> 00:38:44,800
definitions.
400
00:38:44,800 --> 00:38:49,470
And we're going to talk about errors. Now,
this is something that took me a while to
401
00:38:49,470 --> 00:38:54,519
get used to in statistics, there's actually
two kinds of errors in statistics. The first
402
00:38:54,519 --> 00:39:02,030
kind is I call it This is my own terminology,
a fact of life error. It's just an error that
403
00:39:02,030 --> 00:39:08,160
happens. When you do statistics, it's not
bad or good. It's just what happens. And in
404
00:39:08,160 --> 00:39:13,349
this case, I'm going to describe one of those.
It's called a sampling error. So the sampling
405
00:39:13,349 --> 00:39:18,900
error just simply says the population mean
will be different from your sample mean, and
406
00:39:18,900 --> 00:39:22,859
the population percentage will be different
from your sample percentage. So what does
407
00:39:22,859 --> 00:39:28,299
that mean? That means that if I cut corners,
like I said, I could write and just take a
408
00:39:28,299 --> 00:39:33,789
sample to infer to the population. If I actually
do one of those experiments I was telling
409
00:39:33,789 --> 00:39:38,940
you about where I have the population data
and I just take a sample and compare the means
410
00:39:38,940 --> 00:39:43,480
they will be different. Okay, I mean, there
might be this huge coincidence where they're
411
00:39:43,480 --> 00:39:49,359
the same but they're typically different.
Same if you do percentages, and and we just
412
00:39:49,359 --> 00:39:53,180
know this is going to happen. The statistics
we account for it, we have ways of dealing
413
00:39:53,180 --> 00:39:58,479
with it. But we know that there's always going
to be sampling error whenever you take a sample
414
00:39:58,479 --> 00:40:02,770
from a population To try to make a mean or
percentage in the sample, it's just not going
415
00:40:02,770 --> 00:40:06,509
to be exactly what's in the populations fine.
416
00:40:06,509 --> 00:40:08,219
But then
417
00:40:08,219 --> 00:40:12,839
there are other errors and statistics, which
are actually bad. And your it means you made
418
00:40:12,839 --> 00:40:19,529
a mistake. It's like mistakes, literally mistakes.
And so as you go through learning about statistics,
419
00:40:19,529 --> 00:40:23,000
it's almost like you have to sit down and
ask somebody, is this one of those fact of
420
00:40:23,000 --> 00:40:28,069
life errors? Or is this one of those errors
you want to avoid? Well, we just talked about
421
00:40:28,069 --> 00:40:33,869
sampling error. That's just a fact of life
error. But errors, you want to avoid non sampling
422
00:40:33,869 --> 00:40:42,200
error. That's basically using a bad list.
I had an example in my life where I wanted
423
00:40:42,200 --> 00:40:48,920
to study a whole bunch of providers, right.
And my friend gave me this list of providers,
424
00:40:48,920 --> 00:40:54,989
and and said, this is the entire list of all
these providers in this particular professional
425
00:40:54,989 --> 00:41:01,640
society. But when I sent the email to that
list, I found there were not only duplicates
426
00:41:01,640 --> 00:41:05,729
on this list, but a lot of people emailed
me back and said, Why are you sending this
427
00:41:05,729 --> 00:41:14,529
to me? I'm not a provider. I'm not part of
this professional society. And also, some
428
00:41:14,529 --> 00:41:19,700
people who were in that professional society,
who had heard about the survey emailed me
429
00:41:19,700 --> 00:41:24,390
and said, Why didn't I get the survey. So
this was a bad list. Some people had been
430
00:41:24,390 --> 00:41:32,650
left out of the sampling frame. So people
who were in the society somehow weren't on
431
00:41:32,650 --> 00:41:37,470
my email list. And that's a problem, right?
So you have to pay careful attention. This
432
00:41:37,470 --> 00:41:42,430
was actually a mistake I made, you have to
pay careful attention that everyone in the
433
00:41:42,430 --> 00:41:46,960
population who was supposed to be represented
in your sampling frame is actually there.
434
00:41:46,960 --> 00:41:51,480
So I should have really done a better job
of calling the professional society and making
435
00:41:51,480 --> 00:41:59,719
sure that this list was a good list. So sampling
error was caused by the fact that regardless
436
00:41:59,719 --> 00:42:06,130
of what you do, your sample will not perfectly
resent represent the population. Whereas non
437
00:42:06,130 --> 00:42:11,880
sampling error, yeah, I was sloppy. It was
poor sample design, sloppy data collection,
438
00:42:11,880 --> 00:42:16,680
and accurate measurement instruments, you
can have bias and data collection, other problems
439
00:42:16,680 --> 00:42:22,329
introduced by the researcher. So this is your
fault if there's non sampling error, but sampling
440
00:42:22,329 --> 00:42:23,880
error is just a
441
00:42:23,880 --> 00:42:27,809
fact of life.
442
00:42:27,809 --> 00:42:33,539
Little whiplash here, we're gonna now move
on to the concept of simulations. So a simulation
443
00:42:33,539 --> 00:42:42,219
is defined technically as a numerical facsimile,
or representation of a real world phenomenon.
444
00:42:42,219 --> 00:42:48,529
So it's like working through a pretend situation,
to see how it would come out in the case that
445
00:42:48,529 --> 00:42:57,900
was real. And this, you know, when you study
statistics, you end up doing a lot of simulations.
446
00:42:57,900 --> 00:43:03,859
And remember how I've been talking about an
experiment you could do if you somehow did
447
00:43:03,859 --> 00:43:08,569
a census and had a whole bunch of data on
a population, you could do an experiment where
448
00:43:08,569 --> 00:43:13,779
you just took a sample from that population
and looked at their mean to see the sampling
449
00:43:13,779 --> 00:43:23,740
error. That's an example of a simulation.
So to just conclude this little section, it's
450
00:43:23,740 --> 00:43:30,430
really important to do your best to avoid
non sampling error. And this is achieved by
451
00:43:30,430 --> 00:43:35,219
making sure you do not have under coverage
when sampling from your sampling frame. So
452
00:43:35,219 --> 00:43:40,719
this puts together some of our vocabulary.
But just remember, sampling error is a fact
453
00:43:40,719 --> 00:43:47,469
of life. Okay, now we're going to specifically
talk about different types of sampling. And
454
00:43:47,469 --> 00:43:55,769
we're going to start with simple random sample.
Okay, so first, we're gonna start with just
455
00:43:55,769 --> 00:44:00,960
explaining what is meant by simple random
sampling, then we're going to talk about two
456
00:44:00,960 --> 00:44:06,640
different methods of doing simple random sampling,
they work the same way they achieve the same
457
00:44:06,640 --> 00:44:11,059
thing. It's just that depending on how you're
doing your research, one might be more convenient
458
00:44:11,059 --> 00:44:16,819
for you than the other. Finally, we will go
over the limits of simple random sampling,
459
00:44:16,819 --> 00:44:24,519
because all these sampling methods seem perfect.
But then you got to take a look at their limitations.
460
00:44:24,519 --> 00:44:31,890
So let's first define simple random sampling.
So here's a definition. A simple random sample
461
00:44:31,890 --> 00:44:39,159
of n measurements from a population is a subset
of the population selected in such a manner
462
00:44:39,159 --> 00:44:46,269
that every sample of size n from the population
has an equal chance of being selected. Well,
463
00:44:46,269 --> 00:44:52,359
it's kind of complicated, but what it means
is, is that if you use the proper approach
464
00:44:52,359 --> 00:44:58,450
for simple random sampling, whatever sample
you get, you could have had just as easily
465
00:44:58,450 --> 00:45:06,369
a chance of getting another batch, another
group of people from that sample. In other
466
00:45:06,369 --> 00:45:10,750
words, like, let's say you have a list of
the population of students in the class. So
467
00:45:10,750 --> 00:45:16,390
I'm going to define a class as a population.
And you want to take a sample of five students
468
00:45:16,390 --> 00:45:21,190
from this bigger class. If you take a simple
random sample, it means that all the different
469
00:45:21,190 --> 00:45:26,450
groups of five students you could pick from
the list has an equal chance of being the
470
00:45:26,450 --> 00:45:33,200
sample group you actually pick. Now, you can
just imagine that if you race into the class
471
00:45:33,200 --> 00:45:37,470
right at the beginning, and you take your
sample of five and not everybody's in the
472
00:45:37,470 --> 00:45:43,810
class, what does that sound like, right, a
sampling frame problem, maybe an under coverage
473
00:45:43,810 --> 00:45:49,480
problem, maybe biases creeping in there, right.
And so you just got to be careful, if you're
474
00:45:49,480 --> 00:45:55,230
going to do simple random sampling, that you
start with a list with everybody in your sample
475
00:45:55,230 --> 00:46:01,450
frame, because every single sample that you
could possibly take should have equal chance
476
00:46:01,450 --> 00:46:10,240
of ending up being your sample. And I'll kind
of explain it by explaining the two different
477
00:46:10,240 --> 00:46:14,359
methods that can be used of obtaining that
478
00:46:14,359 --> 00:46:15,359
sample.
479
00:46:15,359 --> 00:46:22,140
So one of the best things that you can do
is just start with a really good list of all
480
00:46:22,140 --> 00:46:27,529
the people in your population. So maybe, you
know, if I was going to study, I used to work
481
00:46:27,529 --> 00:46:32,489
at the army. So let's say I was going to study
all the people who are active duty in the
482
00:46:32,489 --> 00:46:39,259
US Army, I would like to get a list of all
of those people from an accurate place at
483
00:46:39,259 --> 00:46:48,650
the army. And I would like to have them have
a unique ID. Okay. And that's true in the
484
00:46:48,650 --> 00:46:54,569
army, everybody in the army has a unique numerical
ID. So what I would do, like in here, if you
485
00:46:54,569 --> 00:46:59,650
were looking at students, you'd take maybe
take a student ID, so then you take the IDS
486
00:46:59,650 --> 00:47:06,339
from everybody on the list, and you cut them
up, like you print them out, and you cut them
487
00:47:06,339 --> 00:47:11,599
up, and you put them in a hat, right, or a
bag where you can't see in it. And they mix
488
00:47:11,599 --> 00:47:17,109
them all up where you can't see it. And you
draw five of them up, or like in the picture,
489
00:47:17,109 --> 00:47:21,731
you know, what they did was mix up all those
papers, and now they're not looking. And they're
490
00:47:21,731 --> 00:47:27,519
drawing a few out. Okay, so what did you just
do, you just made sure, first of all, that
491
00:47:27,519 --> 00:47:31,799
everybody in the population had an ID number.
And that when you printed it out and cut it
492
00:47:31,799 --> 00:47:35,549
up, all, you didn't lose any of them, if you
drop them on the floor, or something that's
493
00:47:35,549 --> 00:47:39,329
not simple random sample, you got to make
sure you keep all of them, and that you put
494
00:47:39,329 --> 00:47:44,829
them all in the hat, and that you didn't look
and you draw five or whatever, because then
495
00:47:44,829 --> 00:47:49,200
any five of those slips of paper could have
been drawn in there for your meeting with
496
00:47:49,200 --> 00:47:57,880
simple random sampling. Okay, that method
will work, right? Another method that works,
497
00:47:57,880 --> 00:48:02,920
that might work better if you can't do this
ID thing where you cut a paper is where you
498
00:48:02,920 --> 00:48:10,249
simply just make your own list of unique random
numbers, right, you just make your own list.
499
00:48:10,249 --> 00:48:16,110
And then you assign those to the population.
A great example is if you're, you know, kind
500
00:48:16,110 --> 00:48:20,259
of teaching kids and you want to put them
in a random order, maybe you're gonna do a
501
00:48:20,259 --> 00:48:26,710
game or something. Well, all you do is you
you get, like, let's say you have 10, kids,
502
00:48:26,710 --> 00:48:31,779
you number one to 10, you put it in the hat,
and then you pull out the first number, let's
503
00:48:31,779 --> 00:48:35,950
say it's five, you give it to the first kid,
right? And then you just keep pulling out
504
00:48:35,950 --> 00:48:40,359
numbers and giving them to the kids and then
tell them to stand in order, right? So you
505
00:48:40,359 --> 00:48:44,410
generate a list of random numbers as long
as the list of the population. So I said,
506
00:48:44,410 --> 00:48:50,069
What if you have 10 kids? Well, if you have,
you know, 500 names, then you get 500 numbers,
507
00:48:50,069 --> 00:48:54,239
and they don't have to be one through 500.
They just have to be unique. Okay, I like
508
00:48:54,239 --> 00:48:59,309
smaller numbers. So I'd say keep them small,
but you can do what you want. And then, in
509
00:48:59,309 --> 00:49:05,099
any case, you randomly assign these numbers,
you can use the hat, I'm big on hats to this
510
00:49:05,099 --> 00:49:11,329
population. And then, you know, you ask them
to stand in order, or somehow you figure out
511
00:49:11,329 --> 00:49:15,499
it's kind of like a raffle you call out who's
got number one, you know, and whoever says
512
00:49:15,499 --> 00:49:19,729
yes, you're like, you're lucky you get to
be in my study, you know, so you can take
513
00:49:19,729 --> 00:49:26,150
the first five numbers in the order, right.
And that's, that'll achieve the same thing
514
00:49:26,150 --> 00:49:30,440
as the last method, you'll get a simple random
sample, it's just two different ways of doing
515
00:49:30,440 --> 00:49:37,759
it. So ultimately, being in a simple random
sample means that the sample has an equal
516
00:49:37,759 --> 00:49:42,719
chase chance of being selected out of the
hat that this group of people or a group of
517
00:49:42,719 --> 00:49:48,609
whatever has an equal chance of being selected.
And you'll see this picture on the left here
518
00:49:48,609 --> 00:49:54,140
is bingo, as some of you may play bingo. You
know, they pull balls out of there and they
519
00:49:54,140 --> 00:49:59,119
call off the names of the balls. Well, each
ball has a unique actually a letter and a
520
00:49:59,119 --> 00:50:04,329
number unique on there. And that's how they
make them random. That's they take a simple
521
00:50:04,329 --> 00:50:11,440
random sample of these bingo balls each time
that they do a bingo game. So I described
522
00:50:11,440 --> 00:50:16,690
to you the first method of doing that using
an old fashioned hat. The second method, you
523
00:50:16,690 --> 00:50:20,349
know, where you generate your own numbers,
and you just make sure they're unique. And
524
00:50:20,349 --> 00:50:25,369
then you assign them to things and put them
in order. Well, that's my electronic hat.
525
00:50:25,369 --> 00:50:30,950
That's how I handle it. If I have, for example,
somebody sends me an Excel sheet with a list
526
00:50:30,950 --> 00:50:36,209
of hospitals on it. I'll just assign each
hospital random number and sort them in order.
527
00:50:36,209 --> 00:50:41,690
And I'll sample the top few hospitals. That'll
be how I get a simple random sample of possibles.
528
00:50:41,690 --> 00:50:46,829
That way, I'm not biased, picking out my favorite
hospitals where all my friends work, right?
529
00:50:46,829 --> 00:50:51,470
If I do it that way, the first method or the
second method, all members of the population
530
00:50:51,470 --> 00:50:56,390
have the equal probability of being selected
in the sample. And more importantly, all possible
531
00:50:56,390 --> 00:51:00,700
samples, all possible groups had an equal
chance of being selected. Of course, I only
532
00:51:00,700 --> 00:51:04,489
did it once. So I only got one of them. But
the other ones that weren't selected had an
533
00:51:04,489 --> 00:51:06,729
equal chance of being selected.
534
00:51:06,729 --> 00:51:15,180
All right, you probably saw the limits, is
this whole list? Even if I'm sampling hospitals,
535
00:51:15,180 --> 00:51:21,650
right? I still need a list of hospitals to
sample from. So you may not know who's gonna
536
00:51:21,650 --> 00:51:26,769
show up in the emergency department that day,
if you do, while you're psychic, because most
537
00:51:26,769 --> 00:51:31,450
people are not. So how would you sample from
them using simple random sampling? So simple
538
00:51:31,450 --> 00:51:36,009
random sampling is okay, when you got a list
like hospitals, but it's not so good when
539
00:51:36,009 --> 00:51:41,979
you don't know who's going to show up that
day. And even if you do a simple random sampling,
540
00:51:41,979 --> 00:51:47,940
you need a good list. I made a mistake once,
where I did a survey with a bunch of professionals
541
00:51:47,940 --> 00:51:54,809
using a professional society list. And when
I sent out the survey, I learned that there
542
00:51:54,809 --> 00:51:59,089
were people on the list who were no longer
part of the society that it was an old list.
543
00:51:59,089 --> 00:52:03,009
And more importantly, there were people who
had joined the society that had not made it
544
00:52:03,009 --> 00:52:09,619
onto that list. So I was getting under coverage.
So like, if you were doing a study with students,
545
00:52:09,619 --> 00:52:13,849
you know, what if they just left off the part
time students, then you'd be missing them.
546
00:52:13,849 --> 00:52:18,079
So this is a great example of non sampling
error. And so if you're going to do simple
547
00:52:18,079 --> 00:52:21,719
random sampling, you do need a list and you
really want to research it and make sure it's
548
00:52:21,719 --> 00:52:30,150
the best list possible. So I just went over
the characteristics of simple random sampling,
549
00:52:30,150 --> 00:52:35,890
and two different methods you can use from
to sample from a list. And I also mentioned
550
00:52:35,890 --> 00:52:44,940
the limits of it. Now we'll talk about a different
kind of sampling, stratified sampling. So
551
00:52:44,940 --> 00:52:50,420
we're gonna go over what it is. And then I'm
just like, simple random sampling had all
552
00:52:50,420 --> 00:52:54,500
these steps to it, there are different steps
in stratified sampling. And I'll give you
553
00:52:54,500 --> 00:53:00,219
some examples. And then of course, just like
simple random sampling, this stratified sampling
554
00:53:00,219 --> 00:53:07,469
has limitations, and I'll talk about those.
So I first wanted to just remind you what
555
00:53:07,469 --> 00:53:14,119
the word stratified means, or what strata
are, the single word is stratum, and more
556
00:53:14,119 --> 00:53:19,529
than one a strata. Now you see that rock on
the slide, you see that big, horizontal line
557
00:53:19,529 --> 00:53:25,910
across it, that those that's a stratum, there
are strata, right? Those are strata of rock,
558
00:53:25,910 --> 00:53:31,680
if you stay geology, that'll the geologists
will explain that where those breaks are,
559
00:53:31,680 --> 00:53:36,440
it means something happened often in the weather
or the environment or whatever. But the reason
560
00:53:36,440 --> 00:53:43,650
why I put this picture up there is I want
you to sort of imagine those layers. Because
561
00:53:43,650 --> 00:53:49,880
that's what we do in stratified sampling is
first, we divide our list, of course, you
562
00:53:49,880 --> 00:53:55,339
know, a list, we divide our list into layers.
Okay, so remember how I was just talking about
563
00:53:55,339 --> 00:53:59,519
simple random sampling? Like, what if I sample
from hospitals? Well, I could take this hospital
564
00:53:59,519 --> 00:54:08,369
list and divide it until layers by for example,
how close they are to the city, I could say,
565
00:54:08,369 --> 00:54:16,049
urban, suburban, and rural, I could first
put them into those strata. Okay. And if I
566
00:54:16,049 --> 00:54:20,069
was doing that, I'd be doing stratified sampling.
Same with students, like I could put them
567
00:54:20,069 --> 00:54:25,489
in, you know, first year nursing students,
second year students, you know, and I'd have
568
00:54:25,489 --> 00:54:31,319
this them divided into strata first. Um, so
this is what so why would you do that? Why
569
00:54:31,319 --> 00:54:36,369
not just do simple random sampling? Well,
if you think about it, let's say that you've
570
00:54:36,369 --> 00:54:41,319
got a class like statistics, maybe a lot of
you know, they're not that many first year
571
00:54:41,319 --> 00:54:47,150
students in it. So let's say the very small
proportion is that way. If you do simple random
572
00:54:47,150 --> 00:54:52,690
sampling, you might just by lock miss all
of them. Right. And so, if you're really concerned
573
00:54:52,690 --> 00:54:59,569
about what a minority thinks, then you can
make sure to get representative from that
574
00:54:59,569 --> 00:55:04,769
stratum. By doing stratified sampling, because
the first thing you do is you put those that
575
00:55:04,769 --> 00:55:13,809
list into groups. And then you take a simple
random sample from each of the strata. So
576
00:55:13,809 --> 00:55:18,759
here's the steps. So step one, divide the
entire population, the whole list you have
577
00:55:18,759 --> 00:55:23,920
into distinct subgroups called strata. And
remember, each individual has to fit into
578
00:55:23,920 --> 00:55:28,249
one of those categories. So if you have somebody
who's sort of halfway halfway between first
579
00:55:28,249 --> 00:55:32,670
year and second year, or you've got a hospital
that's kind of on the border, it you got to
580
00:55:32,670 --> 00:55:37,609
choose, you got to put it in one of those
groups. Step two, um, well, it's not really
581
00:55:37,609 --> 00:55:41,750
step two, but you've got to think about the
strata like what is it based on, it's got
582
00:55:41,750 --> 00:55:46,670
to be based on one specific characteristics,
such as age income, education level, you know,
583
00:55:46,670 --> 00:55:51,740
a great example is you could take people of
all different incomes, right, that's a quantitative
584
00:55:51,740 --> 00:55:56,829
variable, but you can put them in strata by
you know, less than a certain amount. And
585
00:55:56,829 --> 00:55:59,549
then that to that, that to that you can make,
586
00:55:59,549 --> 00:56:04,970
you know, four or five strata. And then, um,
you know, you just want to make sure that
587
00:56:04,970 --> 00:56:10,450
all members of the stratum, each stratum,
share the same characteristic. And then you
588
00:56:10,450 --> 00:56:15,549
could do step four, which is draw a simple
random sample from each stratum. So like,
589
00:56:15,549 --> 00:56:20,769
in the case where I was describing, like,
maybe you have a class with very few first
590
00:56:20,769 --> 00:56:27,699
year students, if you take a random sample
of five from each strata, you know, each stratum,
591
00:56:27,699 --> 00:56:34,000
then you might be, you know, you're kind of
getting almost like, extra votes from a small
592
00:56:34,000 --> 00:56:38,849
minority, right? Like, you're kind of treating
them fairly, even though there's a way bigger
593
00:56:38,849 --> 00:56:46,549
group of the other people you're taking exactly
five from. And, but you just that, that's
594
00:56:46,549 --> 00:56:52,339
the risk you take, because you want to make
sure you hear from that small group. Because
595
00:56:52,339 --> 00:56:56,729
if you just do sample random sampling with
groups, so small, you might just accidentally
596
00:56:56,729 --> 00:57:03,680
miss it. So here are some examples of stratified
sampling. And you'll see this in the youth
597
00:57:03,680 --> 00:57:09,289
Behavioral Risk Factor Surveillance surveys
that they do in high schools, that they'll
598
00:57:09,289 --> 00:57:14,690
stratify by grade, right, because if they
did a simple random sample, you know, a lot
599
00:57:14,690 --> 00:57:19,229
of students drop out of junior and senior
year, they get probably too many freshmen
600
00:57:19,229 --> 00:57:25,400
and sophomores. And so they're gonna want
to look at getting a certain amount of freshman
601
00:57:25,400 --> 00:57:28,589
classes, certain amount of sophomore classes,
certain amount of junior classes, student
602
00:57:28,589 --> 00:57:35,369
run the senior classes, so they can have enough
of each to make good estimates, right. And
603
00:57:35,369 --> 00:57:42,279
in hospitals, they often sample providers
from each department, right? Like, they don't
604
00:57:42,279 --> 00:57:48,089
just do a simple random sample of providers,
if they're asking about like provider satisfaction,
605
00:57:48,089 --> 00:57:52,849
or if you know about a policy, they won't
just do that, because they might, for example,
606
00:57:52,849 --> 00:57:59,339
Miss everybody in the ICU. Or if you're studying,
you know, ICU is you have multiple ICU is
607
00:57:59,339 --> 00:58:00,339
there,
608
00:58:00,339 --> 00:58:01,339
then
609
00:58:01,339 --> 00:58:05,420
you would want to maybe stratify by ICU, just
to make sure even if one of them's smaller,
610
00:58:05,420 --> 00:58:06,869
just to make sure you have
611
00:58:06,869 --> 00:58:07,869
a good,
612
00:58:07,869 --> 00:58:14,869
good solid representation from each ICU. So
those are the reasons that push you to do
613
00:58:14,869 --> 00:58:19,319
stratified sampling. It's not always necessary.
But when you have these situations where you
614
00:58:19,319 --> 00:58:23,380
have these distinct groups, especially the
little one involved, and you want to hear
615
00:58:23,380 --> 00:58:30,660
from everybody, you really want to consider
the stratified sampling. So of course, there's
616
00:58:30,660 --> 00:58:35,289
limitations. And I've been sort of leading
up to this, what you end up doing is over
617
00:58:35,289 --> 00:58:42,109
sampling, one of the groups usually, you know,
like the smallest group, if you make the same
618
00:58:42,109 --> 00:58:48,969
amount of people you take from that stratum,
the same amount as you take from the big stratum.
619
00:58:48,969 --> 00:58:53,009
It's like the smallest group is having all
these powerful votes and the biggest group
620
00:58:53,009 --> 00:58:58,180
has is weaker, you know, they're both equal
when they're not technically equal in the
621
00:58:58,180 --> 00:59:03,690
population. But that's the way it goes, right?
And I do higher level statistics, there's
622
00:59:03,690 --> 00:59:08,930
ways to adjust back for that, to just sort
of say, take a penalty for that and go back
623
00:59:08,930 --> 00:59:14,410
and say, Well, what if the real pot you know,
we can extrapolate this back to the population
624
00:59:14,410 --> 00:59:20,900
proportions? It's possible, but it's it takes
some post processing is just the issue. And
625
00:59:20,900 --> 00:59:27,390
it's also like simple random sampling not
really possible to do without a list beforehand.
626
00:59:27,390 --> 00:59:33,020
And it's also hard to do, because you actually
have to split the list into groups into these
627
00:59:33,020 --> 00:59:37,150
strata. So let's say I had these hospitals
and I didn't know where they were, I didn't
628
00:59:37,150 --> 00:59:42,440
know exactly if they were urban or rural or
suburban. Well, that adds another level of
629
00:59:42,440 --> 00:59:50,059
complexity to this whole stratified sampling.
So, in summary, I just went over what stratified
630
00:59:50,059 --> 00:59:53,520
means, and it means you know, putting things
in groups and then taking from that, and I
631
00:59:53,520 --> 01:00:00,459
describe the steps involved. And it's a stratified
sample. It goes a lot easily. A lot more easily
632
01:00:00,459 --> 01:00:04,749
if the strategist happened to be equal to
begin with, you know, I gave the example of
633
01:00:04,749 --> 01:00:09,920
high schools, usually there's maybe slightly
fewer people in junior and senior year, but
634
01:00:09,920 --> 01:00:14,930
it's kind of close. And it's always nice.
Like if you're comparing ice use, for example,
635
01:00:14,930 --> 01:00:18,029
if the ice use are roughly the same size,
because then you don't have to worry about
636
01:00:18,029 --> 01:00:26,019
this whole, one of them is smaller, but it's
getting an equal vote. Already, now we are
637
01:00:26,019 --> 01:00:34,819
going to move on to talk about systematic
sampling. Okay, well, systematic sampling
638
01:00:34,819 --> 01:00:40,780
actually can be done with or without a list.
So it's a little more flexible than the kind
639
01:00:40,780 --> 01:00:47,680
of sampling we've been talking about. systematic
sampling, it's easier for me to like, define
640
01:00:47,680 --> 01:00:52,309
it by describing the steps you go through
to do it. So I'm just gonna explain how to
641
01:00:52,309 --> 01:00:57,049
do it. And then you'll understand, in fact,
you'll understand why it's called systematic.
642
01:00:57,049 --> 01:01:03,489
So whether you have a list or not, what you
have to do for step one is arrange all the
643
01:01:03,489 --> 01:01:10,999
individuals of the population in a particular
order. Now, if it's a list, you just make
644
01:01:10,999 --> 01:01:16,699
it in whatever order you want to make it in.
But if we're talking about, for example, patients
645
01:01:16,699 --> 01:01:20,180
coming into the ER, well, they come in, in
the order that they want
646
01:01:20,180 --> 01:01:21,180
to.
647
01:01:21,180 --> 01:01:24,519
So they already are arranged in the list,
right? You just don't know what that list
648
01:01:24,519 --> 01:01:32,180
is. Okay, then step two is pick a random individual
as a start. So let's say I had a list of hospitals,
649
01:01:32,180 --> 01:01:39,650
and let's say it was just sorted by state,
right? I, let's say I picked a random individual,
650
01:01:39,650 --> 01:01:44,930
maybe I went down, you know, seven on the
list, and I picked that hospital. Or maybe
651
01:01:44,930 --> 01:01:50,710
you could be at the ER, you start your shift.
And the seventh patient who is admitted to
652
01:01:50,710 --> 01:01:54,701
the ER, you pick that person, just I picked
seven, I mean, you could have picked five,
653
01:01:54,701 --> 01:01:59,999
you could have picked 20, you know, just you
pick a random person. Then the next step,
654
01:01:59,999 --> 01:02:05,789
step three is take every case member of the
population in the sample. Now, don't try this
655
01:02:05,789 --> 01:02:11,880
in Scrabble case is not a word in Scrabble,
okay? It's just a word and statistics ease,
656
01:02:11,880 --> 01:02:19,859
in what case means spelled k th, it means
every so many. So let's pick a number and
657
01:02:19,859 --> 01:02:26,539
fill it in for K. So let's pick the number
three. So let's say after you pick your first
658
01:02:26,539 --> 01:02:30,130
hospital from the list, or the first patient
from the ER, it doesn't matter what number
659
01:02:30,130 --> 01:02:36,660
you chose for that, then you take every third
after that. So every third patient that comes
660
01:02:36,660 --> 01:02:41,450
in after that, you ask them if they want to
be in a study, or every third hospital after
661
01:02:41,450 --> 01:02:46,249
that original random one, I pick and I say,
Okay, this is going to be part of my systematic
662
01:02:46,249 --> 01:02:51,049
sample. So as you can see, it's like pretty
simple to do, it's easy to do, if you have
663
01:02:51,049 --> 01:02:56,680
a list, it's easy to if you don't have a list,
it's just the deal is you have to pick K,
664
01:02:56,680 --> 01:03:01,189
well, first you pick a random place to start,
then you pick K, and then you just keep going
665
01:03:01,189 --> 01:03:08,979
every so many. So you could do this with classes,
you could take out a list of classes available
666
01:03:08,979 --> 01:03:14,920
at your college next semester, she pick a
random number like three, you know, and it's
667
01:03:14,920 --> 01:03:18,900
sorted some way. So you go to the third class
and you circle that, then you pick another
668
01:03:18,900 --> 01:03:24,189
random number like five and then after that
you pick every fifth class. So after the third
669
01:03:24,189 --> 01:03:34,459
one, you go 45678, and then 910 11 1213. And
you keep picking classes. Okay, this is not
670
01:03:34,459 --> 01:03:41,410
career advice. Okay? Do not pick your classes
that way. This was just an example. Alright,
671
01:03:41,410 --> 01:03:45,819
so as you probably guessed, I'm going to be
negative Nelly, again, there are problems
672
01:03:45,819 --> 01:03:52,239
with systematic sampling. If already things
are set up, boy, girl, boy, girl, for example.
673
01:03:52,239 --> 01:03:57,490
If you pick like an even number, you're going
to get all boys are all girls, right? And
674
01:03:57,490 --> 01:04:03,589
I noticed this actually, when I was doing
a study in the lab, we wanted to study like
675
01:04:03,589 --> 01:04:08,589
whenever they put the assay through the machines,
we thought some of the assays weren't running,
676
01:04:08,589 --> 01:04:15,470
right. And so we wanted to take a sample.
And I wanted to take a systematic sample.
677
01:04:15,470 --> 01:04:21,279
But I wanted to take a systematic sample,
like every seven days, and that's a week.
678
01:04:21,279 --> 01:04:29,119
And so I asked my colleague, does the lab
vary day by day in what assez it runs because
679
01:04:29,119 --> 01:04:34,469
of it always runs the sexually transmitted
disease assays, it saves them up and runs
680
01:04:34,469 --> 01:04:40,209
them all on Friday. And I'm sampling from
every Friday, that's all I'm gonna get, right?
681
01:04:40,209 --> 01:04:44,940
That's actually called periodicity. You don't
have to remember that I don't think I've ever
682
01:04:44,940 --> 01:04:50,099
even seen that written. It's just I remember
my lecture in my class telling us that that's
683
01:04:50,099 --> 01:04:55,339
what you have to worry about with systematic
sampling. It's not real common problem, though.
684
01:04:55,339 --> 01:05:00,650
But what's awesome about it is you can do
it in a clinical setting. So you You can sample
685
01:05:00,650 --> 01:05:05,549
patients that way, coming into a clinic or
coming to a central lab or like in the emergency
686
01:05:05,549 --> 01:05:10,380
room. And that's why this is a particular
power, particularly powerful way to sample
687
01:05:10,380 --> 01:05:17,099
is that if you have an ongoing sort of patient
influx, when you design your research, you
688
01:05:17,099 --> 01:05:21,470
could simply say, once you decide how many
people you need to recruit for your sample,
689
01:05:21,470 --> 01:05:25,170
that you would use systematic sampling, and
just have somebody in the clinic inviting
690
01:05:25,170 --> 01:05:33,680
every case person who qualifies every case
patient who qualifies into your study. So
691
01:05:33,680 --> 01:05:39,739
it's easy to do systematic sampling, it's
easy to do with or without a list. And you
692
01:05:39,739 --> 01:05:47,539
just pick a random starting point, and then
you pick every case individual. Next, we're
693
01:05:47,539 --> 01:05:55,299
gonna move on to cluster sampling. So what
is up with cluster sampling? Why do we need
694
01:05:55,299 --> 01:06:00,089
even other kinds of sampling? I just went
over so many kinds. I mean, you could use
695
01:06:00,089 --> 01:06:05,479
stratified systematic or simple random sampling,
why would you even need another kind? Well,
696
01:06:05,479 --> 01:06:11,030
cluster is very special. It's special, because
it's the kind of sampling you use when you
697
01:06:11,030 --> 01:06:18,240
think there's a problem at a particular geographic
location. Typically, that's how cluster sampling
698
01:06:18,240 --> 01:06:22,079
is used. And, and I'll explain it further.
699
01:06:22,079 --> 01:06:29,420
Imagine, for example, there's a particular
factory that's is believed to admit fumes
700
01:06:29,420 --> 01:06:34,439
that cause problems with people's health.
Well, you can't do simple random sampling
701
01:06:34,439 --> 01:06:40,449
all over the nation, right, or you won't even
get people by that factory, can't really do
702
01:06:40,449 --> 01:06:46,130
easily do stratified or systematic sampling
their cluster sampling is what's designed
703
01:06:46,130 --> 01:06:51,880
when you want to study something that's coming
from a geographic location. So when you do
704
01:06:51,880 --> 01:06:58,099
cluster sampling, you start by dividing a
map into geographic areas. So I'm from Minnesota,
705
01:06:58,099 --> 01:07:04,829
and I know that there was a mine there with
vermiculite in it. And it was it was contaminated,
706
01:07:04,829 --> 01:07:10,180
a lot of people got sick from it. But they
didn't know that's what was going on. So they
707
01:07:10,180 --> 01:07:17,739
first I think divided Minnesota into different
geographic areas, areas. after dividing the
708
01:07:17,739 --> 01:07:23,079
area into these different geographic areas,
some with the, with the bad thing in it, and
709
01:07:23,079 --> 01:07:30,230
some without the bad thing in it, you randomly
pick these clusters or areas from the map.
710
01:07:30,230 --> 01:07:38,650
So the app, like if you'll see there on the
screen, there's a map of the state of Virginia,
711
01:07:38,650 --> 01:07:45,809
and it's all been divided into different groups.
And then this, this cluster is is highlighted,
712
01:07:45,809 --> 01:07:52,170
you usually probably pick more than one cluster,
sometimes it's only four or five. But the
713
01:07:52,170 --> 01:07:59,049
idea is you try to enroll all of the individuals
in the cluster, it's usually people, although
714
01:07:59,049 --> 01:08:04,300
you can do it with animals, if there's a disease
going around among animals, you know, you
715
01:08:04,300 --> 01:08:09,770
would have these, the divide the area up into
clusters, and then you try to measure all
716
01:08:09,770 --> 01:08:17,698
the animals in the cluster. So as you can
imagine, not only is this sort of practically
717
01:08:17,698 --> 01:08:23,588
difficult, but there's reasons why people
live together, right? People live in communities.
718
01:08:23,589 --> 01:08:27,910
I mean, people don't just randomly scattered
themselves, you know, cultural communities
719
01:08:27,910 --> 01:08:33,630
grow. companies grow around art, you know,
affluent communities have different people
720
01:08:33,630 --> 01:08:39,670
in them, then communities that have less money.
So sometimes the people located in the cluster
721
01:08:39,670 --> 01:08:43,849
are all similar in a way that makes the problem
hard to study. And this is, especially if
722
01:08:43,849 --> 01:08:49,880
you're studying some geographic thing, like
maybe a factory or a sewage plant, that you
723
01:08:49,880 --> 01:08:55,770
think might be causing cancer, if you're in
an area where there's a lot of pollution anyway,
724
01:08:55,770 --> 01:09:00,689
from other things, and a lot of low income
people live there. Because if you're high
725
01:09:00,689 --> 01:09:06,869
income you can afford not to, well, they're
already being exposed to higher rates of carcinogens
726
01:09:06,870 --> 01:09:11,109
and probably have a higher cancer rate. It's
hard to tell what the independent effect might
727
01:09:11,109 --> 01:09:16,729
be of that thing in that geographic location
because of the other similarities of the people
728
01:09:16,729 --> 01:09:25,499
around. And so this is cancer ends up being
a really difficult, tough nut to crack. Because
729
01:09:25,500 --> 01:09:30,960
where we see high rates, there are often a
lot of different geographic issues going on
730
01:09:30,960 --> 01:09:33,859
there in cluster sampling doesn't really help
tease
731
01:09:33,859 --> 01:09:39,339
that out.
732
01:09:39,339 --> 01:09:45,069
So to wrap this up, cluster sampling is used
when geography is important. So if there is
733
01:09:45,069 --> 01:09:50,359
something geographically located in a certain
spot and you can't move it, then you kind
734
01:09:50,359 --> 01:09:57,329
of are stuck doing cluster sampling. So briefly,
the map around that areas divided into different
735
01:09:57,329 --> 01:10:03,809
sub areas, right. And those are Not all the
areas are picked, just a few are randomly
736
01:10:03,809 --> 01:10:09,449
picked. And then all of the people in that
particular area are sampled. And of course,
737
01:10:09,449 --> 01:10:13,219
it's biased towards the people living in the
area. If you you know, in the area you pick
738
01:10:13,219 --> 01:10:17,619
with a bunch of affluent people, you'll get
affluent people pick an area with a bunch
739
01:10:17,619 --> 01:10:22,989
of immigrants, he'll get immigrants. And so
a cluster sampling is not perfect, but you're
740
01:10:22,989 --> 01:10:27,739
kind of stuck with it. When there's a situation
with geography, how long it was, remember
741
01:10:27,739 --> 01:10:34,099
it is, when I used to live in Florida, we'd
like to drive up to Georgia because they had
742
01:10:34,099 --> 01:10:40,790
the best pecan clusters. That's like a type
of dessert with pecans and Carmel and stuff.
743
01:10:40,790 --> 01:10:44,860
So when I think of cluster sampling, I think
of those pecan clusters that they're only
744
01:10:44,860 --> 01:10:50,360
really good in Georgia. So that's my way of
remembering that cluster sampling has to do
745
01:10:50,360 --> 01:10:57,849
with geography. Now I'm finally going to talk
about the last two types of sampling that
746
01:10:57,849 --> 01:11:02,570
I'm going to cover in this lecture, convenience
sampling and multistage sampling. They're
747
01:11:02,570 --> 01:11:07,059
both a little quick, so I'm going to just
cover them quickly. First, we're going to
748
01:11:07,059 --> 01:11:12,440
start by talking about convenient sampling.
And we like that name, right? It's convenient.
749
01:11:12,440 --> 01:11:16,790
Convenient sampling can be used under low
risk circumstances, like if the findings of
750
01:11:16,790 --> 01:11:21,341
what you're doing aren't really that important.
Like, for instance, let's say that you wanted
751
01:11:21,341 --> 01:11:25,540
to know what ice cream is the best from the
restaurant next to the hospital, let's say
752
01:11:25,540 --> 01:11:29,520
a new restaurant opens up, and you're gonna
go off your diet, you're gonna go get some
753
01:11:29,520 --> 01:11:34,059
ice cream, but you don't want to waste it
right. So you want to ask people, what's the
754
01:11:34,059 --> 01:11:40,060
best one, you might ask your coworkers, you
might ask, you know, the people at the restaurant,
755
01:11:40,060 --> 01:11:44,060
hey, what's the best ice cream, but the results
are not so reliable, because you might end
756
01:11:44,060 --> 01:11:51,360
up on Yelp and see that other people disagree.
So a convenient sampling is basically using
757
01:11:51,360 --> 01:11:57,880
results or data that are conveniently or readily
obtained. And my master's degree, one of the
758
01:11:57,880 --> 01:12:03,210
things I did was I surveyed people anonymously
who were coming to a health fair, I sat at
759
01:12:03,210 --> 01:12:08,430
a booth, and I gave them the survey, to view
questions in it. That was definitely a convenient
760
01:12:08,430 --> 01:12:13,780
sample, you know, just people showing up for
the health fair. And this can be useful when
761
01:12:13,780 --> 01:12:19,630
there's not a lot of resources allocated to
the study, like, I was a starving master's
762
01:12:19,630 --> 01:12:24,280
student, right, like, I didn't have any money.
So that that was perfect for me convenience
763
01:12:24,280 --> 01:12:30,320
sampling. And also, you know, the questions
I was asking them about were just characteristics
764
01:12:30,320 --> 01:12:34,800
of whether or not they had risk for diabetes.
Well, I'm not a doctor, and I wasn't going
765
01:12:34,800 --> 01:12:39,790
to do anything about it. But it was interesting.
So it wasn't a very high risk survey to fill
766
01:12:39,790 --> 01:12:46,210
up. It and convenience sampling is convenient,
because it uses an already assembled group
767
01:12:46,210 --> 01:12:51,949
for surveys like I was doing at the health
fair. An example might be to ask patients
768
01:12:51,949 --> 01:12:55,949
in the waiting room to fill out a survey or
ask students in a class, you know, sometimes
769
01:12:55,949 --> 01:12:59,880
I do when I'm teaching, I'll do a convenient
sample of whoever sitting there. I'll say,
770
01:12:59,880 --> 01:13:02,230
Hey, is the homework that I signed you this
week too hard?
771
01:13:02,230 --> 01:13:03,659
Well, it's always too hard. I
772
01:13:03,659 --> 01:13:08,570
don't even know why I do the survey. But anyway,
um, sometimes as a teacher, you'll just want
773
01:13:08,570 --> 01:13:15,010
to do a convenient sample just to get the
gauge on where the classes but there are problems
774
01:13:15,010 --> 01:13:19,489
with it, right? You can't just use it for
everything, even though it's nice and convenient.
775
01:13:19,489 --> 01:13:23,949
There's bias in every group, right? So if
I let everybody go on break, and then whoever's
776
01:13:23,949 --> 01:13:27,670
still sitting there, I asked them a thong
works too hard, I might get a totally different
777
01:13:27,670 --> 01:13:32,780
answer than if I waited for everybody come
back. Right. And, you know, just about any
778
01:13:32,780 --> 01:13:37,219
time you just waltz into a room, like when
I went to the health fair, who do you think,
779
01:13:37,219 --> 01:13:40,699
is there a bunch of sick people? No, there's
a bunch of health minded people there. And
780
01:13:40,699 --> 01:13:47,320
so I'm gonna get a bunch of bias, right. And
also, more importantly, when you do convenient
781
01:13:47,320 --> 01:13:55,329
sampling, you often miss important subpopulations.
So remember, stratified sampling, how sometimes
782
01:13:55,329 --> 01:14:01,870
people don't group evenly into the different
strata? Maybe they do kind of in high schools,
783
01:14:01,870 --> 01:14:07,369
but especially when it comes to job classifications,
they usually have fewer bigwigs than they
784
01:14:07,369 --> 01:14:14,579
do. lackeys, right. And if they just have
a few bigwigs, if you do a simple random sample,
785
01:14:14,579 --> 01:14:19,599
you you might miss all of them. So maybe you
try a stratified sample. On the other hand,
786
01:14:19,599 --> 01:14:24,909
if you walk into the break room that is used
by the lackeys and you say, hey, I want to
787
01:14:24,909 --> 01:14:32,239
fill out my, you know, work satisfaction survey.
All of the ones you're going to get are going
788
01:14:32,239 --> 01:14:36,690
to be from the lackeys, you're not going to
get any representation from the upper job
789
01:14:36,690 --> 01:14:42,670
classes because they don't go in that lounge,
so you'd be missing them. So that's the main
790
01:14:42,670 --> 01:14:48,170
problem with convenience sample is the results
can be so severely biased because you're only
791
01:14:48,170 --> 01:14:56,119
asking the small, biased group of people that
probably are all alike in some way. It's not
792
01:14:56,119 --> 01:14:58,890
very representative sample.
793
01:14:58,890 --> 01:15:00,570
Next,
794
01:15:00,570 --> 01:15:06,960
I'm going to talk about multi stage sampling.
So, you know, if you have a kid and the kids
795
01:15:06,960 --> 01:15:12,090
crying somebody like What's up, you say, well,
the kids going through stage as well. That's
796
01:15:12,090 --> 01:15:16,160
exactly what you're doing when you're doing
multi stage sampling, as you're going through
797
01:15:16,160 --> 01:15:23,050
stages. It's basically like mixing and matching,
the different sampling I just talked about,
798
01:15:23,050 --> 01:15:28,699
only you do one stage, and then two stages,
and then three stages, and then four stages,
799
01:15:28,699 --> 01:15:33,340
or maybe even more. And that's how you get
your sample. So if you're imagining why I
800
01:15:33,340 --> 01:15:39,340
got to start with a lot of people, you're
probably right, I just gave an example I made
801
01:15:39,340 --> 01:15:45,460
up of a way that you could do multistage sampling
is you could start one with stage one as a
802
01:15:45,460 --> 01:15:51,150
cluster sample, right? Remember, where you
take out a map, and then you divide into areas?
803
01:15:51,150 --> 01:15:56,770
Well, let's divide into states and take two
census regions of states like about 10 states
804
01:15:56,770 --> 01:16:01,770
from those clumps. Okay, now, we limited it
to that. Now let's go to stage two of our
805
01:16:01,770 --> 01:16:07,370
multistage sampling. Now, from each of those,
we could take a random sample of counties,
806
01:16:07,370 --> 01:16:13,250
right. So we go and look at all the counties
and then take that random sample. Then after
807
01:16:13,250 --> 01:16:20,030
we get those counties, stage three, we could
take a stratified sample of schools from each
808
01:16:20,030 --> 01:16:26,030
county. So some of the counties will be totally
rural, some will be totally urban, but most
809
01:16:26,030 --> 01:16:31,090
will have some mix. So we'll take a look at
a few schools from the urban a few schools
810
01:16:31,090 --> 01:16:36,800
from the rural in stage three from the stratified
will tell you a stratified sample schools
811
01:16:36,800 --> 01:16:41,020
from the simple random sample of counties
from this cluster sample of states. Okay,
812
01:16:41,020 --> 01:16:47,000
now we got our schools, stage four could be
a stratified sample of classrooms. So once
813
01:16:47,000 --> 01:16:51,080
we figured out our urban schools or rural
schools, we could go in there and look at
814
01:16:51,080 --> 01:16:56,790
all the classrooms, freshman, sophomore, junior
senior and take a stratified sample of those.
815
01:16:56,790 --> 01:17:01,949
So it's basically mixing and matching. But
you're right, you got to start with a lot
816
01:17:01,949 --> 01:17:05,780
to begin with, if you're gonna whittle it
down, and a whole bunch of stages, doesn't
817
01:17:05,780 --> 01:17:11,460
have to be four I just gave you for. Now I'm
going to give you a real life example. This
818
01:17:11,460 --> 01:17:17,969
is the National Health and Nutrition Examination
Survey. And Haynes definitely not a Master's
819
01:17:17,969 --> 01:17:23,810
project. This is done by the Centers for Disease
Control and Prevention at the United States,
820
01:17:23,810 --> 01:17:30,610
right. So what I'm kind of hinting towards
is the kinds of places doing multistage sampling
821
01:17:30,610 --> 01:17:38,960
our governments, not only do you have to start
with a whole bunch of people and things and
822
01:17:38,960 --> 01:17:43,800
individuals, states and schools, and what
have you, right, is that it's a lot of work
823
01:17:43,800 --> 01:17:49,960
to do all the sampling, and it better be for
good reason. And the National Health and Nutrition
824
01:17:49,960 --> 01:17:55,780
Examination Survey is a good reason. That's,
that's a survey that's done by the CDC to
825
01:17:55,780 --> 01:18:02,079
try and measure America's Health. Of course,
it's doing inferential statistics, right,
826
01:18:02,079 --> 01:18:08,040
it's taking sample and trying to extrapolate
that information back to the population. And
827
01:18:08,040 --> 01:18:11,551
so it's got to be really careful about how
it does a sampler you can't just waltz in
828
01:18:11,551 --> 01:18:17,460
and do a bunch of convenient sampling. So
this is how it does it, just briefly, they
829
01:18:17,460 --> 01:18:24,679
start by in stage one, sampling counties.
Then from those counties, they sample something
830
01:18:24,679 --> 01:18:31,330
called segments, which is defined in the census,
it's their different areas, from those segments,
831
01:18:31,330 --> 01:18:36,800
those areas, they sample households. And that's
what they mean, like, wherever you live as
832
01:18:36,800 --> 01:18:41,670
a household. Even if you live in a dorm, that's
a household or you live in assisted living,
833
01:18:41,670 --> 01:18:47,780
that's a household. I'm an apartment building
house. So they sample those and once they
834
01:18:47,780 --> 01:18:53,400
knock on your door of your household, they
sample individuals from the house. So they
835
01:18:53,400 --> 01:19:02,090
use four stages of sampling. And that's a
real life example of multi stage sampling.
836
01:19:02,090 --> 01:19:08,989
So in summary, convenience and multi stage
sampling, with respect to convenience sampling,
837
01:19:08,989 --> 01:19:16,199
you want to avoid it unless it's really a
low risk question you're asking about. And
838
01:19:16,199 --> 01:19:20,199
you also want to avoid it unless it's really
the only type of sampling possible under the
839
01:19:20,199 --> 01:19:26,920
circumstances. When you have situations where
you have patients with very rare disease,
840
01:19:26,920 --> 01:19:32,300
probably convenience sampling from your Rare
Disease clinic is reasonable. There, it's
841
01:19:32,300 --> 01:19:38,869
also used when resources are low. And so those
are a few good reasons to try to use convenient
842
01:19:38,869 --> 01:19:45,139
sampling. It's really something that you want
to use only if it's the thing
843
01:19:45,139 --> 01:19:50,170
you're stuck with. It's much better to look
towards these other sampling approaches I
844
01:19:50,170 --> 01:19:56,420
described. And then finally, multistage sampling
is usually used in large governmental studies.
845
01:19:56,420 --> 01:20:00,739
So don't expect to actually design anything
alone with multistage sampling. When that
846
01:20:00,739 --> 01:20:06,010
happens, I showed you those four things for
that survey that the CDC does hundreds of
847
01:20:06,010 --> 01:20:11,480
people work on that even just a sampling tons
of people work to try and set that up. It's
848
01:20:11,480 --> 01:20:16,929
very difficult. But I wanted you to know about
that kind of sampling, because it's important
849
01:20:16,929 --> 01:20:23,909
in healthcare, and it happens a lot. So in
conclusion, we made it through the sampling
850
01:20:23,909 --> 01:20:29,920
lecture didn't wait. I first started by describing
some definitions, you needed to be able to
851
01:20:29,920 --> 01:20:35,400
understand all these different types of sampling.
Then I went into simple random sampling, and
852
01:20:35,400 --> 01:20:41,120
showed you how to do it two different ways
and what it achieves and also its limitations.
853
01:20:41,120 --> 01:20:46,800
We next talked about stratified sampling,
why you do that and how you do that, and the
854
01:20:46,800 --> 01:20:51,800
limitations of that one, too. Then we got
into systematic sampling, which is a little
855
01:20:51,800 --> 01:20:58,719
more flexible, and pretty easy to explain.
Next, we talked about cluster sampling, and
856
01:20:58,719 --> 01:21:04,369
why you might need to pull that tool out of
your sampling toolbox. And then finally, we
857
01:21:04,369 --> 01:21:10,219
covered convenient sampling and multistage
sampling. Already. Well, I hope you better
858
01:21:10,219 --> 01:21:15,679
understand sampling now and can keep all of
these different types of sampling straight
859
01:21:15,679 --> 01:21:26,679
in your mind. Hello, everybody, it's Monica
wahi labarre. College lecture for statistics
860
01:21:26,679 --> 01:21:35,489
are on to Section 1.3. Introduction to experimental
design. And here are your learning objectives.
861
01:21:35,489 --> 01:21:41,460
So at the end of this lecture, you should
be able to first state the steps of conducting
862
01:21:41,460 --> 01:21:47,099
a statistical study, and then select one step
of developing a statistical study and state
863
01:21:47,099 --> 01:21:52,610
the reason for the step, you should be able
to name one common mistake that can introduce
864
01:21:52,610 --> 01:21:59,199
bias into a survey and give an example should
be able to explain what a lurking variable
865
01:21:59,199 --> 01:22:05,110
is, and give an example of that. And you should
be able to define what a completely randomized
866
01:22:05,110 --> 01:22:06,110
experiment
867
01:22:06,110 --> 01:22:07,449
is.
868
01:22:07,449 --> 01:22:12,789
So let's get started. This lecture is in a
cover four basic topics. First, we're going
869
01:22:12,789 --> 01:22:19,829
to look at the steps to conducting a statistical
study, you may think there's a lot of steps
870
01:22:19,829 --> 01:22:26,389
to conducting a study, this is from the point
of view of the statistician. Okay? Then we're
871
01:22:26,389 --> 01:22:30,650
gonna go over basic terms and definitions.
And by now, you're probably used to the fact
872
01:22:30,650 --> 01:22:37,350
that in statistics, certain words are reappropriated.
And they mean something specific in statistics.
873
01:22:37,350 --> 01:22:42,420
So we'll talk about that. Then we'll talk
about bias and what that is and how to avoid
874
01:22:42,420 --> 01:22:48,800
it in when designing your studies. Finally,
we'll talk about randomization in particular
875
01:22:48,800 --> 01:22:56,409
topics you need to think about when thinking
about randomization. So let's get started.
876
01:22:56,409 --> 01:23:01,889
We're going to start with, of course, basic
terms and definitions. And so first, we're
877
01:23:01,889 --> 01:23:06,989
going to review these steps that I keep talking
about to conducting a statistical study. But
878
01:23:06,989 --> 01:23:13,130
there's some vocabulary, vocabulary that comes
up. And so we're going to talk about those
879
01:23:13,130 --> 01:23:17,870
vocabulary terms that come up. And then also,
I'm going to give you a few examples from
880
01:23:17,870 --> 01:23:24,829
healthcare. So here are the steps I keep talking
about. So these are the basic guidelines for
881
01:23:24,829 --> 01:23:29,900
planning a statistical study. So the first
thing you want to do is state your hypothesis.
882
01:23:29,900 --> 01:23:34,840
And you know, I'm in a scientist a while now.
And I can't tell you how many times I get
883
01:23:34,840 --> 01:23:40,239
in a group of us, and people are all curious,
and they start thinking about let's do a study.
884
01:23:40,239 --> 01:23:44,119
And it's only halfway through our conversation
that I suddenly say, Hey, wait a second, we
885
01:23:44,119 --> 01:23:49,110
don't have a hypothesis, what's our apotheosis?
So it's easy, even for scientists to forget
886
01:23:49,110 --> 01:23:56,429
that that's really step one, is you have to
have a hypothesis. And so whatever hypothesis
887
01:23:56,429 --> 01:24:03,991
you pick, the hypothesis is about some individuals,
if I have a hypothesis about hospitals, those
888
01:24:03,991 --> 01:24:09,000
are the individuals I have a hypothesis about
patients. Those are the individuals. But it's
889
01:24:09,000 --> 01:24:14,680
important actually, to nail that down. Because
am I talking about patients in the hospitals?
890
01:24:14,680 --> 01:24:19,889
Or am I talking about the hospitals, so make
sure that you understand after you, you know,
891
01:24:19,889 --> 01:24:27,400
percolate and decide on your hypothesis, who
the actual individuals of interest are? And
892
01:24:27,400 --> 01:24:33,369
that's because you're going to have to marry
measure variables about these individuals.
893
01:24:33,369 --> 01:24:38,900
So step three is to specify all the variables
you're going to need to measure about these
894
01:24:38,900 --> 01:24:41,460
individuals. You know, and of course, they
relate to the
895
01:24:41,460 --> 01:24:43,140
hypothesis.
896
01:24:43,140 --> 01:24:50,010
So it's good thing is that was step one, right?
Step four is to determine whether you want
897
01:24:50,010 --> 01:24:57,610
to use the entire population in your study
or a sample. If you already have a bunch of
898
01:24:57,610 --> 01:25:02,469
data like you have the census data you You
might as well use the entire population. But
899
01:25:02,469 --> 01:25:06,500
typically, if you don't have the data, you're
going to want to sit down and think about
900
01:25:06,500 --> 01:25:11,340
using a sample. And if you do that, while
you're sitting down, you should probably also
901
01:25:11,340 --> 01:25:19,030
choose the sampling method on the basis of
what I talked about in the sampling lecture.
902
01:25:19,030 --> 01:25:23,670
Now that you've figured out your hypothesis,
you got your individuals, you figured out
903
01:25:23,670 --> 01:25:27,920
your variables, and you figured out whether
you're going to do a census or a sample, if
904
01:25:27,920 --> 01:25:33,889
you're going to do a sample what type of sample
Step five is you think about the ethical concerns
905
01:25:33,889 --> 01:25:38,830
before data collection. If you're going to
be asking some sensitive questions, you think
906
01:25:38,830 --> 01:25:44,530
about privacy, if you're going to be doing
some invasive procedures, you think about
907
01:25:44,530 --> 01:25:48,929
how painful that would be, and how hard that
would be on somebody, especially if they're
908
01:25:48,929 --> 01:25:54,199
not even, you know, it's they're just healthy.
And you're just doing an experiment of unhealthy
909
01:25:54,199 --> 01:25:58,690
people just to better understand biology.
So you have to really sit down and think about
910
01:25:58,690 --> 01:26:05,679
these ethical concerns. And they may change
slightly your study design. Finally, after
911
01:26:05,679 --> 01:26:11,409
you get steps one through five, are taken
care of, that's when you actually jump in
912
01:26:11,409 --> 01:26:16,909
and collect the data. And like I was saying,
you know, when I meet with my scientist, friends,
913
01:26:16,909 --> 01:26:21,850
we get all excited about an idea. We're often
talking about Step six, we're like, oh, we
914
01:26:21,850 --> 01:26:27,690
should do a survey, we should this we should
that. And I realized I ended up saying, Hey,
915
01:26:27,690 --> 01:26:32,010
we actually have to go back to step one and
start talking about a hypothesis, because
916
01:26:32,010 --> 01:26:36,670
I suddenly realized, I don't even know what
data to collect, right? If you don't go through
917
01:26:36,670 --> 01:26:43,929
the steps in order, you really aren't doing
it right. Step seven, is after you get the
918
01:26:43,929 --> 01:26:50,239
data, you finally use either descriptive or
inferential statistics to answer your hypothesis.
919
01:26:50,239 --> 01:26:57,410
And that's what statistics is about. It's
here for that. And then finally, after you
920
01:26:57,410 --> 01:27:03,330
use the statistics, you have to write up what
you find, even if you're at a workplace. And
921
01:27:03,330 --> 01:27:07,130
they asked you to do a little survey that
happened once when I was working somewhere.
922
01:27:07,130 --> 01:27:13,989
And they wanted us to do a survey. Their hypothesis
was that they didn't have enough leadership
923
01:27:13,989 --> 01:27:18,670
programs, and they weren't building good leaders
they could promote. And so I was on a team
924
01:27:18,670 --> 01:27:24,070
that did the survey, we didn't, you know,
really publish it, like, everywhere. But we
925
01:27:24,070 --> 01:27:30,219
made an internal report, right. And in that
internal report, we had to do step eight,
926
01:27:30,219 --> 01:27:36,630
which we had to note any concerns about data
collection or analysis, you know, that happened
927
01:27:36,630 --> 01:27:41,800
when we were doing a report. And we also had
to make recommendations for future studies,
928
01:27:41,800 --> 01:27:49,390
or if you wanted to study this in future groups
of employees. So in science, what it usually
929
01:27:49,390 --> 01:27:57,699
ends up being is a peer reviewed literature
report, right? is you do a scientific study,
930
01:27:57,699 --> 01:28:03,050
maybe you get a grant. And then you do all
these steps. And then step eight is where
931
01:28:03,050 --> 01:28:09,119
you actually prepare a journal publication.
And in that, you have to note any concerns
932
01:28:09,119 --> 01:28:13,739
about your data collection or analysis, anything
that might have gone wrong, or not gone exactly
933
01:28:13,739 --> 01:28:22,190
the way you planned, or something you need
to take into account to really properly interpret
934
01:28:22,190 --> 01:28:28,010
what the study found. You also want to make
recommendations for future studies, especially
935
01:28:28,010 --> 01:28:33,039
if you screwed something up, or especially
if you answered a really good question. No
936
01:28:33,039 --> 01:28:39,360
reason to per separate on that question, why
don't we move forward and ask the next one.
937
01:28:39,360 --> 01:28:45,980
Now, these are a lot of steps to remember.
So I'm going to help you try to remember them
938
01:28:45,980 --> 01:28:51,760
in sort of clumps. So let's look at the first
clump, which are steps one through three,
939
01:28:51,760 --> 01:28:59,650
which is data hypothesis, identify the individuals
of interest, and specify the variables to
940
01:28:59,650 --> 01:29:08,000
measure. So let's give an example of that.
So let's say our hypothesis was air pollution
941
01:29:08,000 --> 01:29:13,480
causes asthma, and children who live in urban
settings. You know, that's how we'd stated
942
01:29:13,480 --> 01:29:19,080
or we could say that as a research question,
like does air pollution cause asthma in children
943
01:29:19,080 --> 01:29:24,659
who live in urban settings. And so in that
case, the individuals would be children in
944
01:29:24,659 --> 01:29:30,369
urban settings, and the variables we'd have
to measure our air pollution at least, and
945
01:29:30,369 --> 01:29:35,639
asthma at least. And of course, we'd want
to know more things about these individuals,
946
01:29:35,639 --> 01:29:40,780
these children, we probably measure their
income and where exactly they were living,
947
01:29:40,780 --> 01:29:46,579
and how old they were, and if they're male
or female, and these kinds of things, but
948
01:29:46,579 --> 01:29:52,700
that just kind of helps you think about the
first three steps together. Now let's think
949
01:29:52,700 --> 01:29:58,110
about the second three steps together four,
five, and six, which is determine if you're
950
01:29:58,110 --> 01:30:03,360
going to use a population or sample If it's
sample, pick the sampling method, look at
951
01:30:03,360 --> 01:30:11,619
the ethical concerns and then actually collect
the data. So, when you do that, you can either
952
01:30:11,619 --> 01:30:18,309
quote unquote, collect data, you know, like,
by using existing data by downloading data
953
01:30:18,309 --> 01:30:24,429
from the census, or like Medicare, they have
data sets available that are, are de identified,
954
01:30:24,429 --> 01:30:28,719
so you don't know who exactly is in there.
Or you can collect data yourself, like do
955
01:30:28,719 --> 01:30:37,230
a survey or, you know, get a bunch of patients
that will allow you to measurement. When you
956
01:30:37,230 --> 01:30:43,600
use it, a government data set, often you can
make population measures out of it. And so
957
01:30:43,600 --> 01:30:49,630
you don't really have to go through a lot
of sampling, or ethics, because they've already
958
01:30:49,630 --> 01:30:57,150
provided it for you. And it's confidential.
And that's kind of your data collection. But
959
01:30:57,150 --> 01:31:02,079
most of the time, what you'll see, especially
for studying patients, and treatments, and
960
01:31:02,079 --> 01:31:07,559
cures, and things like that, those are on
a smaller scale. So you end up collecting
961
01:31:07,559 --> 01:31:14,489
data from a sample for those estimates. And
again, you need to choose a sampling approach.
962
01:31:14,489 --> 01:31:20,809
And then you need consent, if legally found
to be human research. So I just want to share
963
01:31:20,809 --> 01:31:26,789
with you in case you didn't know, if you want
to go do research on humans, you're a nursing
964
01:31:26,789 --> 01:31:33,250
student, or your medical students or a dental
student, any any students or or your dentist,
965
01:31:33,250 --> 01:31:40,619
your physician, whatever, a nurse, you can't
just make up a survey, or study design and
966
01:31:40,619 --> 01:31:47,119
go out and do it, you have to get approval
from an ethical board. And that ethical board
967
01:31:47,119 --> 01:31:53,099
will talk to you if what you're doing is considered
li li human research, that you need to get
968
01:31:53,099 --> 01:32:00,719
consent from the patients or the participants
in your study if they're humans. And if you're
969
01:32:00,719 --> 01:32:05,710
collecting data about children, for example,
you have to get the consent of their parents
970
01:32:05,710 --> 01:32:10,449
and the assent of the children. And in the
United States, that way, we have a setup,
971
01:32:10,449 --> 01:32:15,469
it's called an institutional review board
for the protection of human subjects and research
972
01:32:15,469 --> 01:32:22,640
or the short answer is IRB. And so I just
want to make sure that if you ever do design
973
01:32:22,640 --> 01:32:27,079
a study that you know about this IRB thing,
and you realize you have to go through this
974
01:32:27,079 --> 01:32:32,980
ethical board and make sure that they're cool
with it. Before you can move on to the next
975
01:32:32,980 --> 01:32:40,219
step of designing a statistical study. All
right, finally, we're on to the last clump
976
01:32:40,219 --> 01:32:47,239
of steps, which is seven, and eight, right?
So that's using descriptive or inferential
977
01:32:47,239 --> 01:32:51,880
statistics to answer your hypothesis you in
six, you collected the data. Now we're going
978
01:32:51,880 --> 01:32:56,969
to do the statistics. And then step eight
is noting any concerns about your data collection
979
01:32:56,969 --> 01:33:00,349
or analysis and making recommendations for
future studies.
980
01:33:00,349 --> 01:33:05,010
So you can kind of imagine this is where we're
sitting in our offices, and writing up our
981
01:33:05,010 --> 01:33:10,000
research, whether we're writing an internal
report to our bosses, over writing for the
982
01:33:10,000 --> 01:33:17,599
scientific literature to publish for everybody.
So at this point, I just want to remind you
983
01:33:17,599 --> 01:33:23,880
that it matters whether you picked a census
or a sample, for your study design. Because
984
01:33:23,880 --> 01:33:28,309
if you pick the census, you're going to do
a certain kind of analysis. And if you pick
985
01:33:28,309 --> 01:33:33,119
the sample, you're going to do a different
kind of analysis and statistics. So again,
986
01:33:33,119 --> 01:33:40,361
that's all kind of cycles back to your study
design. And what's important here is I want
987
01:33:40,361 --> 01:33:48,429
to talk to you about the two different main
types of studies. Now within these two categories,
988
01:33:48,429 --> 01:33:54,039
you have different subtypes. But these are
the two main types that you can have. The
989
01:33:54,039 --> 01:34:00,309
first is called an experiment. experiment
is where a treatment or intervention is deliberately
990
01:34:00,309 --> 01:34:07,389
assigned to the individuals. So you can kind
of imagine that if you enter a study, and
991
01:34:07,389 --> 01:34:12,119
they assign you to take a drug in the study
that you weren't taking before, that would
992
01:34:12,119 --> 01:34:17,810
be an experiment. But another thing could
happen. I mean, you could do this to individuals,
993
01:34:17,810 --> 01:34:22,520
you could do it to animals, but you could
do it, I keep getting the example of hospitals,
994
01:34:22,520 --> 01:34:28,730
we could choose some hospitals and say, Hey,
you need to try a new policy as the intervention
995
01:34:28,730 --> 01:34:34,869
and and that was assigned by the researcher.
So that makes this an experiment. And the
996
01:34:34,869 --> 01:34:40,290
reason why we have experiments is sometimes
you need them. The purpose is to study the
997
01:34:40,290 --> 01:34:47,159
possible effect of the treatment or the intervention
on the variables measured. And so that's one
998
01:34:47,159 --> 01:34:52,679
option you can do is have an experimental
study where the researcher assigns the individuals
999
01:34:52,679 --> 01:35:01,309
to do certain things in the study. There's
another kind of study The other kind, which
1000
01:35:01,309 --> 01:35:07,211
is called observational, and the way you can
think about it is in experiments, the researcher
1001
01:35:07,211 --> 01:35:13,130
does something, they intervene, they give
a treatment, right? But an observational,
1002
01:35:13,130 --> 01:35:21,699
the researcher doesn't do that the researchers
just observes. So, if you enroll in the study,
1003
01:35:21,699 --> 01:35:25,270
and you say, Do I have to take a drug? Am
I supposed to eat something? What am I supposed
1004
01:35:25,270 --> 01:35:30,429
to do? And the researcher just says, No, we're
just going to measure you, we're just going
1005
01:35:30,429 --> 01:35:34,030
to ask you questions, and we're going to measure
things about you, we're not going to tell
1006
01:35:34,030 --> 01:35:40,010
you to do anything different, then you're
in an observational study. So no treatment
1007
01:35:40,010 --> 01:35:44,880
or intervention is assigned by the researcher
in an observational study. Now, let's say
1008
01:35:44,880 --> 01:35:48,090
you're taking a drug, you know, just because
maybe you have migraines, you're taking a
1009
01:35:48,090 --> 01:35:51,789
migraine drug, well, you just keep taking
it, or you can stop taking it, you know, they
1010
01:35:51,789 --> 01:35:55,560
don't care, they might ask you about taking
the drug, but they're not going to assign
1011
01:35:55,560 --> 01:36:02,869
you to take it. It's an observational study.
I wanted to give you a couple of real life
1012
01:36:02,869 --> 01:36:11,199
examples. So Women's Health Initiative up
on the slide was mainly an experiment, okay.
1013
01:36:11,199 --> 01:36:16,310
This is was run by the United States government,
but of course, had the cooperation of many,
1014
01:36:16,310 --> 01:36:24,040
many universities and, and health care centers,
and most importantly, women. So women in America,
1015
01:36:24,040 --> 01:36:29,560
women who were postmenopausal, volunteered
to be in the study. And the study actually
1016
01:36:29,560 --> 01:36:37,349
had two separate sections, the experiment
section, and the observational study section.
1017
01:36:37,349 --> 01:36:42,310
They really wanted women to qualify for the
experiment, and that the purpose of the experiment
1018
01:36:42,310 --> 01:36:48,320
was to study whether hormone replacement therapy,
which is a therapy for symptoms that women
1019
01:36:48,320 --> 01:36:54,630
can get if they're postmenopausal, that are
unpleasant. What whether that therapy is good
1020
01:36:54,630 --> 01:37:00,949
for women, or bad for women, because they
thought maybe it helps them the post menopause
1021
01:37:00,949 --> 01:37:08,829
system symptoms. But they thought maybe it
causes cancer, right? So they know. So what
1022
01:37:08,829 --> 01:37:14,760
they had to do was assign, get a bunch of
women who were agreeing, you know that they
1023
01:37:14,760 --> 01:37:20,000
would take whatever was assigned to them.
And they had to assign the drug to some of
1024
01:37:20,000 --> 01:37:25,570
these women. So that's what made an experiment.
The problem is not all the women qualified
1025
01:37:25,570 --> 01:37:31,270
for the study. So they had a separate observational
study, if if the woman did not qualify to
1026
01:37:31,270 --> 01:37:38,599
get the experimental drug assigned to her,
then she could be in the observational study.
1027
01:37:38,599 --> 01:37:43,750
And because this is these big government studies,
why not, you know, somebody wants to be in
1028
01:37:43,750 --> 01:37:49,800
a study, why not study them, just put them
in the observational section.
1029
01:37:49,800 --> 01:37:57,789
A very huge, popular long, ongoing study.
That's an observational study, again, run
1030
01:37:57,789 --> 01:38:03,730
by Well, this one actually started out of
Harvard. And that's called the nurses Health
1031
01:38:03,730 --> 01:38:10,670
Study. Some really smart person figured out
a long time ago, that nurses are, are smart
1032
01:38:10,670 --> 01:38:16,280
people, they understand their own health,
they understand other people's health. And
1033
01:38:16,280 --> 01:38:21,820
they're good at filling out surveys about
health. So they started studying nurses and
1034
01:38:21,820 --> 01:38:26,829
regularly sending them surveys, of course,
they didn't tell the nurses what to do. They
1035
01:38:26,829 --> 01:38:31,940
didn't assign the nurses any sort of drug
to take or any diet or intervention or anything.
1036
01:38:31,940 --> 01:38:38,460
They just observe the nurses, they send the
nurses a survey, and about the nurses health,
1037
01:38:38,460 --> 01:38:43,320
and then the nurse vault fills out that information.
I think it's every two years that they do
1038
01:38:43,320 --> 01:38:44,320
that,
1039
01:38:44,320 --> 01:38:46,020
they're still doing it.
1040
01:38:46,020 --> 01:38:53,989
Also, at this point, I do want to point out
the concept of replication. So just the word
1041
01:38:53,989 --> 01:39:03,030
replication, right, regular speaking means
to copy, right? Like, if you ever, you know,
1042
01:39:03,030 --> 01:39:08,770
have a new roommate, you might need to replicate
your key. So you have a copy of the key for
1043
01:39:08,770 --> 01:39:16,079
the new roommate? Well, part of the whole
science thing is that studies must be done
1044
01:39:16,079 --> 01:39:20,820
rigorously enough to be replicated. So those
are little keywords in there. A rigorous study
1045
01:39:20,820 --> 01:39:28,659
means one that's done really carefully, like
thinking about sampling very carefully. You
1046
01:39:28,659 --> 01:39:34,309
know, like avoiding, for example, non sampling
error not being sloppy, not getting a lot
1047
01:39:34,309 --> 01:39:40,870
of under coverage, using a good sampling frame.
You know, I'm just giving you examples that
1048
01:39:40,870 --> 01:39:45,980
you might know about. But there's a lot of
things that have to be done in research to
1049
01:39:45,980 --> 01:39:50,599
do it properly. It's just like driving or
anything else. You really have to keep your
1050
01:39:50,599 --> 01:39:55,969
eye on a lot of different things and you want
to try to do them perfectly. And the main
1051
01:39:55,969 --> 01:40:01,309
reason why you want to do that is so if somebody
tries to do this same experiment you did or
1052
01:40:01,309 --> 01:40:05,829
roughly the same experiment you did. Because
you can't do exactly the same, right? If I
1053
01:40:05,829 --> 01:40:10,420
study this hospital over here, and somebody
wants to study that hospital over there, well,
1054
01:40:10,420 --> 01:40:14,639
they're going to get different people in there,
right? But even so if that person decides
1055
01:40:14,639 --> 01:40:20,130
that they want to study that hospital over
there, if I did my study rigorously, then
1056
01:40:20,130 --> 01:40:27,099
it won't be so hard for that person to replicate
how I did the study. And then we can see if
1057
01:40:27,099 --> 01:40:32,289
that person and my study if we get the same
thing, or if there's something slightly off
1058
01:40:32,289 --> 01:40:38,870
or what's going on. And so replicating the
results of both observational studies and
1059
01:40:38,870 --> 01:40:44,340
experiments, is necessary for science to progress.
So you'll know that a lot of experiments are
1060
01:40:44,340 --> 01:40:50,210
done on drugs, before they can be approved
to be given to everybody, because they can't
1061
01:40:50,210 --> 01:40:55,409
just do one study, they have to replicate
it, to make sure that the findings are all
1062
01:40:55,409 --> 01:41:01,429
sort of coming in about the same and that
we can deduce some information about it, you
1063
01:41:01,429 --> 01:41:09,949
really just don't want to rely on one study
for your findings. So I just went over several
1064
01:41:09,949 --> 01:41:14,699
steps that we need to follow when we're doing
a statistical study, and we actually have
1065
01:41:14,699 --> 01:41:20,320
to follow them in order. And you also have
to determine the type of study you're doing,
1066
01:41:20,320 --> 01:41:25,670
you know, is an experiment, or observational
study. And there's a ton of study decisions
1067
01:41:25,670 --> 01:41:31,809
you have to make. So you got to keep that
in mind. Now, we're going to talk about avoiding
1068
01:41:31,809 --> 01:41:38,290
bias in specifically survey design. Now, you
can do a lot of different kinds of studies.
1069
01:41:38,290 --> 01:41:44,230
But let's just talk about surveys, because
that happens a lot in nursing. Nurses interact
1070
01:41:44,230 --> 01:41:50,500
with patients a lot, and with the community
with each other. And often they gather information
1071
01:41:50,500 --> 01:41:56,000
about those interactions or attitudes or,
or how the healthcare system functions by
1072
01:41:56,000 --> 01:42:03,809
using a survey. So surveys can provide a lot
of information and useful information. But
1073
01:42:03,809 --> 01:42:08,940
it's important that all aspects of survey
design and administration when you're giving
1074
01:42:08,940 --> 01:42:13,980
it, you got to think about minimizing bias
and try you know, try to get a representative
1075
01:42:13,980 --> 01:42:21,059
sample trying to get accurate measurements.
And so several considerations should be made.
1076
01:42:21,059 --> 01:42:29,320
When you want to think about non response
and also voluntary response, okay, so I talked
1077
01:42:29,320 --> 01:42:36,940
a lot about sampling in the previous lecture.
But just because you invite someone to participate
1078
01:42:36,940 --> 01:42:41,670
in your study, like maybe you're doing systematic
sampling, and every third patient, you asked,
1079
01:42:41,670 --> 01:42:47,130
Would you like to fill out a survey? That
doesn't mean they're going to, right? And
1080
01:42:47,130 --> 01:42:51,000
so if that person says no, thank you, even
though there were a sample, that's called
1081
01:42:51,000 --> 01:42:56,070
non response. So if I was helping you with
a survey, and you said, Hey, I was getting
1082
01:42:56,070 --> 01:43:01,769
a lot of non response, I would look at the
proportion if you approach 200 people, and
1083
01:43:01,769 --> 01:43:09,650
80 said, No, you know, that's only a 20% response
rate and an 80% non response rate. if many
1084
01:43:09,650 --> 01:43:16,079
people are refusing your survey, the few who
actually completed are likely to have a biased
1085
01:43:16,079 --> 01:43:17,449
opinion.
1086
01:43:17,449 --> 01:43:26,179
I've noticed this at in in situations where
things are really bad, okay. Like, I remember
1087
01:43:26,179 --> 01:43:34,070
going to a subway station and it was flooded,
and it was really in a bad situation. And
1088
01:43:34,070 --> 01:43:40,639
there was a man handing out surveys from the
Transportation Authority. And he was like,
1089
01:43:40,639 --> 01:43:46,340
please take my survey, please take my survey.
And everybody was waving past him. They didn't
1090
01:43:46,340 --> 01:43:52,190
want to grab a survey. While you know me,
I got a bleeding heart for surveys. So I took
1091
01:43:52,190 --> 01:43:58,039
his survey, and I filled it out. You know,
I think the transportation authorities not
1092
01:43:58,039 --> 01:44:04,730
so bad. Right? I lived in Florida, there's
no transportation there, right? So and here
1093
01:44:04,730 --> 01:44:10,080
in Massachusetts, we got a great transportation
system, even if it's flooded or doesn't work
1094
01:44:10,080 --> 01:44:15,411
half the time, right. It's way better than
not having one. Well, I'm not the only one
1095
01:44:15,411 --> 01:44:21,389
who grabbed a survey a bunch of nice Pollyannas,
like me grabbed a survey. So probably the
1096
01:44:21,389 --> 01:44:27,429
Trent Transit Authority thinks that everybody
loves the subway when everybody was waving
1097
01:44:27,429 --> 01:44:32,650
past this poor guy because they were so disgusted,
because the station was flooded.
1098
01:44:32,650 --> 01:44:39,130
Right? So if so many people are refusing your
survey, a high proportion, the feebly will
1099
01:44:39,130 --> 01:44:42,989
actually fill it out are going to be kind
of weird, probably like me. You know, you're
1100
01:44:42,989 --> 01:44:49,269
gonna get a bunch of happy people when most
of the people who said no might be sad people.
1101
01:44:49,269 --> 01:44:54,750
And so, the reason they may not be completing
your survey has may have to do with how they
1102
01:44:54,750 --> 01:45:01,140
feel about your topic. This is not just in
terms of satisfaction. Let's say you want
1103
01:45:01,140 --> 01:45:07,481
to talk about how many drinks per night somebody
has. Okay? Do you think a lot of people who
1104
01:45:07,481 --> 01:45:12,090
are struggling with alcoholism are gonna want
to fill out that survey? You know, how about
1105
01:45:12,090 --> 01:45:18,480
illegal drugs or other illegal activity, people
who are into that they don't always feel so
1106
01:45:18,480 --> 01:45:23,690
good about talking about it. And so, you know,
you might get a few people to fill out your
1107
01:45:23,690 --> 01:45:28,330
survey, but those are not necessarily the
people who are engaging in the behaviors.
1108
01:45:28,330 --> 01:45:35,590
So the fact that we have the freedom to choose
whether or not we want to be in a survey is
1109
01:45:35,590 --> 01:45:41,370
great. But from a researcher standpoint, is
you have to be careful. If you get low response
1110
01:45:41,370 --> 01:45:46,300
rates, you need to ask yourself who was not
responding? And, you know, am I missing a
1111
01:45:46,300 --> 01:45:54,989
good share of opinion there? And then, when
you get people who do respond, you got to
1112
01:45:54,989 --> 01:46:02,350
be careful with that two, respondents may
lie on purpose. If you've got a pretty cool
1113
01:46:02,350 --> 01:46:09,389
survey, but you suddenly ask a question, that's
too personal. People might just lie. If you
1114
01:46:09,389 --> 01:46:15,900
ask, maybe a students you're doing a sin,
you know, maybe satisfaction survey with how
1115
01:46:15,900 --> 01:46:22,530
the front desk runs at a dorm or something.
If you, you know, ask a question, have you
1116
01:46:22,530 --> 01:46:28,970
ever cheated on a test? You know, my, everybody's
probably gonna say no. Also, if you ask a
1117
01:46:28,970 --> 01:46:33,050
question where people don't really know the
answer, offhand, they're not gonna put it.
1118
01:46:33,050 --> 01:46:38,639
Like if you ask somebody, you know, when you're,
you know, you asked a kid who's been living
1119
01:46:38,639 --> 01:46:43,760
in the house forever, when your parents bought
the house? How much did it cost? I mean, they're
1120
01:46:43,760 --> 01:46:49,380
not gonna know. Maybe they'll know, but probably
not. And so you want to be careful when you
1121
01:46:49,380 --> 01:46:53,870
design your questions that you're not asking
anything that's so personal, everybody's in
1122
01:46:53,870 --> 01:46:58,480
lie about it? Or that you're not asking a
question, then you would have Trump people
1123
01:46:58,480 --> 01:47:02,110
try to be accurate, they're probably not even
give you the right answer, because it's just
1124
01:47:02,110 --> 01:47:09,460
too hard to think about. Um, respondents also
to, you know, to surveys may lie without meaning
1125
01:47:09,460 --> 01:47:15,639
to, like, inadvertently. Again, if you ask
a question about something that happened really
1126
01:47:15,639 --> 01:47:22,060
a long time ago, they're not probably going
to get it right. This is called recall bias,
1127
01:47:22,060 --> 01:47:27,789
like you can have you can you know how, like,
you can look back at a time in your life,
1128
01:47:27,789 --> 01:47:31,860
like, especially if you went through something
really harsh, like if you were a part of a
1129
01:47:31,860 --> 01:47:37,099
sports team, and you went to state and it
was really tough that you don't remember the
1130
01:47:37,099 --> 01:47:42,270
tough part, right? You sit around singing,
you know, your sports songs, and you say,
1131
01:47:42,270 --> 01:47:48,000
Hey, that was awesome. Well, that's recall
bias, right? Because after winning state,
1132
01:47:48,000 --> 01:47:53,650
everything looks rosy. But, you know, on the
bus, there really wasn't that easy. So people
1133
01:47:53,650 --> 01:47:58,239
tend to have recall bias, it's influenced
by events that have happened since the original
1134
01:47:58,239 --> 01:48:01,900
event. So if you're giving people a survey,
and you're saying, Well, before you applied
1135
01:48:01,900 --> 01:48:07,929
for nursing school, you know, what did you
think this? Or did you think that, you know,
1136
01:48:07,929 --> 01:48:11,730
they might just tell you and think they're
telling you the truth, but they're actually
1137
01:48:11,730 --> 01:48:17,010
lying. If you actually managed to go back
in time and ask them, then they tell you something
1138
01:48:17,010 --> 01:48:23,929
different. So again, you can kind of screw
up your own data by screwing up your own questions.
1139
01:48:23,929 --> 01:48:30,500
So you want to think about how you word your
questions. You can also screw up your questions
1140
01:48:30,500 --> 01:48:37,780
by introducing a hidden bias. Something happened
to me recently, where a company sent me a
1141
01:48:37,780 --> 01:48:44,329
free app. And they said, try our free app,
and I downloaded it, and it was awful. Okay.
1142
01:48:44,329 --> 01:48:51,710
And then about a month later, they sent me
a survey. And these were the questions I said.
1143
01:48:51,710 --> 01:48:58,599
When do you use the app? You know, what time
of day? Do you use it? Right? Like how how,
1144
01:48:58,599 --> 01:49:03,239
how do you use it? Do you read scientific
literature? Do you read news? And the problem
1145
01:49:03,239 --> 01:49:07,060
was, I couldn't really answer any of this.
Because from the day I downloaded it, I never
1146
01:49:07,060 --> 01:49:13,010
used it. It was so bad. Right? So question
wording may induce a certain response. They
1147
01:49:13,010 --> 01:49:18,780
were asking me how do you use this, but they
didn't give me a choice of I don't. So I had
1148
01:49:18,780 --> 01:49:23,289
to say something. I don't even know what I
said. I mean, there was nothing I could say
1149
01:49:23,289 --> 01:49:29,330
To be honest, because of that bias. So you
have to be careful that you aren't too rosy
1150
01:49:29,330 --> 01:49:35,690
about whatever your topic is, and and assume
everybody loves everything. I mean, you've
1151
01:49:35,690 --> 01:49:39,239
got to put out questions like are you even
using the software? Did you have any problems
1152
01:49:39,239 --> 01:49:45,440
with the software? Right? I'm just assuming
they're using it and liking it and using it.
1153
01:49:45,440 --> 01:49:52,320
You know, like it's supposed to be used is
a big assumption. Order of questions and other
1154
01:49:52,320 --> 01:49:56,420
wording may induce a certain response and
you'll see this a lot if you take a public
1155
01:49:56,420 --> 01:50:04,140
opinion poll. I used to do a lot of polling
We'd ask questions like, how likely are you
1156
01:50:04,140 --> 01:50:10,340
to vote for candidate x? You know, very likely
someone likely? Somewhat unlikely and not
1157
01:50:10,340 --> 01:50:15,440
at all likely? And people say, I don't know,
no, no likely. And then you'd say, Well, what
1158
01:50:15,440 --> 01:50:23,590
if you knew that candidate x supported this
new proposition? proposition? 69. Right, then
1159
01:50:23,590 --> 01:50:30,510
would you be more likely to vote for candidate
x? And so that's why order of questions other
1160
01:50:30,510 --> 01:50:35,280
wording and stuff. They're trying to see if
I add this fact that that fact is that going
1161
01:50:35,280 --> 01:50:41,239
to make the person like the candidate better.
And so you do have to think about the order
1162
01:50:41,239 --> 01:50:46,269
you put the questions. And if you want to
ask about two different subjects, kind of
1163
01:50:46,269 --> 01:50:51,969
think about which subject should come first,
because it might color the respondents answering
1164
01:50:51,969 --> 01:50:58,000
of the subsequent subject. And also on the
slide, I wanted to point out that the scales
1165
01:50:58,000 --> 01:51:05,039
of questions may not accurately measure responses.
Do your feelings always fit on a scale from
1166
01:51:05,039 --> 01:51:10,420
one to five? Well, you know, yelps kind of
figured it out. If people's feelings about
1167
01:51:10,420 --> 01:51:15,889
restaurants tend to fit on a scale of one
to five, I'd have a lot of trouble filling
1168
01:51:15,889 --> 01:51:22,140
that out if they gave me a scale of one to
17. Right. But sometimes people have more
1169
01:51:22,140 --> 01:51:28,270
granular feelings about things, maybe they
need a longer scale one to seven. Um, you'll
1170
01:51:28,270 --> 01:51:34,610
see a lot of pain scales, where they offer
more than just five choices, because probably
1171
01:51:34,610 --> 01:51:41,699
pain can maybe go from one to seven or one
to 10. So think about your scales when you're
1172
01:51:41,699 --> 01:51:51,981
creating these questions, because that's your
choice if you're designing the study. Another
1173
01:51:51,981 --> 01:51:58,659
point to be made is the influence of the interviewer.
Now, we don't have as much interviewing going
1174
01:51:58,659 --> 01:52:03,559
on these days, because we have the internet
where we can do anonymous surveys, and people
1175
01:52:03,559 --> 01:52:11,210
just fill them out self report, we have Robo
phones that you can call robo call. And using
1176
01:52:11,210 --> 01:52:18,869
an automated voice, that's obviously not a
person, you can get survey data. But there's
1177
01:52:18,869 --> 01:52:22,989
always situations where you actually have
to interview people, especially if somebody
1178
01:52:22,989 --> 01:52:28,550
is really sick in bed, and you have to show
up there, you have to talk to them. And so
1179
01:52:28,550 --> 01:52:34,750
even on the phone, you have to interview people,
and they can hear your voice, right. So you
1180
01:52:34,750 --> 01:52:39,480
got to think about when you're pairing up
whoever's being interviewed with whoever's
1181
01:52:39,480 --> 01:52:45,829
interviewing, um, I've found that it's best
to have the interviewer come from the same
1182
01:52:45,829 --> 01:52:52,400
population as the research participant, in
general, the only time that can be a problem
1183
01:52:52,400 --> 01:52:59,159
is a thirst from the same community, and there's
a privacy issue. But it can be very helpful,
1184
01:52:59,159 --> 01:53:07,530
for the most part, not always, to have your
interviewers be actually from the population
1185
01:53:07,530 --> 01:53:13,500
that you would be studying, you know, from
the individuals that you would be studying.
1186
01:53:13,500 --> 01:53:19,690
So for instance, if you need to interview
a bunch of young African American, you know,
1187
01:53:19,690 --> 01:53:25,860
like some African American teenage men, like
I recently saw a study on how health care
1188
01:53:25,860 --> 01:53:30,900
in the United States really isn't suited for
them. And it needs to improve and needs to
1189
01:53:30,900 --> 01:53:36,410
better cater to this population. Well, let's
say you wanted to better understand that,
1190
01:53:36,410 --> 01:53:41,330
the best thing would be is to hire a young
African American male and train him on how
1191
01:53:41,330 --> 01:53:44,909
to be good interviewer and do be good data
collector, because you probably get the best
1192
01:53:44,909 --> 01:53:47,249
data that way.
1193
01:53:47,249 --> 01:53:53,020
On the other hand, let's think of different
ways that that could go, you could take a
1194
01:53:53,020 --> 01:54:02,460
person who was older, who is maybe of a different
race, and maybe that would change how this
1195
01:54:02,460 --> 01:54:08,250
young African American male would respond
to this interviewer. I mean, the interviewer
1196
01:54:08,250 --> 01:54:17,889
could be like, in many ways, like the respondent,
but the respondents perception might change,
1197
01:54:17,889 --> 01:54:25,829
then how they answer all verbal and nonverbal
influences matter, you know, clothing, the
1198
01:54:25,829 --> 01:54:30,670
setting that the person's being interviewed
in. And so I'm not saying there's really a
1199
01:54:30,670 --> 01:54:37,410
solution to all this. I'm just saying, make
some good decisions. Like I remember working
1200
01:54:37,410 --> 01:54:45,250
on a data set where there were some questions
that had been asked about some older men about
1201
01:54:45,250 --> 01:54:51,510
their sexual function. And I, it looks the
data look funny to me in the statistician
1202
01:54:51,510 --> 01:54:57,900
who was there during data collection told
me that they had chosen young, female nursing
1203
01:54:57,900 --> 01:55:04,429
students to interview these elders. Men about
their sexual habits. And I just said, you
1204
01:55:04,429 --> 01:55:14,130
know, that might be subject to interviewer
influence. And then you of course have to
1205
01:55:14,130 --> 01:55:19,999
worry about vague wording. Just because it
looks clear to you doesn't mean it looks clear
1206
01:55:19,999 --> 01:55:27,849
to everyone. There are simple ways of avoiding
vague terms in the survey, when you can just
1207
01:55:27,849 --> 01:55:32,619
put a number on it. So instead of asking a
person, if they've waited a long time in the
1208
01:55:32,619 --> 01:55:40,420
waiting room, you can say, more than 10 minutes.
You can say exactly like within the last month,
1209
01:55:40,420 --> 01:55:47,119
have you done certain a certain activity or
within the next year? Do you expect to change
1210
01:55:47,119 --> 01:55:54,110
schools or whatever. And so try to wherever
you can use numbers or something very specific,
1211
01:55:54,110 --> 01:55:59,580
you know, instead of go to the clinic, go
to the public health clinic at this particular
1212
01:55:59,580 --> 01:56:05,769
corner, or whatever. And then you're going
to get some pretty accurate information.
1213
01:56:05,769 --> 01:56:06,769
But
1214
01:56:06,769 --> 01:56:11,540
sometimes you're stuck using vague terms,
because you're studying vague terms, right?
1215
01:56:11,540 --> 01:56:18,789
I was doing a study of controllable lifestyle
attitudes towards controllable lifestyle in
1216
01:56:18,789 --> 01:56:24,110
medical students. So we asked this question,
how important is having a controllable lifestyle
1217
01:56:24,110 --> 01:56:29,000
to you in your future career? Well, what does
that mean? That's pretty vague. So what we
1218
01:56:29,000 --> 01:56:32,909
did is we use this grounding this anchoring
language,
1219
01:56:32,909 --> 01:56:38,900
we added the sentence, a controllable lifestyle
is defined as one that allows the physician
1220
01:56:38,900 --> 01:56:44,699
to control the number of hours devoted to
practicing his or her specialty. So even though
1221
01:56:44,699 --> 01:56:49,849
we're talking about something kind of wofully,
and watery, loosey goosey like control of
1222
01:56:49,849 --> 01:56:54,570
a lifestyle, who knows what that means? And
that's not to say that that sentence could
1223
01:56:54,570 --> 01:57:00,090
be interpreted differently by people it certainly
is. But if you're stuck with vague wording,
1224
01:57:00,090 --> 01:57:04,110
try to put some grounding language in it.
So everybody's at least sort of led in the
1225
01:57:04,110 --> 01:57:11,809
same direction with their thought before they
answer the question. Now, I want to also point
1226
01:57:11,809 --> 01:57:15,560
out, you probably have noticed, there's all
these issues, you have to think about when
1227
01:57:15,560 --> 01:57:22,480
doing surveys, there's this other issue called
the lurking variable, well, you know, lurk
1228
01:57:22,480 --> 01:57:29,139
means to sneak around behind the scenes, right?
Behind the scenes, a lurking variable is a
1229
01:57:29,139 --> 01:57:35,730
variable that's associated with a condition,
but it may not actually cause it. I remember
1230
01:57:35,730 --> 01:57:43,020
when I was studying epidemiology, they talked
about how a lot of people with motorcycle
1231
01:57:43,020 --> 01:57:49,429
accidents, you unfortunately got in motorcycle
accidents that they had tattoos. So therefore,
1232
01:57:49,429 --> 01:57:53,679
they said, Everybody shouldn't get a tattoo,
you might get it in a motorcycle accident?
1233
01:57:53,679 --> 01:57:59,199
Well, that's a great example of a lurking
variable. Yeah, a lot of people who do get
1234
01:57:59,199 --> 01:58:05,170
into motorcycle accidents, have tattoos, but
that the tattoos don't cause that. Um, we
1235
01:58:05,170 --> 01:58:10,489
also know that having more education increases
income, but people have the same education
1236
01:58:10,489 --> 01:58:14,630
level do not all make the same income, there's
this thing, you know, called, it's sexism.
1237
01:58:14,630 --> 01:58:21,370
And it's called racism. So it matters whether
you're a woman or a man, it matters, the color
1238
01:58:21,370 --> 01:58:27,249
of your skin. If the you know, if you've got
a darker skin, doesn't matter, that you have
1239
01:58:27,249 --> 01:58:32,579
the same education as somebody with lighter
skin, you're still gonna make less money.
1240
01:58:32,579 --> 01:58:37,079
And so you have these lurking variables behind
the scenes. So when people are looking at
1241
01:58:37,079 --> 01:58:41,780
Well, why are people you know, making less
income, because they're less educated, whatever?
1242
01:58:41,780 --> 01:58:50,239
Well, you got to look for also the lurking
variables. So current studies show that why
1243
01:58:50,239 --> 01:58:54,369
women and African Americans make less money
on the whole, it's not explained by fewer
1244
01:58:54,369 --> 01:59:01,380
of them working or fewer of them getting degrees.
It's really these lurking variables. And so
1245
01:59:01,380 --> 01:59:07,639
you got to think critically. And I guess what
I would say is, whenever you do a survey,
1246
01:59:07,639 --> 01:59:12,390
if you're studying something that has a lot
of lurking variables associated with it, make
1247
01:59:12,390 --> 01:59:17,929
sure you measure those variables. Like early
studies where they were looking to see if
1248
01:59:17,929 --> 01:59:24,999
drinking a lot of alcohol causes lung cancer.
Some of them forgot to really study how much
1249
01:59:24,999 --> 01:59:31,519
these people would smoke. Because we know
smoking causes lung cancer. And we know if
1250
01:59:31,519 --> 01:59:36,179
you're hanging out in a place with a lot of
drinking and they allow smoking, you'll see
1251
01:59:36,179 --> 01:59:41,119
a lot of people smoking too. They seem to
go hand in hand. So you don't want to miss
1252
01:59:41,119 --> 01:59:46,630
measuring variables that you think might be
lurking variables. It's no problem to measure
1253
01:59:46,630 --> 01:59:54,570
them and not use them later, but just make
sure they're included. So, as a final note
1254
01:59:54,570 --> 02:00:01,499
on bias, I just want to point out that survey
results are so important. for healthcare,
1255
02:00:01,499 --> 02:00:07,170
and for the progression of science, that you
really owe it to even a simplest survey, to
1256
02:00:07,170 --> 02:00:12,610
think about all of these things, these possible
things that could go wrong, just with the
1257
02:00:12,610 --> 02:00:17,989
wording of questions or with how you're approaching
things, and just really consider how you can
1258
02:00:17,989 --> 02:00:24,449
improve it. It's really important to pay attention
to avoiding bias when you're designing and
1259
02:00:24,449 --> 02:00:31,750
conducting your survey. So think about all
these things at the design phase. Finally,
1260
02:00:31,750 --> 02:00:37,929
I'll get into the last section of this lecture,
which is about randomization, which I think
1261
02:00:37,929 --> 02:00:44,059
a lot of us have heard about. So I'm going
to explain the steps to a completely randomized
1262
02:00:44,059 --> 02:00:50,409
experiment. And after I go through all that,
I'm going to also talk about the concept of
1263
02:00:50,409 --> 02:00:57,770
a placebo and the placebo effect. Then we're
going to briefly touch on blocked randomization,
1264
02:00:57,770 --> 02:01:08,320
and also define for you what is meant by blinding.
So why ever randomize, right? So what randomizing
1265
02:01:08,320 --> 02:01:16,510
is, is when you take a bunch of respondents
or participants in your study, and you randomly
1266
02:01:16,510 --> 02:01:22,719
choose what group they go in. And if you remember,
like I was talking about experiment versus
1267
02:01:22,719 --> 02:01:28,139
observational study, we can't do that in observational
study. This is definitely an experiment because
1268
02:01:28,139 --> 02:01:30,310
you're telling them what group to go,
1269
02:01:30,310 --> 02:01:35,050
right. So randomization is used to assign
individuals to treatment groups. And when
1270
02:01:35,050 --> 02:01:38,940
you do that, when you randomly assign them,
not only you're assigning them, but you're
1271
02:01:38,940 --> 02:01:43,480
randomly assigning them, you're not picking,
you know, you're using like dice or some sort
1272
02:01:43,480 --> 02:01:49,690
of random method, and helps prevent bias and
selecting members for each group. It distributes
1273
02:01:49,690 --> 02:01:53,869
the lurking variables evenly, even if you
don't know about the lurking variables, even
1274
02:01:53,869 --> 02:02:00,580
if you aren't measuring them. By using this
randomization method, they get equally allocated
1275
02:02:00,580 --> 02:02:09,060
in each group. So just to remind you, how
you actually do that is, first I remember
1276
02:02:09,060 --> 02:02:15,469
the steps to that statistical study, you have
to follow those. And after you get to the
1277
02:02:15,469 --> 02:02:20,610
point where you have ethical approval, that's
when you start doing the data collection step.
1278
02:02:20,610 --> 02:02:25,610
And that's where you start recruiting sample
or, you know, hanging up signs and saying,
1279
02:02:25,610 --> 02:02:30,260
Be in my study, and people come in, and you
see if they qualify, and if they qualify,
1280
02:02:30,260 --> 02:02:36,289
you've got this group of sample, right. And
what you do with those people is you say thank
1281
02:02:36,289 --> 02:02:40,989
you for being in my study. And you measure
the confounders, which is another word for
1282
02:02:40,989 --> 02:02:46,440
lurking variables. You also measure the outcome,
whatever you're trying to study, if you're
1283
02:02:46,440 --> 02:02:50,869
doing a randomized experiment, I know I've
been involved in a lot of these where they're
1284
02:02:50,869 --> 02:02:57,079
studying drugs for lowering blood pressure.
So they'll often have maybe two groups or
1285
02:02:57,079 --> 02:03:02,289
three groups, where they're randomizing people
into, but they don't do that first, the first
1286
02:03:02,289 --> 02:03:05,760
thing to do is get everybody in there and
measure their blood pressure, right? The outcome,
1287
02:03:05,760 --> 02:03:10,530
you know, because they want to know that before,
they are going to take a picture of that before.
1288
02:03:10,530 --> 02:03:15,059
And they also measure confounders, like smoking,
remember, smoking is not good for your blood
1289
02:03:15,059 --> 02:03:20,010
pressure, you know, other things are not good
for your blood pressure, like not exercising,
1290
02:03:20,010 --> 02:03:25,749
well measure all of those things. Okay, now,
here's where we get into things. That's when
1291
02:03:25,749 --> 02:03:31,019
the whole randomization happens. So I showed
this picture of a dye, but we usually use
1292
02:03:31,019 --> 02:03:36,869
a computer for it. So we got all these people
together. And now you know, randomly, we put
1293
02:03:36,869 --> 02:03:41,540
them in different groups. And in this example,
on the slide, we're just going to pretend
1294
02:03:41,540 --> 02:03:47,079
that there's two groups. And in fact, we can't
really study blood pressure on the slide.
1295
02:03:47,079 --> 02:03:51,540
Because we're going to give one group treatment
and the other group placebo, which is an inactive
1296
02:03:51,540 --> 02:03:57,440
treatment, it's fake, it doesn't work. Of
course, the treatment and the placebo are
1297
02:03:57,440 --> 02:04:02,070
going to look the same to the people taking
it or, you know, we're going to fool them.
1298
02:04:02,070 --> 02:04:06,300
They don't, they won't know. But the reason
why in real life, you can't do that with a
1299
02:04:06,300 --> 02:04:07,670
blood pressure study
1300
02:04:07,670 --> 02:04:08,699
today
1301
02:04:08,699 --> 02:04:13,300
is we know that high blood pressure is really
bad for you. So it's really unethical to give
1302
02:04:13,300 --> 02:04:17,739
someone a placebo, you got to give them some
sort of drug to lower the blood pressure.
1303
02:04:17,739 --> 02:04:22,479
So usually when we do studies like this on
blood pressure, now, new blood pressure drugs,
1304
02:04:22,479 --> 02:04:29,429
Group A is treatment in Group B is old treatment,
like they usually take a new treatment and
1305
02:04:29,429 --> 02:04:35,099
give it to group by an old treatment to Group
B, see if they can find just a better treatment.
1306
02:04:35,099 --> 02:04:41,119
But if we were talking about something like
all timers, especially late stage old timers,
1307
02:04:41,119 --> 02:04:46,570
there's no treatment. Okay? And so what go
what's on the side here, Group A, that gets
1308
02:04:46,570 --> 02:04:52,239
treatment and Group B, which gets this Sham
pill, this placebo, that would be ethical
1309
02:04:52,239 --> 02:04:56,530
then, but let's just cross our fingers that
someday that's not ethical anymore and that
1310
02:04:56,530 --> 02:05:00,440
we do get a treatment right.
1311
02:05:00,440 --> 02:05:01,440
Okay. So
1312
02:05:01,440 --> 02:05:06,739
after you put them in the two groups with
sort of missing from the slide is time passes,
1313
02:05:06,739 --> 02:05:11,420
people in Group A take whatever they're supposed
to take their treatment. And in this example,
1314
02:05:11,420 --> 02:05:15,980
on the slide, people in Group B, take the
fake treatment, the placebo, and neither of
1315
02:05:15,980 --> 02:05:21,960
them, you know, usually knows what's happening.
But it takes a while, right. And in the olden
1316
02:05:21,960 --> 02:05:27,409
days before we knew high blood pressure was
bad. These were the study designs. And this
1317
02:05:27,409 --> 02:05:33,420
is what ended up happening is that you would
see, at the beginning where they measured
1318
02:05:33,420 --> 02:05:37,880
the confounders and the outcome, everybody
had high blood pressure, they all look the
1319
02:05:37,880 --> 02:05:43,999
same. But after treatment, Group A would go
down, whereas group and Group B would go down
1320
02:05:43,999 --> 02:05:50,139
a little bit from CBOE effect, which I'll
explain in the next slide. But that's how
1321
02:05:50,139 --> 02:05:55,659
we learned that you can make blood pressure
go down with these different pills. Finally,
1322
02:05:55,659 --> 02:06:02,749
after that time passed, it could be six weeks,
it could be years, however long that took
1323
02:06:02,749 --> 02:06:08,460
after that passed, when it was over, we'd
measure again, the confounders because they
1324
02:06:08,460 --> 02:06:13,400
could have changed. And the outcome, which
in my example, was blood pressure, or, you
1325
02:06:13,400 --> 02:06:20,869
know how serious some of these Alzheimer's
disease would be, if we were doing that. So
1326
02:06:20,869 --> 02:06:25,960
I promised you on the last slide that I talked
to you about more about what a placebo is,
1327
02:06:25,960 --> 02:06:32,080
and the placebo effect, found this great picture
of old placebos from the National Institutes
1328
02:06:32,080 --> 02:06:37,630
of Health. So a placebo is this fake drug
that's given and it's actually kind of hard
1329
02:06:37,630 --> 02:06:44,429
to make placebos. Just imagine a drug you
may need to take me even excetera and or something
1330
02:06:44,429 --> 02:06:51,039
like that. Imagine we had to study etc. And
we'd have to make a fake excedrin that tasted
1331
02:06:51,039 --> 02:06:57,719
like it and look like it. Because then Otherwise,
the people who are randomized to the placebo
1332
02:06:57,719 --> 02:07:02,829
group would be able to totally tell that they
were in the placebo group, and that's not
1333
02:07:02,829 --> 02:07:09,389
good to do. So, what the reason why you need
a placebo is there's this thing called the
1334
02:07:09,389 --> 02:07:16,059
placebo effect. And that occurs when there
is no treatment, but the participant assumed
1335
02:07:16,059 --> 02:07:24,390
she is receiving treatment and responds favorably.
Now, sometimes I talk about one of my favorite
1336
02:07:24,390 --> 02:07:32,190
epidemiologists, comedians, Ben Goldacre,
he reported in one of us, I think one of his
1337
02:07:32,190 --> 02:07:39,500
TED talks about a study where they everybody
they enrolled, um, they didn't have a disease,
1338
02:07:39,500 --> 02:07:44,570
right, I guess they had a mild disease. And
they told everybody, either they were going
1339
02:07:44,570 --> 02:07:49,800
to give them nothing, or they were going to
give them a pill, that's a placebo, it doesn't
1340
02:07:49,800 --> 02:07:55,600
do anything. Or they're going to give them
an injection. That's a placebo injection,
1341
02:07:55,600 --> 02:08:00,460
it doesn't do anything. And what they found
is of the three groups, the people who got
1342
02:08:00,460 --> 02:08:05,790
the injection did the best. And the people,
you know, the fake injection, people got the
1343
02:08:05,790 --> 02:08:10,960
fake pill, the placebo pill, that is second
best that people didn't get anything didn't,
1344
02:08:10,960 --> 02:08:15,849
the worst. And that his point is, that's what
the placebo effect is, for some reason, when
1345
02:08:15,849 --> 02:08:21,389
we're getting injected. Even with just sailing,
we think we're getting some sort of drug and
1346
02:08:21,389 --> 02:08:28,190
it psychologically, or however, affects our
bodies. The same thing when we're taking a
1347
02:08:28,190 --> 02:08:36,979
pill. I don't know if you've ever seen kids,
you know, saying, Oh, I need medicine 90 minutes.
1348
02:08:36,979 --> 02:08:40,070
And then then the parent gives them an m&m,
right, they think it's a pill, they're happy
1349
02:08:40,070 --> 02:08:45,789
with it. But actually, the placebo effect
can cause real effects on your health, it
1350
02:08:45,789 --> 02:08:51,349
can make you feel better just because you
think you're taking a drug. And so that's
1351
02:08:51,349 --> 02:08:57,440
why it's super important to include a placebo
group, if you don't have a comparison group,
1352
02:08:57,440 --> 02:09:03,110
like I described with blood blood pressure
in all your studies, because if you just have
1353
02:09:03,110 --> 02:09:07,860
one group where they're taking it, they'll
all say it's good. They would say it's good
1354
02:09:07,860 --> 02:09:14,469
if it was water, right. So the placebo is
given to what's called a control group, and
1355
02:09:14,469 --> 02:09:18,789
they receive the placebo. Now, if you're studying
like acupuncture, you can't really give up
1356
02:09:18,789 --> 02:09:24,499
placebo acupuncture. So what they'll do is
they'll sort of hang, hang up a little curtain
1357
02:09:24,499 --> 02:09:31,940
and kind of tap you and you don't know whether
you're getting real or it's called sham acupuncture.
1358
02:09:31,940 --> 02:09:36,120
Other things have to happen like that when
you're doing these studying these interventions
1359
02:09:36,120 --> 02:09:42,699
that aren't pills. Those are called attention
controls, right? Where we have like a sham
1360
02:09:42,699 --> 02:09:48,190
acupuncture. So in any case, you've got to
think about this because you need a controller
1361
02:09:48,190 --> 02:09:55,690
comparison group. That's fair. Whenever you're
testing in an experiment in a randomized experiment,
1362
02:09:55,690 --> 02:10:00,300
a new thing
1363
02:10:00,300 --> 02:10:05,920
promised you I'd talk a little bit about blocked
randomization, I won't get much into it. But
1364
02:10:05,920 --> 02:10:11,060
sometimes when you go to randomize, right,
you know, you get this whole group of people,
1365
02:10:11,060 --> 02:10:15,250
they're all about the same, but you're gonna
split them into a group A and Group B, one's
1366
02:10:15,250 --> 02:10:20,199
gonna get maybe a drug and the others maybe
gonna get the placebo. Sometimes you get worried
1367
02:10:20,199 --> 02:10:25,889
that the groups are going to be unbalanced
with respect to a particular lurking variable.
1368
02:10:25,889 --> 02:10:29,789
In blood pressure, we'd always care about
smoking, we want the equal amount of smokers
1369
02:10:29,789 --> 02:10:35,920
in each group. You know, a lot of times we
we care about gender, we want equal amounts
1370
02:10:35,920 --> 02:10:40,520
of men and women in each group. So if you're
worried about that, with randomization, you
1371
02:10:40,520 --> 02:10:45,059
can't just do it one at a time, because you
might just randomly put too many men in one
1372
02:10:45,059 --> 02:10:52,059
group. So what you have to do is block randomization.
So see, I drew all these blocks on the on
1373
02:10:52,059 --> 02:10:57,550
the screen, and you'll see that there's nobody
in them, they're just blank, I just put xxx.
1374
02:10:57,550 --> 02:11:03,469
So this is before you do your study, you have
these blank blocks. And what you do is as
1375
02:11:03,469 --> 02:11:06,999
you enroll those people remember you have
to measure them and make sure that they qualify
1376
02:11:06,999 --> 02:11:13,599
for the study, as you get them in, you can
just write them in the blocks, right. So here,
1377
02:11:13,599 --> 02:11:18,909
I just put their fake initials, you know,
so let's say that XYZ came in first, that's
1378
02:11:18,909 --> 02:11:25,420
a woman, and then maybe NSW came in, and that's
another woman, you just keep putting the women
1379
02:11:25,420 --> 02:11:30,239
there. And then when the men come in, you
put them in, and you fill up the blocks, then
1380
02:11:30,239 --> 02:11:37,079
here's a trick, you actually randomize the
entire blocks, right? So block one and block
1381
02:11:37,079 --> 02:11:42,889
three ended up in Group A, and but magic,
you got to equal men and women there. And
1382
02:11:42,889 --> 02:11:49,510
then Group B equal men and women. And so that's
how you do with blocks. So but you know, there's
1383
02:11:49,510 --> 02:11:54,440
some limitation to this, like, if you get
multiple races in your study, maybe, you know,
1384
02:11:54,440 --> 02:11:59,889
four or five racial groups. If you make a
five block, you've got to fill up the whole
1385
02:11:59,889 --> 02:12:05,900
block before you randomize it. And, you know,
sometimes you're you're in an area where certain
1386
02:12:05,900 --> 02:12:10,869
racial groups are rare. And you might have
trouble filling up your blocks. So there's
1387
02:12:10,869 --> 02:12:14,650
some limitations of this too.
1388
02:12:14,650 --> 02:12:16,070
Now,
1389
02:12:16,070 --> 02:12:21,880
I had mentioned the situation where you really
don't want if you're going to do an experiment,
1390
02:12:21,880 --> 02:12:26,249
right, not an observational study, experiment.
And you're going to randomize people either
1391
02:12:26,249 --> 02:12:33,540
to a drug or some sort of intervention versus
placebo, or a drug versus another drug, an
1392
02:12:33,540 --> 02:12:39,540
old drug, you really don't want them to know
what group they're in. I mean, because you
1393
02:12:39,540 --> 02:12:42,429
have to be ethical. before they enter the
study, you have to tell them, you're gonna
1394
02:12:42,429 --> 02:12:46,210
put them in one or two group, one of two groups,
but you got to tell them, you're not going
1395
02:12:46,210 --> 02:12:52,170
to know what group you're in wallets going
on. So blinding is where the, where any person
1396
02:12:52,170 --> 02:12:58,269
is deliberately not told of the treatment
assignment. So he or she is not biased in
1397
02:12:58,269 --> 02:13:04,020
reporting study information. And it actually
doesn't have to just be the participant in
1398
02:13:04,020 --> 02:13:09,760
the study, it can be researched, like, the
most common one is a participant is blinded
1399
02:13:09,760 --> 02:13:16,249
to treatment or placebo. But I've been in
studies or I've been worked on studies of
1400
02:13:16,249 --> 02:13:22,999
like Alzheimers disease, right? Well, they'll
they want to take the patients are the participants
1401
02:13:22,999 --> 02:13:29,909
in the study might have Alzheimer's disease,
and look at their image, the MRI of their
1402
02:13:29,909 --> 02:13:37,790
head. And often, they'll have also a neurologist
interview them, they'll also see a neuro psychologist.
1403
02:13:37,790 --> 02:13:41,989
And they often want those three different
groups, they imaging group, the neuro psychology
1404
02:13:41,989 --> 02:13:48,150
group and the neurology group, not to know
about each other's opinion of this particular
1405
02:13:48,150 --> 02:13:55,469
patient. So they'll blind them to each other's
opinion. So blinding AR is much more complicated
1406
02:13:55,469 --> 02:14:00,449
than just blinding the participant to whether
or not they're in placebo, or they're in drug
1407
02:14:00,449 --> 02:14:07,820
group. But double blind is a really important
concept. And that means that both the participant
1408
02:14:07,820 --> 02:14:13,440
and the study staff do not know the treatment
assignment. So everybody who's operating with
1409
02:14:13,440 --> 02:14:18,249
the patient doesn't know it. So you're probably
thinking that's really pretty serious, right?
1410
02:14:18,249 --> 02:14:23,360
Like, what if that person gets sick, and goes
to the emergency room, and they're taking
1411
02:14:23,360 --> 02:14:27,340
an experimental drug or they could be taking
placebo? Who knows what they're taking? Well,
1412
02:14:27,340 --> 02:14:33,280
in that case, what happens is there's an unblinding
procedure, there just has to be as part of
1413
02:14:33,280 --> 02:14:39,460
ethics. It's already set up in the study.
If somebody goes to the emergency room, there's
1414
02:14:39,460 --> 02:14:46,369
a person that can be called to unblind. The
pate, the participant who's now a patient,
1415
02:14:46,369 --> 02:14:50,360
and and once they're unblind, they learn what
they were taking. Even if they were taking
1416
02:14:50,360 --> 02:14:55,479
placebo, the whole thing's over. Right? Even
the study staff work. It's just a fact of
1417
02:14:55,479 --> 02:15:00,090
life. It has to happen sometime. But for the
most part, what we tried to do is keep things
1418
02:15:00,090 --> 02:15:07,310
steady. double blind because it makes things
the least biased in the most fair. So 10,
1419
02:15:07,310 --> 02:15:12,010
the session on randomization, the purpose
of randomization, why we go through all this
1420
02:15:12,010 --> 02:15:17,909
when we're testing treatments, especially,
is that it's used to reduce bias. And especially
1421
02:15:17,909 --> 02:15:22,960
if you have a particular variable you're concerned
about like gender, like we were talking about
1422
02:15:22,960 --> 02:15:28,729
race, or smoking, smoking status, you can
use a block randomization to even out each
1423
02:15:28,729 --> 02:15:33,940
group. And then blinding further prevents
bias, right? Because people don't know what
1424
02:15:33,940 --> 02:15:38,530
they're taking in the study staff don't know
what they're giving them. And the reason why
1425
02:15:38,530 --> 02:15:42,940
you have to really think about blinding is
the placebo effect is necessary to take into
1426
02:15:42,940 --> 02:15:47,510
account, you're always going to get the placebo
effect every time you give somebody something.
1427
02:15:47,510 --> 02:15:54,909
So you've got to account for that in your
study design. So in conclusion, I went over
1428
02:15:54,909 --> 02:15:59,409
the steps to conducting a statistical study
in order and kind of give you tips on how
1429
02:15:59,409 --> 02:16:04,949
to remember that we looked at some basic terms
and definitions. And we talked about how to
1430
02:16:04,949 --> 02:16:10,710
avoid bias in survey design, because there's
a lot of different considerations. And finally,
1431
02:16:10,710 --> 02:16:17,360
we talked more in depth about specifically
about randomization in experiments. All right.
1432
02:16:17,360 --> 02:16:22,640
Now, you know, a lot, maybe too much. I hope
you enjoyed my lecture.
1433
02:16:22,640 --> 02:16:31,349
Hi, Whoa, it's me again, Monica wahi, your
statistics lecturer from labarre College.
1434
02:16:31,349 --> 02:16:37,139
Now we're going to go go back and cover what
I didn't cover in the last lecture about chapter
1435
02:16:37,139 --> 02:16:45,529
2.1, which are frequency histograms and distributions.
So here are your learning objectives for this
1436
02:16:45,530 --> 02:16:50,110
lecture. So at the end of this lecture, you
should be able to state the steps for drawing
1437
02:16:50,110 --> 02:16:55,330
a frequency histogram, you should also be
able to name two types of distributions and
1438
02:16:55,330 --> 02:17:00,650
explain how they look, you should be able
to define what an outlier is, and say one
1439
02:17:00,650 --> 02:17:07,049
reason why you would make a frequency histogram.
Finally, you should be able to define what
1440
02:17:07,049 --> 02:17:14,309
a relative frequency is and what a cumulative
frequency is. Okay, so let's get started.
1441
02:17:14,309 --> 02:17:19,089
First, we're going to review frequency histograms
and relative frequency histogram. So you'll
1442
02:17:19,090 --> 02:17:24,850
figure out what I'm talking about there. Then
we're going to go over five common distributions
1443
02:17:24,850 --> 02:17:29,751
in statistics, so you know what that's all
about. And then I'm going to talk about outliers.
1444
02:17:29,751 --> 02:17:35,820
Now, you'll notice I have a lot of pictures
in this presentation of skylines. And the
1445
02:17:35,820 --> 02:17:43,730
reason why is they remind me of histograms.
So let's talk about what is a frequency histogram.
1446
02:17:43,730 --> 02:17:51,260
So a frequency histogram is important in statistics,
because, as you'll see, you need to make one
1447
02:17:51,260 --> 02:17:56,299
in order to see what the distribution is.
So I'm going to go first explain what one
1448
02:17:56,299 --> 02:18:00,840
is, like, show you what one looks like. And
then I'll explain how to make one. And then
1449
02:18:00,841 --> 02:18:05,450
I'll explain the relative frequency histogram.
And then we'll move on to looking at why do
1450
02:18:05,450 --> 02:18:12,020
we need that for distributions. So here's
another skyline because it looks like a histogram
1451
02:18:12,020 --> 02:18:17,889
to me. So what is a frequency histogram? Well,
it's actually a specific type of bar chart.
1452
02:18:17,889 --> 02:18:23,468
And it's made from data in a frequency table.
So you might see a frequency histogram and
1453
02:18:23,468 --> 02:18:28,029
go, well, that looks like a boring old bar
graph. Well, it's not just any old bar graph,
1454
02:18:28,030 --> 02:18:32,840
it's got specific properties that I'm going
to talk to you about in this lecture. Okay.
1455
02:18:32,840 --> 02:18:38,070
Both frequency histograms and relative frequency
histograms are bar charts with their special
1456
02:18:38,070 --> 02:18:43,790
bar charts that have to be done a certain
way. And why? Because if they're done that
1457
02:18:43,790 --> 02:18:48,509
way, in their histograms, they will reveal
the distribution of the data, which I'll explain
1458
02:18:48,510 --> 02:18:58,020
later. So here is a frequency table, we had
this before. This was of those fake patient
1459
02:18:58,020 --> 02:19:03,710
transport miles, right. So you'll notice here
were the class limits, and then we put in
1460
02:19:03,710 --> 02:19:08,819
the frequency and we even threw in this relative
frequency. Okay, so this is the frequency
1461
02:19:08,820 --> 02:19:13,360
table I'm going to use as a demonstration
for how you make a frequency histogram, you
1462
02:19:13,360 --> 02:19:20,820
first need a frequency table. Okay, now, here's
the histogram version of what's in that frequency
1463
02:19:20,820 --> 02:19:27,650
table. So I'm going to annotate this one image
to explain the order in which you draw it
1464
02:19:27,650 --> 02:19:33,389
basically by hand. So the first thing you
do is draw this vertical line for the y axis,
1465
02:19:33,389 --> 02:19:36,449
okay, you just draw a line.
1466
02:19:36,450 --> 02:19:38,709
Next, you write
1467
02:19:38,709 --> 02:19:46,949
words next to the line, and you always start
with frequency of, and then whatever In our
1468
02:19:46,950 --> 02:19:52,080
example, it was patience, okay. And I'm telling
you, you need to do it in this order, or you'll
1469
02:19:52,080 --> 02:19:58,280
get confused. So you start with that first
line, and then you write this frequency. Okay.
1470
02:19:58,280 --> 02:20:02,910
Next, you draw the whole horizontal line for
the x axis,
1471
02:20:02,910 --> 02:20:04,210
okay.
1472
02:20:04,210 --> 02:20:12,200
And then after that you write the classes
below. Remember, like the lowest class is
1473
02:20:12,200 --> 02:20:16,740
one to eight, that's a lower class and an
upper class limit of the lowest class, like
1474
02:20:16,740 --> 02:20:22,300
you literally write those labels in. And why
do I, why am I so freaking out about this
1475
02:20:22,300 --> 02:20:28,050
order is because I totally get confused if
I do not do this y axis first. Because then
1476
02:20:28,050 --> 02:20:32,580
all there's all these numbers. And it's totally
confusing. So just try to do it in this order.
1477
02:20:32,580 --> 02:20:41,510
Okay. Now, number six, I had to flip the slide
here. Okay, at step six, use drawn like the
1478
02:20:41,510 --> 02:20:46,690
basic background, you've got the x and y axis
and those labels. So now you have to start
1479
02:20:46,690 --> 02:20:50,921
drawing in the bars. So for your first bar,
you look at the first class, and you find
1480
02:20:50,921 --> 02:20:56,340
the frequency on the table, which I think
it was 14 or something. And so you look for
1481
02:20:56,340 --> 02:21:04,750
it on the y axis, and you want to label the
y axis so that the maximum one is is incorporated
1482
02:21:04,750 --> 02:21:10,990
in it, like you see our maximum is above 20.
So we wouldn't want to end our Y axis at 20,
1483
02:21:10,990 --> 02:21:16,280
or 15, or something, you have to make it bigger,
so you can put everybody on there. But our
1484
02:21:16,280 --> 02:21:22,650
first one was what at 14, so we draw this
horizontal line around the 14, right there,
1485
02:21:22,650 --> 02:21:27,040
that that horizontal line, because we're gonna
make that first bar,
1486
02:21:27,040 --> 02:21:28,040
then
1487
02:21:28,040 --> 02:21:31,530
you draw the two vertical lines down, and
you position it over where you labeled the
1488
02:21:31,530 --> 02:21:39,780
class. And that makes the bar and then you,
you actually color in the bars, like and you
1489
02:21:39,780 --> 02:21:44,561
repeat this for each class, right? So you
go, that's why I labeled the classes first
1490
02:21:44,561 --> 02:21:48,960
on the x axis just to make sure everything
is even. And then I go through and I make
1491
02:21:48,960 --> 02:21:54,370
all the bars. And again, this is why you need
to prepare your frequency table first. So
1492
02:21:54,370 --> 02:22:02,320
you know how to graph it, you know what to
put on this graph? Okay, this is the relative
1493
02:22:02,320 --> 02:22:07,272
frequency histogram, you already understand
what relative frequency is, right? It's that
1494
02:22:07,272 --> 02:22:14,181
proportion, the proportion of your sample
that's in each class. And so the change, if
1495
02:22:14,181 --> 02:22:17,541
you're going to do a relative frequency histogram,
you basically go through the same steps, it's
1496
02:22:17,541 --> 02:22:24,601
just you're changing what's on the y axis,
you change what you label it, okay? But the
1497
02:22:24,601 --> 02:22:30,620
x axis stays the same. And even though you're,
you're charting the relative frequencies,
1498
02:22:30,620 --> 02:22:35,410
like, you'll be like, Okay, this is a totally
different number, what you'll see is the pattern
1499
02:22:35,410 --> 02:22:40,760
ends up being the same. So it takes on the
similar pattern, which is the pattern is actually
1500
02:22:40,760 --> 02:22:45,681
what we're going after, that's the thing I'm
going to talk about with a disparate distribution.
1501
02:22:45,681 --> 02:22:50,750
And so I tend to prefer since the pattern
is going to come out the same, I tend to prefer
1502
02:22:50,750 --> 02:22:56,710
using a relative frequency histogram, versus
a frequency histogram. Because if I have two
1503
02:22:56,710 --> 02:23:02,351
different groups, like let's say, there were
two hospitals, and I gathered two sets of
1504
02:23:02,351 --> 02:23:09,110
data, and I wanted to compare the models transported,
then I could use this relative frequency histogram,
1505
02:23:09,110 --> 02:23:16,351
and not only with the patterns be evident,
but I could compare them fairly, like whatever's
1506
02:23:16,351 --> 02:23:23,010
35, you know, point three, five or 35%. In
this, even if the other hospital maybe had
1507
02:23:23,010 --> 02:23:30,330
tons more transports, I could see it as like
35%. And I could really compare the percent,
1508
02:23:30,330 --> 02:23:34,970
right. So that's why I lean towards relative
frequency histogram. But ultimately, you're
1509
02:23:34,970 --> 02:23:43,771
going to get the same pattern on your histogram,
whether you use frequency or relative frequency.
1510
02:23:43,771 --> 02:23:49,630
So again, another picture of a skyline. So
you can see why I think of skylines because
1511
02:23:49,630 --> 02:23:54,500
they look like histograms, right? So after
making a frequency table, what you do with
1512
02:23:54,500 --> 02:23:58,940
quantitative data, right? Because you're trying
to organize it, it's also important to then
1513
02:23:58,940 --> 02:24:04,141
make a frequency histogram and or relative
frequency histogram, and why it's because
1514
02:24:04,141 --> 02:24:08,990
it reveals a distribution. And now, that's
what we're going to talk about. We're going
1515
02:24:08,990 --> 02:24:14,421
to talk about distributions. So first, I'm
going to define what I'm talking about with
1516
02:24:14,421 --> 02:24:18,190
the distribution. And now you're gonna see
a lot of other kinds of pictures like this
1517
02:24:18,190 --> 02:24:23,860
on the right, see that that shape? That's
one of our distributions, okay. And so that's
1518
02:24:23,860 --> 02:24:28,480
a little prequel to what I'm going to say.
So first, we're going to talk about what these
1519
02:24:28,480 --> 02:24:34,860
distributions are. Then I'm going to describe
what an outlier is, and, and how you can detect
1520
02:24:34,860 --> 02:24:40,920
them by using histograms. Finally, I'm going
to wrap it up by explaining what cumulative
1521
02:24:40,920 --> 02:24:44,590
frequency is and when an old jive is.
1522
02:24:44,590 --> 02:24:50,970
Okay, so what is this distribution thing I
keep talking about? Well, it's actually just
1523
02:24:50,970 --> 02:24:57,670
a shape. It's the shape that is made if you
draw a line along the edges of the histograms
1524
02:24:57,670 --> 02:25:05,830
bars, so On the left, you see I drew the scribbly
shape. But you'll notice you can do it with
1525
02:25:05,830 --> 02:25:10,690
a stem and leaf too. This is not the same
data graphed on the right in the stem and
1526
02:25:10,690 --> 02:25:15,521
leaf. I'm just using, you know, recycling
the old picture that I used before. But you
1527
02:25:15,521 --> 02:25:23,271
see, you can do the same drawing that squiggly
line, you know. And that's actually the distribution.
1528
02:25:23,271 --> 02:25:26,920
I mean, they don't all look exactly like that.
But that's what you do is you draw this line
1529
02:25:26,920 --> 02:25:33,400
thing. I know, it's kind of odd that that's
what a distribution is, is just a shape. But
1530
02:25:33,400 --> 02:25:39,820
there's actually five of them that we use
a lot. There's way more than five, actually,
1531
02:25:39,820 --> 02:25:44,410
in statistics, but you have to get into kind
of higher level statistics to care about those,
1532
02:25:44,410 --> 02:25:50,391
we're only going to concentrate on these five.
Okay. So the first one is called normal distribution.
1533
02:25:50,391 --> 02:25:55,760
And it's called that everywhere, except I
noticed the book call that mound shaped symmetrical
1534
02:25:55,760 --> 02:26:01,740
distribution, but I'm going to call it a normal
distribution. And there's nothing really normal
1535
02:26:01,740 --> 02:26:07,261
about it, it's just named that for some reason.
And then there's a uniform distribution, skewed
1536
02:26:07,261 --> 02:26:12,811
left distribution, skewed right distribution,
and by modal distribution, so those are the
1537
02:26:12,811 --> 02:26:18,830
five we're going to cover. So let's start
here with the normal distribution. So as you
1538
02:26:18,830 --> 02:26:23,501
can see, on the right, somebody made a histogram.
And then they do that squiggly line. Well,
1539
02:26:23,501 --> 02:26:27,811
actually, it was me who made this histogram
and drew the squiggly line. And notice the
1540
02:26:27,811 --> 02:26:32,141
squiggly line, what it looks like, it kind
of looks like what the book called it, it's
1541
02:26:32,141 --> 02:26:38,351
mound shaped and symmetrical. But that's the
shape of the normal distribution, it looks
1542
02:26:38,351 --> 02:26:43,990
like that it's got kind of hokey things on
the side, and, and a mound in the middle.
1543
02:26:43,990 --> 02:26:48,170
And if that's what your histogram ends up
looking like, where it's kind of like a little
1544
02:26:48,170 --> 02:26:54,110
mountain like that, then you've got a normal
distribution. Okay, let's look at a different
1545
02:26:54,110 --> 02:26:58,790
histogram. Okay? In this histogram, you'll
notice that like, each of the bars, each of
1546
02:26:58,790 --> 02:27:04,040
the frequencies is almost like the same, right?
It's either five or six. And it doesn't matter
1547
02:27:04,040 --> 02:27:10,331
what class we're talking about. When it's
like that, the little line you draw across,
1548
02:27:10,331 --> 02:27:16,370
it's not squiggly at all, it's straight. I
don't see this very often in healthcare data.
1549
02:27:16,370 --> 02:27:20,830
But it does happen in other kinds of data
more frequently. And this is called the uniform
1550
02:27:20,830 --> 02:27:26,290
distribution, which makes sense, it's almost
all of these bars are a uniform height. So
1551
02:27:26,290 --> 02:27:32,761
that's what a uniform distribution is. Okay,
now, this is one kind of like the one we were
1552
02:27:32,761 --> 02:27:37,931
looking at before, where it looks kind of
like a slide like at a playground, where,
1553
02:27:37,931 --> 02:27:42,740
you know, like, you climb up the right side,
and then you slide down to the left side.
1554
02:27:42,740 --> 02:27:48,650
Okay? And that whenever it's like that, where
it's low on one side and high on the other,
1555
02:27:48,650 --> 02:27:56,650
it's called skewed. The problem is, which
way is it skewed? Right? And how I remember
1556
02:27:56,650 --> 02:28:03,090
which way to say it's skewed? Is it skewed,
where it's light or short? So here, I would
1557
02:28:03,090 --> 02:28:08,650
say it's light on the left. So it's skewed
left, right? Because on the left side, it's
1558
02:28:08,650 --> 02:28:12,400
really the bars are all short. And then you
can just imagine what's going to come next
1559
02:28:12,400 --> 02:28:18,621
here? Well, look at this, this is skewed,
right, because it's light on the right. It's
1560
02:28:18,621 --> 02:28:24,660
short on the right. So it's skewed, right.
So technically, I mean, both of them are just
1561
02:28:24,660 --> 02:28:29,030
skewed distributions. I like I just like to
explain them separately. Because sometimes
1562
02:28:29,030 --> 02:28:33,460
people don't know which way to say is left
to right. And this is how I remember light
1563
02:28:33,460 --> 02:28:43,280
on the left, light on the right. Finally,
we have bi modal. Now, the word mode in some
1564
02:28:43,280 --> 02:28:51,561
areas of statistics, and then engineering
and stuff often means like a high point. And
1565
02:28:51,561 --> 02:29:00,811
by modal means two high points. So as you
can see, it looks like a camel with two humps.
1566
02:29:00,811 --> 02:29:07,460
And it's a little hard sometimes to tell by
modal from normal. Because if you remember
1567
02:29:07,460 --> 02:29:12,730
normal, like let's say you have a normal distribution,
but you just have one little
1568
02:29:12,730 --> 02:29:17,791
one little bar kind of in the middle, you're
like, is this bi modal, or is this normal?
1569
02:29:17,791 --> 02:29:24,610
How I tell coach people to see if it's bi
modal is if there's a really big space between
1570
02:29:24,610 --> 02:29:30,182
the two humps that's not so apparent on this
image here. But you'll see class three and
1571
02:29:30,182 --> 02:29:35,230
class four, they're both short. If only one
of them was short, I might I might have called
1572
02:29:35,230 --> 02:29:40,410
it a normal distribution. But I've really
seen by modal distributions when it comes
1573
02:29:40,410 --> 02:29:47,550
to like lab data, because my best friend is
a pathologist, and he'll show me you know,
1574
02:29:47,550 --> 02:29:51,990
with situations where people have like really
super high platelet counts, and then like
1575
02:29:51,990 --> 02:29:56,830
no platelets practically and there's nothing
in the middle. And that's where you'll see
1576
02:29:56,830 --> 02:30:04,340
a bi modal distribution. Now we're gonna talk
about outliers. And outliers are data values
1577
02:30:04,340 --> 02:30:09,330
that are, quote very different from other
measurements in the data. What's very different,
1578
02:30:09,330 --> 02:30:15,240
right? Like it's an opinion. But people in
statistics come up with different formulas
1579
02:30:15,240 --> 02:30:19,701
to try and figure out if something is very
different from the other measurements. And
1580
02:30:19,701 --> 02:30:25,160
we'll talk about that actually, later in later
chapters in the class, not so much for identifying
1581
02:30:25,160 --> 02:30:30,610
outliers, but just to just to better understand
our distributions. But just as a quick and
1582
02:30:30,610 --> 02:30:36,760
dirty representation of what would be an obvious
outlier lit, like nobody would disagree on
1583
02:30:36,760 --> 02:30:41,341
is this histogram here. So you'll notice I
just threw down nine classes, I made up this
1584
02:30:41,341 --> 02:30:45,801
data. But you'll see a class two and class
three, there's just like nothing, and there's
1585
02:30:45,801 --> 02:30:50,240
nothing in class eight. But when you get,
and then suddenly, there's something in class
1586
02:30:50,240 --> 02:30:53,521
one and something in class nine. And when
you have these big gaps, this is kind of like
1587
02:30:53,521 --> 02:30:57,920
that platelets, like I was telling you about
only this maybe would be you know, you would
1588
02:30:57,920 --> 02:31:01,061
say this is tri modal, like there's three
modes, but there's not really three modes,
1589
02:31:01,061 --> 02:31:06,240
right? There's a wacky low one and a wacky
high one, and everything else is in the middle.
1590
02:31:06,240 --> 02:31:12,061
So because that one in class one, and that
one, and class nine, they're so far away from
1591
02:31:12,061 --> 02:31:18,601
what's in the middle, like just about every
statistician would agree, these are both outliers.
1592
02:31:18,601 --> 02:31:25,580
But you can just imagine how much we argue
about what actually is an outlier. It's especially
1593
02:31:25,580 --> 02:31:32,580
hard when you're getting data on weight of
people. Some people really do weigh 400 500,
1594
02:31:32,580 --> 02:31:40,080
maybe even 600 pounds, you don't know if they're
really outliers, or data mistakes, or what
1595
02:31:40,080 --> 02:31:44,851
to do with them. They're real people. And
maybe they have really high weights. And unfortunately,
1596
02:31:44,851 --> 02:31:51,480
some of them have really low weights too.
So the one of the main points of doing the
1597
02:31:51,480 --> 02:31:58,730
histogram is not only to look for these distributions,
but also to see if you've got any super obvious
1598
02:31:58,730 --> 02:32:03,851
outliers that you're just gonna have to think
about before you proceed with your analysis.
1599
02:32:03,851 --> 02:32:11,710
Now, I'm going to talk to you about what cumulative
frequency means, you know, the word accumulate
1600
02:32:11,710 --> 02:32:16,641
means to just like keep accumulating things
like if you have a gutter on your house, it
1601
02:32:16,641 --> 02:32:20,851
will accumulate leaves, like old leaves will
sit there and new leaves will keep coming
1602
02:32:20,851 --> 02:32:25,450
and the old ones will still be there, until
it like totally clogs your gutter, and you
1603
02:32:25,450 --> 02:32:30,891
have to clean it. So that's what cumulative
frequency is, is where it accumulates all
1604
02:32:30,891 --> 02:32:34,870
the frequencies. So you see on the slide,
you know, in the first class, when they ate,
1605
02:32:34,870 --> 02:32:38,880
we had a frequency of 14. So your cumulative
frequency, those are like the leaves at the
1606
02:32:38,880 --> 02:32:45,081
first beginning of the season, that's all
you got is 14. But when you add on the next
1607
02:32:45,081 --> 02:32:51,280
class 21. Now you add to the cumulative frequency,
it accumulates, you add that 21 to the 14,
1608
02:32:51,280 --> 02:32:57,190
and now you've got 35. And if you can extrapolate
as you walk up all these classes, eventually
1609
02:32:57,190 --> 02:33:03,851
you get to the total, right. And so yeah,
so that's what you got. And the first class
1610
02:33:03,851 --> 02:33:08,450
is always the same as the frequency and each
cumulative frequency is equal to or higher
1611
02:33:08,450 --> 02:33:10,101
than the last one.
1612
02:33:10,101 --> 02:33:15,971
I'll have to say in healthcare, we don't really
use cumulative frequency a whole lot, you'll
1613
02:33:15,971 --> 02:33:22,090
see it but we are really into relative frequency,
I'll just tell you that. But some groups are
1614
02:33:22,090 --> 02:33:28,420
into cumulative frequency and those who are,
they like to plot it in a plot called an Ojai.
1615
02:33:28,420 --> 02:33:32,920
And again, I'll be honest, and healthcare,
I've never seen an old giant that was just
1616
02:33:32,920 --> 02:33:37,290
in the scientific literature, which is why
you'll see this is about NFL teams salaries,
1617
02:33:37,290 --> 02:33:41,710
because I think they use it a lot more in
economics. But at any rate, what you'll see
1618
02:33:41,710 --> 02:33:46,750
is that the classes are along the x axis,
you know, you're used to that, because that's
1619
02:33:46,750 --> 02:33:52,170
what we do in a frequency histogram. But along
the y axis, you see these numbers called cumulative
1620
02:33:52,170 --> 02:33:58,170
frequency. And you just graph it, right, but
one of the things you'll just notice is that
1621
02:33:58,170 --> 02:34:03,670
it's going to go up, like each one is going
to either, unless you have a class with zero
1622
02:34:03,670 --> 02:34:07,160
in it, it's going to stay the same for that
one. But otherwise, it's just going to keep
1623
02:34:07,160 --> 02:34:11,260
going up. So you'll always see some sort of
shape like this, where it's always going up
1624
02:34:11,260 --> 02:34:21,940
and it hits the top. At the end, it hits the
total cumulative frequency at the end. So,
1625
02:34:21,940 --> 02:34:26,830
just to review, there are five main types
of distributions used in statistics. And I
1626
02:34:26,830 --> 02:34:31,771
emphasize mean, there's other ones, but these
are the ones we're going to look at. And so
1627
02:34:31,771 --> 02:34:35,580
that's why we were doing our histograms and
our seven leaf displays is we were looking
1628
02:34:35,580 --> 02:34:40,001
for these distributions. And also we were
looking for outliers. And then finally, I
1629
02:34:40,001 --> 02:34:44,670
just quickly did a shout out for your Oh,
jive here and your cumulative frequency. So
1630
02:34:44,670 --> 02:34:51,420
you know what, what's up with that. So in
conclusion, the purpose of the histogram is
1631
02:34:51,420 --> 02:34:56,171
to reveal the distribution and also the stem
and leaf displays reveal the distribution.
1632
02:34:56,171 --> 02:35:02,660
And you look then, for outliers. You'll probably
wondering, Well, why do we do all this work
1633
02:35:02,660 --> 02:35:06,881
to, to reveal the distribution, we'll you'll
find in later chapters and matters, what kind
1634
02:35:06,881 --> 02:35:13,420
of distribution you have, what kind of statistics
you can do insert, in a way, you know, like
1635
02:35:13,420 --> 02:35:17,300
I went kind of, on and on about the normal
distribution. Well, we all really like that
1636
02:35:17,300 --> 02:35:20,950
in statistics, we're all really partial to
that, because it allows you to do a whole
1637
02:35:20,950 --> 02:35:26,271
bunch of different statistics, you know, pretty
easily if you get a normal distribution. However,
1638
02:35:26,271 --> 02:35:31,591
what's often happens is in healthcare, because
I've done it, is you get a skewed distribution
1639
02:35:31,591 --> 02:35:37,260
left skewed right skewed, and then you have
to make some decisions, that makes it a little
1640
02:35:37,260 --> 02:35:41,931
harder. Also, I've had to buy moral distribution
before I'm remembering that one day, that
1641
02:35:41,931 --> 02:35:47,080
was kind of an issue, and then I had to figure
that one out. So that's roughly why we have
1642
02:35:47,080 --> 02:35:51,280
to go through this chapter and figure out
how to do these distributions. And then later,
1643
02:35:51,280 --> 02:36:00,300
I'll explain to you what you do with that
knowledge. Hello, there, it's Monica wahi
1644
02:36:00,300 --> 02:36:08,040
labarre College statistics lecturer. We're
going to circle back now to chapter 2.2. And
1645
02:36:08,040 --> 02:36:12,840
talk about these other graphs, I'm doing things
a little out of order, because it makes sense
1646
02:36:12,840 --> 02:36:19,760
to me. I hope it makes sense to you too. Well,
for this lecture, we're going to have these
1647
02:36:19,760 --> 02:36:23,630
learning objectives. So when you're done with
this lecture, you should be able to describe
1648
02:36:23,630 --> 02:36:29,230
a case in which a time series graph would
be appropriate, you should be able to explain
1649
02:36:29,230 --> 02:36:34,580
the difference between what would be graphed
on a bar graph versus a time series graph,
1650
02:36:34,580 --> 02:36:39,190
you should be able to describe the type of
data graphed in a pie chart. And you should
1651
02:36:39,190 --> 02:36:44,961
also be able to list two considerations to
make when choosing what type of chart to develop.
1652
02:36:44,961 --> 02:36:50,351
Alright, so let's get started here. What I'm
going to be doing it in this lecture is, first
1653
02:36:50,351 --> 02:36:55,500
I'm going to explain what a time series graph
is. Then I'm going to talk about a bar graph.
1654
02:36:55,500 --> 02:36:59,580
And of course, I'm going to show you roughly
how to make these, I'm gonna explain a pie
1655
02:36:59,580 --> 02:37:05,090
chart and how to make that. And then I'm going
to go over a review of all the graphs I've
1656
02:37:05,090 --> 02:37:13,250
talked about for chapter two. And just summarize
when to use what type of graph. So let's start
1657
02:37:13,250 --> 02:37:19,590
with the time series graph. And actually,
the word time is the key.
1658
02:37:19,590 --> 02:37:20,590
The time
1659
02:37:20,590 --> 02:37:26,460
we're going to talk about this time series
graph and what our time series data, right.
1660
02:37:26,460 --> 02:37:32,500
As you can see, by this little example, time
is across the x axis. And that's kind of a
1661
02:37:32,500 --> 02:37:38,070
hint for where we're going. Okay, so then
I'll show you roughly how to plot one. And
1662
02:37:38,070 --> 02:37:44,380
I'll explain why we have these time series
graphs, like how you interpret them and why
1663
02:37:44,380 --> 02:37:54,540
you even make them. So, of course, I'm an
epidemiologist. So what am i into m&m mortality,
1664
02:37:54,540 --> 02:38:01,500
morbidity. So here's a nice time series graph,
wonderful graph of the percentage of visits
1665
02:38:01,500 --> 02:38:08,710
for influenza like illness reported by the
US outpatient influenza like illness surveillance
1666
02:38:08,710 --> 02:38:17,141
network, by surveillance week, and this is
October 1 2006, through May 1 2010. And you're
1667
02:38:17,141 --> 02:38:23,011
like, oh, time? Yeah, that's the deal. time
series data are made of measurements for the
1668
02:38:23,011 --> 02:38:29,880
same variable, for the same individual taken
in intervals over a period of time. Only.
1669
02:38:29,880 --> 02:38:36,391
In this case, in the example here, the individual
is not a person, right? Because remember,
1670
02:38:36,391 --> 02:38:40,450
individuals are just what you measure what
you're measuring variables about. Here, the
1671
02:38:40,450 --> 02:38:47,601
individuals are actually weeks, right? Because
every week, they're making a measurement.
1672
02:38:47,601 --> 02:38:52,021
So like I said, time series data are made
of measurements for the same variable, which
1673
02:38:52,021 --> 02:38:58,010
is what percentage of visits for influenza
like illness. So every week they went to I
1674
02:38:58,010 --> 02:39:03,980
don't know who is in like, what clinics are
in this outpatient influenza like illness
1675
02:39:03,980 --> 02:39:08,420
surveillance network, but let's just pretend
there's like 10 clinics in there. So each
1676
02:39:08,420 --> 02:39:13,370
week, these clinics have to go in and say,
Yeah, I had, for example, 100 visits this
1677
02:39:13,370 --> 02:39:19,250
week, and 10 of them were for influenza, like
illness. So then that would be 10%. That week,
1678
02:39:19,250 --> 02:39:23,080
for that clinic. Well, they got all the clinics
together, and they found out what the percents
1679
02:39:23,080 --> 02:39:27,870
were. And you can see on the y axis, right,
there's the percentage, and then you see on
1680
02:39:27,870 --> 02:39:34,160
the x axis all the weeks in the year. So um,
so you've seen these before, right? You especially
1681
02:39:34,160 --> 02:39:38,540
see it with stock market, right? You go on
Yahoo, and look at your favorite stock, right?
1682
02:39:38,540 --> 02:39:42,910
You know, we're also rich, we own so much
stock, and so you track your favorite stock
1683
02:39:42,910 --> 02:39:48,330
that way. Personally, I'm spend more time
looking at mortality and morbidity, things
1684
02:39:48,330 --> 02:39:53,070
like influenza, but hey, there after I get
some money, I'll be looking at stock market
1685
02:39:53,070 --> 02:40:01,080
prices. So when we see these time series data
graphed in these time series graphs It's often
1686
02:40:01,080 --> 02:40:10,921
about things like influenza rates. Other rates,
you'll see life expectancy, rates of heart
1687
02:40:10,921 --> 02:40:14,681
attack. And that's usually what we see, because
we're trying to affect those rates. And we're
1688
02:40:14,681 --> 02:40:19,880
trying to see if they're going up or down.
So I'm going to just roughly go through how
1689
02:40:19,880 --> 02:40:24,141
you make one, if you ever wanted to make one,
the first thing you need is a table, kind
1690
02:40:24,141 --> 02:40:29,380
of like the one on the right, I just made
up these data, they don't mean anything. But
1691
02:40:29,380 --> 02:40:34,960
roughly what you need is a column that says,
in this case, I put year, the influenza people
1692
02:40:34,960 --> 02:40:39,540
they put a week, but you have to put like
regular time increments in the first column.
1693
02:40:39,540 --> 02:40:45,311
And then you have to put that variable measured
at that time in the next column. So let's
1694
02:40:45,311 --> 02:40:49,240
say it's today, and you're like, Oh, I want
to measure how many times I went to the gym
1695
02:40:49,240 --> 02:40:53,940
each week, you know, weekly over the last
few months? Well, you're gonna have to reconstruct
1696
02:40:53,940 --> 02:40:58,380
that data, right? Like maybe from your memory
or your calendar. So normally, when you're
1697
02:40:58,380 --> 02:41:03,900
going to go do time series stuff, you start
and you collect the data as you go along.
1698
02:41:03,900 --> 02:41:08,130
And then it's nice and accurate. Okay, so
let's say you did that, and you managed to
1699
02:41:08,130 --> 02:41:12,811
get some time series data together, then how
do you plot? Well, the first thing you do,
1700
02:41:12,811 --> 02:41:18,080
and I'm using this influential thing, as an
example, is you draw a horizontal line and
1701
02:41:18,080 --> 02:41:23,561
you make that your x axis, now you gathered
your data based on years or weeks or something.
1702
02:41:23,561 --> 02:41:28,040
So you can label those time periods there,
because you already know those time periods.
1703
02:41:28,040 --> 02:41:33,870
And so you just label that x axis. There,
then you draw the vertical line for your y
1704
02:41:33,870 --> 02:41:41,400
axis. And again, you've done all your measurements,
right? So if you were measuring how many times
1705
02:41:41,400 --> 02:41:47,300
you went to the gym per week, you know, maybe
once a day, you know, that would be seven
1706
02:41:47,300 --> 02:41:52,480
would be the maximum, right? So you didn't
want to make sure your y axis is tall enough
1707
02:41:52,480 --> 02:41:58,211
to get that seven. And if you had a good week
there. And so that's really what you're looking
1708
02:41:58,211 --> 02:42:02,251
for in the y axis, you don't want to too tall,
like you see the highest point that they have
1709
02:42:02,251 --> 02:42:07,420
Ooh, in 2009, they had an outbreak there,
they needed to make sure that the y axis was
1710
02:42:07,420 --> 02:42:11,000
tall enough so that they could graph that.
But other than that, you don't want to too
1711
02:42:11,000 --> 02:42:15,960
much taller. And then make sure you label
it. I'm big on labeling here, because otherwise
1712
02:42:15,960 --> 02:42:17,420
people get confused.
1713
02:42:17,420 --> 02:42:22,630
Okay, now we're going on to the next step,
then this is where you get into actually putting
1714
02:42:22,630 --> 02:42:28,240
in your data. Now, because there were so many
weeks, like if you look at like 2007 is only
1715
02:42:28,240 --> 02:42:34,000
about like the x axis is only about two inches
wide. And all like 52 weeks of 2007 were plotted
1716
02:42:34,000 --> 02:42:40,331
in there. So it literally looks like a super
smooth line. But honestly, what they did was
1717
02:42:40,331 --> 02:42:46,790
they went and they put each point in. And
so they put each point in separately, and
1718
02:42:46,790 --> 02:42:53,830
then they connected the dots. And that's why
it looks so smooth. If you only have
1719
02:42:53,830 --> 02:42:54,830
a few
1720
02:42:54,830 --> 02:43:01,320
points, and you have a wider x axis, it'll
be a more choppier, it will be, it'll look
1721
02:43:01,320 --> 02:43:06,460
a little bit more like they'll stock market.
Graphs like that go up and down, up and down
1722
02:43:06,460 --> 02:43:10,230
and kind of look like a roller coaster and
not so smooth. But if you have a lot of points
1723
02:43:10,230 --> 02:43:15,000
and you mission together ends up looking really
smooth. You also I just wanted to point out
1724
02:43:15,000 --> 02:43:19,901
can have more than one line on the graph.
For more than one set of data values. Like
1725
02:43:19,901 --> 02:43:24,450
here, they're comparing, I don't know some
sort of book performance, how much it was
1726
02:43:24,450 --> 02:43:30,771
sold. In US versus Canada, you just have to
make sure that you have a legend if you do
1727
02:43:30,771 --> 02:43:37,930
that, so people can tell the lines apart.
So to summarize, time series graphs are useful
1728
02:43:37,930 --> 02:43:43,900
for understanding trends over time, like whether
things go up or down like you saw on that
1729
02:43:43,900 --> 02:43:49,120
influenza chart, we could see when there apparently
was kind of an epidemic or an outbreak. So
1730
02:43:49,120 --> 02:43:53,561
graphing more than one set of time series
data, like you saw in the last graph on one
1731
02:43:53,561 --> 02:43:59,471
graph can help and comparing the differences
between the datasets I worked at for the US
1732
02:43:59,471 --> 02:44:04,320
Army. And there's a lot of problems with people
getting injured in the army. And so I made
1733
02:44:04,320 --> 02:44:09,650
a lot of time series graphs of rates of injury
over the years because we were trying to do
1734
02:44:09,650 --> 02:44:14,330
things to make the rates of injury go down.
And then that way we could see if the trend
1735
02:44:14,330 --> 02:44:19,660
was there that we were actually making them
go down. So that's the main goal of these
1736
02:44:19,660 --> 02:44:26,940
time series graphs. Now, I'm going to move
on to talk about a bar graph, which can display
1737
02:44:26,940 --> 02:44:33,450
quantitative or qualitative data. And I'm
going to first start with the features of
1738
02:44:33,450 --> 02:44:38,940
the bar graph. here's just an example on the
right here. I'm going to talk about how to
1739
02:44:38,940 --> 02:44:44,750
make one and then we're going to talk about
what happens when you change the scale meaning
1740
02:44:44,750 --> 02:44:51,190
the x axis like how how tall the x axis is,
on a bar chart because it really changes things.
1741
02:44:51,190 --> 02:44:56,110
I call it a bar chart sometimes, or bar graph.
They're really the same thing. I don't know
1742
02:44:56,110 --> 02:45:00,550
why they chose graph in the book. But then
finally, there's I want to do A little shout
1743
02:45:00,550 --> 02:45:05,490
out to what purrito charts are, we don't really
use them much in healthcare, but I still wanted
1744
02:45:05,490 --> 02:45:11,570
you to know about them. Alright, so let's
look at the features of a bar graph. The first
1745
02:45:11,570 --> 02:45:16,320
thing you want to know is that they the bars
can be vertical or horizontal. So don't, even
1746
02:45:16,320 --> 02:45:20,190
though I'm showing you this horizontal, or
this vertical example, don't be thrown off,
1747
02:45:20,190 --> 02:45:25,440
if you see a horizontal example. Regardless
of whether they're vertical or horizontal,
1748
02:45:25,440 --> 02:45:31,510
the bars are supposed to have a uniform width,
and uniform spacing, they can't be wider or
1749
02:45:31,510 --> 02:45:39,650
skinnier. And they have to be spaced apart
at a uniform rate. I'm gonna use, like I said,
1750
02:45:39,650 --> 02:45:46,680
this big one here, as an example, to talk
about bar graphs, I just want you to notice
1751
02:45:46,680 --> 02:45:51,920
what is being graphed here. And this is the
percentage of people in the US not covered
1752
02:45:51,920 --> 02:45:56,811
by health insurance. And it's split up by
race and ethnicity. And it's looking at the
1753
02:45:56,811 --> 02:46:03,080
years 2008 through 2012, which is like bad,
right? Like, you want people to have health
1754
02:46:03,080 --> 02:46:09,410
insurance. Okay, um, so item three here says
the length of the bars represent either the
1755
02:46:09,410 --> 02:46:14,960
variables frequency or percentage of occurrence.
So if we were looking at instead of percent
1756
02:46:14,960 --> 02:46:18,970
like it's I've circled percentage, because
that's what we're looking at in this one,
1757
02:46:18,970 --> 02:46:23,570
we could have looked at, you know, number
of visits at a health care clinic, and that
1758
02:46:23,570 --> 02:46:28,500
would be frequency, right. But we haven't
been looking at percentage here. So I, so
1759
02:46:28,500 --> 02:46:36,670
I just wanted to call that out. So you'll
see then, on the y axis, we have the measurement
1760
02:46:36,670 --> 02:46:42,980
scale. And as long as we write it there, and
we use that same measurement scale, for graphing
1761
02:46:42,980 --> 02:46:46,930
each of the bars, we will be fulfilling the
item for which is the same measurement scale
1762
02:46:46,930 --> 02:46:51,430
is used for each mark. I don't know why anybody
do it any other way. But that's part of the
1763
02:46:51,430 --> 02:46:58,330
features of the bar graph. Now, this is a
feature that really is like my pet peeve,
1764
02:46:58,330 --> 02:47:03,881
I get so irritated when I find a bar graph
or any other graph where things are not labeled,
1765
02:47:03,881 --> 02:47:10,461
I get totally confused. So you really want
to put on a title, you need to put the bar
1766
02:47:10,461 --> 02:47:16,150
labels, at least on the app on the x axis,
right? Like you have to know see how it says
1767
02:47:16,150 --> 02:47:20,170
white alone, black alone, like you wouldn't
even know what those bars were unless somebody
1768
02:47:20,170 --> 02:47:26,710
put something there, right. And some people
also add the actual values for each bar, I'll
1769
02:47:26,710 --> 02:47:31,551
do that if there's space, like there was space
here. If it gets too busy, I don't do that.
1770
02:47:31,551 --> 02:47:36,460
But um, because you can kind of see them from
the graph.
1771
02:47:36,460 --> 02:47:41,480
Now, you're probably wondering, um, you're
probably kind of having a flashback, you're
1772
02:47:41,480 --> 02:47:46,540
like, this looks totally like a histogram.
What is the difference? Well, I started by
1773
02:47:46,540 --> 02:47:53,290
talking to you about histograms, they're actually
a special case of a bar graph, right? So bar
1774
02:47:53,290 --> 02:47:59,471
graphs are more general. And the histogram
is a specific type of bar graph. So histograms
1775
02:47:59,471 --> 02:48:06,580
are bar graphs that must have classes of a
quantitative variable on the x axis. So you
1776
02:48:06,580 --> 02:48:14,061
can already see that the bar graph I'm showing
you is not a histogram, because it says categorical,
1777
02:48:14,061 --> 02:48:20,910
qualitative things, it doesn't have a class,
right? Also histograms must have frequency
1778
02:48:20,910 --> 02:48:26,040
or relative frequency on the y axis, which
as you can see this as percentage of something.
1779
02:48:26,040 --> 02:48:30,641
So that's not that. So this isn't a histogram.
But whenever you make a histogram, you're
1780
02:48:30,641 --> 02:48:37,200
just making kind of a special bar graph. And
I just wanted to point that out, so you weren't
1781
02:48:37,200 --> 02:48:43,500
confused. Now, I said, I was going to warn
you about what goes wrong when you change
1782
02:48:43,500 --> 02:48:50,220
the scale. And what I mean by changing the
scale is when you look at that y axis, notice
1783
02:48:50,220 --> 02:48:58,300
how it the top of it the way this person made,
it, is at 35, or 35%. But notice that the
1784
02:48:58,300 --> 02:49:04,220
highest racial group without health insurance,
which is unfortunately, those of Hispanic
1785
02:49:04,220 --> 02:49:11,870
origin, that that's close to 30. But it's
not all the way up to 35. So I'm not exactly
1786
02:49:11,870 --> 02:49:18,300
sure why they made it so high. So I wanted
to see what would happen, what the shape would
1787
02:49:18,300 --> 02:49:22,710
change these bars, if I actually made the
top 30. So I regenerated this, and then you'll
1788
02:49:22,710 --> 02:49:29,670
see what happens. See, it's the same data.
I just made it and I made the top 30. It's
1789
02:49:29,670 --> 02:49:35,220
kind of subtle, but suddenly all the bars
look bigger, right? So if I were like some
1790
02:49:35,220 --> 02:49:39,930
advocate and running around saying this is
terrible, you know, these people don't have
1791
02:49:39,930 --> 02:49:45,010
insurance. I'd like to look at the one on
the left more than the one on the right. But,
1792
02:49:45,010 --> 02:49:51,960
you know, in a way, that's a little misleading,
right? It's the same data. So the differences
1793
02:49:51,960 --> 02:49:58,040
between bars are more dramatic when we change
the scale to be shorter, a little bit more
1794
02:49:58,040 --> 02:50:04,490
dramatic. But let's go The other way, and
this is where I see people do things a lot.
1795
02:50:04,490 --> 02:50:10,580
Let's see what happens, see how that the the
top of the y axis is 35. Right now, let's
1796
02:50:10,580 --> 02:50:17,970
double that. Let's just make it 70. And then
let's see what happens. As you can see, the
1797
02:50:17,970 --> 02:50:25,010
differences between the bars look small, right?
Like, the difference between that big Hispanic
1798
02:50:25,010 --> 02:50:33,670
origin one and the lower white and Asian alone
ones isn't really that big anymore. So my
1799
02:50:33,670 --> 02:50:38,530
opponents would rather look at that graph.
In fact, everything looks kind of small. on
1800
02:50:38,530 --> 02:50:43,601
that graph, it's a Oh, there's no problems
with insurance. Um, and that's, you know,
1801
02:50:43,601 --> 02:50:47,460
when people talk about lying with statistics,
so to speak, I mean, these are the kind of
1802
02:50:47,460 --> 02:50:54,500
tricks people do to try and change how things
appear. And the best way to do it is to just
1803
02:50:54,500 --> 02:50:59,940
do kind of what I suggested is look at the
next one up from your tallest one. And do
1804
02:50:59,940 --> 02:51:06,590
that, use that as your top of your y axis,
what I would have to do with the army is I
1805
02:51:06,590 --> 02:51:11,541
was looking at rate of knee injury, and also
rate of ankle injury. But knee injury was
1806
02:51:11,541 --> 02:51:18,901
way more common. And so if I wanted to compare
the two, I always use the same scale, because
1807
02:51:18,901 --> 02:51:25,190
otherwise, people wouldn't be able to see
that the ankle injury was really, really low.
1808
02:51:25,190 --> 02:51:32,511
Compared to the knee injury, even though they're
both important. Um, let's hall with a taller
1809
02:51:32,511 --> 02:51:37,530
y axis, the differences between the bars look
dress less dramatic, and also the taller you
1810
02:51:37,530 --> 02:51:42,540
make your y axis, the less it looks like you
have of the bars, so you got to be really
1811
02:51:42,540 --> 02:51:47,610
careful. I don't think you would do that.
But you know, other people do that, when they're
1812
02:51:47,610 --> 02:51:52,800
trying to make their points. So just be careful
for that. Also, a term that was mentioned
1813
02:51:52,800 --> 02:51:58,080
in the book is the term clustered and clustered
bar graph. It's not that complicated, it just
1814
02:51:58,080 --> 02:52:03,290
means more than one bar is graph for each
category. You'll see in the in the last one
1815
02:52:03,290 --> 02:52:10,381
I did, it was just on on one topic. And here,
if you look at this one on the right, and
1816
02:52:10,381 --> 02:52:15,830
of course, I mixed it up a little I did the
horizontal version. But this is life expectancy
1817
02:52:15,830 --> 02:52:18,120
at birth.
1818
02:52:18,120 --> 02:52:23,820
And it's it's separated by you'll see that
there's three sets of bars, right? There's
1819
02:52:23,820 --> 02:52:28,580
both sexes together, in there's a bunch of
bars for that. And you see the legend Hispanic,
1820
02:52:28,580 --> 02:52:32,580
non Hispanic, black, non Hispanic, white,
and then they mix them all together all races
1821
02:52:32,580 --> 02:52:38,160
origin. And then they also have separate set
of bars for male and female. And so this would
1822
02:52:38,160 --> 02:52:43,280
be clustered. And if you do that, you really
need a legend so people can tell what's going
1823
02:52:43,280 --> 02:52:49,210
on. You'll also notice that you know, life
expectancy, that's good. If it's high, right,
1824
02:52:49,210 --> 02:52:56,620
you want to live to be 8090 100. But if you
look at the bottom of the slide where we have
1825
02:52:56,620 --> 02:53:01,950
the x axis, if we mean if we started at zero,
and just made it all long, it would not even
1826
02:53:01,950 --> 02:53:06,271
fit on the slide. So what they'll do is they'll
make these little hash marks with this little
1827
02:53:06,271 --> 02:53:13,750
squiggle, and indicate that they just skipped
ahead. But like I said in the first part of
1828
02:53:13,750 --> 02:53:19,840
this, if they skip ahead on the female one,
they have to skip ahead on all of them. Right,
1829
02:53:19,840 --> 02:53:24,280
so everything is skipped ahead there. This
is a fair comparison. It's just like we're
1830
02:53:24,280 --> 02:53:28,960
sort of, it's like, we're fast forwarding
through the movie up to about 50. And then
1831
02:53:28,960 --> 02:53:33,780
looking at the differences there because everything's
the same up to that. So that's just another
1832
02:53:33,780 --> 02:53:39,760
thing about scale is notice whether it's clustered
if you've got a legend, and also look for
1833
02:53:39,760 --> 02:53:47,990
the squiggle. Okay, now I'm going to give
you a shout out to a purrito chart. And you
1834
02:53:47,990 --> 02:53:51,510
probably already noticed, we don't really
use these much in healthcare, because this
1835
02:53:51,510 --> 02:53:57,960
example is about causes of an engine overheating.
Well, we don't do that a lot in healthcare.
1836
02:53:57,960 --> 02:54:06,650
And you'll see I kind of slapped on a label
on the y axis, the word frequency, okay. So
1837
02:54:06,650 --> 02:54:11,570
in a perrito chart, this is you remember how
I was saying this histogram is a special bar
1838
02:54:11,570 --> 02:54:17,920
chart, or bar graph will pre though chart
is a different kind of special bar graph.
1839
02:54:17,920 --> 02:54:24,090
Okay. And then that one, the height of the
bar indicates the frequency of an event. Like
1840
02:54:24,090 --> 02:54:31,360
if you look at these events here, like damage
radiator core, that happened 31 times right?
1841
02:54:31,360 --> 02:54:36,080
And then happened more often than faulty fans,
which only happened 20 times. So what they
1842
02:54:36,080 --> 02:54:40,080
do is they figure out what happened the most
and the second most and least whatever, and
1843
02:54:40,080 --> 02:54:45,500
they deliberately arranged them in order left
to right, according to decreasing height.
1844
02:54:45,500 --> 02:54:51,601
It's a way of sort of zoning in on what is
the most important problem you're finding.
1845
02:54:51,601 --> 02:54:57,521
So it's really meant to graph frequencies
of problems. I actually only saw one purrito
1846
02:54:57,521 --> 02:55:02,830
chart I've ever ever in healthcare, so So
far, I really looked for one. And what it
1847
02:55:02,830 --> 02:55:09,061
was about was, it was about things that can
happen that are bad in a nursing home. And
1848
02:55:09,061 --> 02:55:15,820
I remember the tallest bar was for falls,
right? Like people fall in a nursing home.
1849
02:55:15,820 --> 02:55:21,970
And then there was a smaller bar for medication
errors that happens. The reason why we don't
1850
02:55:21,970 --> 02:55:27,131
I think the reason why we don't use these
a lot in healthcare is, you know, let's pretend
1851
02:55:27,131 --> 02:55:31,530
that's what this was of it, let's pretend
this 31 instead of damage radiator course
1852
02:55:31,530 --> 02:55:35,670
that 31 Falls? Well, the first thing you'd
probably ask is, well, how many people are
1853
02:55:35,670 --> 02:55:41,811
in that, that nursing home? You know, and
how long did you collect data for right? 31
1854
02:55:41,811 --> 02:55:47,220
Falls is pretty bad. But it's not bad. If
you have hundreds of people over 10 years
1855
02:55:47,220 --> 02:55:52,321
of that all you get a 31 Falls, you're doing
pretty well. So I would say that the reason
1856
02:55:52,321 --> 02:55:57,050
why we don't use preto charts a lot in healthcare
is that sort of leaves out some important
1857
02:55:57,050 --> 02:56:02,841
information about these serious events. And
so we like to look at things in different
1858
02:56:02,841 --> 02:56:12,110
ways. So just to summarize, about bar graphs,
bar graphs must be made following a few rules,
1859
02:56:12,110 --> 02:56:17,300
I talked to you about the you know the difference.
with, you know, you have to keep the width
1860
02:56:17,300 --> 02:56:22,870
the same and, and how you have to label the
axes. So we know what you're talking about.
1861
02:56:22,870 --> 02:56:27,710
Because you can visualize both quantitative
and qualitative data using a bar chart. So
1862
02:56:27,710 --> 02:56:32,222
these labels become really important, as do
scales, right? Like, I showed you how you
1863
02:56:32,222 --> 02:56:36,271
change the scale, and you can make things
look different. So you want to be careful
1864
02:56:36,271 --> 02:56:42,160
and be cognizant of that. And also, I did
a shout out to purrito charts, and I explained
1865
02:56:42,160 --> 02:56:46,980
why I think they're not used that much in
healthcare.
1866
02:56:46,980 --> 02:56:51,391
Now we're going to jump into pie charts. You
know, just even the thought of a pie chart
1867
02:56:51,391 --> 02:56:56,110
makes me hungry, doesn't make you hungry.
Um, so here's what a pie chart is. They're
1868
02:56:56,110 --> 02:57:02,580
also called circle graphs. They're used with
counts or frequencies that are mutually exclusive.
1869
02:57:02,580 --> 02:57:08,361
And that sounds really fancy. But all it means
is when every individual can only fall in
1870
02:57:08,361 --> 02:57:12,230
one category. So I'm going to give you the
example on the right, which is actually from
1871
02:57:12,230 --> 02:57:16,820
a real report you should probably read. It
was a survey that was done by the Massachusetts
1872
02:57:16,820 --> 02:57:23,160
nursing Association, and they got 339 nurses
to fill out the survey, one of the questions
1873
02:57:23,160 --> 02:57:30,311
was, do you receive annual blood borne pathogen
training? Now the answer is only going to
1874
02:57:30,311 --> 02:57:37,681
be yes or no. They can't say yes and no. That
is what mutually exclusive is, is where you
1875
02:57:37,681 --> 02:57:43,881
can only answer one answer. So as you can
see, 234 people said yes, which is good. And
1876
02:57:43,881 --> 02:57:48,271
105 said no, which is bad, I'm worried about
that.
1877
02:57:48,271 --> 02:57:50,530
But these pie charts
1878
02:57:50,530 --> 02:57:54,101
are often made in graphing programs, because
they're a little difficult to do by hand.
1879
02:57:54,101 --> 02:58:00,880
And I'll explain to you why. And unlike peredo,
charts, these are super common in healthcare,
1880
02:58:00,880 --> 02:58:06,760
as you can see right there on the slide. So
let's look at the features of a pie chart.
1881
02:58:06,760 --> 02:58:14,790
Um, I actually just made up this fake pie
chart, I pretended I had a class where I gave
1882
02:58:14,790 --> 02:58:20,430
a five point quiz, right? And the reason why
I did that is I wanted to show you how to
1883
02:58:20,430 --> 02:58:26,630
do it with a quantitative variable. Because
remember, the last one, it was yes or no.
1884
02:58:26,630 --> 02:58:30,590
And that's qualitative. Those are the the
answers that the nurses could give to that
1885
02:58:30,590 --> 02:58:35,710
survey question. Well, this is a different
one. This is where I actually put, you know,
1886
02:58:35,710 --> 02:58:41,620
fake students in their their points on this
quiz into classes, right? Like you see zero
1887
02:58:41,620 --> 02:58:47,870
points, one to two points, three to four points
and five points, right. So regardless of whether
1888
02:58:47,870 --> 02:58:52,540
you're doing yes, no, no qualitative, or,
you know, different categories like that,
1889
02:58:52,540 --> 02:58:58,940
or you're doing classes like this, every individual
in your data must be in only one of the categories,
1890
02:58:58,940 --> 02:59:04,801
only one of the classes kind of like frequency
tables and histograms. You everybody gets
1891
02:59:04,801 --> 02:59:09,040
one vote. And that's really important in a
pie chart, even though it can be used with
1892
02:59:09,040 --> 02:59:14,930
qualitative or quantitative variables. And
you'll see later What I mean by that. And
1893
02:59:14,930 --> 02:59:21,130
so here is just a fake example I made of how
you would then make a pie chart out of a quantitative
1894
02:59:21,130 --> 02:59:24,521
variable.
1895
02:59:24,521 --> 02:59:25,521
So
1896
02:59:25,521 --> 02:59:30,840
I'm just gonna briefly go over how you would
do this by hand and I'm realizing I've never
1897
02:59:30,840 --> 02:59:37,970
done this by hand. I always use Excel as you
probably recognize that lovely purple color,
1898
02:59:37,970 --> 02:59:43,300
which comes out of Excel. But if you were
going to do it by hand, I guess you'd have
1899
02:59:43,300 --> 02:59:48,240
to go buy one of those things in the lower
left, which is a protractor because that helps
1900
02:59:48,240 --> 02:59:54,220
you see the degrees of a circle. Remember,
it's a whole circle has 360 degrees, right?
1901
02:59:54,220 --> 02:59:58,551
I don't know if you remember all this from
like trigonometry. And but then like a half
1902
02:59:58,551 --> 03:00:04,130
circle would be 182 Freeze. And so that's
how you figure out like how much of the piece
1903
03:00:04,130 --> 03:00:08,271
of the pie you need is using this protractor.
So if you're going to make a pie chart by
1904
03:00:08,271 --> 03:00:13,470
hand, you first have to make a table, you'll
see we make tables constantly and statistics.
1905
03:00:13,470 --> 03:00:20,680
And I put class in the first column, because
I was doing one that required class because
1906
03:00:20,680 --> 03:00:24,910
it's quantitative. If you were doing that
one with the nurses saying yes or no, you
1907
03:00:24,910 --> 03:00:29,700
would put category and you just say yes or
no, right, and then total, then of course,
1908
03:00:29,700 --> 03:00:34,501
next, you put the frequency. And I always
put total to add it up to try and make sure
1909
03:00:34,501 --> 03:00:38,710
you know my fake class apparently, and 37
people in it. So I just want to make sure
1910
03:00:38,710 --> 03:00:44,820
you know, everything adds up, then the next
step room will remind you of relative frequency,
1911
03:00:44,820 --> 03:00:49,750
it's where you figure out the proportion of
the circle that that's going to take up, right.
1912
03:00:49,750 --> 03:00:56,830
So see, the five points out the seven people
who got five points? Well, if you divide seven
1913
03:00:56,830 --> 03:01:02,490
by 37, you're going to get point one, nine,
well, that's I like percent. So that's 19%.
1914
03:01:02,490 --> 03:01:09,240
So that would say what proportions the circle
they get, right. And then finally, in the
1915
03:01:09,240 --> 03:01:14,930
last column, remember how it's telling you
the whole circle is 360 degrees, when you
1916
03:01:14,930 --> 03:01:21,320
take that proportion you get, and you multiply
it by 360, to figure out how many degrees,
1917
03:01:21,320 --> 03:01:25,570
you're going to make your circle. And that's
why you need the protractor. And that's also
1918
03:01:25,570 --> 03:01:29,601
why I always use Excel for this because it
makes it so you don't have to worry about
1919
03:01:29,601 --> 03:01:35,391
those things. All you would need for Excel
is actually just the class or the categories,
1920
03:01:35,391 --> 03:01:41,271
and the frequency. And then if you use their
automatic pie graph function, then you can
1921
03:01:41,271 --> 03:01:47,271
get all this other stuff out very quickly.
So I just wanted to make a few notes about
1922
03:01:47,271 --> 03:01:52,851
pie charts. This is the thing I'm coming back
to is this mutually exclusive categories.
1923
03:01:52,851 --> 03:01:57,851
So I want you to imagine that I do a survey,
right. And I asked the question, what is your
1924
03:01:57,851 --> 03:02:03,631
favorite color? And I give some choices like
red, green, blue, whatever, there's only going
1925
03:02:03,631 --> 03:02:08,370
to be one answer to everybody's question,
right? Because you can only have one favorite,
1926
03:02:08,370 --> 03:02:14,761
right? And that then is eligible to be used
in a pie chart, because everybody gets one
1927
03:02:14,761 --> 03:02:20,330
vote. But a lot of times, I'll see people
who do a different survey question, they'll
1928
03:02:20,330 --> 03:02:25,621
say, check off all of the colors you like.
So if I get that I'm like, Oh, I love red.
1929
03:02:25,621 --> 03:02:29,841
I like orange, I like green, I'm checking
off a bunch. There's some people I know who
1930
03:02:29,841 --> 03:02:33,681
don't really like color, like they just were
gray and black. So they probably wouldn't
1931
03:02:33,681 --> 03:02:38,040
check off anything. And then there are the
people who just check off one or two. Well,
1932
03:02:38,040 --> 03:02:44,591
as you can see, people can have multiple votes
or no votes or whatever. And if you have that
1933
03:02:44,591 --> 03:02:48,141
situation, like I was telling you, where people
can say multiple things, you've got to go
1934
03:02:48,141 --> 03:02:54,181
into bargraph land, okay? Because a whole
bunch of people can like read a whole bunch
1935
03:02:54,181 --> 03:02:58,340
of people can like green, a whole bunch of
people can like blue. And you won't get a
1936
03:02:58,340 --> 03:03:04,601
circle out of that. If everybody answers just
one answer. And so therefore, everybody's
1937
03:03:04,601 --> 03:03:10,680
in a mutually exclusive category, then you
can use the pie chart. I also wanted to let
1938
03:03:10,680 --> 03:03:16,561
you know that I find it and I think a lot
of people do more informative to put the percentage
1939
03:03:16,561 --> 03:03:22,630
on the actual chart, then the frequency, some
people put both the frequency and the percentage,
1940
03:03:22,630 --> 03:03:28,710
which is good, it's not so helpful to just
put the frequency as you see that the nursing
1941
03:03:28,710 --> 03:03:35,040
report did on the left. And it's because you
really don't know, you know, 234 seems like
1942
03:03:35,040 --> 03:03:38,610
a lot. But what proportion is that of the
circle, that's what you would kind of want
1943
03:03:38,610 --> 03:03:43,370
to know. Whereas if you look on the right
on mine, you can see like, for instance, only
1944
03:03:43,370 --> 03:03:49,271
5% God zero point, that's a small amount,
right? You know what 5% means? It's just hard
1945
03:03:49,271 --> 03:03:55,391
to tell, you know, if you look at that one
on the left, and looks a little like two thirds,
1946
03:03:55,391 --> 03:03:59,440
which would be 66%. But we don't know what
the percent is, right. And so it's really
1947
03:03:59,440 --> 03:04:05,091
helpful to have that percent. And always include
a title and a legend. Because if you're, if
1948
03:04:05,091 --> 03:04:07,820
you're graphing a pie chart, you're gonna
have more than one category, and so people
1949
03:04:07,820 --> 03:04:12,811
are gonna want to know what that color means.
1950
03:04:12,811 --> 03:04:17,021
This looks so good, doesn't look good. Um,
pie charts are common in healthcare, and they
1951
03:04:17,021 --> 03:04:21,351
graph mutually exclusive categories. Okay,
so so you'll see this all the time. And like
1952
03:04:21,351 --> 03:04:26,440
I said, it's easier to make using software,
I use Excel, it can come out of other software,
1953
03:04:26,440 --> 03:04:31,910
but I just like Excel because you can really
put fancy labels on and you can do that squiggle
1954
03:04:31,910 --> 03:04:38,711
thing and but choosing a graph requires some
consideration, like whether or not you actually
1955
03:04:38,711 --> 03:04:45,271
want to make a pie chart or a bar chart or
whatever, requires some thought. And also,
1956
03:04:45,271 --> 03:04:49,021
regardless of the chart you make, you should
follow these rules. You should always provide
1957
03:04:49,021 --> 03:04:54,550
a title, okay? Even if it's just for your
private use. Trust me, I've done this. I go
1958
03:04:54,550 --> 03:04:59,251
back and I'm like, I don't even know what
I grabbed. So take your time sit down, write
1959
03:04:59,251 --> 03:05:06,650
a little title. So you remember what you also
labeled the axes. Because, again, you think
1960
03:05:06,650 --> 03:05:10,061
you're going to remember or maybe you think
it's obvious everybody in the audience is
1961
03:05:10,061 --> 03:05:16,591
going to tell, don't leave anything to be
assumed, just be absolutely clear about what's
1962
03:05:16,591 --> 03:05:22,480
on each axis. Always identify your units of
measure. So if you're talking about a rate
1963
03:05:22,480 --> 03:05:28,530
per 10,000 people or a percentage, or maybe
you're talking about an average, or you're
1964
03:05:28,530 --> 03:05:32,672
talking about a frequency, it doesn't matter,
just make sure you're clear about what you're
1965
03:05:32,672 --> 03:05:39,920
talking about. In the units of measure, usually,
this ends up on the y axis. So the thought
1966
03:05:39,920 --> 03:05:46,150
is to make the graph as clear as possible,
thinking font size, thinking number of items
1967
03:05:46,150 --> 03:05:51,040
graph, you know, I've sometimes seen a bunch
of time series graphs where they put so many
1968
03:05:51,040 --> 03:05:57,500
lines on there, I can't even see anything.
Or they'll have these really tiny font sizes.
1969
03:05:57,500 --> 03:06:03,660
Or they'll just try to put too much on one
graph. And it's hard to read. So if you find,
1970
03:06:03,660 --> 03:06:08,102
if you have trouble reading it, probably everybody
else will. So you want to modify it. So I
1971
03:06:08,102 --> 03:06:13,040
just throw this on the right. Can you tell
what's missing from the above graph? The above
1972
03:06:13,040 --> 03:06:17,021
graph is really missing a lot of information.
I mean, we don't even know what it's about
1973
03:06:17,021 --> 03:06:21,510
we, we can kind of guess it's a time series
graph because of the time at the bottom. But
1974
03:06:21,510 --> 03:06:25,160
what else right? So the person who made this
really knew what they were talking about,
1975
03:06:25,160 --> 03:06:31,230
but we don't, and you don't want that to happen
to your graph. Okay, so here, what I'm going
1976
03:06:31,230 --> 03:06:37,000
to do is review all the different graphs I've
talked about in chapter two, and talk about
1977
03:06:37,000 --> 03:06:42,400
the cases where that graph is useful. So you
can keep the straight in your heads what why
1978
03:06:42,400 --> 03:06:47,470
we have all these graphs, right. So first,
there's the frequency histogram. Remember
1979
03:06:47,470 --> 03:06:52,551
that that was only for quantitative data.
And that's what you make when you want to
1980
03:06:52,551 --> 03:06:57,570
see the distribution, right? Remember, the
distribution was a shape. And, and a frequency
1981
03:06:57,570 --> 03:07:04,330
histogram is a particular type of bar graph
that is meant for showing these distributions.
1982
03:07:04,330 --> 03:07:09,040
I also showed you how to make a relative frequency
histogram, which is almost the same thing,
1983
03:07:09,040 --> 03:07:13,841
only it graphs the relative frequency instead
of the frequency. And that also will show
1984
03:07:13,841 --> 03:07:18,940
you the distribution, right, because the pattern
will be the same. But this one's specifically
1985
03:07:18,940 --> 03:07:24,141
good for comparing to other data. So if you
have two sets of data, maybe from two different
1986
03:07:24,141 --> 03:07:29,200
locations are two different groups, then you
want to use the relative frequency histogram,
1987
03:07:29,200 --> 03:07:35,650
because then it's easier to compare distributions,
right. I also showed you how to make a stem
1988
03:07:35,650 --> 03:07:39,860
and leaf display, I explained what the stem
and leaf is, what the leaves are, and what
1989
03:07:39,860 --> 03:07:46,221
the stem is. And that's also for quantitative
data. And that's also if you want to see the
1990
03:07:46,221 --> 03:07:49,830
distribution, it's also good for organizing
the data, it's a little easier to make by
1991
03:07:49,830 --> 03:07:56,061
hand than a histogram. Because a histogram
makes you make a frequency table first, and
1992
03:07:56,061 --> 03:08:00,750
stem and leaf display, you can kind of skip
that step. So again, these first three were
1993
03:08:00,750 --> 03:08:06,730
just about trying to take quantitative data
and visualize it so you can look at distributions
1994
03:08:06,730 --> 03:08:13,150
and also look for outliers. Next, we went
into the time series graph. And that is really
1995
03:08:13,150 --> 03:08:18,790
about time, right? That's for graphing a variable
that changes over time. And as measured at
1996
03:08:18,790 --> 03:08:24,220
regular intervals, mainly to see trends like
is it going up? Is it going down? Was there
1997
03:08:24,220 --> 03:08:29,771
an epidemic, and that's what a time series
graph is for a bar graph. Now this is the
1998
03:08:29,771 --> 03:08:37,220
generic bar graph, not the specific histogram,
like I described, but the generic bar graph
1999
03:08:37,220 --> 03:08:43,540
can be used for qualitative data or for quantitative
data. And it can be used for displaying frequency
2000
03:08:43,540 --> 03:08:49,521
or percentage, and we went over some examples.
Then I shouted out to the perrito chart, which
2001
03:08:49,521 --> 03:08:56,230
is a special bar graph, right. And that special
bar graph graphs frequencies of rare events,
2002
03:08:56,230 --> 03:09:01,990
in descending order, usually bad things, you
know, rare bad things. And again, we don't
2003
03:09:01,990 --> 03:09:07,900
really use this much in healthcare. Finally,
I went over the pie graph. And that's four
2004
03:09:07,900 --> 03:09:12,931
mutually exclusive categories, quantitative
or qualitative. And we use those a lot in
2005
03:09:12,931 --> 03:09:15,230
healthcare.
2006
03:09:15,230 --> 03:09:21,251
So in conclusion, in this particular lecture,
I first went over the time series graphs,
2007
03:09:21,251 --> 03:09:26,440
and explained how they show changes over time.
And then I went over bar graphs and showed
2008
03:09:26,440 --> 03:09:31,061
you how they can display quantitative and
qualitative data. They can be up and down
2009
03:09:31,061 --> 03:09:36,891
or horizontal. I showed you some different
examples. And then we went through pie charts,
2010
03:09:36,891 --> 03:09:40,601
looking at mutually exclusive categories,
which I think are my favorite, like look at
2011
03:09:40,601 --> 03:09:46,561
this pie. This makes me so hungry. Um, but
at the end, it's important to pick the right
2012
03:09:46,561 --> 03:09:52,460
chart. Because you want to have a useful visualization
of your data. If you're trying to look for
2013
03:09:52,460 --> 03:09:57,150
a distribution. Choose the right kind of visualizations,
the right kind of graphs, if you want to instead
2014
03:09:57,150 --> 03:10:02,061
look for trends over time. You get to choose
the right kind of work. So I gave you some
2015
03:10:02,061 --> 03:10:08,931
pointers on how to do that. And now my mouth
is watering. So I'm gonna go eat some pie.
2016
03:10:08,931 --> 03:10:18,061
Yoo hoo, it's Monica wahi. Again, your statistics
lecturer from labarre College, I decided to
2017
03:10:18,061 --> 03:10:24,771
chop up chapter two and reconfigure it. So
this first lecture is going to be on part
2018
03:10:24,771 --> 03:10:33,400
of chapter 2.1, frequency tables, and the
entire chapter 2.3, which is stem and leaf
2019
03:10:33,400 --> 03:10:39,931
displays. So here are your learning objectives
for this lecture. At the end of this lecture,
2020
03:10:39,931 --> 03:10:46,051
you should be able to state the steps for
making a frequency table defined class, upper
2021
03:10:46,051 --> 03:10:51,570
class limit and lower class limit, you should
be able to explain what relative frequency
2022
03:10:51,570 --> 03:10:56,601
is and why it's useful for comparing groups.
Also, you should be able to state the steps
2023
03:10:56,601 --> 03:11:02,120
for making a stem and leaf display. And finally,
you should be able to describe the difference
2024
03:11:02,120 --> 03:11:08,540
between an ordered and ordered leaf. And if
all that sounds foreign to you, don't worry,
2025
03:11:08,540 --> 03:11:14,460
you'll understand it all at the end of this
lecture. So just to introduce what I'm going
2026
03:11:14,460 --> 03:11:19,660
to cover, first, I'm going to define for you
what a frequency table actually is. And then
2027
03:11:19,660 --> 03:11:23,540
I'll explain to you how to make one which
will help you understand even better what
2028
03:11:23,540 --> 03:11:29,830
it is. After that I'm jumping right into what
a stem and leaf display is, and how to make
2029
03:11:29,830 --> 03:11:34,780
one of those in the main reason why I can
combine these is because I feel like stem
2030
03:11:34,780 --> 03:11:40,390
and leaf displays can help you make frequency
tables. That connection was not really made
2031
03:11:40,390 --> 03:11:46,120
in the book. So I'm making it here. So let's
just start with the frequency table. So what
2032
03:11:46,120 --> 03:11:51,391
is one of those? Well, you know, when I think
of frequency, I think of the radio, right?
2033
03:11:51,391 --> 03:11:56,921
Like I think of REM what's the frequency?
KENNETH? I think that was a last hit. Okay,
2034
03:11:56,921 --> 03:12:01,840
that's not what we're talking about. We're
talking about frequency, like the word frequently,
2035
03:12:01,840 --> 03:12:07,470
like How frequently do you go to work per
week, right. And you would count how many
2036
03:12:07,470 --> 03:12:10,690
times you go to work or go to class per week?
2037
03:12:10,690 --> 03:12:11,690
Well, frequency
2038
03:12:11,690 --> 03:12:17,660
is, like frequently, it's like how frequent
something happens. So first, I'm going to
2039
03:12:17,660 --> 03:12:22,410
explain to you what a frequency table is,
and why you make them, then I'm going to define
2040
03:12:22,410 --> 03:12:26,830
some more terms, I just defined frequency,
I'm going to just define some more that you're
2041
03:12:26,830 --> 03:12:30,761
going to need to know. And then I'm going
to explain the steps for making a frequency
2042
03:12:30,761 --> 03:12:40,780
table and a relative frequency table. So remember,
quantitative data, I'll just remind you qualitative
2043
03:12:40,780 --> 03:12:47,510
data are categorical. So that's like gender
race diagnosis, where you put individuals
2044
03:12:47,510 --> 03:12:53,750
into categories. And quantitative data are
numerical. Remember, like age, heart rate,
2045
03:12:53,750 --> 03:12:58,460
blood pressure. Now, I just want to calibrate
you to the idea that this whole frequency
2046
03:12:58,460 --> 03:13:05,610
table thing, this, this whole thing is about
quantitative data. And so this entire lecture
2047
03:13:05,610 --> 03:13:12,740
actually is focusing only on quantitative
data and not qualitative data already. So
2048
03:13:12,740 --> 03:13:17,470
when you have quantitative data, as you probably
noticed, if you've ever had it, right, like,
2049
03:13:17,470 --> 03:13:22,780
let's say that you, let's say you go on Yelp,
you know, I always give that example. And
2050
03:13:22,780 --> 03:13:27,740
you tried to decide whether to go to a restaurant
or not. You have a bunch of fives, and fours
2051
03:13:27,740 --> 03:13:32,480
and threes and twos and one stars, how do
you know, you know, you just have a pile of
2052
03:13:32,480 --> 03:13:37,200
numbers. So how do you organize them, I'm
going to give you like a totally fake example
2053
03:13:37,200 --> 03:13:42,980
I made up Okay, so I'm pretending that 60
patients were studied for the distance, they
2054
03:13:42,980 --> 03:13:47,511
needed to be transported in an ambulance.
So how far they needed to be transported from
2055
03:13:47,511 --> 03:13:52,891
where they call the ambulance, and were picked
up and actually got to the hospital. So the
2056
03:13:52,891 --> 03:13:59,090
shortest transport in my fake data, or the
minimum was one mile, which is awesome. That's
2057
03:13:59,090 --> 03:14:02,420
kind of what happens to me because I live
right near a hospital, hopefully, I don't
2058
03:14:02,420 --> 03:14:07,341
need to be in an ambulance very often. But
that's what happens in urban centers, the
2059
03:14:07,341 --> 03:14:12,710
longest transport the maximum was 47 miles,
which would really suck. And I just want to
2060
03:14:12,710 --> 03:14:18,160
point that out that happens to people in the
rural areas because of lack of access. So
2061
03:14:18,160 --> 03:14:22,311
this is kind of realistic, even though it's
fake data. But anyway, it's hard to just look
2062
03:14:22,311 --> 03:14:27,990
at a pile of numbers. So how do we understand
these data? Well, now I'm going to start those
2063
03:14:27,990 --> 03:14:33,910
definitions. The word class means the interval
in the data. So in Remember, we're talking
2064
03:14:33,910 --> 03:14:39,490
quantitative data. So let's say I just made
up well, how many people got transported between
2065
03:14:39,490 --> 03:14:47,720
30 and 40 miles, okay. That would be a class
of 30 to 40, right. And the class limit is
2066
03:14:47,720 --> 03:14:52,860
the lowest and highest value that can fit
in the class. So carrying on with my example
2067
03:14:52,860 --> 03:14:58,731
of a class I just randomly picked 30 to 40.
If we made that a class we would say 30 would
2068
03:14:58,731 --> 03:15:05,221
be the lower class. limit, and 40 would be
the upper class limit. Make sense? Alrighty.
2069
03:15:05,221 --> 03:15:10,171
So then, of course, you have the width of
the class or the class width. So that's how
2070
03:15:10,171 --> 03:15:15,920
wide the classes. So carrying on with the
example, if the upper class limit was 40,
2071
03:15:15,920 --> 03:15:21,550
and the lower class limit was 30, what you
do is you minus 30, from 40, which you get
2072
03:15:21,550 --> 03:15:26,450
10. And then you add one, and n equals 11.
That's a little formula. But if you're like
2073
03:15:26,450 --> 03:15:32,591
me, and you count on your fingers, you would
go 3031 32 6034, blah, blah, blah, and you'd
2074
03:15:32,591 --> 03:15:39,900
realize that there are 11 numbers in that.
Now we get to frequency, like I sort of quickly
2075
03:15:39,900 --> 03:15:46,640
explained in that is how many values from
the data fall in the class. So how many patients
2076
03:15:46,640 --> 03:15:52,771
were transported 30 to 40 miles. Or another
way of saying it is, if you look in all the
2077
03:15:52,771 --> 03:15:59,630
data you have, and you find every single person
that either got 3031 3233, blah, blah, blah,
2078
03:15:59,630 --> 03:16:06,880
up to 40, count all those people up that then
you will get the frequency for that class.
2079
03:16:06,880 --> 03:16:14,271
Okay, but you probably realize you do need
to decide on classes before you go counting
2080
03:16:14,271 --> 03:16:19,160
frequencies, because you need to know the
lower and upper class limits. So let's talk
2081
03:16:19,160 --> 03:16:24,521
about some rules about classes. First of all,
classes have to be the same width, you can
2082
03:16:24,521 --> 03:16:30,761
have 30 to 40, and then 40 to 42, right, or
41 to 42, right? You can't have skinny class,
2083
03:16:30,761 --> 03:16:37,561
fat class, they have to have the same width.
But, um, there are different ways to pick
2084
03:16:37,561 --> 03:16:44,721
it, right? So, class width can be determined
empirically isn't that a fancy word empirically
2085
03:16:44,721 --> 03:16:50,370
just means you just choose it because you
like it, right. And if you ever look at survey
2086
03:16:50,370 --> 03:16:52,710
data, about just about anything, when they
2087
03:16:52,710 --> 03:16:59,780
look at the quantitative variable of age,
they often put that in classes. And as you'll
2088
03:16:59,780 --> 03:17:06,440
see on the slide, these are the classes we
often see 18 to 2425 to 3435 to 44. And you
2089
03:17:06,440 --> 03:17:10,120
can go on, right, like, that's what you normally
see. And that means, empirically, you just
2090
03:17:10,120 --> 03:17:15,990
picked it out of the hat. And already, you're
probably noticing Well, 18 to 25, or 18, to
2091
03:17:15,990 --> 03:17:17,090
2465. and
2092
03:17:17,090 --> 03:17:18,360
older, those classes
2093
03:17:18,360 --> 03:17:22,970
aren't really equal as the ones in the middle,
right? Like, what's the upper class limit
2094
03:17:22,970 --> 03:17:29,140
for 65 and older? Okay, well, that's just
normally what happens in the world, and especially
2095
03:17:29,140 --> 03:17:35,181
in healthcare, and healthcare, when you pick
classes. Even though the classes are technically
2096
03:17:35,181 --> 03:17:39,890
supposed to be the same width, you really
should be guided by the scientific literature.
2097
03:17:39,890 --> 03:17:46,650
And you'll see why later, when I show you
the other videos in this chapter. It's because
2098
03:17:46,650 --> 03:17:52,170
you really want to be able to compare whatever
you find to whatever other people have found
2099
03:17:52,170 --> 03:17:55,610
before you. And therefore you don't want to
cut up your classes in different ways, or
2100
03:17:55,610 --> 03:18:03,641
it's hard to compare them. However, in the
book, they teach this class with formula,
2101
03:18:03,641 --> 03:18:09,830
so I thought I should really show you that,
too. So here's the class with formula that
2102
03:18:09,830 --> 03:18:15,370
I don't really see used much in healthcare
statistics, but I'm going to teach you anyway.
2103
03:18:15,370 --> 03:18:19,521
So this is the formula. First you calculate
this number, you find the maximum in your
2104
03:18:19,521 --> 03:18:24,040
data, and you're in the minimum in your data,
and you subtract the minimum from the maximum.
2105
03:18:24,040 --> 03:18:29,671
So the example I was giving from the fake
data about the transport is 47 was a maximum,
2106
03:18:29,671 --> 03:18:35,021
and the minimum was one. So I did the first
step and got 46. Okay, looking back into the
2107
03:18:35,021 --> 03:18:40,301
formula, you divide whatever you got there
by the number of classes desired. In other
2108
03:18:40,301 --> 03:18:45,841
words, like however many, you know, categories
you want, right. So if you never want too
2109
03:18:45,841 --> 03:18:53,230
many, like you don't want 10 or something,
you know, 34567, usually something in that
2110
03:18:53,230 --> 03:18:58,601
range is a good number of classes. So let's
pick six just for fun. So we'll take that
2111
03:18:58,601 --> 03:19:05,141
46 number we got we divided by six and we
get 7.7. Then back to the formula side, how
2112
03:19:05,141 --> 03:19:10,681
you decide then your class width is you increase
this number, you get to the next whole number.
2113
03:19:10,681 --> 03:19:14,080
Now a lot of people are confused by that,
because even if I've gotten something like
2114
03:19:14,080 --> 03:19:20,771
low, like 7.1, I'd still go up to eight, you
have to increase it up to the next whole number.
2115
03:19:20,771 --> 03:19:25,400
So you have like this, this integer, you know,
that's a number without any decimals after
2116
03:19:25,400 --> 03:19:30,061
it. So you have this integer for your class
with so our class with in this example then
2117
03:19:30,061 --> 03:19:40,601
would be eight. So, um, now I described to
you that whole class with, but I'm not going
2118
03:19:40,601 --> 03:19:45,351
to use it in the example because we don't
really do that much in healthcare and it makes
2119
03:19:45,351 --> 03:19:50,220
it actually kind of hard to understand because
you want something that's a little intuitive,
2120
03:19:50,220 --> 03:19:57,110
like if you look on the slide right now, you
know, less than 20 miles 21 to 2930 to 39
2121
03:19:57,110 --> 03:20:04,101
and then 40 or more, that may A little more
sense in your head. You know, that's how we
2122
03:20:04,101 --> 03:20:11,570
think of miles. If I had put like 18 to 24,
and 25 to 29, you know, we don't really think
2123
03:20:11,570 --> 03:20:16,340
that way. So this is helpful in healthcare
to boil it down to something like this. And
2124
03:20:16,340 --> 03:20:20,351
by the way, if I was writing a real paper
in the sort of real data, I'd be looking at
2125
03:20:20,351 --> 03:20:24,430
the papers before this that talked about transport
times and looking at those
2126
03:20:24,430 --> 03:20:26,360
class limits. Okay,
2127
03:20:26,360 --> 03:20:31,431
so a frequency table displays each class,
along with the frequency, the number of data
2128
03:20:31,431 --> 03:20:35,250
points in each class, as you can see, the
class limits are on the left side of the simple
2129
03:20:35,250 --> 03:20:40,450
frequency table, you know, the classes, and
then the frequencies on the right side, right.
2130
03:20:40,450 --> 03:20:46,580
And you'll notice that they all add up to
60, because we measured 60, fake patients,
2131
03:20:46,580 --> 03:20:50,860
and it's really good to do that little check.
Because you don't want to double count people
2132
03:20:50,860 --> 03:20:56,591
put them in two classes, they only get to
be in one, etc. So selecting arbitrary class
2133
03:20:56,591 --> 03:21:00,671
limits, can make the frequency table unbalanced.
So in other words, doing this empirical thing
2134
03:21:00,671 --> 03:21:07,480
can make it sort of weird because less than
20 is big, and 40 or more miles is big. And
2135
03:21:07,480 --> 03:21:12,660
it's bigger than the other classes. So it's
does it kind of breaks the rules of class
2136
03:21:12,660 --> 03:21:17,490
with but not following the scientific literature
can make your results not comparable, and
2137
03:21:17,490 --> 03:21:23,190
can make the science less useful. And so that's
why I sort of flail against the book with
2138
03:21:23,190 --> 03:21:31,790
this class with formula thing. So I'm, I'm
going to just give you another example for
2139
03:21:31,790 --> 03:21:37,740
a frequency table. Okay. This one is more,
it's also health carry, you know, glucose
2140
03:21:37,740 --> 03:21:43,050
is measured in the blood and expressed in
milligrams, 400 milliliters, right? So glucose
2141
03:21:43,050 --> 03:21:47,800
is a huge molecule, and it should be cleared
from the blood, especially a fasting. So if
2142
03:21:47,800 --> 03:21:51,790
you're not eating anything, you're not putting
any glucose in your body supposed to be like
2143
03:21:51,790 --> 03:21:57,820
metabolizing. That problem is some people
don't metabolize glucose very well, you know,
2144
03:21:57,820 --> 03:22:02,740
that's what diabetes is. So you, you care
about how much glucose is sitting around people's
2145
03:22:02,740 --> 03:22:03,740
blood.
2146
03:22:03,740 --> 03:22:04,740
So blood
2147
03:22:04,740 --> 03:22:09,490
glucose levels for a random sample of 70,
women were recorded after a 12 hour fast.
2148
03:22:09,490 --> 03:22:15,420
And this is what they got, they got the minimum
was 45, the maximum was 109. And they picked
2149
03:22:15,420 --> 03:22:26,021
six classes. So this is how they set up their
class limits. And again, this is using a class
2150
03:22:26,021 --> 03:22:30,740
with formula. And just to demonstrate, you
know, it sort of comes out a little weird
2151
03:22:30,740 --> 03:22:38,431
here. But then they they got these frequencies,
okay. And this is again, just another example,
2152
03:22:38,431 --> 03:22:42,961
using this time the class width formula to
get our six classes and to make sure that
2153
03:22:42,961 --> 03:22:48,180
they covered everybody. Now, you'll notice
in this, we start with the minimum like 45
2154
03:22:48,180 --> 03:22:52,340
to 55. And we end with the maximum, which
is up to 110. And that's really the clearest
2155
03:22:52,340 --> 03:22:57,870
way to do it. It's just not typically done
that way. If you read, like scientific literature
2156
03:22:57,870 --> 03:23:05,641
and healthcare, you just don't see these frequency
tables labeled like that. So and just to wrap
2157
03:23:05,641 --> 03:23:11,681
up this part, make sure all of your data points
are accounted for only once in one of the
2158
03:23:11,681 --> 03:23:17,580
classes. So whether you use a class with formula,
or you use empirical or arbitrarily picked
2159
03:23:17,580 --> 03:23:24,521
classes, every single data point only gets
one vote, it can only be in one of the classes.
2160
03:23:24,521 --> 03:23:29,311
And, and also, you don't want to leave any
of the data points out. So you want to make
2161
03:23:29,311 --> 03:23:32,931
sure that that happens that you account for
all of them. And also you need to make sure
2162
03:23:32,931 --> 03:23:37,330
your classes cover all the data, right. And
healthcare when we do that thing up to 20,
2163
03:23:37,330 --> 03:23:42,930
and 65. And over all that stuff, we cause
that to happen. However, if you're going to
2164
03:23:42,930 --> 03:23:47,410
use a class with formula, you really have
to pay attention to where your minimum and
2165
03:23:47,410 --> 03:23:52,680
your maximum are. Because then you want to
make sure all of your classes cover all of
2166
03:23:52,680 --> 03:23:59,300
your data. And like I mentioned, make sure
the total of your classes of the frequencies
2167
03:23:59,300 --> 03:24:03,240
in your classes adds up to the total number
of data points, it's just a little check,
2168
03:24:03,240 --> 03:24:10,391
make sure you didn't do something wrong. Now
I'm going to talk about what is a relative
2169
03:24:10,391 --> 03:24:15,561
frequency table. And that builds on what you
already just learned about frequency. So we
2170
03:24:15,561 --> 03:24:21,140
all know what our relatives are. They're like
our family, right? We have relationships with
2171
03:24:21,140 --> 03:24:28,370
them. And so what relative means is in relationship
to the rest of the data, okay? So in statistics,
2172
03:24:28,370 --> 03:24:35,330
they often use this fancy F to stand for frequency.
And, as I've mentioned before, the sample
2173
03:24:35,330 --> 03:24:42,120
size, if you have a sample, they use a lowercase
n. So what they use as the formula for relative
2174
03:24:42,120 --> 03:24:49,061
frequency is F divided by n. And if you're
clever with math, you realize what that means
2175
03:24:49,061 --> 03:24:55,220
is is if you take a frequency of any of the
classes, you know, it's just a portion of
2176
03:24:55,220 --> 03:25:00,630
the whole sample, and you divide it by the
total sample, which is that n you You'll get
2177
03:25:00,630 --> 03:25:07,690
the proportion of values that are in that
class, it's not really that fancy. So relative
2178
03:25:07,690 --> 03:25:11,511
frequency is something very useful to put
in a frequency table. So you'll see that I,
2179
03:25:11,511 --> 03:25:16,380
I kind of crammed it in onto the right side,
this is the old frequency table I just showed
2180
03:25:16,380 --> 03:25:23,190
you with glucose, but I crammed in this relative
frequency next to it. So it's super easy to
2181
03:25:23,190 --> 03:25:29,160
calculate, like, for example, for the first
one, see, 45 to 55, the frequency is three,
2182
03:25:29,160 --> 03:25:34,390
what did I do? Pull out the old calculator?
Well, I actually I use Excel. And I did three
2183
03:25:34,390 --> 03:25:40,070
divided by 70, because I was a total. And
I got Oh point oh four. And those of you don't
2184
03:25:40,070 --> 03:25:44,391
really like proportions, you can do that thing
where you move the decimal two places to the
2185
03:25:44,391 --> 03:25:50,061
right, and then put us percent sign. So that
would be like 4% of those 70 people are in
2186
03:25:50,061 --> 03:25:55,261
that first class. And then the same thing
happened with the next one, I took, you know,
2187
03:25:55,261 --> 03:26:03,400
the 56 to 66, I took seven divided by 70,
which came out 2.10. And those of you into
2188
03:26:03,400 --> 03:26:08,320
percents, I'm really into percents, I like
moving that decimal over, I think of it as
2189
03:26:08,320 --> 03:26:14,150
10%, then, but whatever, as you can see at
the bottom, and all has to equal 1.0. If you
2190
03:26:14,150 --> 03:26:19,590
like proportion, land, or 100%, if you're
like me, and you like percent land. But in
2191
03:26:19,590 --> 03:26:24,271
any case, this is all you have to do to do
the relative frequency table, you just make
2192
03:26:24,271 --> 03:26:30,811
another column and do all those calculations.
And it's super easy to calculate it. And it's
2193
03:26:30,811 --> 03:26:35,720
very helpful. So why did we even do this,
because we had a pile of
2194
03:26:35,720 --> 03:26:41,061
quantitative data, and it was really hard
to organize right. And the first thing was
2195
03:26:41,061 --> 03:26:46,061
we had to do was select class width. And I
talked about the politics behind that. But
2196
03:26:46,061 --> 03:26:49,940
ultimately, whatever you do you do in the
lower in the upper class limits need to be
2197
03:26:49,940 --> 03:26:55,330
determined and put in the first column of
your frequency table. Then in your second
2198
03:26:55,330 --> 03:27:00,101
column, which are the frequencies, you count
up, how many are in that class, and you fill
2199
03:27:00,101 --> 03:27:05,851
it in. And then if you make that third column,
then you can do that dividing thing and get
2200
03:27:05,851 --> 03:27:10,801
your relative frequencies. And that's great.
That's how you build your frequency table.
2201
03:27:10,801 --> 03:27:17,180
And as I go through future lectures, you'll
see even more why you would make that table
2202
03:27:17,180 --> 03:27:22,190
like how useful that can be. Given that you
have quantitative data, and it kind of gets
2203
03:27:22,190 --> 03:27:27,500
all over the place, it's very helpful to organize
it in that table.
2204
03:27:27,500 --> 03:27:28,500
Now I'm going to
2205
03:27:28,500 --> 03:27:32,311
move on to talk about the stem and leaf. And
the reason why I picked talking about it.
2206
03:27:32,311 --> 03:27:38,973
Now it's because it's on the theme of organizing
quantitative data. So I'm going to talk to
2207
03:27:38,973 --> 03:27:44,750
you about what the stem and leaf plot actually
is. Here's a just an example on the slide
2208
03:27:44,750 --> 03:27:50,920
and how you make one. And why why you might
make one of these you'll find it feels a lot
2209
03:27:50,920 --> 03:27:55,910
like making a frequency table. But why do
you make these instead of a frequency table?
2210
03:27:55,910 --> 03:28:03,120
And it's just more food for thought. So first,
one of the things that I got hung up on when
2211
03:28:03,120 --> 03:28:09,280
I took biostatistics is I could not get over
the fact that it was called a stem and leaf.
2212
03:28:09,280 --> 03:28:14,720
So I had to understand that. So this is an
example of a stem and leaf there. So why is
2213
03:28:14,720 --> 03:28:19,960
it called a seven leaf? Well, there's always
the stem. And that's so see these corn stalks,
2214
03:28:19,960 --> 03:28:25,240
I'm from Minnesota, I'm used to seeing them,
you'll notice that there's a stem, right,
2215
03:28:25,240 --> 03:28:30,211
like this big corn stock has the stem, that
thing you see that vertical line and a bunch
2216
03:28:30,211 --> 03:28:36,181
of numbers on the left, that part of the stem
and leaf plot is called the stem. And then
2217
03:28:36,181 --> 03:28:40,940
leaves are added onto the sim as we tally
up the length of the leaves. And that may
2218
03:28:40,940 --> 03:28:45,061
not make much sense right now, but I'll show
you how to make one. But essentially, what
2219
03:28:45,061 --> 03:28:50,851
you end up doing is adding these leafs like
you see under two, there's a little leaf that
2220
03:28:50,851 --> 03:28:55,660
just has a zero on it. But if you see under
five, there's this big long leaf with a whole
2221
03:28:55,660 --> 03:29:02,090
bunch of numbers off of it. So I'm making
one will help you understand this terminology.
2222
03:29:02,090 --> 03:29:07,311
But I first wanted to just show you this picture
because it's actually kind of hard to understand
2223
03:29:07,311 --> 03:29:12,090
what's going on with a stem leaf unless you
understand that that vertical line in the
2224
03:29:12,090 --> 03:29:16,331
numbers to the left of it is considered a
stem. And then each one of these things we
2225
03:29:16,331 --> 03:29:21,271
build off start, you know, off of each of
those numbers is called a leaf. So people
2226
03:29:21,271 --> 03:29:29,910
talk about the four leaf in the five leaf
already. Okay, so again, I'm just so into
2227
03:29:29,910 --> 03:29:36,811
making up data, right? So I decided to make
up data from 42 patients who visited a primary
2228
03:29:36,811 --> 03:29:41,800
care clinic and referred to mental health.
Now the reason why I made update on the subject
2229
03:29:41,800 --> 03:29:45,891
is I'm very upset about this subject. I think
people are waiting too long to get mental
2230
03:29:45,891 --> 03:29:52,180
health treatment. Especially if you've been
following the news about the Veterans Administration.
2231
03:29:52,180 --> 03:29:57,061
In the US. A lot of people are put on hold
even for primary care. You know, they're put
2232
03:29:57,061 --> 03:30:01,601
on waiting lists and I don't like so I made
a fake data by That as a demonstration just
2233
03:30:01,601 --> 03:30:07,660
to highlight these issues. Okay, so what what
data Did I make up, I made up the the number
2234
03:30:07,660 --> 03:30:13,950
of days between the referral and their first
mental health appointment. That was what was
2235
03:30:13,950 --> 03:30:19,800
collected. So let's say you go in on January
1, and you get a referral. And then 10 days
2236
03:30:19,800 --> 03:30:24,120
later, you actually show up at the clinic,
then that would be 10. Right? That would be
2237
03:30:24,120 --> 03:30:30,400
your value. So that's quantitative. So let's
take a look at it. So on the right side of
2238
03:30:30,400 --> 03:30:37,071
the slide, you see just this pile of numbers
from all these people that came in and, and
2239
03:30:37,071 --> 03:30:40,490
then got a referral. So like, you look at
the first person had to wait a
2240
03:30:40,490 --> 03:30:41,490
month,
2241
03:30:41,490 --> 03:30:47,440
go see a mental health professional. But if
you look, you know, the third one, and that
2242
03:30:47,440 --> 03:30:52,390
person only needed 12 days. So that's how
you sort of consume this fake data I made.
2243
03:30:52,390 --> 03:30:57,050
And then you'll see over on the on the left
side, I already made a step. It's blank that
2244
03:30:57,050 --> 03:31:00,390
doesn't have any numbers on it, but I knew
I need that vertical line. So I just made
2245
03:31:00,390 --> 03:31:04,480
that in preparation. Okay, so let's build
our simile.
2246
03:31:04,480 --> 03:31:08,830
So what we do is we start with the first number,
and that's what's awesome about this is you
2247
03:31:08,830 --> 03:31:12,521
just start with the first number. And if you
want, you can kind of cross them out as you
2248
03:31:12,521 --> 03:31:13,521
go along
2249
03:31:13,521 --> 03:31:17,580
to keep track. So we start with this first
number. And you'll see what I did, I went
2250
03:31:17,580 --> 03:31:22,900
over to the stem, and I put the three on the
left side of the stem and the zero on the
2251
03:31:22,900 --> 03:31:25,710
right, this begins the three leaf,
2252
03:31:25,710 --> 03:31:26,710
okay.
2253
03:31:26,710 --> 03:31:33,670
Here's the next number. Now, I put the two
above the three because it's like right before
2254
03:31:33,670 --> 03:31:38,200
it and you can kind of imagine we're gonna
walk down like 23456. And then I put the seven
2255
03:31:38,200 --> 03:31:45,220
on the right side to start the the two leaf.
Alrighty, here we are with the next number,
2256
03:31:45,220 --> 03:31:50,420
which is 12. And as you'll see, I started
the one leaf, you're starting to see the pattern,
2257
03:31:50,420 --> 03:31:54,730
right? And you can probably guess what's going
to happen next, we start the four leaf and
2258
03:31:54,730 --> 03:32:01,462
put the two there. Okay, our next leaf, we've
already started, right for 35. So what do
2259
03:32:01,462 --> 03:32:07,980
we do there? Well, we just add the five on
to the three leaf, the three leaf was already
2260
03:32:07,980 --> 03:32:15,660
started with that, that 30 at the beginning,
so we just pile a five on there. Here's 47,
2261
03:32:15,660 --> 03:32:21,510
we just pile a seven on there. Now you'll
notice I tried to line up that seven on the
2262
03:32:21,510 --> 03:32:25,440
four leaf with the five on the three leaf.
When you're doing this by hand, well, even
2263
03:32:25,440 --> 03:32:29,811
when you're not doing it by hand, you really
have to keep those things lined up or you
2264
03:32:29,811 --> 03:32:34,690
you won't have a good stem and leaf. Okay.
Now I'm going to just fast forward a little
2265
03:32:34,690 --> 03:32:40,061
a little because you can probably imagine
how to do the next row the 3836. You just
2266
03:32:40,061 --> 03:32:44,811
keep piling it on. But I want to show you
what happens when you get to the special case
2267
03:32:44,811 --> 03:32:51,140
here. Okay, well, we'll go with this 29. This
is the last thing before the special case.
2268
03:32:51,140 --> 03:32:57,260
So you'll notice that 38 got put in there,
see that eight, three leaf that 36 got put
2269
03:32:57,260 --> 03:33:01,150
in there, you know from the second row, see,
we put everything in there. And now we put
2270
03:33:01,150 --> 03:33:06,840
in the 29 look at that we got a three after
that. That's our next one. So where are we
2271
03:33:06,840 --> 03:33:11,200
gonna put that three? And I, you know, you
might think on three leaf but that's not right,
2272
03:33:11,200 --> 03:33:16,690
right? Because that's 30 something. So where
do you put the three? Well, some of you figured
2273
03:33:16,690 --> 03:33:23,500
this out, you have to add a zero onto your
step. So look at that, I put that zero there
2274
03:33:23,500 --> 03:33:28,200
and then we put the three in. And then you
can already guess how to do the 21. Next,
2275
03:33:28,200 --> 03:33:33,730
we'll just tack a one on to the to lead. But
then when we get to the next zero, we just
2276
03:33:33,730 --> 03:33:41,291
add a zero on to the zero we.
2277
03:33:41,291 --> 03:33:47,010
So you can probably figure out how to pile
up all of these. But I did want to talk to
2278
03:33:47,010 --> 03:33:53,090
you about something else that happens with
these stem leafs. As you go on adding to the
2279
03:33:53,090 --> 03:33:58,240
leaf, you got to be careful because you might
end up with a situation where you got something
2280
03:33:58,240 --> 03:34:03,671
big now I really feel sorry for this fake
person. 51 days for a mental health appointment
2281
03:34:03,671 --> 03:34:09,521
that's too long, right? But it causes us later
to have to add a five.
2282
03:34:09,521 --> 03:34:14,010
Now this can cause real estate problems, especially
on a piece of paper, you know, what have you
2283
03:34:14,010 --> 03:34:18,080
the four was right at the bottom of the paper,
right, it's kind of hard, maybe you have to
2284
03:34:18,080 --> 03:34:24,540
tape some paper at the bottom I have this
problem a lot. Um, you'll see here this, I
2285
03:34:24,540 --> 03:34:30,280
even had to move this up on the slide when
we got later to the 70 I'd add the seven leaf.
2286
03:34:30,280 --> 03:34:35,340
Now I just want to show you for some reason
the state of we didn't have any 60s. But you
2287
03:34:35,340 --> 03:34:42,290
still have to put that six leaf place or in
that that's got to be there. So even if you
2288
03:34:42,290 --> 03:34:46,790
know as we go on, if we're missing any leaves
in between, we just need the place are there
2289
03:34:46,790 --> 03:34:52,880
because that space has to be there. And here's
here's an outlier. We're gonna learn about
2290
03:34:52,880 --> 03:34:58,610
outliers pretty soon. This is a really long
time. 105 days this is kind of like VA status
2291
03:34:58,610 --> 03:35:04,819
right? But it And you'll see that and of course,
this is fake data, but unfortunately reflects
2292
03:35:04,819 --> 03:35:10,950
real data. You'll see when we get to 105,
not only did we skip the eight leaf and the
2293
03:35:10,950 --> 03:35:17,710
nine leaf, and we need to leave a space for
them, but 10 becomes the part of the stem.
2294
03:35:17,710 --> 03:35:18,710
So
2295
03:35:18,710 --> 03:35:23,000
if we went on to 200, or 300, I mean, that
would be awful. The wait that long, though,
2296
03:35:23,000 --> 03:35:31,521
the first two digits of it, like if we had
365, the 36 of the 365 would be the part of
2297
03:35:31,521 --> 03:35:38,190
the step. Alright, so I just did a little
demonstration to explain certain nuances of
2298
03:35:38,190 --> 03:35:45,300
the stem leaf that you might encounter in
your life. So now, I'm going to just reflect
2299
03:35:45,300 --> 03:35:50,530
back on the two ways that I've described in
this lecture for you to organize quantitative
2300
03:35:50,530 --> 03:35:57,021
data. First, I showed you how to make a frequency
table. But what you need to do with that one
2301
03:35:57,021 --> 03:36:02,730
is you need to set up classes and class with
and and to count the frequencies in there
2302
03:36:02,730 --> 03:36:06,930
a lot of there's a lot of pre processing a
lot of pre calculations, you really want to
2303
03:36:06,930 --> 03:36:11,090
think when you're doing this, and you don't
want to be distracted. However, if you're
2304
03:36:11,090 --> 03:36:16,470
trying to do a stem and leaf, you really can
do that on the fly, you don't need to set
2305
03:36:16,470 --> 03:36:22,521
up classes or class with, as you noticed,
we just went through the line of those pile
2306
03:36:22,521 --> 03:36:28,050
of numbers, and just crossed them off as we
put them onto the stemmen wave. And there
2307
03:36:28,050 --> 03:36:34,480
was really no need to count, you can tally
the data as you go through the list, you know,
2308
03:36:34,480 --> 03:36:40,630
cross it off. And it's just really quicker
to do. Of course, those of you who are pretty
2309
03:36:40,630 --> 03:36:46,010
clever saying, Well, basically you're forcing
in a stem and leaf everything to be in the
2310
03:36:46,010 --> 03:36:52,431
class of, you know, the 10s, right, you know,
the 20s and the 30s in the 40s. That's like
2311
03:36:52,431 --> 03:36:57,140
the two leaf, the three leaf and the poorly.
And yeah, it is kind of like a simplified
2312
03:36:57,140 --> 03:37:02,440
way of making those kinds of classes. But
in any case, I just wanted to alert you to
2313
03:37:02,440 --> 03:37:08,840
this because you might see some similarities
between the two. And I wanted to highlight
2314
03:37:08,840 --> 03:37:16,261
those as well as the differences. Now I'm
going to give you a few tricks here, I want
2315
03:37:16,261 --> 03:37:21,711
to tell you about the concept of an unordered
leaf. So an unordered leaf is what we were
2316
03:37:21,711 --> 03:37:27,271
making before when I was demonstrating, it's
just where the numbers are out of order in
2317
03:37:27,271 --> 03:37:31,290
the leaf like you'll see this two leaf it's
a seven, seven to nine. Well, if there were
2318
03:37:31,290 --> 03:37:37,110
an order would say 2779, right, like the two
would come first before the seventh and the
2319
03:37:37,110 --> 03:37:41,030
ninth. And the same with the three leaf that's
out of order, because you can see that it's
2320
03:37:41,030 --> 03:37:44,240
zero and five is fine, but eight doesn't come
before six
2321
03:37:44,240 --> 03:37:46,470
and five, right? That's no
2322
03:37:46,470 --> 03:37:53,410
problem to make an unordered leap. However,
after making an unordered version, you can
2323
03:37:53,410 --> 03:37:58,400
rewrite the stem and leaf in an ordered way.
So you see how I did that I rewrote the two
2324
03:37:58,400 --> 03:38:03,460
leaf and the three leaf. And now they're all
the leaves are in in order. Okay, you don't
2325
03:38:03,460 --> 03:38:08,730
have to be but you can do that. And if you
do that, if you make your stem only first
2326
03:38:08,730 --> 03:38:14,590
unordered the way I was demonstrating, then
you rewrite it into ordered, it is way easier
2327
03:38:14,590 --> 03:38:20,960
to count it up to make a frequency table no
matter what classes you choose. Or you can
2328
03:38:20,960 --> 03:38:25,670
just make each leaf a class. And then it's
super easy to make the frequency table. So
2329
03:38:25,670 --> 03:38:31,061
that's why I combined these two pieces of
the chapter together is because I wanted to
2330
03:38:31,061 --> 03:38:38,891
show you how you can use a stem and leaf to
help you make a frequency table. So a stem
2331
03:38:38,891 --> 03:38:44,021
leaf, it's just another way to organize quantitative
data. And it's easier to make kind of on the
2332
03:38:44,021 --> 03:38:50,050
fly than a frequency table because it requires
less preparation. And they can help you put
2333
03:38:50,050 --> 03:38:57,000
data in order before like in preparation for
a frequency table started to help you as a
2334
03:38:57,000 --> 03:39:03,300
first step to make sure that you can organize
everything. And at the end. Remember I keep
2335
03:39:03,300 --> 03:39:07,100
emphasizing your frequency table has to reflect
all your data points. And they can only be
2336
03:39:07,100 --> 03:39:12,070
in one class, blah, blah. Well this is one
way to make sure that happens is to first
2337
03:39:12,070 --> 03:39:20,440
do this pre organization using an ordered
stem and leaf. So in conclusion, frequency
2338
03:39:20,440 --> 03:39:26,680
tables and stem and leaf displays organize
data, they organize quantitative data. And
2339
03:39:26,680 --> 03:39:30,850
the stem and leaf may help you make a frequency
table. So you might want to start with that.
2340
03:39:30,850 --> 03:39:36,931
And the purpose of both of these things is
to reveal a thing called a distribution. And
2341
03:39:36,931 --> 03:39:43,271
I'm going to explain that in the next lecture.
Hello, it's Monica wahi. Again, your lecturer
2342
03:39:43,271 --> 03:39:48,730
from library college and we are moving on
to chapter 3.1 which is measures of central
2343
03:39:48,730 --> 03:39:54,511
tendency. And here are your learning objectives.
So at the end of this lecture, you should
2344
03:39:54,511 --> 03:39:59,440
be able to explain how to calculate the mean.
You should also be able to describe what a
2345
03:39:59,440 --> 03:40:04,891
mode is In say how many modes a dataset can
have, you should be able to demonstrate how
2346
03:40:04,891 --> 03:40:09,380
to find the median in the set of data with
odd number of values, as well as in a set
2347
03:40:09,380 --> 03:40:14,220
of data with an even number of values. And
you should also be able to define trim mean
2348
03:40:14,220 --> 03:40:19,900
and weighted average. All right, so what's
this measures of central tendency, I'm going
2349
03:40:19,900 --> 03:40:24,580
to explain that why we kind of call it that.
And then I'm going to talk about them, which
2350
03:40:24,580 --> 03:40:28,581
the three biggies are mode, median, and mean.
So I'm going to talk about those and explain
2351
03:40:28,581 --> 03:40:33,910
how to get those. Then, towards the end of
the lecture, I'm going to go into some special
2352
03:40:33,910 --> 03:40:39,760
situations. One is called the trimmed mean.
And the second is a weighted average. So let's
2353
03:40:39,760 --> 03:40:44,851
get started. What is the central tendency
thing? Well, if you think about quantitative
2354
03:40:44,851 --> 03:40:49,040
data, which that you can only do this with
quantitative data, not qualitative data. But
2355
03:40:49,040 --> 03:40:52,790
when you think of having a pile of numbers
like this, one of the things you want to know
2356
03:40:52,790 --> 03:40:57,511
is how much they tend towards the center.
Now, of course, you don't know where the center
2357
03:40:57,511 --> 03:41:02,430
is, until you start looking at the data. Some
data are kind of high up in the hundreds,
2358
03:41:02,430 --> 03:41:07,720
like systolic blood pressure. I give a five
point quiz and one of my classes, so those
2359
03:41:07,720 --> 03:41:13,131
numbers are low, like 12345. But then the
question becomes, do the group towards the
2360
03:41:13,131 --> 03:41:19,360
center of whatever list of data they're in?
Or don't they? How sort of sensory?
2361
03:41:19,360 --> 03:41:20,790
Are they?
2362
03:41:20,790 --> 03:41:26,250
You see these distributions on the slide?
You'll see, on the left, you'd probably say,
2363
03:41:26,250 --> 03:41:30,262
well, that looks more sensory than what's
on the right, you know, this normal distribution
2364
03:41:30,262 --> 03:41:35,432
on the left, and the skewed right distribution
on the right. And so intuitively, you kind
2365
03:41:35,432 --> 03:41:39,561
of know what I'm talking about. But what this
lecture is going to be about is how to actually
2366
03:41:39,561 --> 03:41:44,881
put numbers on the difference between what
you see on the left and what you see on the
2367
03:41:44,881 --> 03:41:49,220
right. So these are the numbers, these are
the measures of central tendency, we're going
2368
03:41:49,220 --> 03:41:55,570
to go over mode, median, and mean. And the
median is a little different, depending on
2369
03:41:55,570 --> 03:41:59,180
whether you have an odd number of values or
an even number of values. I mean, it means
2370
03:41:59,180 --> 03:42:03,940
the same thing, but you calculate it slightly
differently. So I'll go over that. And then
2371
03:42:03,940 --> 03:42:08,440
the mean, a lot of you already know what a
mean is, but there's a couple special means
2372
03:42:08,440 --> 03:42:13,311
we can make. One is called a trim mean, and
another is called weighted average, which
2373
03:42:13,311 --> 03:42:17,410
is a weighted mean, I don't know why they
chose the word average for that one, because
2374
03:42:17,410 --> 03:42:23,290
mean an average mean the same thing. But I'm
going to go over these things. Okay, well,
2375
03:42:23,290 --> 03:42:27,890
let's start with the mode. The mode is the
number in the data set that occurs the most
2376
03:42:27,890 --> 03:42:34,102
frequently. So I put up this little tiny data
set here of just five numbers. And it's obvious
2377
03:42:34,102 --> 03:42:36,120
that then five is the mode, right, because
it
2378
03:42:36,120 --> 03:42:41,990
repeats Once there, two fives there. But look,
I just changed one of them, I changed it to
2379
03:42:41,990 --> 03:42:44,521
a six. And now there's no mode.
2380
03:42:44,521 --> 03:42:48,920
So I just want you to know that a lot of data
sets don't even have a mode, there's just
2381
03:42:48,920 --> 03:42:55,271
no repeat at all in them. And that usually
happens when you have a broad range of numbers,
2382
03:42:55,271 --> 03:42:59,300
they can have like systolic blood pressure,
I mean, it would be kind of lucky, you just
2383
03:42:59,300 --> 03:43:05,380
got two people with the exact same one. But
that can happen. So don't think there's always
2384
03:43:05,380 --> 03:43:11,061
going to be a mode, there might not be one.
It's also possible to have more than one mode,
2385
03:43:11,061 --> 03:43:15,730
like look at that. So I've got six numbers
up there. And the two repeats once and the
2386
03:43:15,730 --> 03:43:22,261
three repeat ones. So you've got two modes,
right? But let's say that the three actually
2387
03:43:22,261 --> 03:43:27,350
repeated three times, then it would only be
one mode, because the three threes would Trump
2388
03:43:27,350 --> 03:43:28,521
the two twos,
2389
03:43:28,521 --> 03:43:35,540
right? So you can just imagine how confusing
this gets when you got a ton of numbers. What's
2390
03:43:35,540 --> 03:43:40,272
a little less confusing is, um, if you like
I said have a broad range of numbers, it would
2391
03:43:40,272 --> 03:43:44,930
be kind of a coincidence, if two patients
had the exact same systolic blood pressure
2392
03:43:44,930 --> 03:43:48,390
or platelet count, you know, like you get
a repeat in there. And then that would be
2393
03:43:48,390 --> 03:43:52,751
the mode. Of course, if you measure a whole
bunch of people, then eventually you're probably
2394
03:43:52,751 --> 03:43:57,601
going to get one. But I just wanted to say
and also, if you look at the slide all those
2395
03:43:57,601 --> 03:44:01,500
numbers, you'd really have to go through and
organize them and count them up and see if
2396
03:44:01,500 --> 03:44:05,830
there is a mode, there probably is one because
we see a lot of repeats. But then which was
2397
03:44:05,830 --> 03:44:10,061
the one that wins that's repeated the most?
Or are there two that are repeated the most,
2398
03:44:10,061 --> 03:44:17,450
and becomes kind of political when you really
do it. And it's not worth a lot of work, because
2399
03:44:17,450 --> 03:44:23,010
what does the mode tell you? It doesn't really
tell you much. It does tell you the most popular
2400
03:44:23,010 --> 03:44:29,240
answer. The word mode in French means fashion.
So like I put on the slide, you know Allah
2401
03:44:29,240 --> 03:44:33,820
mode, it's in fashion. So it's the one that's
most popular or the most common result, but
2402
03:44:33,820 --> 03:44:40,101
it's not used a lot in healthcare. And it's
actually not used very often once in a while.
2403
03:44:40,101 --> 03:44:45,561
I'll say, Oh, the mode. In the class for my
five point quiz was five, meaning everybody
2404
03:44:45,561 --> 03:44:50,521
did pretty well they mostly got a five. That
was the most popular result. But you hardly
2405
03:44:50,521 --> 03:44:52,980
ever have to say that. And so
2406
03:44:52,980 --> 03:44:54,850
remember, we learn
2407
03:44:54,850 --> 03:45:00,180
the words resistant, like if a measure is
resistant, you can't whack it out very easily.
2408
03:45:00,180 --> 03:45:04,030
Well, you can change things pretty easily
with the mode, the modes not resistant, I
2409
03:45:04,030 --> 03:45:08,561
even just demonstrated that on those slides,
by just changing one number, you can erase
2410
03:45:08,561 --> 03:45:14,021
the mode or add a mode or whatever. And so
it's not stable, it's not resistant. And those
2411
03:45:14,021 --> 03:45:18,190
are the kinds of things we don't really like
and healthcare, so we don't really use them.
2412
03:45:18,190 --> 03:45:23,561
So I'll move on to some cooler measures of
central tendency.
2413
03:45:23,561 --> 03:45:28,690
And here's a really cool one, which is called
the median. And it's the middle of the data.
2414
03:45:28,690 --> 03:45:35,171
And I'll explain that a little bit more what
we mean by the center of the data. Okay, so
2415
03:45:35,171 --> 03:45:39,811
remember, we're talking about quantitative
data. So you've got some pile of numbers,
2416
03:45:39,811 --> 03:45:43,870
it doesn't matter, you can always sort them
in order of lowest to highest. And I keep
2417
03:45:43,870 --> 03:45:47,290
talking about this five point quiz, I give
him my class. It's an easy quiz. And most
2418
03:45:47,290 --> 03:45:51,930
people get fives. But even so somebody gets
a four usually, or somebody doesn't show up
2419
03:45:51,930 --> 03:45:55,471
for the quiz, and they get a zero. And so
it doesn't matter, I can have 100 people in
2420
03:45:55,471 --> 03:45:59,830
the class, I still could put all of those
numbers in order of lowest to highest, even
2421
03:45:59,830 --> 03:46:05,010
if most of them were fives. Because you'll
get repeats in your data sometimes, right.
2422
03:46:05,010 --> 03:46:08,420
And also, sometimes you'll get outliers. Like
if I said one person maybe didn't take the
2423
03:46:08,420 --> 03:46:13,771
quiz and they get a zero. But everybody gets
else gets four and five is an easy quiz, well,
2424
03:46:13,771 --> 03:46:18,870
then that zero would be an outlier. So you
don't have to worry about that. And like I
2425
03:46:18,870 --> 03:46:21,990
said, you know, the data values sometimes
are almost the same, like almost everybody
2426
03:46:21,990 --> 03:46:25,900
gets a five on my quiz, because it's so easy.
So it doesn't matter. Even if you have these
2427
03:46:25,900 --> 03:46:30,581
weirdnesses in your data, you can still just
arrange them in order. And that's what we
2428
03:46:30,581 --> 03:46:36,001
mean by the median is the number that is halfway
up, or halfway down, right. So if I've got
2429
03:46:36,001 --> 03:46:40,750
100 people in my class, and I've got the zero
over here on the left, and I put all the,
2430
03:46:40,750 --> 03:46:47,230
you know, fours, and then the fives, you know,
I have to count up what 50, right to see where
2431
03:46:47,230 --> 03:46:51,740
the middle is. And it's probably going to
be in the five range, right. But that's all
2432
03:46:51,740 --> 03:46:58,221
we mean, we say, you'll take however many
values you have, put them in order, even if
2433
03:46:58,221 --> 03:47:02,230
there's repeats and outliers or whatever,
just put them in order, and then count up
2434
03:47:02,230 --> 03:47:07,460
halfway. And that's where the median is going
to be. So I'll demonstrate this here. So how
2435
03:47:07,460 --> 03:47:11,811
to find the median, the first step is to order
the data from the smallest to largest. So
2436
03:47:11,811 --> 03:47:15,690
I'm giving you two demonstrations. And I don't
even know what these data mean, I just totally
2437
03:47:15,690 --> 03:47:20,830
made them up. The one at the top, the data
set the top that starts with 42, that only
2438
03:47:20,830 --> 03:47:25,000
has five numbers in it. So I'm going to demonstrate
the odd version with them.
2439
03:47:25,000 --> 03:47:26,000
The one
2440
03:47:26,000 --> 03:47:30,040
set at the bottom has actually six numbers
in it. So I'm going to demonstrate the even
2441
03:47:30,040 --> 03:47:33,170
version, because remember, it goes a little
differently, whether you have an odd number
2442
03:47:33,170 --> 03:47:39,240
of numbers or an even number of numbers. Okay,
so those are the numbers. And we still have
2443
03:47:39,240 --> 03:47:43,101
to do the first step, which is order the data
from smallest to largest, because you can
2444
03:47:43,101 --> 03:47:48,230
see they're not in order. So I'm going to
do that here. Okay, there it is. So those
2445
03:47:48,230 --> 03:47:54,131
are the same numbers, they're just in order
from smallest to largest, okay. So we're going
2446
03:47:54,131 --> 03:47:59,180
to get rid of those numbers on the top, and
instead put the position they're in. So let's
2447
03:47:59,180 --> 03:48:05,480
look at the top data set, which is the odd
one. So I'm going to say this is how you find
2448
03:48:05,480 --> 03:48:07,800
the median is you
2449
03:48:07,800 --> 03:48:08,800
number the positions,
2450
03:48:08,800 --> 03:48:14,021
you know, it's 12345. And it's the middle
position. So you can imagine, if we had had
2451
03:48:14,021 --> 03:48:20,680
seven data points, we'd go out 1234. And we'd
circle that one, and that would be the median.
2452
03:48:20,680 --> 03:48:25,021
So that's what you have to do is you take
these, if you have odd values, you just put
2453
03:48:25,021 --> 03:48:29,771
them in order, and see I numbered them for
you. And then you take the middle number,
2454
03:48:29,771 --> 03:48:34,830
and that's the median. That's what it is.
It's 42 in this one. Okay, we'll do the downstairs
2455
03:48:34,830 --> 03:48:40,850
data set there that has six, as you can see,
the positions are numbered. And then what
2456
03:48:40,850 --> 03:48:46,980
do you do, you go to the third and fourth
position, which is the kind of the middle
2457
03:48:46,980 --> 03:48:53,061
right, and you literally make an average of
them, you add the two, and they happen to
2458
03:48:53,061 --> 03:48:57,260
be seven and eight right next to each other.
But if they had been like eight and 10, then
2459
03:48:57,260 --> 03:49:00,370
the average would have been nine, and that
would have been the median. But because this
2460
03:49:00,370 --> 03:49:05,980
is seven and eight, you do seven plus eight,
divided by two, and it's 7.5. So when you
2461
03:49:05,980 --> 03:49:10,130
do the median with an odd number of values,
you're going to be taking one of the values
2462
03:49:10,130 --> 03:49:16,290
in there. If you're doing the median, on an
even number of values, you might get something
2463
03:49:16,290 --> 03:49:21,610
with like a decimal, because you're looking
for the two values that straddle the middle,
2464
03:49:21,610 --> 03:49:24,650
and you're going to be making an average of
them. And so you might get kind of a wacky
2465
03:49:24,650 --> 03:49:34,040
number like 7.5 that's not in the underlying
data set. So um, this is fine for like, if
2466
03:49:34,040 --> 03:49:38,930
you have five or six numbers or seven. What
What if you have like 150 numbers, I mean,
2467
03:49:38,930 --> 03:49:43,580
you do still have to put them all in order
to begin with, you know, like I use Excel,
2468
03:49:43,580 --> 03:49:50,410
I probably just soared. But you have to know
how many numbers to go up. It's not obvious.
2469
03:49:50,410 --> 03:49:52,200
So this is how you find the middle number.
They
2470
03:49:52,200 --> 03:49:53,720
have a little
2471
03:49:53,720 --> 03:49:58,980
formula for it. So let's say we have an odd
number of values. And I'm giving you the example
2472
03:49:58,980 --> 03:50:04,080
like 21 love Let's say at 21 students in my
class, and that's how many values I have.
2473
03:50:04,080 --> 03:50:09,230
And I wanted to make a median of their grade,
what I would do is put them all in order.
2474
03:50:09,230 --> 03:50:13,730
And I'd say, Well, I have to go up so many,
and that's the median. But I don't know how
2475
03:50:13,730 --> 03:50:21,150
many to go up. So I would use this calculation.
So I take the end, which in our case is 21.
2476
03:50:21,150 --> 03:50:27,390
And I'd add it to one to it. And then we get
22. And then I divide by two. So that's just
2477
03:50:27,390 --> 03:50:33,561
how it works. So if you had 41, you would
do 41 plus one, it would be 42, divided by
2478
03:50:33,561 --> 03:50:39,510
two. Or if you had, like, I don't know why
I'm picking on ones like 27, you do 27 plus
2479
03:50:39,510 --> 03:50:45,851
one, and that would be 28. And 28 divided
by two is 14. And so you see, it would just
2480
03:50:45,851 --> 03:50:50,561
force it to be an even number that you come
out with. And then that's the position you
2481
03:50:50,561 --> 03:50:55,030
got go often. So if I had 21 students in my
class, and I took the grades and raised them
2482
03:50:55,030 --> 03:51:00,631
in order from lowest, lowest to highest, like
if they were that quiz grades, you know, most
2483
03:51:00,631 --> 03:51:03,490
of them would probably be four and five, but
it wouldn't matter, what I would do is just
2484
03:51:03,490 --> 03:51:08,410
start with the lowest and count up to the
11th 1/11 position, and then that would be
2485
03:51:08,410 --> 03:51:14,101
my meaning. Now, you also have to do that,
you have to find the middle number, even if
2486
03:51:14,101 --> 03:51:20,200
you have an even number of values. So I took
an example 14, now you'll notice we use the
2487
03:51:20,200 --> 03:51:26,590
same formula. But if you do use this formula,
you get 7.5. And that doesn't, that's not
2488
03:51:26,590 --> 03:51:31,600
the median. That's just how many positions
you have to go up. Right. And so remember,
2489
03:51:31,600 --> 03:51:37,200
on the earlier slide, we had, we had to go
between the third and fourth position, we
2490
03:51:37,200 --> 03:51:42,470
had to average those two numbers. Well, this
is basically saying, if you get 7.5, you have
2491
03:51:42,470 --> 03:51:46,440
to go to the seventh and the eighth, the one
that straddles it, and those are the two that
2492
03:51:46,440 --> 03:51:53,561
you average. So if my n like 100 is a nice,
even number. So if you have 100 plus one and
2493
03:51:53,561 --> 03:52:00,190
you get 101, then you've got, you know, 50.5,
right, and that just is a secret message that
2494
03:52:00,190 --> 03:52:06,000
when you line up all your data, you take the
50th, one in the row and the 51st, one in
2495
03:52:06,000 --> 03:52:10,210
the row, add them together, divide by two,
and that's going to be your median. So I just
2496
03:52:10,210 --> 03:52:14,260
wanted to share with you this little formula,
just in case, you get like a large number
2497
03:52:14,260 --> 03:52:19,030
of numbers thrown at you and putting them
in order is a big pain. And then you have
2498
03:52:19,030 --> 03:52:24,400
to figure out how many to count up, you can
use this formula to get the middle number.
2499
03:52:24,400 --> 03:52:28,040
So what does a median tell you, we have a
lot more to talk about here. First of all,
2500
03:52:28,040 --> 03:52:33,601
it's called the 50th percentile of the data,
what it means is 50%, or half of the data
2501
03:52:33,601 --> 03:52:37,801
points are below the median, and the other
half are above. And that intuitively makes
2502
03:52:37,801 --> 03:52:41,890
sense because you just created we created
this median together. And we could see that
2503
03:52:41,890 --> 03:52:46,811
half of the points are in the bottom half
on the top. And so it's also known as a middle
2504
03:52:46,811 --> 03:52:51,230
rank of the data. And what's nice about the
median is it doesn't really care much about
2505
03:52:51,230 --> 03:52:57,160
the ends of the data. Like if I gave extra
credit to a few people in my five point quiz,
2506
03:52:57,160 --> 03:53:01,830
and they got a few sixes, probably the median
won't even change because it's in the middle
2507
03:53:01,830 --> 03:53:05,681
where all the action is where we find the
median. And outliers don't really bother it
2508
03:53:05,681 --> 03:53:10,061
because like if one or two people get a zero
on the quiz, it's really, you know, if there's
2509
03:53:10,061 --> 03:53:14,470
21 people in there, or 100 people in there,
it really isn't gonna affect, you know, these
2510
03:53:14,470 --> 03:53:18,360
things happening at the end. So we like the
median because it's very resistant, and it's
2511
03:53:18,360 --> 03:53:25,850
very stable, you can't really whack it out
with some outliers, throwing them on the ends.
2512
03:53:25,850 --> 03:53:31,410
Now I'm moving on to the third measure of
central tendency, which is a mean, but I also
2513
03:53:31,410 --> 03:53:36,130
threw in here, trimmed mean and weighted average
because there are other kinds of means. And
2514
03:53:36,130 --> 03:53:40,180
we're going to talk a little bit also about
resistant measures, because like I just mentioned
2515
03:53:40,180 --> 03:53:41,180
that.
2516
03:53:41,180 --> 03:53:44,230
But I'm gonna step back
2517
03:53:44,230 --> 03:53:49,021
and talk a little bit about the Greek letter
sigma here, that's actually capital sigma,
2518
03:53:49,021 --> 03:53:53,370
I do not speak Greek. And I actually have
trouble speaking statistics, because a lot
2519
03:53:53,370 --> 03:53:57,811
of it's in Greek. So I try to avoid that and
my lectures, but sometimes you can't get away
2520
03:53:57,811 --> 03:54:02,681
from it. So I have to really introduce you
to this capital sigma. So in English, we say
2521
03:54:02,681 --> 03:54:07,630
or statistics ease, I guess, is whenever you
see this, you say some of Wah, like you expect
2522
03:54:07,630 --> 03:54:14,730
something to be right after it. Okay. So if
you see, like the sigma and then x, you would
2523
03:54:14,730 --> 03:54:20,931
say sum of X. That's how you say. So what
is x? Well, remember how we were just making
2524
03:54:20,931 --> 03:54:26,900
medians. And we were looking at modes, well,
each value there is considered an X, okay,
2525
03:54:26,900 --> 03:54:32,180
so each of the values in those days sets an
X. So sum of X would mean add these all up
2526
03:54:32,180 --> 03:54:36,751
or add up all the axes. And then I just threw
on another example, let's say somebody came
2527
03:54:36,751 --> 03:54:41,391
to you and said sum of X, Y, it would mean
you must have some x y's lying around and
2528
03:54:41,391 --> 03:54:46,061
you have to add them together. Or somebody
came up to you and said, you know, some of
2529
03:54:46,061 --> 03:54:50,820
the prices on your, of the food in your
2530
03:54:50,820 --> 03:54:56,530
basket and the grocery store, right? Somebody
said some of that, you'd be like, Okay, I
2531
03:54:56,530 --> 03:55:00,551
have to go through all these prices and add
them up. Right. So that's what some of them
2532
03:55:00,551 --> 03:55:04,561
Okay, and it's used a lot in statistics, and
we're going to use some of all the time. So
2533
03:55:04,561 --> 03:55:08,261
I just want you to get in your head that whenever
you see some of, there's probably going to
2534
03:55:08,261 --> 03:55:13,330
be this thing next to it. And it's gonna be
a batch of numbers that you have to add up.
2535
03:55:13,330 --> 03:55:18,370
And if it's numbers from our data set, it
will be called x, if it's other numbers from
2536
03:55:18,370 --> 03:55:22,070
something else that will be called whatever
they're called. But just know that this means
2537
03:55:22,070 --> 03:55:27,250
some of and I see on the slide, the upper
one is Times New Roman, and the lower ones
2538
03:55:27,250 --> 03:55:30,790
Arial, they look kind of different. But I
just wanted you to get ready to deal with
2539
03:55:30,790 --> 03:55:36,980
this some of a lot. Okay, so here we are,
I'm hitting you with a sum up. This is the
2540
03:55:36,980 --> 03:55:41,011
formula for the mean. And a lot of you already
know how to calculate the mean. And you just
2541
03:55:41,011 --> 03:55:45,160
kind of do it. And you didn't know this is
how you say it in statistics. But basically,
2542
03:55:45,160 --> 03:55:51,170
it's this ratio. So this is like a fraction.
And on the top of the fraction is a sum of
2543
03:55:51,170 --> 03:55:55,220
X, you add up all your actions. And on the
bottom of the fraction is an, which is however
2544
03:55:55,220 --> 03:55:58,863
many you have. So you add them all up and
divide by however many you have. And you've
2545
03:55:58,863 --> 03:56:04,561
probably been doing this your whole life.
But this is actually the formula. So I just
2546
03:56:04,561 --> 03:56:09,890
thought I'd demonstrated, um, see, I put that
sum of remember those six data points I was
2547
03:56:09,890 --> 03:56:14,230
using for the median, I just kind of copied
them over here, I add them all up. And so
2548
03:56:14,230 --> 03:56:19,551
I got some of axes 40, right. And then I counted
them, and that was six, while I made them
2549
03:56:19,551 --> 03:56:24,550
be six. And so 40 divided by six is 6.7. So
that would be the mean for these data. And
2550
03:56:24,550 --> 03:56:27,750
you probably already knew how to do that.
But I wanted to sort of crosshatch it with
2551
03:56:27,750 --> 03:56:35,760
the actual formula. Okay, now I'm again, going
to take a little break here to just talk about
2552
03:56:35,760 --> 03:56:41,110
means, because remember, we talked about sample
statistics and population parameters. If somebody
2553
03:56:41,110 --> 03:56:47,140
just talks about a mean to you, and they say,
look, the mean such and such as six or something,
2554
03:56:47,140 --> 03:56:50,950
unless you really get into it with them, you're
not going to tell it's not going to be obvious
2555
03:56:50,950 --> 03:56:57,220
if they did a sample mean, or did a population
mean? So but when we write this down, it becomes
2556
03:56:57,220 --> 03:57:03,400
obvious. If I say, x bar, see that x without
line above it, that's pronounced x bar, and
2557
03:57:03,400 --> 03:57:07,160
you'll see I write it on the sides x bar,
because it's so hard to put that little line
2558
03:57:07,160 --> 03:57:12,511
up there. But that means the same thing, this
x bar, whenever here x bar, or you see that
2559
03:57:12,511 --> 03:57:17,660
x with a line over it, it means that it's
the sample statistics. So if you ever saw
2560
03:57:17,660 --> 03:57:23,610
like x bar equals six, not only do you know
the mean is six, but the secret code says
2561
03:57:23,610 --> 03:57:28,820
this mean comes from a sample, because x bar
is being stated. But if you look on the right
2562
03:57:28,820 --> 03:57:35,600
side, you'll see that it says there's this
m, and it's pronounced mu, it's a Greek letter
2563
03:57:35,600 --> 03:57:40,400
again, and I you'll show, you'll see on the
left, I put it in Arial. And on the right,
2564
03:57:40,400 --> 03:57:44,970
it's n times new roman looks a little different.
But it's pronounced mu. And so if you saw
2565
03:57:44,970 --> 03:57:51,351
mu equal sex, you'd be like, Whoa, that was
a population they measured. And the you probably
2566
03:57:51,351 --> 03:57:54,320
say that too, because you don't see mu a lot
like people usually don't
2567
03:57:54,320 --> 03:57:55,320
measure the population,
2568
03:57:55,320 --> 03:58:01,720
it's a lot of work, you often see x bar, but
even so I want you to be cognizant of whether
2569
03:58:01,720 --> 03:58:05,881
it says mute or whether it says x bar, because
it's still going to be a mean. But if it's
2570
03:58:05,881 --> 03:58:09,771
mu, they're talking about the population.
And if it's x bar, they're talking about a
2571
03:58:09,771 --> 03:58:15,761
sample. And that might be more important later.
But just keep this in mind. Also, when we
2572
03:58:15,761 --> 03:58:21,450
talk about samples, we use a lowercase n to
mean the number of numbers we have. Whereas
2573
03:58:21,450 --> 03:58:27,751
if we use, we're talking about populations,
we use an uppercase n a capital N. So you'll
2574
03:58:27,751 --> 03:58:35,080
see that the sample mean formula on the left
side, this x bar equals sum of x divided by
2575
03:58:35,080 --> 03:58:41,740
n, it changes if you're talking about the
population mean, and you're like, come on,
2576
03:58:41,740 --> 03:58:47,910
you add it up the same way. Like mu is basically
the population mean, and capital and it's
2577
03:58:47,910 --> 03:58:53,580
just the number in the population, that means
almost the same formula. But the issue is
2578
03:58:53,580 --> 03:58:57,720
you really are supposed to label things what
they are. So if you're doing a population
2579
03:58:57,720 --> 03:59:01,800
mean, mean, you're supposed to call it mu,
and you're supposed to use, you know, write
2580
03:59:01,800 --> 03:59:05,440
it like that on the right side of the slide.
And if you're doing a sample mean, you're
2581
03:59:05,440 --> 03:59:09,430
supposed to call it x bar, and you're supposed
to do it like on the left side of the slide.
2582
03:59:09,430 --> 03:59:14,010
So I just wanted to make that clear to you
as you go through the rest of these lectures.
2583
03:59:14,010 --> 03:59:20,010
Because when I say mu, I'm gonna mean a mean,
but it's gonna be from a population. And when
2584
03:59:20,010 --> 03:59:27,430
I say x bar, the mean the mean, but it's gonna
mean it's from a sample. Alright, so now we've
2585
03:59:27,430 --> 03:59:32,391
talked about several measures of central tendency,
but I wanted to put a means and medians together
2586
03:59:32,391 --> 03:59:37,100
in kind of a cage match because I wanted you
to look at them and see what their differences
2587
03:59:37,100 --> 03:59:43,851
are. Now, I've been sort of giving accolades
to the median, right, because it is very resistant
2588
03:59:43,851 --> 03:59:48,271
to outliers, and it's very stable. Remember
how I pointed out if you throw some outliers
2589
03:59:48,271 --> 03:59:53,521
on either side, it doesn't really affect it
much. Unfortunately, means are not resistant
2590
03:59:53,521 --> 03:59:59,351
to outliers. You could just throw like if
I took my five point quiz, and I just felt
2591
03:59:59,351 --> 04:00:02,900
like failure. barring a student and then giving
them 10 points, it would totally screw up
2592
04:00:02,900 --> 04:00:09,480
the mean for that class. And it's so it's
not very stable. So one of the things we can
2593
04:00:09,480 --> 04:00:14,320
do if we've got outliers in our data is to
just use the median. But sometimes we want
2594
04:00:14,320 --> 04:00:19,180
to use the mean. So we got to do different
things with it. So one of the things we can
2595
04:00:19,180 --> 04:00:26,160
do to try and make a more stable mean, or
honest mean is to trim it. So I'm going to
2596
04:00:26,160 --> 04:00:30,120
talk about how you do that. So as you can
see, on the left side of the slide, a very
2597
04:00:30,120 --> 04:00:35,100
high value, a very low low value, like an
outlier, or more than one outlier can really
2598
04:00:35,100 --> 04:00:39,710
throw off the mean. And it's not a problem
with median. So if you want to make the meal
2599
04:00:39,710 --> 04:00:46,061
a little resistant, what you can do is trim
data off of each end. So the outliers get
2600
04:00:46,061 --> 04:00:47,061
cut
2601
04:00:47,061 --> 04:00:48,061
off,
2602
04:00:48,061 --> 04:00:51,610
okay? The problem is, you can't look at the
data, when you're doing that, really, you
2603
04:00:51,610 --> 04:00:56,170
would just have to make a rule when you're
not looking and say, Okay, I'm going to trim
2604
04:00:56,170 --> 04:01:00,690
X amount off the top and X amount at the bottom
and as to be equal, and you just have to look
2605
04:01:00,690 --> 04:01:07,950
away when you're doing. Okay, so what I'm
some people do is a 5%, trim mean, which means
2606
04:01:07,950 --> 04:01:13,101
you take 5% of the data at the top and cut
it off, and 5% at the bottom and cut it off.
2607
04:01:13,101 --> 04:01:17,950
So you basically lose 10% of your data. And
in health care, a lot of people get mad about
2608
04:01:17,950 --> 04:01:22,090
that they don't want to lose any data. So
they don't like to use this way of fixing
2609
04:01:22,090 --> 04:01:27,230
the problem of outliers, they use other ways.
But I wanted to show you this as a simple
2610
04:01:27,230 --> 04:01:32,080
way to fix it. So I'm going to imagine we
have 100 data points, because it just makes
2611
04:01:32,080 --> 04:01:38,260
it easier for you to see what's going on.
Um, so if you had 100 data points, 5% of them
2612
04:01:38,260 --> 04:01:45,040
would be five. So basically, you'd be trimming
five off of the top, and five off the bottom.
2613
04:01:45,040 --> 04:01:49,811
So the first step would be is probably you
already made the mean out of this 100. And
2614
04:01:49,811 --> 04:01:53,880
you didn't like it because you saw outliers
at the top and bottom. So what you have to
2615
04:01:53,880 --> 04:01:57,720
do is put the data in order just like you
do for the median, you put them all in order,
2616
04:01:57,720 --> 04:02:01,681
you sort order from, you know, the lowest
to the highest, take all of your 100 and do
2617
04:02:01,681 --> 04:02:07,250
that, then what you would do is you would
like circle the five most bottom ones, and
2618
04:02:07,250 --> 04:02:11,030
they're going to get cut off, and you'd circle
the five top most one of them, they're going
2619
04:02:11,030 --> 04:02:16,141
to get cut off, they get thrown out. And then
you're you've got the 90 values left in the
2620
04:02:16,141 --> 04:02:21,200
middle. Now you make a mean out of those.
And then that's a 5% trim mean, and you got
2621
04:02:21,200 --> 04:02:25,280
to tell people, if you do that, you can say
here's the original mean, and here's the 5%
2622
04:02:25,280 --> 04:02:29,010
trimmed mean, because then people get an idea
that there must have been some outliers and
2623
04:02:29,010 --> 04:02:34,711
some of your data got hacked off. But then
this might give you sort of a more stable
2624
04:02:34,711 --> 04:02:42,400
estimate of the mean. Now I'm going to move
to something else entirely. It's not about
2625
04:02:42,400 --> 04:02:48,080
trying to make the mean stable, it's just
about trying to make the mean a little different.
2626
04:02:48,080 --> 04:02:54,240
Sometimes certain values in your mean should
count more than others towards the mean. And
2627
04:02:54,240 --> 04:03:00,040
that sounds really esoteric, but the way we
see it all the time is in school. So you might
2628
04:03:00,040 --> 04:03:04,800
get a great grade on your homework, you might
get A's on your homework, right? But if homeworks
2629
04:03:04,800 --> 04:03:12,311
only worth 10% of your final grade, it doesn't
help you much. And so what that 10% is it
2630
04:03:12,311 --> 04:03:17,690
when you have a class like that is it's called
a weight. When you move into statistics, you
2631
04:03:17,690 --> 04:03:22,240
say well, I'm going to, you know, I as the
teacher, I'm going to wait your homework grade
2632
04:03:22,240 --> 04:03:26,801
at 10% of your final grade. So it doesn't
matter how awesome your homework grade is,
2633
04:03:26,801 --> 04:03:32,080
or how bad it is, it's really only going to
count for 10% of your final grade. And that's
2634
04:03:32,080 --> 04:03:35,971
why we do weighted averages, you know, I don't
think your homework should be worth like 50%
2635
04:03:35,971 --> 04:03:40,721
of your grade, right? That doesn't make any
sense. And so even though, so you might want
2636
04:03:40,721 --> 04:03:46,860
to have different things contribute a different
amounts of weight to that final mean. So this
2637
04:03:46,860 --> 04:03:51,140
is a way of messing around with the mean,
and making certain things going into it count
2638
04:03:51,140 --> 04:03:57,301
for more, or have kind of a bigger vote than
the other ones. And so I again, I'm just gonna
2639
04:03:57,301 --> 04:04:01,850
stick with school to give examples because
this is where we normally see it. So I mean,
2640
04:04:01,850 --> 04:04:06,521
if this example where homework is worth 10%
of your final grade and quizzes would be worth
2641
04:04:06,521 --> 04:04:12,190
20%. And the final worth 70%. And I just want
to point out, I've actually seen people do
2642
04:04:12,190 --> 04:04:17,720
this, like cuz I tutor, and like this is horrible
making your final worth, like, over 50% of
2643
04:04:17,720 --> 04:04:20,990
your grade. So this is just a shout out to
any like professors watching this. Don't do
2644
04:04:20,990 --> 04:04:26,021
this. Okay. But anyway, let's say I was mean
and I did it. And let's say you were pretty
2645
04:04:26,021 --> 04:04:30,980
good student and you got an A on the homework,
right? And so we're gonna say that's a 4.0
2646
04:04:30,980 --> 04:04:37,000
because a lot of schools would say A's 4.0.
Then let's say you got B plus on the quizzes,
2647
04:04:37,000 --> 04:04:41,700
maybe because the lectures weren't very good,
right? Haha. So you got B plus on the quizzes
2648
04:04:41,700 --> 04:04:46,820
that would translate to the number 3.5 on
that four point scale. And let's say you got
2649
04:04:46,820 --> 04:04:51,771
to be on the final. That's too bad, but that's
3.0. So what do I say that's too bad? Well,
2650
04:04:51,771 --> 04:04:56,990
you probably want an eight because the final
counts for greater weight right accounts for
2651
04:04:56,990 --> 04:05:01,730
70% and you'd want that to be really high.
Great. Now I first wanted to show you the
2652
04:05:01,730 --> 04:05:06,390
non weighted average, like the normal mean,
you would make the normal mean you would make
2653
04:05:06,390 --> 04:05:10,160
as you just add the four to the 3.4 to the
three and then divide by three, because you
2654
04:05:10,160 --> 04:05:16,420
have a three in there, and you'd get 3.5,
you get a B plus in the class, right? But
2655
04:05:16,420 --> 04:05:22,500
let's just look down, or let's look up at
that formula. So this is the weighted average
2656
04:05:22,500 --> 04:05:29,230
formula. It's the sum of x times the weights,
2657
04:05:29,230 --> 04:05:35,120
divided by the weights. And remember what
I said sum of x y, like as an example. So
2658
04:05:35,120 --> 04:05:39,460
we have to, instead of just summing x, like
we did in the non weighted average, we have
2659
04:05:39,460 --> 04:05:44,891
to do X times W, on all of them in summit,
and you're like, what's w? Well, remember,
2660
04:05:44,891 --> 04:05:50,780
I told you what the homework worth 10% that's
the weight for it, right? And so, so using
2661
04:05:50,780 --> 04:05:54,680
percent, when we do the weighted average,
you use the decimal version. So you'll see
2662
04:05:54,680 --> 04:06:01,141
under the weighted average, I'm doing that
sum of X w thing by taking the four and timesing
2663
04:06:01,141 --> 04:06:08,230
it by point one for that 10% first, and then
see that B plus that 3.5. That gets multiplied
2664
04:06:08,230 --> 04:06:13,530
by point two, because that's where 20% and
then there's that B, you got on the final,
2665
04:06:13,530 --> 04:06:19,890
right, that gets multiplied by point seven.
So that's the sum of X w thing going. And
2666
04:06:19,890 --> 04:06:26,800
what do you get, you get 3.2. Now I don't
even bother to, to divide this by some of
2667
04:06:26,800 --> 04:06:32,450
W, because some of W is one in this case,
like if you add up point seven plus point
2668
04:06:32,450 --> 04:06:36,720
two plus point one, you get one. And that
often happens, you just make the weights add
2669
04:06:36,720 --> 04:06:40,480
up to one. But I just wanted to let you know
if for some reason you had goofy weights that
2670
04:06:40,480 --> 04:06:44,800
didn't add up to one, the last thing you have
to do is divide by them. So as you can see,
2671
04:06:44,800 --> 04:06:49,680
in the lower part of the slide, the sum of
X W is 3.2. And if we divided it by one, we
2672
04:06:49,680 --> 04:06:56,840
get 3.2. And now you don't get b plus in the
class, now you get like a B. And that's the
2673
04:06:56,840 --> 04:07:00,590
difference between the non weight and the
weighted average is the weighted average weighted
2674
04:07:00,590 --> 04:07:06,690
this final be extra, and then that caused
the grade, the final grade to be lower. And
2675
04:07:06,690 --> 04:07:13,540
that's what waiting is. Now, I just want to
say a few things. I've gone through all our
2676
04:07:13,540 --> 04:07:18,200
measures of central tendencies, but I wanted
to talk about how they relate to the distributions
2677
04:07:18,200 --> 04:07:26,070
we learned recently. So I just put up an example
of a normal distribution. And then I color
2678
04:07:26,070 --> 04:07:34,360
coded these lines. So see on the way, right,
there's a color coded mean. And then there's
2679
04:07:34,360 --> 04:07:40,490
a green median. And then there's a purple
mode. Technically, they should all be right
2680
04:07:40,490 --> 04:07:44,560
on top of each other. But you can see them
if I did that, so I just wished him up next
2681
04:07:44,560 --> 04:07:48,810
to each other. what the point is, is if you
have data with a normal distribution, all
2682
04:07:48,810 --> 04:07:54,521
these three things are on top of each other.
And what the magic of this is, is you don't
2683
04:07:54,521 --> 04:08:01,600
even need a histogram to know. So like I use
statistical software, and I'll feed in the
2684
04:08:01,600 --> 04:08:06,350
data, like a quantitative variable. And they'll
say, Tell me the mean, median, and mode. And
2685
04:08:06,350 --> 04:08:12,271
then it will, it'll tell me the mean, median
and mode. And even if I don't look at the
2686
04:08:12,271 --> 04:08:18,220
histogram, if it says almost the same number
for Mean, Median mode, I automatically know
2687
04:08:18,220 --> 04:08:25,120
it's a normal distribution. Well, that's not
the case with skewed distributions. So with
2688
04:08:25,120 --> 04:08:30,990
skewed distributions, the measures of central
tendency are not right on top of each other.
2689
04:08:30,990 --> 04:08:37,110
In fact, they're in a different order, depending
on whether we have right skewed or left skewed.
2690
04:08:37,110 --> 04:08:42,521
So at the top of the slide, I've got an example
of a right skewed distribution, right? Because
2691
04:08:42,521 --> 04:08:49,720
it's light on the right. Alright, so what's
happening here? Well, the mean, is getting
2692
04:08:49,720 --> 04:08:58,790
dragged around by that tail, that big tail.
So you can see that the blue mean, is on the
2693
04:08:58,790 --> 04:09:04,080
right side of the median. So the median is
more resistance. So it's sort of hanging out
2694
04:09:04,080 --> 04:09:09,670
closer to the bottom of the data. But the
the tail, that right tail is pulling the mean
2695
04:09:09,670 --> 04:09:16,091
up. And then the mode is the lowest one. So
if I get this print out, and I see that the
2696
04:09:16,091 --> 04:09:21,210
mode is the lowest the medians in the middle,
and the means the highest, I can say without
2697
04:09:21,210 --> 04:09:27,090
even looking at the histogram, this is probably
right skewed. Now let's look at the bottom
2698
04:09:27,090 --> 04:09:30,590
of the slide where we have the left skewed
distribution, you know, because it's light
2699
04:09:30,590 --> 04:09:36,021
on the left, and you see the same phenomenon,
but it's going the other direction, that that
2700
04:09:36,021 --> 04:09:42,190
tail, that's towards the low end of the data.
It's dragging the mean down now. And notice
2701
04:09:42,190 --> 04:09:47,790
the median is more resistant doesn't get dragged
down as much. And of course, the mode stays
2702
04:09:47,790 --> 04:09:54,610
at the high part of the data where there's
more data, right? So if I get the printout,
2703
04:09:54,610 --> 04:09:58,231
and I see that the mean is the lowest and
the medians in the middle and the most the
2704
04:09:58,231 --> 04:10:02,681
highest I'm like Okay, all right. have to
look at the histogram. And I know this is
2705
04:10:02,681 --> 04:10:08,230
left skewed. So this is basically what I wanted
to tell you about the, the distributions,
2706
04:10:08,230 --> 04:10:13,140
and these actual numbers and how they sort
of relate.
2707
04:10:13,140 --> 04:10:17,970
So in conclusion, what this lecture was mainly
about was the measures of central tendency,
2708
04:10:17,970 --> 04:10:25,150
right? mode, median and mean, and how to calculate
those. And, you know, I've been kind of bagging
2709
04:10:25,150 --> 04:10:29,760
on the mean, I'm sorry, but the mean is just
not resistance is totally not stable. And
2710
04:10:29,760 --> 04:10:34,811
the median is, so you want to remember these
things? Yeah, you can kind of fix things by
2711
04:10:34,811 --> 04:10:38,700
doing the trimmed mean, we don't really like
to do that in healthcare. Because we lose
2712
04:10:38,700 --> 04:10:44,720
some of our data, we find other ways of fixing
the fact that our mean, maybe kind of goofy.
2713
04:10:44,720 --> 04:10:49,771
But they're outside of this lecture, how we
do that. I also showed you about weighted
2714
04:10:49,771 --> 04:10:54,620
average, you know, just in case you have to
hand calculate your grade. I'm actually I
2715
04:10:54,620 --> 04:10:59,140
had a student in my class once. And this is
back when we had Blackboard. And there was
2716
04:10:59,140 --> 04:11:03,540
something wrong with Blackboard. So she was
really upset because she thought she was getting
2717
04:11:03,540 --> 04:11:09,060
a really bad grade. But she was getting a
bad grade because she didn't do a good job
2718
04:11:09,060 --> 04:11:13,500
of learning weighted average, because when
I showed her how to actually calculate her
2719
04:11:13,500 --> 04:11:17,320
grade, it turned out to be a B, I remember
she was crying. Because she did an unweighted
2720
04:11:17,320 --> 04:11:20,801
average, she was crying in my office. And
then I just showed her how to do the weighted
2721
04:11:20,801 --> 04:11:27,600
average. And she stopped crying, she was getting
a B. So just don't cry. Try the weighted average
2722
04:11:27,600 --> 04:11:33,780
first, okay. And then finally, I went over
distributions and measures of central tendency,
2723
04:11:33,780 --> 04:11:40,221
and just related to you how the distributions,
how the numbers we get from the measures of
2724
04:11:40,221 --> 04:11:46,200
central tendency, how we can put them on distributions
and see some information about the distribution.
2725
04:11:46,200 --> 04:11:52,420
All right, well, you made it through the measures
of central tendency, get ready for 3.2 measures
2726
04:11:52,420 --> 04:12:02,310
of variation. Hello, and welcome to chapter
3.2. It's Monica wahi. Library college lecture.
2727
04:12:02,310 --> 04:12:08,490
And I'm here to go over with you measures
of variation. Alright, right, here are your
2728
04:12:08,490 --> 04:12:12,710
learning objectives. So at the end of this
lecture, the student should be able to state
2729
04:12:12,710 --> 04:12:18,560
three different measures of variation using
statistics, you should also be able to explain
2730
04:12:18,560 --> 04:12:22,930
how to calculate variance and standard deviation,
which I'll give you a hint, those are two
2731
04:12:22,930 --> 04:12:28,120
of the measures. All right, you should also
be able to calculate the coefficient of variation
2732
04:12:28,120 --> 04:12:35,760
and explain its interpretation. And finally,
you should be able to state chebi shows theorem.
2733
04:12:35,760 --> 04:12:41,602
So now we're going to be concentrating on
measures of variation. And the first one,
2734
04:12:41,602 --> 04:12:45,931
I'm going to talk about his range. And then
I'm going to talk about variance and standard
2735
04:12:45,931 --> 04:12:48,521
deviation, which are two different ones, but
I'm going to talk about them together. And
2736
04:12:48,521 --> 04:12:53,280
you'll see why. Then we're going to go over
the coefficient of variation, which is abbreviated
2737
04:12:53,280 --> 04:12:58,550
CV, then we're going to talk about Chevy,
Chevy Chevy came up with a theorem, we're
2738
04:12:58,550 --> 04:13:03,660
gonna talk about his theorem. And then his
theorem leads us to calculate these intervals.
2739
04:13:03,660 --> 04:13:07,850
Remember, intervals are like, have a lower
limit and an upper limit. I'll remind you
2740
04:13:07,850 --> 04:13:12,061
that and when will calculate Championship
at intervals together? Alright, let's get
2741
04:13:12,061 --> 04:13:19,510
started. So let's think about variation. Okay.
What is variation even mean? Well, it means
2742
04:13:19,510 --> 04:13:24,640
how much does the data vary? So imagine I
taught two classes, which isn't too hard,
2743
04:13:24,640 --> 04:13:29,550
because I do teach two classes, I teach two
of the same classes, two different sections.
2744
04:13:29,550 --> 04:13:35,880
So imagine that I gave a quiz. And the same
mean grade was in each class. Okay. And I
2745
04:13:35,880 --> 04:13:41,311
said that, could we tell how internally consistent
those grades were? So for instance, let's
2746
04:13:41,311 --> 04:13:46,990
say that I gave a five point quiz. And the
mean, in each class was three? Do we really
2747
04:13:46,990 --> 04:13:52,601
know how many people got something far from
three, like, maybe in one class, people got
2748
04:13:52,601 --> 04:13:58,350
a lot of fives, and ones. And that's how we
got the average of three. And maybe in the
2749
04:13:58,350 --> 04:14:03,021
other class, everybody just got three, like,
we really can't tell from a measure of central
2750
04:14:03,021 --> 04:14:08,880
tendency like median, or mean, or even mode,
we can't tell how internally consistent the
2751
04:14:08,880 --> 04:14:12,900
data are, especially, we can't even tell that
from a mean, two different classes can have
2752
04:14:12,900 --> 04:14:18,580
the same mean, and a totally different kind
of variation behind the scenes. So when you're
2753
04:14:18,580 --> 04:14:23,790
talking about quantitative data, and you have
a whole data set, and you do the measures
2754
04:14:23,790 --> 04:14:29,030
of central tendency, like Mean, Median mode,
it doesn't tell the whole story, you have
2755
04:14:29,030 --> 04:14:34,690
to also add on the information about variation.
And these calculations that we're going to
2756
04:14:34,690 --> 04:14:41,420
learn here in this lecture are about ways
to express how much the data vary in the data
2757
04:14:41,420 --> 04:14:45,240
set. And it's just separate from central tendency.
So central tendency is just about central
2758
04:14:45,240 --> 04:14:51,140
tendency. And then this variation is about
variation. And you need to know both before
2759
04:14:51,140 --> 04:14:55,210
you can really evaluate your data set. So
we'll get started on talking about ways to
2760
04:14:55,210 --> 04:14:59,271
calculate these measures of variation.
2761
04:14:59,271 --> 04:15:03,690
So um, As I said, I'm going to go through
range. First, I'm going to talk about variance
2762
04:15:03,690 --> 04:15:07,261
and standard deviation. And I just want to
remind you, you know how I'm always going
2763
04:15:07,261 --> 04:15:12,561
on about sample statistics versus population
parameters. Well, this starts playing in in
2764
04:15:12,561 --> 04:15:17,190
that the formulas are slightly different than
for sample variance, the standard deviation
2765
04:15:17,190 --> 04:15:22,080
and population standard deviation. So we'll
go over those separate
2766
04:15:22,080 --> 04:15:23,561
different formulas.
2767
04:15:23,561 --> 04:15:27,440
Finally, we're going to talk about in the
measures of variation, we're going to talk
2768
04:15:27,440 --> 04:15:32,470
about the coefficient of variation or CV,
but we'll do that after these other ones.
2769
04:15:32,470 --> 04:15:37,680
Okay, so we're going to start with the range,
because it's the simplest to calculate. So
2770
04:15:37,680 --> 04:15:41,641
here's how you do it. So you'll notice on
the right, I just made up five numbers, I
2771
04:15:41,641 --> 04:15:46,761
just totally made them up. I don't know what
they are. Okay, I just did that for a demonstration,
2772
04:15:46,761 --> 04:15:52,530
because the range is the difference between
the maximum and minimum value. So literally,
2773
04:15:52,530 --> 04:15:56,920
it's pretty easy to calculate, you have to
first search around for the highest or the
2774
04:15:56,920 --> 04:16:01,630
maximum, which in this little data set, it's
so cute. It's only got five numbers. So it
2775
04:16:01,630 --> 04:16:07,090
was obvious that somebody ate was the highest,
right? And it's sort of obvious at 21 is the
2776
04:16:07,090 --> 04:16:11,880
lowest. So how you calculate the range is
you take the highest minus the lowest, and
2777
04:16:11,880 --> 04:16:16,240
then you get a number. And that's the range.
And sometimes my students actually take the
2778
04:16:16,240 --> 04:16:20,431
highest, and then they put minus and then
the lowest. And then they tell me, that's
2779
04:16:20,431 --> 04:16:24,800
the range. And I'm like, No, yeah, I actually
have to subtract it out. So you'll see here,
2780
04:16:24,800 --> 04:16:32,080
it says 78 minus 21 equals 57. So it's 57.
That's the range. Okay. So all it's telling
2781
04:16:32,080 --> 04:16:38,630
you is the distance between the top and the
bottom. And I'll just say that, that's not
2782
04:16:38,630 --> 04:16:43,910
very useful. In fact, I had a problem with
that when I was working, I worked at the army
2783
04:16:43,910 --> 04:16:50,780
on this army database. And I looked at the
range of ages of soldiers when they started.
2784
04:16:50,780 --> 04:16:59,120
And the range was h Four, three 107. Alright,
obviously, there was a problem with the data,
2785
04:16:59,120 --> 04:17:03,881
right? Just for some reason, there was a screwed
up record that said, somebody got him when
2786
04:17:03,881 --> 04:17:07,641
they were four. And there was another screwed
up record that said, somebody got in when
2787
04:17:07,641 --> 04:17:11,530
they were over 100, they were just screwed
up data, okay. And that caused me to have
2788
04:17:11,530 --> 04:17:17,801
this ridiculous range. And so the range is
not very stable or resistant, right? If we
2789
04:17:17,801 --> 04:17:21,641
just fixed that, you know, record that said
somebody was four when they got in the army,
2790
04:17:21,641 --> 04:17:26,860
then we might have a normal range, you know,
like little more like a minimum, we might
2791
04:17:26,860 --> 04:17:33,190
see 18, or 17, or 19, or something. But, as
you can see, on the right side of the slide,
2792
04:17:33,190 --> 04:17:37,740
I just picked out that the minimum and the
maximum, we could just change arbitrarily
2793
04:17:37,740 --> 04:17:43,120
change those numbers. And suddenly, we'd have
something totally different from 57. So as
2794
04:17:43,120 --> 04:17:48,480
you can see, even though this range is a measure
of variation, it's not stable and resistant.
2795
04:17:48,480 --> 04:17:53,750
And it actually kind of doesn't tell you much.
If I say we've got a range of 57, you don't
2796
04:17:53,750 --> 04:17:59,390
know if the minimum is like zero, or like
negative, or like 105, you know, you really
2797
04:17:59,390 --> 04:18:04,561
don't know where that ranges in. So it's not
very useful. But it's a place to start, because
2798
04:18:04,561 --> 04:18:09,800
that's our first measure of variation. Now
we're going to get into what we really use
2799
04:18:09,800 --> 04:18:15,521
in statistics a lot, you'll sometimes see
in articles where they state with the ranges,
2800
04:18:15,521 --> 04:18:19,830
they usually don't state the actual number
I tell you to calculate, they actually state
2801
04:18:19,830 --> 04:18:24,550
the minimum and the maximum. And sometimes
that's interesting. But variance and standard
2802
04:18:24,550 --> 04:18:28,730
deviation. That's what we really live on in
statistics for measures of variation. And
2803
04:18:28,730 --> 04:18:32,730
you're probably wondering why I'm talking
about them together when they're totally different
2804
04:18:32,730 --> 04:18:37,190
calculations. Well, it's because they're friends.
Okay? And how are they friends? Well, the
2805
04:18:37,190 --> 04:18:41,540
variance calculations, kind of a big formula.
And so you get through that, and then you
2806
04:18:41,540 --> 04:18:46,490
have the variance. And then all you have to
do to get the standard deviation is take the
2807
04:18:46,490 --> 04:18:50,480
square root of the variance. So that's why
they're friends is like you go through all
2808
04:18:50,480 --> 04:18:54,311
this trouble to get the variance. And then
the next step is just take the square root
2809
04:18:54,311 --> 04:18:58,771
of that, and you get the standard deviation.
So before I actually talk about those formulas,
2810
04:18:58,771 --> 04:19:05,040
I wanted to just set in your head, what these
words mean. Because, like, I remember, I worked
2811
04:19:05,040 --> 04:19:09,881
in a mental health place. And I don't know,
we didn't have enough licensed people there.
2812
04:19:09,881 --> 04:19:14,360
And so our leader said, Oh, I'm applying to
the state for a variance, right? Meaning that
2813
04:19:14,360 --> 04:19:19,760
the state would give us allow us to vary from
the rules. Well, that's what variances is
2814
04:19:19,760 --> 04:19:25,430
how the data vary. So you think of the spread
of the data and how well does the mean every
2815
04:19:25,430 --> 04:19:30,990
represent that spread? It doesn't, right.
So variance is a way of representing how the
2816
04:19:30,990 --> 04:19:36,310
data vary really around the meet. Now, you're
probably wondering, well, then why do you
2817
04:19:36,310 --> 04:19:40,910
even have standard deviation? It's the square
root of variance. But let's just think about
2818
04:19:40,910 --> 04:19:46,021
what the word means. You know, standard means
sort of following a standard are the same.
2819
04:19:46,021 --> 04:19:53,950
So it's just the amount of variation, that
standard in the data set. And you know what
2820
04:19:53,950 --> 04:19:58,360
the word deviation means? Like, you can say,
oh, that person is a social deviant because
2821
04:19:58,360 --> 04:20:03,590
they go to crimes or something. Or like this
guy with a healthy nose, he does not have
2822
04:20:03,590 --> 04:20:08,610
a deviated septum. But you know, some people
do have a deviated septum where it's like
2823
04:20:08,610 --> 04:20:09,610
crooked,
2824
04:20:09,610 --> 04:20:13,290
right and they have trouble like sneezing
and blowing their nose and sometimes even
2825
04:20:13,290 --> 04:20:18,420
breathing. Well, a standard deviation would
simply mean that everybody's deviation is
2826
04:20:18,420 --> 04:20:24,660
about the same. So, variance is a calculation
that says how much things vary. And so the
2827
04:20:24,660 --> 04:20:27,750
standard deviation, because it's just the
square root of variance, but I just want you
2828
04:20:27,750 --> 04:20:36,110
to imagine in your head, oh, standard deviation,
that means how much the data deviates around
2829
04:20:36,110 --> 04:20:40,650
the mean, because a lot of times students
get confused about the measures of central
2830
04:20:40,650 --> 04:20:45,561
tendency, they try to apply them to variation,
but variation is totally different thing.
2831
04:20:45,561 --> 04:20:50,271
So just remember what variance literally means,
and what standard deviation literally means.
2832
04:20:50,271 --> 04:20:57,910
And that might help you get through these
formulas and understand the interpretation.
2833
04:20:57,910 --> 04:21:03,700
So as I mentioned earlier, the formulas for
variance and standard deviation are different,
2834
04:21:03,700 --> 04:21:11,351
whether you're talking about a sample, or
a population. And, admittedly, we don't use
2835
04:21:11,351 --> 04:21:17,360
the population variance or population standard
deviation calculation very often, because
2836
04:21:17,360 --> 04:21:22,240
we don't measure the population that often.
So we tend to use the sample variance and
2837
04:21:22,240 --> 04:21:26,271
sample standard deviation all the time. So
I'm going to demonstrate those. But you'll
2838
04:21:26,271 --> 04:21:32,460
notice conceptually, they're really similar.
Like, um, you know, if you have population
2839
04:21:32,460 --> 04:21:39,160
parameters like Meuse, and like population
standard deviations, they tend to behave similarly
2840
04:21:39,160 --> 04:21:45,610
in formulas, as sample versions, it's just
that in statistics, we always want to be really
2841
04:21:45,610 --> 04:21:49,980
clear about what we're talking about. So we
always want to use the right symbol, so we're
2842
04:21:49,980 --> 04:21:56,250
hinting towards, we're analyzing a sample
versus we're analyzing a population even though
2843
04:21:56,250 --> 04:22:00,830
conceptually like means or a mean, right?
But you want to represent which mean you're
2844
04:22:00,830 --> 04:22:05,851
talking about one, that's a parameter, or
one, that's a statistic, whenever you write
2845
04:22:05,851 --> 04:22:12,030
out the formula, so I'm just being picky about
that. And then there's also two other things
2846
04:22:12,030 --> 04:22:18,181
you want to know. Um, there's two different
ways of actually doing each of these formulas.
2847
04:22:18,181 --> 04:22:22,780
You know how like an algebra, you can have
a big equation, and you can express it more
2848
04:22:22,780 --> 04:22:28,431
than one way. So that's all they do is they
put a formula in one way called the defining
2849
04:22:28,431 --> 04:22:34,980
formula. And then they put the formula, same
formula, but rearranged by algebra into the
2850
04:22:34,980 --> 04:22:39,590
computational formula. Now, I always think
that's kind of funny that they call the computation,
2851
04:22:39,590 --> 04:22:43,780
right? I mean, both the formulas give you
the same results, it's just plugging in numbers
2852
04:22:43,780 --> 04:22:47,350
and getting out the answer. And the answer
is gonna be same, whether you use the defining
2853
04:22:47,350 --> 04:22:51,850
formula, or the computational formula. But
what I think is so funny is they call it the
2854
04:22:51,850 --> 04:22:57,031
computational formula, but I cannot compute
it. Like I always get confused when I use
2855
04:22:57,031 --> 04:23:02,920
it. So I pretty much ignore the computational
formula in my entire life. And I just teach
2856
04:23:02,920 --> 04:23:07,680
the defining formula. And I find my students
always remember the defining formula, they
2857
04:23:07,680 --> 04:23:11,670
always can get through it. Although people
who are into the computational formula, they
2858
04:23:11,670 --> 04:23:16,771
tell me that I'm doing things the hard way,
I'm going the long way around. But you know,
2859
04:23:16,771 --> 04:23:22,030
what just goes a long way around, it helps
you not get confused, and helps you convince
2860
04:23:22,030 --> 04:23:27,001
yourself you actually got the right answer.
So let's just do the defining formula. All
2861
04:23:27,001 --> 04:23:32,440
right. So let's look at the defining formula,
you can look it up, you can look up the computational
2862
04:23:32,440 --> 04:23:36,840
formula, but this is the defining formula.
So let's just get get our minds wrapped around
2863
04:23:36,840 --> 04:23:43,340
that. Remember, I told you that variance is
great, because you calculate that, and then
2864
04:23:43,340 --> 04:23:46,851
you just take the square root of that, and
you get the standard deviation. So as you
2865
04:23:46,851 --> 04:23:50,530
can see on the left side of the slide, we
abbreviate the sample variance by just saying
2866
04:23:50,530 --> 04:23:55,950
s, which is the standard deviation to the
second. I know that sounds ridiculous, right?
2867
04:23:55,950 --> 04:23:59,880
Like why don't we have a special thing just
for the variance? Why do we just say it's
2868
04:23:59,880 --> 04:24:04,660
so the second and then say sample standard
deviation is just as as well actually, to
2869
04:24:04,660 --> 04:24:09,180
be honest with you people use different notation.
I'm just using this because it matches the
2870
04:24:09,180 --> 04:24:15,721
textbook we're using. But people will often
say var for variance. And so in other textbooks,
2871
04:24:15,721 --> 04:24:21,280
they'll do that, and then statistical software,
but they'll also say s to the second like
2872
04:24:21,280 --> 04:24:26,940
this, and it's maybe a good way of you remembering
that the standard deviation is just the square
2873
04:24:26,940 --> 04:24:33,550
root of the variance, right? So if you ever
see s to the second, remember, S is the sample
2874
04:24:33,550 --> 04:24:38,121
standard deviation, and s The second is the
sample variance. And I'll show you the population
2875
04:24:38,121 --> 04:24:44,940
one in a minute. But if you see those, that's
what they're talking. Okay. Now, let's look
2876
04:24:44,940 --> 04:24:52,101
upstairs at the top formula. See this thing
on the top? It's really kind of scary, but
2877
04:24:52,101 --> 04:24:55,410
we're going to work through this and you're
not going to be scared of it. Okay.
2878
04:24:55,410 --> 04:24:59,771
I know you know that there's a little some
sign there that capital sigma so you know,
2879
04:24:59,771 --> 04:25:04,240
something's gonna They get summed up. But
that looks kind of scary that x minus x bar
2880
04:25:04,240 --> 04:25:09,440
to the second thing will handle that, okay.
But n minus one on the bottom, that's not
2881
04:25:09,440 --> 04:25:14,370
so scary, okay. And we'll handle that one
too. And then you'll just notice, all I did
2882
04:25:14,370 --> 04:25:18,710
for the bottom part is I just put this huge
square root sign over that whole thing. So
2883
04:25:18,710 --> 04:25:21,910
that's the only difference between the upstairs
and the downstairs. And then I also wanted
2884
04:25:21,910 --> 04:25:26,630
to show you a picture of a calculator, because
a lot of times, if you haven't really done
2885
04:25:26,630 --> 04:25:31,300
math or statistics for a while, you forget
the whole concept of square root. And I'll
2886
04:25:31,300 --> 04:25:35,620
just remind you, whenever there's a square
root of something, it just means that if you
2887
04:25:35,620 --> 04:25:41,681
times it by itself, you'll get that number.
So remember, like 25, the square root of 25,
2888
04:25:41,681 --> 04:25:45,230
if you put 25 in your calculator, and you
hit that square root thing, you'll get five,
2889
04:25:45,230 --> 04:25:50,440
right, because five times five is 25. However,
if you put in 24, you're gonna get something
2890
04:25:50,440 --> 04:25:54,940
with decimals, right? But whatever it is,
you get, if you times it by itself, you'll
2891
04:25:54,940 --> 04:25:58,980
get 24. So I just want to remind you of that,
because sometimes people forget that if they
2892
04:25:58,980 --> 04:26:03,110
haven't been doing statistics or math for
a while, or they haven't used the calculator
2893
04:26:03,110 --> 04:26:09,341
for a while. All right, I told you, I talked
to you about this numerator, right that the
2894
04:26:09,341 --> 04:26:13,930
top is the numerator in a fraction, and the
bottom is the denominator. So I'm going to
2895
04:26:13,930 --> 04:26:20,150
talk to you about this numerator. So the sum
of X minus X bar squared, you know, that's
2896
04:26:20,150 --> 04:26:23,820
how I would say it, this is actually called
this little piece of the formula is called
2897
04:26:23,820 --> 04:26:29,350
the sum of squares. And so when From now on,
when I say sum of squares, I literally mean
2898
04:26:29,350 --> 04:26:35,641
the top half of this equation. So what you
do when you do the defining formula, is you
2899
04:26:35,641 --> 04:26:39,131
just kind of relax and say, the first thing
I'm going to do is figure out the sum of squares,
2900
04:26:39,131 --> 04:26:43,561
I'm going to figure out the top part. And
then I'm going to just write that down, and
2901
04:26:43,561 --> 04:26:47,980
then later, I'm gonna come back to this formula
and enter it. So this next part is, how do
2902
04:26:47,980 --> 04:26:52,780
we figure out that top part of the equation?
How do we get the sum of squares, and I'll
2903
04:26:52,780 --> 04:26:59,080
show you. Okay, so let's just look at the
slide, I'm on the left, there's this blank
2904
04:26:59,080 --> 04:27:04,410
table. And that's usually what I do first
is I make this blank table. And you don't
2905
04:27:04,410 --> 04:27:08,551
want to say column one, column two, column
three, I just put that there. So I could talk
2906
04:27:08,551 --> 04:27:13,150
about the columns. And then you know, I was
talking about, but usually, what I put is
2907
04:27:13,150 --> 04:27:18,100
I put x in the first column, and they put
x minus x bar I wrote out minus, but you can
2908
04:27:18,100 --> 04:27:25,160
just use a dash. And then I put in parentheses
in the third column, x minus x bar to the
2909
04:27:25,160 --> 04:27:29,750
second, like that. Remember, when you have
parentheses, you have to do what's inside
2910
04:27:29,750 --> 04:27:35,110
the parentheses first. So this means you literally
have to do X minus X bar before you to the
2911
04:27:35,110 --> 04:27:39,930
second it or square it. And I'm just walking
you through this to get you ready for what
2912
04:27:39,930 --> 04:27:44,230
we're going to do with this tape. On the right,
so this slide, I'm just reminding you that
2913
04:27:44,230 --> 04:27:48,552
the sum of x minus x squared to the second,
in other words, the sum of whatever is going
2914
04:27:48,552 --> 04:27:54,050
to be in the column three. That's another
way of saying the sum of squares. Okay. So
2915
04:27:54,050 --> 04:27:58,521
an easy way to explain this, what the squares
are, is to just show you how to calculate
2916
04:27:58,521 --> 04:27:59,521
it.
2917
04:27:59,521 --> 04:28:00,771
So I just
2918
04:28:00,771 --> 04:28:05,790
pulled out some data set, imagine a sample
of six patients presented to Central lab.
2919
04:28:05,790 --> 04:28:09,910
So this happens to me when I go to my doctor,
sometimes she'll say, you know, it's time
2920
04:28:09,910 --> 04:28:15,911
to do a lab panel for you. So she gives me
this slip of paper, and I go downstairs to
2921
04:28:15,911 --> 04:28:20,010
the central lab, and I give them the slip
of paper, and they say, Okay, sit down, and
2922
04:28:20,010 --> 04:28:24,380
then we'll call you up, and we'll draw your
blood or whatever. So we're imagining six
2923
04:28:24,380 --> 04:28:31,140
people did that. And then they got up to have
their blood drawn. We asked them, How long
2924
04:28:31,140 --> 04:28:36,530
did you wait? Okay. And I'm in the central
lab where I literally do wait two minutes,
2925
04:28:36,530 --> 04:28:37,780
that's a really good
2926
04:28:37,780 --> 04:28:38,780
lap. But
2927
04:28:38,780 --> 04:28:42,940
sometimes it's really busy if I go like during
lunch, and I'll wait something like 10 minutes.
2928
04:28:42,940 --> 04:28:48,650
So here are six patients. One of them waited
two minutes, a couple of them waited three
2929
04:28:48,650 --> 04:28:53,021
minutes, probably the other three came in
during lunch because they waited eight minutes,
2930
04:28:53,021 --> 04:28:57,940
10 minutes and 10 minutes. Okay, so that's
our data, that it's a little tiny data set,
2931
04:28:57,940 --> 04:29:04,410
but I just wanted to use something small to
show you how to calculate the variance, and
2932
04:29:04,410 --> 04:29:10,390
then the standard deviation with just this
little data set. Okay. So what's the first
2933
04:29:10,390 --> 04:29:15,390
step? After making the table you have to make
the blank table for us is you fill in the
2934
04:29:15,390 --> 04:29:21,600
first column, which is called x. So what is
x? Actually, each of these patients waiting
2935
04:29:21,600 --> 04:29:29,150
time is an X. Remember sum of x, if we said
sum of x, we would mean add all these x's
2936
04:29:29,150 --> 04:29:34,521
together, right? So So that's all I did, I
just put each x in the column, you'll see
2937
04:29:34,521 --> 04:29:40,830
2338 1010. It's just like identical to these
x's. And then I put at the bottom, I put that
2938
04:29:40,830 --> 04:29:46,190
little fancy sum of X and said 36. Okay, and
so that's just the first thing you do. Just
2939
04:29:46,190 --> 04:29:51,960
put them all in and do the sum of X. All right.
Now the next step is don't look at the left
2940
04:29:51,960 --> 04:29:57,990
side of the slide yet, look at the right side.
Before you go and fill in column two, you
2941
04:29:57,990 --> 04:30:03,391
have to do X bar. In other words, You have
to figure out the mean. Now, you can kind
2942
04:30:03,391 --> 04:30:09,340
of cheat because you just figure it out some
of x. And if you remember the formula, the
2943
04:30:09,340 --> 04:30:15,440
mean, or the x bar of the sample is the sum
of x divided by n. And remember, I told you
2944
04:30:15,440 --> 04:30:22,210
at six patients, so you just take 36 divided
by six, and you get six. Now you just hold
2945
04:30:22,210 --> 04:30:23,210
that number,
2946
04:30:23,210 --> 04:30:24,650
you hold that.
2947
04:30:24,650 --> 04:30:30,830
So between column one and column two, you
got to calculate x bar, and you hold, right.
2948
04:30:30,830 --> 04:30:35,141
And then while you're holding that, you keep
it off to the side, you realize that this
2949
04:30:35,141 --> 04:30:42,460
is how we're going to fill in column two is
what x minus x bar means is the x bar is just
2950
04:30:42,460 --> 04:30:49,220
six. But we have to go through each x and
minus x bar from, it's helpful to order the
2951
04:30:49,220 --> 04:30:54,100
x's before you do this, like notice, I put
them in order 2338 1010, it's a good idea
2952
04:30:54,100 --> 04:30:59,400
to just do that, because it helps your brain
think whether or not you're doing the right
2953
04:30:59,400 --> 04:31:05,931
thing. So let's start with the two. So we
do two minus six, which is the x bar. Now
2954
04:31:05,931 --> 04:31:11,200
you can look at column two, two minus six
equals negative four, I hate negative numbers,
2955
04:31:11,200 --> 04:31:16,240
but you just have to deal with them sometimes.
Okay, so it's negative four, so you just deal
2956
04:31:16,240 --> 04:31:21,060
with that, then you go to the next slide,
and it's three minus six, which is negative
2957
04:31:21,060 --> 04:31:24,561
three. So we're still on the water here with
the negatives, but you'll notice that the
2958
04:31:24,561 --> 04:31:29,190
next 1x is three, so you can kind of copy
what you just did. So you're getting negative
2959
04:31:29,190 --> 04:31:32,771
three. So what you're actually technically
filling in this column, I showed you the equation,
2960
04:31:32,771 --> 04:31:36,590
but you're putting negative four in the first
one, negative three in the second one, negative
2961
04:31:36,590 --> 04:31:42,970
three in the third one. And then now finally,
the fourth x is eight. So eight minus six,
2962
04:31:42,970 --> 04:31:48,070
we got above water, now we're in two, right.
And then we have 10 minus six was 410 minus
2963
04:31:48,070 --> 04:31:52,810
six, which is war. And when you order them
like that, that's often what happens. In fact,
2964
04:31:52,810 --> 04:31:56,950
that's always what happens is you end up with
a bunch of negative ones at the beginning
2965
04:31:56,950 --> 04:32:01,811
and a bunch of positive one later, that's
just totally normal. Don't worry about that.
2966
04:32:01,811 --> 04:32:06,551
But you got to be careful, you got to make
sure you make the right meet. I've had people
2967
04:32:06,551 --> 04:32:11,840
on tests actually screw up this mean. So you
can just imagine when a train wreck happens
2968
04:32:11,840 --> 04:32:15,990
after that is you do not get anything right
after that. So make sure your means right.
2969
04:32:15,990 --> 04:32:20,650
And then make sure you subtract it from every
single x and put the right answer in column
2970
04:32:20,650 --> 04:32:26,200
two. That's the next step. All right. Okay,
so we're done with that step, what do we do
2971
04:32:26,200 --> 04:32:33,800
next? Now, we just take whatever we got in
column two in square. So we have the first
2972
04:32:33,800 --> 04:32:39,641
one was negative four. So we take remember,
square is just the the number time itself.
2973
04:32:39,641 --> 04:32:45,460
So if you don't like to use x to the second
button on your calculator, you can just do
2974
04:32:45,460 --> 04:32:50,970
negative four times negative four, same thing.
And so you'll notice we do negative four times
2975
04:32:50,970 --> 04:32:55,760
negative four, we get 16. Now, it's pretty
easy. negative three times negative three
2976
04:32:55,760 --> 04:33:01,190
is not, you know, two times two is four. But
I what I want you to really look at is the
2977
04:33:01,190 --> 04:33:07,590
10s. Notice that they get a 16 two, just like
the two did. And that's
2978
04:33:07,590 --> 04:33:12,169
the trick here. Remember, I said I hate negative
numbers? Well, a lot of statisticians feel
2979
04:33:12,169 --> 04:33:13,759
the same way I do.
2980
04:33:13,759 --> 04:33:20,269
And so they often fix it by squaring the number
because it's a racist, the negative. Just
2981
04:33:20,270 --> 04:33:25,621
remember, negative times negative is positive,
and positive times positive is also positive.
2982
04:33:25,621 --> 04:33:31,551
That's a little trick, you know, when it comes
to multiplying. And so when we do that, we
2983
04:33:31,551 --> 04:33:40,061
are squaring each one of column two. And they're
called squares, right? So we've got 16 994
2984
04:33:40,061 --> 04:33:47,520
1616. These each are squares. So what do you
think we do? We add up that entire column,
2985
04:33:47,520 --> 04:33:52,778
and we get the sum of squares. So look at
that, we add up that entire column, and we
2986
04:33:52,778 --> 04:33:57,849
get that super complicated looking thing at
the bottom, which is the numerator for our
2987
04:33:57,849 --> 04:34:02,339
variance equation, right? Like this wasn't
really that hard. Was it? Okay, so we sum
2988
04:34:02,340 --> 04:34:09,711
that up. And as it turns out, we get the number
70. So 70 is our sum of squares. All right.
2989
04:34:09,711 --> 04:34:15,438
All right. Now we're back at the sample variance
formula. And I'm so excited because look at
2990
04:34:15,438 --> 04:34:21,519
the top of the formula. We answered. It's
it's 70. Okay, so we got that 70. But we still
2991
04:34:21,520 --> 04:34:25,938
have to deal with the bottom of the formula.
Remember, n was six, right? We had six patients,
2992
04:34:25,938 --> 04:34:30,519
and the bottom of the formula is n minus one.
So the bottom of the formula is going to be
2993
04:34:30,520 --> 04:34:36,211
five, right? So let's fill this in. I was
kind of running out of room, so I just filled
2994
04:34:36,211 --> 04:34:40,990
it in upstairs. So you see that 70 divided
by five suddenly this looks super easy, right?
2995
04:34:40,990 --> 04:34:48,528
So 70 divided by five is 14. Okay? That's
the variance. totally easy, right? Once you
2996
04:34:48,528 --> 04:34:52,269
make that, I mean, it's not it's tedious,
right? You have to make that whole table and
2997
04:34:52,270 --> 04:34:57,141
add things up and stuff. But here, it's not
really that hard. Now, Guess how we're gonna
2998
04:34:57,141 --> 04:35:03,641
make the standard deviation you've probably
guessed it, we're just going to take a square
2999
04:35:03,641 --> 04:35:08,961
root of 14. So remember that button on your
calculator, you could put in 14, hit that
3000
04:35:08,961 --> 04:35:15,141
button, and you get 3.74 and a bunch of other
stuff, but I just chopped it off at 3.74.
3001
04:35:15,141 --> 04:35:21,938
So that is your sample standard deviation.
Now I promised you I would talk about the
3002
04:35:21,938 --> 04:35:27,779
population formulas for standard deviation
and variance, as well as the sample ones.
3003
04:35:27,779 --> 04:35:34,690
And I told you, they wouldn't really be conceptually
much different. As you can see on the left
3004
04:35:34,690 --> 04:35:39,790
side of the slide, sample variances expressed,
I made things red, so you can see what the
3005
04:35:39,791 --> 04:35:45,391
differences were sample variances s to the
second, but population variances as other
3006
04:35:45,391 --> 04:35:50,750
Greek letter. Remember, I told you that that
other sum was capital sigma, like, you know,
3007
04:35:50,750 --> 04:35:55,801
Greek is like English, in the sense they have
capital and lowercase letters? Well, that
3008
04:35:55,801 --> 04:36:00,009
thing that I always think it looks like a
jelly roll, but the Jelly Roll looking thing
3009
04:36:00,009 --> 04:36:06,269
is actually lowercase sigma. So that I'm never
going to say lowercase sigma, except for now,
3010
04:36:06,270 --> 04:36:10,660
I'm going to say population variance and population
standard deviation. So you'll see at the bottom
3011
04:36:10,660 --> 04:36:15,070
of the slide, the lowercase sigma alone is
the population standard deviation. And then
3012
04:36:15,070 --> 04:36:21,230
the lowercase sigma to the second is the variance.
So just remember, if you see that Jelly Roll
3013
04:36:21,230 --> 04:36:26,099
thing, we're talking about a population version
of the standard deviation or variance in that
3014
04:36:26,099 --> 04:36:33,649
the sample. Also, you already know about mu
versus x bar, right, so we have x bar on the
3015
04:36:33,650 --> 04:36:39,750
left. And that's the sample mean, and mu on
the right, which is population mean. And you
3016
04:36:39,750 --> 04:36:45,820
also already know about n, which is the number
in your sample. And this is where there's
3017
04:36:45,820 --> 04:36:52,131
a big difference actually, in the sample,
you have to do n minus one on the bottom,
3018
04:36:52,131 --> 04:36:57,278
and in the population, you just do, and capital
N that whole population. And if you think
3019
04:36:57,278 --> 04:37:02,060
about it, it makes kind of sense, because
populations are huge, so won't even matter
3020
04:37:02,061 --> 04:37:08,301
if you like subtracted one. Whereas, you know,
samples are small. So you sometimes have to,
3021
04:37:08,301 --> 04:37:12,539
you know, adjust or something, so you have
to minus one, but you wouldn't even matter
3022
04:37:12,539 --> 04:37:17,109
like people make a mistake and accidentally
minus one from the population one, they don't
3023
04:37:17,109 --> 04:37:21,291
get much of a different answer. And so that's
why I'm concentrating on the sample once,
3024
04:37:21,291 --> 04:37:25,150
that's what we normally do. But I wanted to
give a shout out Just so you know, if you
3025
04:37:25,150 --> 04:37:30,278
ever see the arm formulas on the right side
of the slide, you know their population level
3026
04:37:30,278 --> 04:37:38,259
formulas. Alright, now we're gonna move on,
we made it through range, variance and standard
3027
04:37:38,259 --> 04:37:43,130
deviation. So now we're gonna move on to talk
about the coefficient of variation. And this
3028
04:37:43,131 --> 04:37:50,240
is used a lot for comparisons for comparing
between two different labs often.
3029
04:37:50,240 --> 04:37:54,871
I say that because my friends are pathologist,
in the first time I actually use this in medicine,
3030
04:37:54,871 --> 04:38:01,801
as we were comparing lab values on the same
assay from two different labs, I just wanted
3031
04:38:01,801 --> 04:38:06,340
to explain to you this might be the first
time you've heard the word coefficient. And
3032
04:38:06,340 --> 04:38:11,080
that gets a little confusing for people in
statistics who are new, because the word coefficient
3033
04:38:11,080 --> 04:38:17,980
is actually just kind of a generic term for
certain kinds of numbers. So you'll hear somebody
3034
04:38:17,980 --> 04:38:22,699
say, coefficient of variation. And you'll
say, you'll hear somebody say coefficient
3035
04:38:22,699 --> 04:38:27,340
of something else, or coefficient of something
else. And just a word coefficient. Most people
3036
04:38:27,340 --> 04:38:33,449
haven't even heard it. It just means a certain
kind of number. It's just somebody says, oh,
3037
04:38:33,449 --> 04:38:38,509
the coefficient is not good, or it's high,
or whatever, you need to ask them, What coefficient
3038
04:38:38,509 --> 04:38:43,710
are you talking about, right. So in other
words, coefficient doesn't mean a specific
3039
04:38:43,711 --> 04:38:49,340
thing. It just means a number that comes out
of statistics. And so you have to know which
3040
04:38:49,340 --> 04:38:54,250
coefficient they're talking about. So this
is the first time maybe you've heard the word
3041
04:38:54,250 --> 04:38:58,750
coefficient. And I'm going to talk for the
first time then, to you if you've never heard
3042
04:38:58,750 --> 04:39:04,169
coefficient before, about a specific coefficient
called the coefficient of variation. Now,
3043
04:39:04,169 --> 04:39:10,011
you'll, as we go through this textbook, there's
other coefficients on it. So please remember
3044
04:39:10,011 --> 04:39:16,958
this one is coefficient of variation, right?
And a way to remember it is a CV for short.
3045
04:39:16,958 --> 04:39:22,999
And so other coefficients have different abbreviations,
but the coefficient of variation is CV. So
3046
04:39:23,000 --> 04:39:30,099
I put on the right side of the slide the the
formulas, and nobody seems to have any trouble
3047
04:39:30,099 --> 04:39:34,329
doing the formula, right, because once you
calculate the standard deviation, the sample
3048
04:39:34,330 --> 04:39:38,600
standard deviation of the population one,
as you can see in the formulas, and once you
3049
04:39:38,600 --> 04:39:44,380
calculate x bar, which is a mean for the sample,
it's pretty easy to do the division, and then
3050
04:39:44,380 --> 04:39:49,520
they like it when you do it in percent. And
you'll notice that about statistics is certain
3051
04:39:49,520 --> 04:39:55,282
things they prefer as proportions. And certain
things they prefer as percents. It's just
3052
04:39:55,282 --> 04:40:01,050
like, I don't know, it's just like our culture
in a way and so coefficient a very is always
3053
04:40:01,050 --> 04:40:07,130
expressed as a percent. So you have to times
that by 100. And then put a percent sign after
3054
04:40:07,130 --> 04:40:11,560
it. But really, that's pretty easy to do you
take the standard deviation, you'll see I
3055
04:40:11,560 --> 04:40:15,970
did it for our patients 3.74. It took us all
that work to get there, right? Remember square
3056
04:40:15,970 --> 04:40:22,370
root of 14. And then remember, our x bar was
six. So we needed that remember earlier for
3057
04:40:22,370 --> 04:40:28,872
that column, too. So I just dumpster dive
dumpster dove, those numbers, and then did
3058
04:40:28,872 --> 04:40:34,790
this calculation out and I got 62%. And so
students generally don't have trouble getting
3059
04:40:34,790 --> 04:40:40,070
that number. But what the problem is, is like,
what is the number even mean? Right? Like,
3060
04:40:40,070 --> 04:40:43,952
what does it mean, if you divide the standard
deviation by the x bar and times by 100%?
3061
04:40:43,952 --> 04:40:51,270
And like, how do you interpret that percent?
So the easiest way to talk about it is to
3062
04:40:51,270 --> 04:40:55,660
actually compare it with something. Because
one thing you'll also notice in statistics
3063
04:40:55,660 --> 04:41:02,800
is if you make ratios of things, they don't
have any units. So if I take your blood pressure,
3064
04:41:02,800 --> 04:41:09,100
like your systolic blood pressure, and I say
it's whatever, 130 mmHg. If I divide that
3065
04:41:09,100 --> 04:41:14,240
by your diastolic blood pressure, or even
by some lab value, or your temperature, or
3066
04:41:14,240 --> 04:41:19,770
whatever, your IQ, suddenly I get a ratio,
and that doesn't have units, right, it doesn't
3067
04:41:19,770 --> 04:41:24,720
have mmHg, or anything like that. And if I
do that to a bunch of people, all of those
3068
04:41:24,720 --> 04:41:26,460
ratios don't have any units.
3069
04:41:26,460 --> 04:41:30,032
And so they technically could be compared
to each other. So you'll see that that's a
3070
04:41:30,032 --> 04:41:35,130
strategy in statistics is they'll make ratios
of things and say all those don't have any
3071
04:41:35,130 --> 04:41:41,602
units. So it's, you know, sort of lacking
in that way. But the power is you can compare
3072
04:41:41,602 --> 04:41:48,790
these ratios. So, I decided to just pull out
other patients, I just made up other patients,
3073
04:41:48,790 --> 04:41:53,510
right. I pretended we went back to the lab,
the next day, and we gathered some data. And
3074
04:41:53,510 --> 04:41:59,940
we gather some data, and we came up with I
just made this up an x bar of eight, and a
3075
04:41:59,940 --> 04:42:05,852
standard deviation of four. It's a little
close to what we had before, right? Like x
3076
04:42:05,852 --> 04:42:13,220
bar six insanity, Visa 3.74. But anyway, in
this next sample patients, the S four divided
3077
04:42:13,220 --> 04:42:18,842
by the x bar of eight times 100 equal to 50%,
and not 62%, like the other one did. So how
3078
04:42:18,842 --> 04:42:23,730
do you interpret that? Well, the CV is a measure
of the spread of the data relative to the
3079
04:42:23,730 --> 04:42:29,800
average of the data. So in the first sample,
the standard deviation is only 50% of the
3080
04:42:29,800 --> 04:42:35,650
mean. But in the second sample, the standard
deviation is 62%.
3081
04:42:35,650 --> 04:42:37,122
of the mean.
3082
04:42:37,122 --> 04:42:47,820
So what I would say is that the second sample,
the red one with the 62%, has more standard
3083
04:42:47,820 --> 04:42:53,820
deviation, compared to the mean. And so that
means it's less stable, right? It's got more
3084
04:42:53,820 --> 04:42:57,420
variance compared to its mean, and it's more
standard standard deviation compared to its
3085
04:42:57,420 --> 04:43:04,100
mean. So it's less stable. So it moves around
a lot. So if you said to me, if these were
3086
04:43:04,100 --> 04:43:09,750
actually two different labs, I would say,
you know, I prefer the first lab, the purple
3087
04:43:09,750 --> 04:43:16,840
lab, because it's more predictable. I know,
it's gonna be like less variation, because
3088
04:43:16,840 --> 04:43:23,031
it's 50%. And the 62% means that that's less
predictable. It's a little hard to see in
3089
04:43:23,031 --> 04:43:28,522
this example. But what happens is, if you
have two different labs, and you're looking
3090
04:43:28,522 --> 04:43:33,150
at this, like maybe you split a blood sample
or a bunch of blood samples and send half
3091
04:43:33,150 --> 04:43:37,380
to one lab and half to the other, what you're
supposed to get the same mean and the same
3092
04:43:37,380 --> 04:43:39,460
standard deviation, right? They're the same
blood,
3093
04:43:39,460 --> 04:43:40,950
you just want it.
3094
04:43:40,950 --> 04:43:45,410
But sometimes you don't sometimes you get
something like this, in which case, if you're
3095
04:43:45,410 --> 04:43:49,880
comparing labs, you would go with the purple
lab and not the red lab because they produce
3096
04:43:49,880 --> 04:43:57,150
a more predictable result. So CV is a little
hard to interpret. But it's easy to calculate.
3097
04:43:57,150 --> 04:44:06,270
So that's one awesome thing about now, we're
gonna move on to chubby chef and his theorem.
3098
04:44:06,270 --> 04:44:12,260
So chubby chef figured something out a long
time ago. And this is how he started thinking
3099
04:44:12,260 --> 04:44:16,310
about it. He first started thinking, well,
let's say you have an x bar and an S, like
3100
04:44:16,310 --> 04:44:20,900
we just did with the CV. He noticed something
else about it, he didn't notice the CV, he
3101
04:44:20,900 --> 04:44:26,740
noticed that you can create a lower and upper
limit by subtracting the ass and adding the
3102
04:44:26,740 --> 04:44:33,570
s to the x bar. So remember back when we were
making frequency tables, and I said, Well,
3103
04:44:33,570 --> 04:44:39,200
we need to make class limits, we need to make
a lower class limit and an upper class limit.
3104
04:44:39,200 --> 04:44:43,602
Well, we use those terminology a lot like
lower limits and upper limits. Well, Chevy
3105
04:44:43,602 --> 04:44:49,770
show was like wait a second, I got an idea.
Let's say I take a mean. And I you know, this
3106
04:44:49,770 --> 04:44:54,100
will force the mean to be in the middle of
this. I can subtract one standard deviation
3107
04:44:54,100 --> 04:44:59,220
from it, and I'll get some sort of lower limit
and I'll add a standard deviation to that
3108
04:44:59,220 --> 04:45:02,760
mean and get some Sort of upper limit. And
of course, let's pretend the standard deviation
3109
04:45:02,760 --> 04:45:07,340
was one, like you'd subtract one to that one.
And so this would be like totally symmetrically
3110
04:45:07,340 --> 04:45:11,430
in the middle, right, the x bar would be in
the middle, and then it'd be surrounded equally
3111
04:45:11,430 --> 04:45:16,060
by these two standard deviations. And I'm
just saying standard deviation generically,
3112
04:45:16,060 --> 04:45:19,390
because you could do this with a mu, and the
population standard deviation, two, you can
3113
04:45:19,390 --> 04:45:25,280
do the population work. So he just sort of,
like figured out, that's a thing that can
3114
04:45:25,280 --> 04:45:30,180
happen, you can add and subtract a standard
deviation from the mean. And you can get these
3115
04:45:30,180 --> 04:45:34,930
limits. And so example, let's say I have a
mu. So I'm gonna pretend I have a population
3116
04:45:34,930 --> 04:45:39,693
a mu of 100. I don't know what I measured,
but I got 100 and a population standard deviation
3117
04:45:39,693 --> 04:45:44,911
of five. So Chevy, I was thinking, you know
what I could do, I could take that 100 and
3118
04:45:44,911 --> 04:45:51,650
subtract that five from it, and I get 95,
I could take that 100 and add five to it,
3119
04:45:51,650 --> 04:45:57,022
I get 105. And so we just started like working
with this concept, like I could subtract and
3120
04:45:57,022 --> 04:46:01,690
add like a standard deviation. And then he
thought, Wait a second, I could even do this
3121
04:46:01,690 --> 04:46:06,400
with two standard deviations, right? So I
could take like, if it was five, I could take
3122
04:46:06,400 --> 04:46:11,440
that times two, that's 10. And so I could
do 100, subtract 10, and I get 90 for the
3123
04:46:11,440 --> 04:46:17,930
lower limit, and 100 and add 10. And I get
110 for the upper limit. And so I can make
3124
04:46:17,930 --> 04:46:22,442
this this range or this interval, right? from
the lower limit to the upper limit, we call
3125
04:46:22,442 --> 04:46:29,120
it an interval, right. And so he just sort
of conceptually realized that if he used some
3126
04:46:29,120 --> 04:46:34,590
rules along with this, there might be some
useful interpretation of these limits, right,
3127
04:46:34,590 --> 04:46:39,660
there might be some way that uses limits to
mean something. So we're going to look at
3128
04:46:39,660 --> 04:46:45,310
how he figured out to be able to use, you
know, one standard deviation on either side
3129
04:46:45,310 --> 04:46:51,320
of the mean, or two, or three, or four multiples
of these standard deviations on either side
3130
04:46:51,320 --> 04:46:58,860
of the mean, to actually come up with some
lower and upper limits, that meant something.
3131
04:46:58,860 --> 04:47:05,600
So he realized that what these low lower and
upper limits would mean is that at least some
3132
04:47:05,600 --> 04:47:10,820
percent of the data would be between these
limits. So in other words, some percent of
3133
04:47:10,820 --> 04:47:16,680
the of the axes would be between the lower
and the upper limit. But that percent would
3134
04:47:16,680 --> 04:47:22,730
depend on how many standard deviations you're
going out, right? Like is it one is a two
3135
04:47:22,730 --> 04:47:29,100
is a three, the, the more you go out, obviously,
the more percent of your data are covered
3136
04:47:29,100 --> 04:47:33,340
by the limits, because they're just huge,
like, get it. So the interval so big, and
3137
04:47:33,340 --> 04:47:37,830
almost covers the whole thing. So you would
expect that percentage go up, as the number
3138
04:47:37,830 --> 04:47:43,590
of standard deviations you use goes up. So
so he was working on this out, and he came
3139
04:47:43,590 --> 04:47:48,710
up with this formula, right. And he also,
he was figuring out, he wanted this to work
3140
04:47:48,710 --> 04:47:55,180
for all distributions, like normal, but also
skewed. And also like uniform and by modal.
3141
04:47:55,180 --> 04:48:00,862
So this was the formula he came up with. Now,
in this formula, see at the bottom, k stands
3142
04:48:00,862 --> 04:48:05,200
for the number of standard deviations, or
the number of population standard deviations
3143
04:48:05,200 --> 04:48:12,640
that he's going to use, right? So let's pretend
that he made KB to like two standard deviations,
3144
04:48:12,640 --> 04:48:18,820
right? Then you'd see this, it says one minus
one divided by k to the second, which would
3145
04:48:18,820 --> 04:48:25,280
be to the second, so that would be to the
second is what four. So one divided by four
3146
04:48:25,280 --> 04:48:31,130
is point two, five. And so one minus point
two, five is like point seven, five, well,
3147
04:48:31,130 --> 04:48:34,420
you make that a percent at 75%. So
3148
04:48:34,420 --> 04:48:38,900
he's like, okay, that's what I'm going to
say. If you go out two standard deviations
3149
04:48:38,900 --> 04:48:46,250
up or down, and you make those upper and lower
limits, at least 75% of the data of the axes
3150
04:48:46,250 --> 04:48:53,120
are going to be there, at least, there might
be more, but it'll be at least that. So he
3151
04:48:53,120 --> 04:48:57,850
did this he used to, and they use three, and
he used four.
3152
04:48:57,850 --> 04:48:58,850
So
3153
04:48:58,850 --> 04:49:03,440
two standard deviations, either way, three
standard deviations either way, or four standard
3154
04:49:03,440 --> 04:49:07,442
deviations either way. Now, students in my
class often think that they have to memorize
3155
04:49:07,442 --> 04:49:13,550
this one minus one over K to the second, you
don't memorize. This was just a story of how
3156
04:49:13,550 --> 04:49:19,420
Chevy chef did this proof. So you can memorize
it for fun, but nobody memorizes it. I mean,
3157
04:49:19,420 --> 04:49:24,510
you know, Chevy chef did the work. I'm just
showing you the proof, right? So he figured
3158
04:49:24,510 --> 04:49:30,020
this all out. So as you can see how he like
you can do this with two, three and four,
3159
04:49:30,020 --> 04:49:33,150
you'll get the same answers Chevy chef does.
So it's kind of a waste of time, but you can
3160
04:49:33,150 --> 04:49:37,862
do it just for fun. So he did the two one,
I showed you that on the top. I even talked
3161
04:49:37,862 --> 04:49:43,890
you through it. So you've plugged two into
the equation, you'll get 75%. So in that thing
3162
04:49:43,890 --> 04:49:50,410
I was just talking about like imagine I had
100, right? And that was my x bar and my standard
3163
04:49:50,410 --> 04:49:57,190
deviation was five, right? And then two times
that is 10. So I go well my lower limit then
3164
04:49:57,190 --> 04:50:02,780
would be 90 in my upper limit. That would
be 110. And I would be able to confidently
3165
04:50:02,780 --> 04:50:10,930
say at least 75% of my x's are between 90
and 110. So if I'd measured maybe 100 people,
3166
04:50:10,930 --> 04:50:15,870
right, I'd say at least 75 of them are going
to be between these limits. In fact, it could
3167
04:50:15,870 --> 04:50:21,070
be 80, could be more, but at least 75. So
then Remember, I told you to predict that
3168
04:50:21,070 --> 04:50:25,150
as we made this number bigger, you know, we
go out more standard deviations, we're going
3169
04:50:25,150 --> 04:50:30,910
to cover more of the data, right? So we needed
three, it didn't come out as even, it came
3170
04:50:30,910 --> 04:50:37,240
out in 88.9% of the data. So almost 89% will
be covered if you go out three, and at least
3171
04:50:37,240 --> 04:50:44,230
almost 88.9%. And if you go out four standard
deviations, it's at least 93.8%. Right? And
3172
04:50:44,230 --> 04:50:48,880
just to remind you, you know, when you have
upper and lower limits, you have an interval,
3173
04:50:48,880 --> 04:50:53,020
right? That's just we just call it that. But
this particular interval, if you get it this
3174
04:50:53,020 --> 04:50:57,900
way, it's Chevy service interval, because
everybody's so happy did all this work, right?
3175
04:50:57,900 --> 04:51:04,420
Because I wouldn't have figured it out. So
I just wanted to demonstrate an example of
3176
04:51:04,420 --> 04:51:09,520
championships interval, because then you can
know how to interpret them or why anybody
3177
04:51:09,520 --> 04:51:16,070
does them. Okay, so remember our patient sample,
they're in the waiting room at the lab, right?
3178
04:51:16,070 --> 04:51:19,600
So they waited on average, six minutes, and
then the standard deviation of them waiting
3179
04:51:19,600 --> 04:51:25,282
was 3.74. Right? Now, when I gave you this
demonstration of how to calculate the standard
3180
04:51:25,282 --> 04:51:31,561
deviation, I use this patient sample, I did
that I only had a few patients in the sample
3181
04:51:31,561 --> 04:51:35,750
on purpose, because otherwise your table that
we made with the defining formula would be
3182
04:51:35,750 --> 04:51:41,190
huge, and I never finished this video. So
what I'm gonna ask you to do is pretend that
3183
04:51:41,190 --> 04:51:47,420
instead, we had 100 patients in there, right?
Instead, I measured 100, and I got my x bar,
3184
04:51:47,420 --> 04:51:53,070
my 3.75 standard deviations, okay, so if we
measured 100 patients, and we got that, I
3185
04:51:53,070 --> 04:52:00,160
just want to, I put this chubby shove rules
in that table. So if we go out two standard
3186
04:52:00,160 --> 04:52:05,590
deviations from the mean, from the x bar,
either side, whatever limits we get whatever
3187
04:52:05,590 --> 04:52:12,100
interval we get, we know at least because
I made it, so we say you know, studied 100
3188
04:52:12,100 --> 04:52:18,710
patients. So by law, we're at least 75 of
those patients will be between those lower
3189
04:52:18,710 --> 04:52:24,610
and upper limits, if we follow championship
syrup. And if I do go out three standard deviations,
3190
04:52:24,610 --> 04:52:30,490
at least 88.9 patients will be in there. Okay,
I know that doesn't make any sense, like 88.9
3191
04:52:30,490 --> 04:52:34,490
patients Saudia point nine of a patient. But
what they're saying is, I guess it would be
3192
04:52:34,490 --> 04:52:41,780
89. All right, yeah, 89% of the patients or
in other words, 89 patients, at least would
3193
04:52:41,780 --> 04:52:47,970
be in that interval. And of course, if I went
out for at least, I wouldn't have to say 94.8
3194
04:52:47,970 --> 04:52:52,840
of a patient, but at least 94 patients would
fit in that interval. And if you're thinking
3195
04:52:52,840 --> 04:52:56,920
about if we only start with 100 patients,
that's almost all of them. So the for one
3196
04:52:56,920 --> 04:53:01,920
isn't so useful, right? So you'll see me on
the left side of the slide calculating the
3197
04:53:01,920 --> 04:53:08,290
intervals, right? So let's start with the
first one. The first one is two standard deviations
3198
04:53:08,290 --> 04:53:14,940
on either side of the mean. So the chubby
chef interval we get is negative 1.48 to 13
3199
04:53:14,940 --> 04:53:20,520
4.48. And you probably notice you can't wait
negative time. So already, this is kind of
3200
04:53:20,520 --> 04:53:27,282
weird, right? But what this is saying is of
our 100 patients, at least 75 of them because
3201
04:53:27,282 --> 04:53:33,772
this is 75% championship interval, weighted
between negative 1.48 minutes, so that might
3202
04:53:33,772 --> 04:53:41,870
as well rounded to zero between zero minutes,
and 13 4.48 limp minutes, right. And so at
3203
04:53:41,870 --> 04:53:52,373
least 75% of them are, I fell in that range.
Now 13.48 minutes is kind of long. So we would
3204
04:53:52,373 --> 04:53:58,000
be happy, I guess is 75% of them fell in that
range, because then that means
3205
04:53:58,000 --> 04:54:05,430
that they were probably not waiting that long.
But if you go out, then you widen this interval
3206
04:54:05,430 --> 04:54:13,890
like 88.9. If you do that, then you say at
least well rounded to 89 89% of the patients
3207
04:54:13,890 --> 04:54:18,120
waited between negative five point to two
minutes, which is you might as well make zero
3208
04:54:18,120 --> 04:54:24,830
and 17.22 minutes. So as you see, if we widen
the interval, we're going to get some later
3209
04:54:24,830 --> 04:54:32,260
waiters in there. And so then we'll say, Well,
at least 89% were between there, but at least
3210
04:54:32,260 --> 04:54:38,250
90 89% were between there and that means it
wasn't bigger, right. And then again, we go
3211
04:54:38,250 --> 04:54:43,970
out one more, we get 93.8%. So let's just
round it to 94. So at least 94% of the patients
3212
04:54:43,970 --> 04:54:50,400
or if we have 100 patients, at least 94 of
them waited between negative 8.96 minutes,
3213
04:54:50,400 --> 04:54:58,160
which again is nonsensical, up to 20.96. But
then we're starting to get where we'll have
3214
04:54:58,160 --> 04:55:03,080
almost all the patients with Somewhere between
zero and 20 minutes, we really don't know
3215
04:55:03,080 --> 04:55:07,950
how long they waited. So this is just kind
of to show you what happens when you line
3216
04:55:07,950 --> 04:55:13,520
that interval, you you maybe have less certainty
about what individuals happen, be sort of
3217
04:55:13,520 --> 04:55:22,150
a better idea of what the range is. So again,
I just put this at the bottom. If we had 100
3218
04:55:22,150 --> 04:55:26,830
patients, this is how you would interpret
it, at least somebody five would have waited
3219
04:55:26,830 --> 04:55:33,360
between the lower and upper limit for the
75% championship interval. And then at least
3220
04:55:33,360 --> 04:55:39,500
80.9 patients I know nonsensical. And then
the 93.8. So you see that interpretation lower
3221
04:55:39,500 --> 04:55:49,320
part of the slide. So this is a really difficult
concept for a lot of students. And so I'll
3222
04:55:49,320 --> 04:55:55,830
just give you this take home message. First
of all, Chevy shove interval works for any
3223
04:55:55,830 --> 04:55:59,770
distribution, normal skewed whatever. Reason
why that's part of the take home messages
3224
04:55:59,770 --> 04:56:04,610
later, we're going to learn about intervals
that only work with normal distributions.
3225
04:56:04,610 --> 04:56:09,390
Okay? So this one is loosey goosey. It works
with all distributions. So that's one of the
3226
04:56:09,390 --> 04:56:15,030
take home messages for chubby sets interval.
Also, Chevy says interval tell you that at
3227
04:56:15,030 --> 04:56:20,460
least a certain percent of the data are in
the interval. Later, we're going to learn
3228
04:56:20,460 --> 04:56:25,282
about intervals where exactly a certain amount
of data are in that interval. And so Chevy
3229
04:56:25,282 --> 04:56:31,690
shop again, a little loosey goosey, right,
he says at least. Next, championship intervals
3230
04:56:31,690 --> 04:56:36,640
are sometimes nonsensical, as we just talked
about. Negative time doesn't work, right.
3231
04:56:36,640 --> 04:56:42,940
Sometimes you'll have very high limits, especially
with a four. And so ultimately, they're not
3232
04:56:42,940 --> 04:56:47,520
very useful. And they're not used in health
care. I literally had never heard of Chevy
3233
04:56:47,520 --> 04:56:52,580
shows interval until I started teaching this
class. So what is the purpose of teaching
3234
04:56:52,580 --> 04:56:57,820
you Chevy says interval. The purpose of teaching
this is to point out in statistics, we often
3235
04:56:57,820 --> 04:57:03,040
use the s or the population standard deviation,
you know, just standard deviation. And we
3236
04:57:03,040 --> 04:57:08,510
add or subtract, we'll add and subtract it
from the mean, is a good way of making lower
3237
04:57:08,510 --> 04:57:13,290
and upper limits that have special significance.
That's really the main take home message is
3238
04:57:13,290 --> 04:57:19,200
that you'll see this pattern as we go through
this class, where we get a mean either populations
3239
04:57:19,200 --> 04:57:26,870
or sample, and we have x bar, you know, x
bar or population mean. And then we have a
3240
04:57:26,870 --> 04:57:31,380
standard deviation, right either from sample
a population. And then we take either one
3241
04:57:31,380 --> 04:57:36,342
standard deviation, we added subtracted or
two, or multiples. And those intervals then
3242
04:57:36,342 --> 04:57:40,970
have certain significance. I only taught you
in this one about Chevy chef, what you learn
3243
04:57:40,970 --> 04:57:48,480
about other intervals later that are made
similarly. So in conclusion, what did we learn,
3244
04:57:48,480 --> 04:57:51,920
we learned how to calculate the range, we
learned how to calculate the variance and
3245
04:57:51,920 --> 04:57:56,020
standard deviation. We learned about how to
calculate the coefficient of variation, how
3246
04:57:56,020 --> 04:58:02,660
to interpret it. And we talked about the difference
in the formulas from sample versus population.
3247
04:58:02,660 --> 04:58:06,900
And we learned about Chevy Chevy and his theorem,
how he figured it out, and how we calculate
3248
04:58:06,900 --> 04:58:10,770
this intervals and how you interpret them.
Now I just thought I'd show you this picture
3249
04:58:10,770 --> 04:58:16,350
of Chevy chef here. He's a Russian guy. Well,
the stamp was from the USSR, for the Iron
3250
04:58:16,350 --> 04:58:22,202
Curtain fell. But I just thought I'd show
it to you. So you knew who figured all this
3251
04:58:22,202 --> 04:58:27,250
out? Good job, you've made it through the
measures of variation. And now you're ready
3252
04:58:27,250 --> 04:58:32,138
to do what the quiz, the homework, whatever,
right? You're totally knowledgeable.
3253
04:58:32,138 --> 04:58:33,460
Good job.
3254
04:58:33,460 --> 04:58:40,990
Well, I'm back. And so are you. Welcome to
Chapter 3.3 percentiles and box and whisker
3255
04:58:40,990 --> 04:58:46,270
plots. It's Monica wahi. Library college lecturer.
And this is what we're going to talk about.
3256
04:58:46,270 --> 04:58:49,610
And this is what you're going to learn. At
the end of this lecture, the students should
3257
04:58:49,610 --> 04:58:55,740
be able to explain what a percentile means,
describe what the interquartile range is,
3258
04:58:55,740 --> 04:59:01,020
and how to calculate it. Explain the steps
to making a box and whisker plot, and also
3259
04:59:01,020 --> 04:59:08,110
state how a box and whisker plot helps a person
evaluate the distribution of the data. So
3260
04:59:08,110 --> 04:59:12,870
let's get started. You know, whenever we talk
about a box and whisker plot, I think of some
3261
04:59:12,870 --> 04:59:15,410
cute little animal with all those whiskers.
3262
04:59:15,410 --> 04:59:19,401
I'll explain what the whiskers really are,
I mean, not on the animal, but on the box
3263
04:59:19,401 --> 04:59:23,750
and whisker plot later. So what are we going
to go over, we're going to go over percentiles,
3264
04:59:23,750 --> 04:59:28,670
and we're going to explain what those are.
Then we're going to talk about core tiles
3265
04:59:28,670 --> 04:59:32,770
sounds a little slimmer, it's got the tiles
and it will you'll you'll understand why they're
3266
04:59:32,770 --> 04:59:36,880
similar. Then we're going to compute core
tiles. And then finally, we're going to do
3267
04:59:36,880 --> 04:59:42,700
the box and whisker plot. All right. So let's
go. So percentiles, we're going to have a
3268
04:59:42,700 --> 04:59:45,990
flashback, okay. You're not going to like
this little part because it's going to remind
3269
04:59:45,990 --> 04:59:50,670
you of standardized tests. So maybe not all
of you have been subjected to this, but most
3270
04:59:50,670 --> 04:59:55,460
of us have if you gone to high school. In
the US, you probably got to deal with these
3271
04:59:55,460 --> 05:00:00,660
standardized tests. So just remember, we're
only talking about quantitative data. All
3272
05:00:00,660 --> 05:00:05,050
right. So if you take a standardized test
or a non standardized test, you usually get
3273
05:00:05,050 --> 05:00:08,730
points. And points are numerical. So that's
quantitative
3274
05:00:08,730 --> 05:00:09,730
data.
3275
05:00:09,730 --> 05:00:15,200
So I remember I used to take the standardized
tests, and I'd be, you know, showing my friends
3276
05:00:15,200 --> 05:00:19,610
what I got, right, because they'd send you
that thing in the mail. Now, I learned pretty
3277
05:00:19,610 --> 05:00:24,790
early on, that it mattered who all was in
the pool of people maybe taking a test with
3278
05:00:24,790 --> 05:00:29,990
you, right. So if you're taking the test with
a lot of stupid people, it's easier to get
3279
05:00:29,990 --> 05:00:35,602
a higher percentile, because what percentile
means is it for example, if you test at the
3280
05:00:35,602 --> 05:00:43,520
77th percentile, it means you did better than
77% of people taking the test. And a lot of
3281
05:00:43,520 --> 05:00:48,190
those standardized tests, they didn't care
how many points you got, what they cared about
3282
05:00:48,190 --> 05:00:54,430
is what percentile you were at. So different
batches of people would have different scores.
3283
05:00:54,430 --> 05:00:59,210
And if you got a lot of lucky, got a lot of
stupid people, then your score would be higher
3284
05:00:59,210 --> 05:01:04,000
than there. So it didn't really matter what
your absolute score was, it just mattered
3285
05:01:04,000 --> 05:01:08,410
what your percentile was. So just to sort
of remind you, if somebody had come up to
3286
05:01:08,410 --> 05:01:14,730
me in high school and said, I got 77 percentile,
what I'd say is okay, if only 100 people had
3287
05:01:14,730 --> 05:01:19,430
taken the test, you'd have done better than
Sunday, seven of them. Of course, we were
3288
05:01:19,430 --> 05:01:24,770
all Brady, Brady, you know, I was always in
like the 95th, or the 97th, or the 98th. And
3289
05:01:24,770 --> 05:01:29,770
it happened so often, I wondered if it was
really true. But what I realized is, is that
3290
05:01:29,770 --> 05:01:33,830
there were so many people in the pool, because
you know, I was in public high school in Minnesota,
3291
05:01:33,830 --> 05:01:38,291
well, they were pulling together all the public
high schools in Minnesota, ninth grade, you
3292
05:01:38,291 --> 05:01:41,870
know, as pulled with them in 10th, grade or
whatever. And when you're taking like nursing
3293
05:01:41,870 --> 05:01:46,600
examinations, sometimes they'll do that they'll
put you on a percentile. So I try to tell
3294
05:01:46,600 --> 05:01:49,872
people, you know, strategize, try to take
in when only stupid people are taking, which
3295
05:01:49,872 --> 05:01:53,661
of course, makes no sense. How can you tell
when stupid people are taking it, right? You
3296
05:01:53,661 --> 05:01:59,130
don't even know who's taking it. But really,
that's that's what a percentile is, it's the
3297
05:01:59,130 --> 05:02:05,210
percentage of people that you did better than
if you're at the 77th percentile, then you
3298
05:02:05,210 --> 05:02:13,640
did better than 77%. Okay, so here's just
some rules about percentiles. First of all,
3299
05:02:13,640 --> 05:02:19,640
you know, I gave the example of the 77th percentile,
well, the rule is you have to have one between
3300
05:02:19,640 --> 05:02:24,950
one and 99. Like, you can't have the negative
second percentile, or the 100, and fifth percentile.
3301
05:02:24,950 --> 05:02:29,943
So that's the first, then whatever number
you pick, like I was saying, that percent
3302
05:02:29,943 --> 05:02:36,140
of the values would fall below that number.
And 100 minus that number, have the values
3303
05:02:36,140 --> 05:02:42,353
fall above that number. So like, in my, well,
here, we'll give an example. 20, people take
3304
05:02:42,353 --> 05:02:49,880
a test, just 20, right, let's say there's
a maximum score of five on the test. The 25th
3305
05:02:49,880 --> 05:02:56,110
percentile means that 25% of the scores will
fall below whatever score that is, and 75%
3306
05:02:56,110 --> 05:03:00,880
will fall above that score. So let's say it's
an easy test. And let's say out of my 20,
3307
05:03:00,880 --> 05:03:05,510
people, 12, get a four, which is almost the
total, right, and the remaining eight, get
3308
05:03:05,510 --> 05:03:12,458
a five, so everybody gets either a four or
five, well, then, you know, the 25th percentile,
3309
05:03:12,458 --> 05:03:18,690
or the score that cuts off the bottom five
tests, right, will be a four, just because
3310
05:03:18,690 --> 05:03:22,560
this was an easy test. And every you know,
the first 12, people got a four and then the
3311
05:03:22,560 --> 05:03:27,950
rest eight out of five. So even the 50th percentile,
then would technically be at a four, right?
3312
05:03:27,950 --> 05:03:32,860
Now, this would all come out differently if
it were a hard test, and most people got a
3313
05:03:32,860 --> 05:03:39,440
score below three, right? And so the percentiles
would be shifted down, I just tell you that
3314
05:03:39,440 --> 05:03:44,860
so you can keep in mind the difference between
the actual score and the percentile. So the
3315
05:03:44,860 --> 05:03:50,560
percentile just happens to mean that this
percent of people got the score lower than
3316
05:03:50,560 --> 05:03:56,112
whatever your score is, it doesn't actually
say what your score was, right? So that's
3317
05:03:56,112 --> 05:04:02,140
what you just want to remember as we're going
to percent. Okay, now we're going to talk
3318
05:04:02,140 --> 05:04:07,060
about core tiles, and also the interquartile
range. Remember the tile think so this relates
3319
05:04:07,060 --> 05:04:12,300
to percentiles. So I put a little quarter
up there. So core tiles is a specific set
3320
05:04:12,300 --> 05:04:16,780
of percentiles. And you'll see why I put the
little quarter up there. It's because there's
3321
05:04:16,780 --> 05:04:22,120
technically four core tiles, it's just that
the top quartile doesn't count because it's
3322
05:04:22,120 --> 05:04:27,230
like the 100% one. And remember, it can only
go up to 99, like I was just showing you.
3323
05:04:27,230 --> 05:04:32,710
So we calculate the first second and third
quartile. So we have the 25th percentile is
3324
05:04:32,710 --> 05:04:38,280
the first quartile, the 50th percentile, which
is also known as the median, which you're
3325
05:04:38,280 --> 05:04:43,550
already good at, right? That's known as a
second quartile. And then the third quartile
3326
05:04:43,550 --> 05:04:51,240
is the 75th percentile. So those are your
courthouse 25th 50th and 75th. And technically
3327
05:04:51,240 --> 05:04:55,610
a 100th. But we never say that, right? Because
it only goes up to 99. So you have the first
3328
05:04:55,610 --> 05:05:00,800
quartile at the 25th percentile, the second
quartile at the 50th percentile. The third
3329
05:05:00,800 --> 05:05:05,610
quartile at the 75th percentile. And these
are actually not that hard to calculate by
3330
05:05:05,610 --> 05:05:07,630
hand.
3331
05:05:07,630 --> 05:05:13,792
So here's, like how you do it sort of an overview.
So first you order the data from smallest
3332
05:05:13,792 --> 05:05:18,080
to largest, because remember, we have quantitative
data, so you can sort them, so you sort them
3333
05:05:18,080 --> 05:05:22,540
smallest to largest. And this is feeling very
immediately, right? Well guess what, that's
3334
05:05:22,540 --> 05:05:27,450
step two is you find the median, because the
median is also the second quartile, which
3335
05:05:27,450 --> 05:05:32,810
is also the 50th percentile. So already, you
have know how to do this, right? Because you
3336
05:05:32,810 --> 05:05:38,330
could already do step one, and two. Now, this
is the harder part, this is the new part.
3337
05:05:38,330 --> 05:05:44,050
Step three is where you find the median of
the lower half of the data. Right. And so
3338
05:05:44,050 --> 05:05:49,370
wherever you put your median, you pretend
that's the end, and you look at the smaller
3339
05:05:49,370 --> 05:05:54,510
values, and you find the median of those.
And that would be the first quartile or the
3340
05:05:54,510 --> 05:06:00,140
75th percentile. Then finally, step four,
which you probably guessed, is you find where
3341
05:06:00,140 --> 05:06:03,570
your median was. And then you look at the
upper half of the data between the median
3342
05:06:03,570 --> 05:06:08,180
and the maximum, and you make a median out
of that part of the data, and then that's
3343
05:06:08,180 --> 05:06:13,890
your 75th percentile. Okay, and I'll show
you an example of us doing that. But this
3344
05:06:13,890 --> 05:06:20,442
is an overview of the steps. Now, remember,
range before what the range was, yeah, you
3345
05:06:20,442 --> 05:06:24,793
remember it, that's where we had the maximum
minus the minimum, right? And I told you,
3346
05:06:24,793 --> 05:06:29,262
you have to actually do out the equation and
tell me what number you get. And that's the
3347
05:06:29,262 --> 05:06:35,570
range. Well, we have something new and improved.
In this lecture, here, we have the inter quartile
3348
05:06:35,570 --> 05:06:40,202
range. Okay, so you already know about quartiles,
we were just talking about them. But inter
3349
05:06:40,202 --> 05:06:46,650
quartile sort of means like, within, right.
So once you have the third quartile, and you
3350
05:06:46,650 --> 05:06:52,220
have the first quartile, you can calculate
the inter quartile range, or RQR for short.
3351
05:06:52,220 --> 05:06:56,190
So if you see IQ are on here, just remember,
that's interquartile range. So that's the
3352
05:06:56,190 --> 05:07:01,050
third quartile minus the first quarter. And
again, I'll show you an example. It's this
3353
05:07:01,050 --> 05:07:07,720
is just an overview. Okay, here's the example
I promised. On the right side of the slide,
3354
05:07:07,720 --> 05:07:13,880
you will see a sample of data I collected,
I went to HD comm that's American Hospital
3355
05:07:13,880 --> 05:07:20,600
directory calm, and that provides publicly
available information about American hospitals.
3356
05:07:20,600 --> 05:07:26,862
So I went in, and I took a random sample of
11, Massachusetts hospitals, there's a lot
3357
05:07:26,862 --> 05:07:31,920
more, so I took a random sample. And what
I did was I wrote down how many beds each
3358
05:07:31,920 --> 05:07:36,952
of those hospitals had. Because if a hospital
has several 100 beds, they're considered kind
3359
05:07:36,952 --> 05:07:42,250
of a big hospital. And if they have less than
100 beds, they're considered a smaller hospital.
3360
05:07:42,250 --> 05:07:48,130
So I wrote all those numbers down. And then
I already did step one of making our courthouse
3361
05:07:48,130 --> 05:07:51,910
which is to order the data from smallest to
largest. So you'll see on the right side of
3362
05:07:51,910 --> 05:07:59,841
the slide, my smallest hospital had only 41
beds, and my largest hospital had 364 beds
3363
05:07:59,841 --> 05:08:04,702
and see I put all of them in order, they're
on the right. And so we already did step one.
3364
05:08:04,702 --> 05:08:12,282
So let's go on to step two. So the Step two
is to find the median, and that's quartile
3365
05:08:12,282 --> 05:08:18,522
two, or the 50th percentile. Now, you're already
good at that, right. And so we have 11 hospitals.
3366
05:08:18,522 --> 05:08:24,550
So we know that the sixth one in the row is
going to be the median, you know, because
3367
05:08:24,550 --> 05:08:29,380
it's an odd number of hospitals that I drew.
And so the sixth one will circle it, that's
3368
05:08:29,380 --> 05:08:34,542
the 50th percentile or the median, so we already
got quartile two, it's, it's funny that you
3369
05:08:34,542 --> 05:08:36,090
have to start with quartile two, but that's
3370
05:08:36,090 --> 05:08:37,990
what you have to do.
3371
05:08:37,990 --> 05:08:43,770
Now, I just re color coded these. So you could
kind of remember what's going on as we do
3372
05:08:43,770 --> 05:08:50,410
the other steps. 126 is the median. That's
kind of not on anybody's side, it's not on
3373
05:08:50,410 --> 05:08:55,750
the lowest side, and it's not on the highest
side. The orange ones then are considered
3374
05:08:55,750 --> 05:09:01,100
below the median. And the blue ones are considered
above the median. And so I just color coded
3375
05:09:01,100 --> 05:09:06,950
them so you can keep track of what's going
on in the next slides. Okay, now we're going
3376
05:09:06,950 --> 05:09:12,550
to do the 25th percentile for step three.
So the goal is to find the median of the lower
3377
05:09:12,550 --> 05:09:16,840
half of the data. So now you see why I color
coded it is because now we're pretending just
3378
05:09:16,840 --> 05:09:21,810
the orange ones exist. And we are just finding
the median of that. And we're not counting
3379
05:09:21,810 --> 05:09:29,112
that 126, because that's already been used.
And so now we find that 90 is the 25th percentile,
3380
05:09:29,112 --> 05:09:32,770
how you remember that it's not the 75th, it's
not the third one is because it's the low
3381
05:09:32,770 --> 05:09:37,350
one, like 25 is a low number. And 75 is a
higher number. So you go to the lower part
3382
05:09:37,350 --> 05:09:40,880
of the data, you find the median of that,
and that's going to be your 25th percentile.
3383
05:09:40,880 --> 05:09:47,122
And so in our case, that's 90 then you probably
guessed it, you go to the blue ones, right
3384
05:09:47,122 --> 05:09:54,410
the upper half and you go get the median out
of that. And so of course ours is 254. So
3385
05:09:54,410 --> 05:10:00,020
that's our 75th percentile. So what we just
did is we calculated our courthouse. We have
3386
05:10:00,020 --> 05:10:05,010
Our 50th percentile, our 25th percentile and
our 75th percentile. So that's what I meant
3387
05:10:05,010 --> 05:10:10,760
by that overview slide. This is an example
of how you would do that. And of course, I
3388
05:10:10,760 --> 05:10:16,080
have to give a shout out to the IQ R, which
is the interquartile range. Remember, you
3389
05:10:16,080 --> 05:10:22,960
just learn that. So that's the 75th percentile
minus the 25th percentile. So in our case,
3390
05:10:22,960 --> 05:10:31,920
that's going to be 254 minus 90, which equals
164. So that is your IQ R. So if I gave you
3391
05:10:31,920 --> 05:10:37,050
a test, and I asked you what is the IQ or
for these data, you can't just put 254 minus
3392
05:10:37,050 --> 05:10:42,580
90, you actually have to work it out and put
164. So there you go. So that's our quarterly
3393
05:10:42,580 --> 05:10:50,430
example. So I just wanted to step back and
give you some philosophical points on what
3394
05:10:50,430 --> 05:10:58,090
happens with q1 and q3, depending on how many
data points you have. Okay, so remember, the
3395
05:10:58,090 --> 05:11:04,450
first step of this is always to put them in
order from smallest to largest. So let's pretend
3396
05:11:04,450 --> 05:11:11,080
I had only drawn the first six values of my
hospitals. See how I put on the slide, I put
3397
05:11:11,080 --> 05:11:18,930
the position of the number, which is 123456.
And I put above the example numbers. So let's
3398
05:11:18,930 --> 05:11:22,772
say I was going to do the median on that,
you know, what I'd have to do is I'd have
3399
05:11:22,772 --> 05:11:30,280
to take 90 plus 97, divided by two. But then
the next question is, what do we do for q1
3400
05:11:30,280 --> 05:11:38,510
and q3? Well, given that in the example of
having six values, the 90 and 97 are mushed,
3401
05:11:38,510 --> 05:11:44,550
together for the median, they don't get, they
can get reused, or they do get reused when
3402
05:11:44,550 --> 05:11:50,470
looking at the bottom and the top half of
the data. So when we went to go to do q one
3403
05:11:50,470 --> 05:11:55,030
in this, we would actually count that 90 in
there. In fact, q one would be 74, because
3404
05:11:55,030 --> 05:12:01,280
that's the median of the three numbers below
the median right below that line. And then
3405
05:12:01,280 --> 05:12:07,292
the Q three would actually be 121, because
we actually count the 97 in there. So in other
3406
05:12:07,292 --> 05:12:11,432
words, when you have like six values, and
the median is made out of mushing together
3407
05:12:11,432 --> 05:12:15,750
two values, like taking the average of those
two values, those two values, they get to
3408
05:12:15,750 --> 05:12:20,330
double dip, they get to be in the bottom,
and the bottom line gets to be in the bottom,
3409
05:12:20,330 --> 05:12:27,980
and the top one gets to be in the top when
calculating q1 and q3. Now, well, what if
3410
05:12:27,980 --> 05:12:33,100
we had seven values instead of six? Okay,
so I just expanded and pretended we had seven
3411
05:12:33,100 --> 05:12:38,790
hospitals. And you'll see that I have seven
positions there. Well, this was a little like
3412
05:12:38,790 --> 05:12:46,190
the one we did, together with the 11 values,
where the median was clearly this 97. Here,
3413
05:12:46,190 --> 05:12:53,280
in this case, it's 97. So that 97 does not
get reused in the bottom in the top. So you'll
3414
05:12:53,280 --> 05:12:58,890
notice that q one is the middle number of
the three bottom ones, and Q three is the
3415
05:12:58,890 --> 05:13:04,390
middle number, the top three ones. And so
that's what happens when you have seven values.
3416
05:13:04,390 --> 05:13:11,080
And it's also happens when you have 11 values,
like I demonstrated with those hospitals.
3417
05:13:11,080 --> 05:13:15,800
But it's not super predictable. Because what
if you had eight values, we suddenly see it
3418
05:13:15,800 --> 05:13:20,702
gets a little complicated. So how would we
do this? Well see the first four are between
3419
05:13:20,702 --> 05:13:26,530
41 and 97, top four between 121 155. Well,
to make our median, we'd have to take the
3420
05:13:26,530 --> 05:13:32,040
mean of 97, and 121. But remember, they don't
get used up the 97 then gets to double dip
3421
05:13:32,040 --> 05:13:37,830
and be part of the calculation for q1, and
121 gets a double dip and Part B part of the
3422
05:13:37,830 --> 05:13:42,770
calculation for q3. But even even with this
double dipping, if you go down, you'll see
3423
05:13:42,770 --> 05:13:49,250
that there are four then numbers to contend
with, for q1. So of course, to get q1, you
3424
05:13:49,250 --> 05:13:55,750
actually have to mush together or take an
average of 74 and 90. And if you go up the
3425
05:13:55,750 --> 05:14:00,530
upper part of the data, in order to get q
three, you're going to have to make an average
3426
05:14:00,530 --> 05:14:06,650
of 126 and 142 are the ones in position six
in position seven. So if you're unlucky enough
3427
05:14:06,650 --> 05:14:10,450
to get like eight values, then you realize
you're going to have to make your median by
3428
05:14:10,450 --> 05:14:14,990
making an average of two numbers, your q1
of making an average of two numbers and your
3429
05:14:14,990 --> 05:14:21,190
q3 like that. So it's not super predictable
what's going to happen. You just have to pay
3430
05:14:21,190 --> 05:14:27,820
a lot of attention. Just remember if your
median is made out of two numbers average,
3431
05:14:27,820 --> 05:14:34,542
those numbers get to double dip in the downstairs
and the upstairs of calculating q1 and q3.
3432
05:14:34,542 --> 05:14:39,840
If instead your median is just one number,
like because you have an odd number of values,
3433
05:14:39,840 --> 05:14:48,470
then that guy has to just stay there and does
not double dip in q1 and q3 calculations.
3434
05:14:48,470 --> 05:14:53,420
So we can just see another example of this.
So this is nine values right? Now remember,
3435
05:14:53,420 --> 05:14:58,000
when I had 11 values, it was like having seven
values. I had this median and it was really
3436
05:14:58,000 --> 05:15:03,150
clear like we have here but even Um, the medians
of the top of the top of the data and the
3437
05:15:03,150 --> 05:15:07,090
bottom of the day, they were just, you know,
it was an odd number. And so it was easy to
3438
05:15:07,090 --> 05:15:12,890
figure that out. Well, you see here, in this
case, our median is the fifth value, and that's
3439
05:15:12,890 --> 05:15:19,670
121. So 121, does not double dip anywhere,
right? So we go to calculate q one, we only
3440
05:15:19,670 --> 05:15:24,020
have four values, because we're not counting
the 121. And then we're stuck with taking
3441
05:15:24,020 --> 05:15:28,782
an average of the second and third value to
get q one. And then same thing upstairs here,
3442
05:15:28,782 --> 05:15:33,050
between, you know, 142, and 155. You know,
those are the two middle numbers of our four
3443
05:15:33,050 --> 05:15:37,710
numbers at the top. And then we have to take
an average of those to get q3. So I guess
3444
05:15:37,710 --> 05:15:41,400
this is just my long way of saying you got
to be really careful what you're doing. First,
3445
05:15:41,400 --> 05:15:46,760
make sure you've gotten the median, then figure
out if that median is this kind of a median
3446
05:15:46,760 --> 05:15:51,170
where it's just you're circling, or it's a
medium that came out of an average, because
3447
05:15:51,170 --> 05:15:54,410
if it's a medium that came out of an average,
just know that those numbers are going to
3448
05:15:54,410 --> 05:15:59,922
double dip in q1 and q3. And if it's a medium
that was because you had an odd number of
3449
05:15:59,922 --> 05:16:06,280
data, it was just like in the middle, that
one doesn't get to double dip. Okay, enough
3450
05:16:06,280 --> 05:16:09,872
double dipping, I'm getting hungry. When I
go to that roller coaster, I'm going to get
3451
05:16:09,872 --> 05:16:14,970
a double dip ice cream cone. Okay, we're gonna
move on to box and whisker plot, which is
3452
05:16:14,970 --> 05:16:20,230
kind of like your percentiles getting graphed,
right. So let's go back to our ingredients,
3453
05:16:20,230 --> 05:16:24,910
we already created our box plot ingredients.
In fact, that's why I trickily went through
3454
05:16:24,910 --> 05:16:30,420
those portals first, because now we've created
our ingredients to make a box plot. So I just
3455
05:16:30,420 --> 05:16:36,372
sort of summarize what we have on the left
slot, side of the slide, say that 50 times,
3456
05:16:36,372 --> 05:16:43,092
hospital beds was what we were counting, the
smallest Regional Hospital had only 41 beds.
3457
05:16:43,092 --> 05:16:50,350
q1 was 96. a little easier. I put it in an
order cure, one was 90, median q2 was 126.
3458
05:16:50,350 --> 05:16:55,550
You know what I mean? I mean, cuartel, right,
like by these cues, then q3 is 254. And then
3459
05:16:55,550 --> 05:17:00,651
the maximum was 364. Okay, so let's make a
boxplot. And then you remember what the data
3460
05:17:00,651 --> 05:17:03,680
looks like on the right side of the slide.
Okay, well, now I'm going to walk you through
3461
05:17:03,680 --> 05:17:09,660
how you would make this box plot. So first,
you draw this thing? Well, how do you know
3462
05:17:09,660 --> 05:17:14,762
what to draw? Well, I usually just draw a
line and a vertical line, and then put a zero
3463
05:17:14,762 --> 05:17:18,760
at the bottom, and then I cheat, I go look
at the maximum go, Oh, I wonder where that
3464
05:17:18,760 --> 05:17:24,860
is. And see our maximum was like 364. So I
just made 400. At the top, if our maximum
3465
05:17:24,860 --> 05:17:28,880
had been something like, you know, I think
Massachusetts General Hospital has something
3466
05:17:28,880 --> 05:17:35,750
like 600 or 800 beds. If we had gotten that
one in there, and that was our maximum, I
3467
05:17:35,750 --> 05:17:39,931
would maybe go up to 900, you know, whatever
is a little bit above the maximum, that's
3468
05:17:39,931 --> 05:17:45,400
what I put at the top. So this was 364. So
I put 400, then what I did was I divided it
3469
05:17:45,400 --> 05:17:50,470
in half, like I see where the 200 is, I just
kind of threw that in there. And then I divided
3470
05:17:50,470 --> 05:17:54,980
between the 200 and the 400, a half and put
the 300. And so you can just kind of eyeball
3471
05:17:54,980 --> 05:18:00,000
this and draw it out that way if you want.
Okay, so I got this thing set up. And then
3472
05:18:00,000 --> 05:18:02,460
here we go, we're going to do the first thing.
3473
05:18:02,460 --> 05:18:09,420
Okay, here's the first thing we're going to
draw in q1 or quarter one. So on the left
3474
05:18:09,420 --> 05:18:14,850
side of the slide, you'll see a circle that's
90. On the right side of the slide, I made
3475
05:18:14,850 --> 05:18:21,820
this horizontal line. Now how Why do you make
that line? Well, look at how its proportion
3476
05:18:21,820 --> 05:18:26,970
to that that upward and down graph thing I
made, you know, with the numbers, you probably
3477
05:18:26,970 --> 05:18:32,880
don't want to too wide, but you don't want
to too skinny. This is just about right, like
3478
05:18:32,880 --> 05:18:38,850
Goldilocks just right. Okay, so you just make
this horizontal line at q1. So that's the
3479
05:18:38,850 --> 05:18:48,240
first. Now you make a copy of that same line
parallel, and you make it at q3. So if you
3480
05:18:48,240 --> 05:18:52,740
look at that, if you're I hope you're not
lost, if you look at that, you know, 100 200
3481
05:18:52,740 --> 05:18:58,271
300 400, you know, q1 is 90, so it's about
10, under 100. So that's how I knew where
3482
05:18:58,271 --> 05:19:04,190
to position that lower one. And then 254,
that's about, you know, a little bit higher
3483
05:19:04,190 --> 05:19:09,050
than halfway between 203 100. So that's where
I roughly knew how to position this one. It's
3484
05:19:09,050 --> 05:19:13,670
not perfect. If you do it in statistical software,
they put it out and it's perfect. But for
3485
05:19:13,670 --> 05:19:15,850
demonstration purposes, that's
3486
05:19:15,850 --> 05:19:16,850
what I'm doing.
3487
05:19:16,850 --> 05:19:21,640
Okay, so now what we've done is we put in
q1 and q3 and we put these horizontal lines
3488
05:19:21,640 --> 05:19:28,960
that are parallel. Alright, here's the next
step. We connect them, hence, the box so the
3489
05:19:28,960 --> 05:19:35,990
box gets made, right that you just call it
connect them. Alright, now I put a little
3490
05:19:35,990 --> 05:19:39,910
circle on the right side of the slide because
I wanted you to make sure you saw what's going
3491
05:19:39,910 --> 05:19:46,170
on there. Okay. That's when we put in q2 or
the median, right? So the median is 126. See
3492
05:19:46,170 --> 05:19:51,230
where 100 is. It's up a little bit, and we
make that parallel. But you see how I made
3493
05:19:51,230 --> 05:19:56,580
q one q three connected the box and then did
the median. I think this is the easiest order
3494
05:19:56,580 --> 05:20:00,600
to do it and when you're drawing it by hand
and you're not the statistical software Because
3495
05:20:00,600 --> 05:20:05,380
then that way, you know, this box is all nice.
And then your median fits and everything looks
3496
05:20:05,380 --> 05:20:12,690
nice, but we're not done yet. We got the whiskers.
So you're probably wondering this whole time,
3497
05:20:12,690 --> 05:20:16,920
what is this whisker thing? Well, you just
figured out what the boxes the whiskers are
3498
05:20:16,920 --> 05:20:24,602
the markers for the minimum and the maximum.
So you'll see the minimums at 41. And then
3499
05:20:24,602 --> 05:20:30,110
we have a whisker at 41. So why is it called
a whisker? Well, it's smaller. I don't know
3500
05:20:30,110 --> 05:20:34,350
why it's called the whisker, but it's different
from the other ones. Because it's smaller.
3501
05:20:34,350 --> 05:20:39,030
I guess that's a reason maybe. But notice
how it's like half the size, almost half the
3502
05:20:39,030 --> 05:20:44,040
size. Sometimes they're really, really small,
but it's tiny. And you want to position it,
3503
05:20:44,040 --> 05:20:49,530
like vertically in the middle, like you don't
want it off to the side or anything. But and
3504
05:20:49,530 --> 05:20:55,820
you also want these parallel. You'll notice
the maximums up there way high at 364. So
3505
05:20:55,820 --> 05:20:59,990
I just did both of these on the same slide.
So you draw on the whiskers. And then you
3506
05:20:59,990 --> 05:21:05,060
probably can guess the last step. Yeah, connect
the whiskers to the box. So good job. There
3507
05:21:05,060 --> 05:21:11,362
you went and did it You made a box plot. And
then now let's look at the inter quartile
3508
05:21:11,362 --> 05:21:18,080
range. Remember how you calculated this, you
took q three minus q one? Well, that means
3509
05:21:18,080 --> 05:21:27,770
this boxy thing is 164. Beds long, right?
So that's where your IQ are. This is a visual
3510
05:21:27,770 --> 05:21:34,700
pictorial of your IQ. So very good. We did
our boxplot, we did our inter quartile range.
3511
05:21:34,700 --> 05:21:37,940
And you're probably wondering, why don't we
just do this?
3512
05:21:37,940 --> 05:21:40,250
I'll explain.
3513
05:21:40,250 --> 05:21:46,091
So why do we do this? Well, one of the main
things that we do is we look at the distribution
3514
05:21:46,091 --> 05:21:50,800
in the data. I know, I know, you guys learn
how to do a histogram already, and you're
3515
05:21:50,800 --> 05:21:55,610
good at a stem and leaf. Those are other ways
of looking at the distribution. And if you
3516
05:21:55,610 --> 05:22:01,690
make a histogram of these data, you'll find
that Well, I mean, these are only 11. But
3517
05:22:01,690 --> 05:22:05,410
you know, if you get a pile of data, and you
make a histogram and the stem and leaf, you'll
3518
05:22:05,410 --> 05:22:11,240
find that those images agree with the boxplot.
And you're probably thinking, Well, how do
3519
05:22:11,240 --> 05:22:14,942
how do they agree? Well, if you look on the
right side of the slide, I'm just giving you
3520
05:22:14,942 --> 05:22:20,650
an example. So skewed, right? If you had skewed
right data, and you knew it, because you made
3521
05:22:20,650 --> 05:22:26,830
a histogram and you saw a skewed right distribution,
if you took the same data, and you made a
3522
05:22:26,830 --> 05:22:34,110
boxplot, it would be kind of like that skewed
right one that we just did, where the top,
3523
05:22:34,110 --> 05:22:38,280
whisker would be really high in that thing
connecting the whisker to the box. That would
3524
05:22:38,280 --> 05:22:43,150
be like really long, whereas the one on the
bottom is short. As you can see, the skewed
3525
05:22:43,150 --> 05:22:49,971
left is the opposite, right? The bottom one
is long, and the top one short. If you have
3526
05:22:49,971 --> 05:22:56,330
a normal distribution, remember that that's
symmetrical. That's that mound shaped distribution,
3527
05:22:56,330 --> 05:23:00,530
and you have a larger spread. In other words,
you have a bigger standard deviation, you
3528
05:23:00,530 --> 05:23:05,930
have a bigger variance, right? Then you're
going to see a box that's really big like
3529
05:23:05,930 --> 05:23:10,260
that. But if you have a smaller spread, and
it's a normal distribution, you're going to
3530
05:23:10,260 --> 05:23:13,430
see a box that looks like this. And you're
probably wondering, where are you getting
3531
05:23:13,430 --> 05:23:21,610
these shapes? Well, I'll show you a kind of
on the last slide here as we wrap up the conclusion.
3532
05:23:21,610 --> 05:23:27,770
It's because if you fly over a roller coaster,
like see this roller coaster, this roller
3533
05:23:27,770 --> 05:23:32,520
coaster is skewed right? That would make sense,
right? Because you want to go up steeply,
3534
05:23:32,520 --> 05:23:40,530
and then go down really fast. And see how
the boxplot for the roller coaster looks.
3535
05:23:40,530 --> 05:23:47,670
You've got sort of the part where you start
going up really fast. That's kind of near
3536
05:23:47,670 --> 05:23:53,910
the median and kind of near the the 25th percentile.
And the part where you start where you're
3537
05:23:53,910 --> 05:23:58,692
just getting on and it's slowly going there.
That's like the bottom whisker. And then you
3538
05:23:58,692 --> 05:24:03,330
go up and you come down. And it's a long tail,
which is good, I guess if you design roller
3539
05:24:03,330 --> 05:24:09,042
coasters, and then that long tail, then is
that right skew? So that's why I mean, if
3540
05:24:09,042 --> 05:24:13,442
in your mind, you're going how she getting
this this histogram in this box, but this
3541
05:24:13,442 --> 05:24:19,080
is kind of how I'm doing it, as I'm saying,
Well, if you flew over the histogram, or the
3542
05:24:19,080 --> 05:24:24,990
roller coaster, you might see like a shape
of a box plot. So in conclusion, we talked
3543
05:24:24,990 --> 05:24:29,620
about percentiles, in general, like the 77th
percentile, what that all means. And then
3544
05:24:29,620 --> 05:24:35,430
we focus in on quartiles, which are a specific
set of percentiles. And then we're going to
3545
05:24:35,430 --> 05:24:40,810
go or we already did calculate the quartiles.
And the reason why we did that is because
3546
05:24:40,810 --> 05:24:45,800
we first needed to do that in order to make
the interquartile range. And then finally,
3547
05:24:45,800 --> 05:24:51,380
we need those quartiles in order to make and
interpret a box and whisker plot. Okay, this
3548
05:24:51,380 --> 05:24:56,770
isn't the roller coaster I'm going to, but
I'm going to one and I guarantee you it is
3549
05:24:56,770 --> 05:24:58,080
skewed right.
3550
05:24:58,080 --> 05:25:05,000
Greetings and salutations. Hi, this is Monica
wahi, your library college lecturer bringing
3551
05:25:05,000 --> 05:25:13,160
to you chapter 4.1, scatter diagrams and linear
correlation. So here's what you're gonna learn
3552
05:25:13,160 --> 05:25:18,360
at the end of this lecture, you should be
able to explain what a scattergram is and
3553
05:25:18,360 --> 05:25:25,952
how to make one state what strength and direction
mean with respect to correlations and compute
3554
05:25:25,952 --> 05:25:31,920
correlation coefficient are using the computational
formula. And finally, you should be able to
3555
05:25:31,920 --> 05:25:38,122
describe why correlation is not necessarily
causation. So let's jump right into it. First,
3556
05:25:38,122 --> 05:25:42,750
we're going to talk about making a scatter
diagram. And the thing on the right side of
3557
05:25:42,750 --> 05:25:47,440
the screen is not a scatter diagram, but it's
kind of scattered. So I put it there, it's
3558
05:25:47,440 --> 05:25:51,390
kind of pretty. And then next, we're going
to talk about correlation coefficient, R,
3559
05:25:51,390 --> 05:25:56,840
and how to make it. And then finally, we're
gonna do a shout out to causation and lurking
3560
05:25:56,840 --> 05:26:01,100
variables, which remember we talked about
before, but we're going to talk about them
3561
05:26:01,100 --> 05:26:06,980
again, in relationship to our. So let's start
with the scattergram. And I also call it a
3562
05:26:06,980 --> 05:26:10,350
scatter plot, because it's like everything
in statistics, there's got to be about eight
3563
05:26:10,350 --> 05:26:15,840
names for everything. So scatter gram, and
scatterplot mean the same thing. So let's
3564
05:26:15,840 --> 05:26:23,820
just get with the setup here. So scatter grams,
or scatter plots are graphs of x, y pairs.
3565
05:26:23,820 --> 05:26:31,050
So what's an XY pair, xy pairs are measurements,
two measurements made of the same individual
3566
05:26:31,050 --> 05:26:37,250
or the same unit. So if you measure my height
and my weight, that's an XY pair, if you measure
3567
05:26:37,250 --> 05:26:41,200
my height in the my friend's weight, that's
not an XY pair, because that's two different
3568
05:26:41,200 --> 05:26:50,410
people, right? So these xy pairs, the x part
is called the explanatory or independent variable.
3569
05:26:50,410 --> 05:26:55,720
And it's always graphed on the x axis. So
remember, in algebra, you would do these graphs,
3570
05:26:55,720 --> 05:27:00,920
where you have this vertical line, and that
was the y axis, and you have this horizontal
3571
05:27:00,920 --> 05:27:04,730
line, which was the x axis. And I always had
trouble remembering, which is which, but that's
3572
05:27:04,730 --> 05:27:11,040
how it is. And so whichever x whichever of
the pairs is x, expect that to be graphed
3573
05:27:11,040 --> 05:27:17,080
along the x axis. And it's also called the
explanatory and or independent. Remember,
3574
05:27:17,080 --> 05:27:22,870
there's got to be a million names for everything
explanatory or independent variable. So if
3575
05:27:22,870 --> 05:27:27,560
I talk to you and said, here's an XY pair,
and this one is the independent variable,
3576
05:27:27,560 --> 05:27:31,840
or this one is the explanatory variable, you
need to like just secretly know I'm talking
3577
05:27:31,840 --> 05:27:38,680
about the X of the two. And then surprise,
here's the y of the two and the Y is also
3578
05:27:38,680 --> 05:27:44,180
called response variable. It's also called
the dependent variable. And that is graphed
3579
05:27:44,180 --> 05:27:50,070
on the y axis. So again, like I said, I used
to have trouble remembering is the vertical
3580
05:27:50,070 --> 05:27:55,830
one, the y axis or the horizontal one. But
what I did was I remembered, if you take a
3581
05:27:55,830 --> 05:28:01,120
capital Y, and you go grab onto its tail,
and you go pull it straight down, you'll see
3582
05:28:01,120 --> 05:28:05,820
that it's vertical. And that's how I remember
that's the y axis, it doesn't hurt the Y.
3583
05:28:05,820 --> 05:28:11,830
It's used to that. So if you can stretch the
y's tail down, and you get vertical, remember,
3584
05:28:11,830 --> 05:28:16,762
that's the y axis. And then the other one
is the x axis. Okay? And then also, you have
3585
05:28:16,762 --> 05:28:23,370
to find a way to remember which one means
what like, does x mean explanatory and independent?
3586
05:28:23,370 --> 05:28:28,350
Or what or does it mean response independent.
So how I do it is, you know how we sing the
3587
05:28:28,350 --> 05:28:36,622
ABCs abcdefg. Well, if you fast for the N
is w x, y, z, right, so the x comes before
3588
05:28:36,622 --> 05:28:45,000
the Y, you know, in the alphabet, so I do
x and then an arrow to y. And then I imagined
3589
05:28:45,000 --> 05:28:49,390
in my head that saying X causes Y, even though
it doesn't necessarily cause y's, you'll see
3590
05:28:49,390 --> 05:28:54,452
at the end of this lecture, but I think about
it that way. Because if that happens, then
3591
05:28:54,452 --> 05:29:01,730
y is dependent on x and x is independent,
it can do whatever it wants, but y is dependent.
3592
05:29:01,730 --> 05:29:08,910
So that's my way of remembering x is the independent
variable, and y is the dependent variable.
3593
05:29:08,910 --> 05:29:15,060
So anyway, that's a long way of saying the
scattergram is a graph of these xy pairs.
3594
05:29:15,060 --> 05:29:21,740
And that's what we're going to do is make
that graph. So we needed some xy pairs, right?
3595
05:29:21,740 --> 05:29:27,380
So I asked the question, do the number of
diagnoses a patient has, does that correlate
3596
05:29:27,380 --> 05:29:32,580
with the number of medications she or he takes?
So if you don't have that many diagnoses,
3597
05:29:32,580 --> 05:29:36,820
you probably aren't on that many meds, right.
But if you have a lot of diagnoses, you should
3598
05:29:36,820 --> 05:29:38,530
be on a lot of meds. But we all know
3599
05:29:38,530 --> 05:29:42,890
people in real life can sort of violate that
just depending, I mean, you could have one
3600
05:29:42,890 --> 05:29:47,170
really bad diagnosis with a lot of meds. Or
you can have a bunch of diagnoses that are
3601
05:29:47,170 --> 05:29:50,780
all taken care of with one mad so it's not
perfect, but this is kind of a reasonable
3602
05:29:50,780 --> 05:29:59,500
thing to think. So what I did was I put up
here just for x y, Paris, as you can see,
3603
05:29:59,500 --> 05:30:05,670
so I'm got four pretend patients. And you
can see here's the first patient, that person
3604
05:30:05,670 --> 05:30:11,220
has an x sub one because they only one diagnosis,
but like I was saying must be a bad diagnosis
3605
05:30:11,220 --> 05:30:16,920
because that person has a y of three or is
on three meds for it. Right? So that's how
3606
05:30:16,920 --> 05:30:23,770
you read this table. So let's start making
our scattergram out of these data. Okay, so
3607
05:30:23,770 --> 05:30:29,400
here we go. So I labeled the x axis number
of diagnoses, right just to keep things straight,
3608
05:30:29,400 --> 05:30:34,340
and the y axis number of medications, and
then you'll see where I put the dot, right?
3609
05:30:34,340 --> 05:30:41,350
because x is one, I went over to one number
of diagnosis, right? The one diagnosis, and
3610
05:30:41,350 --> 05:30:48,260
then, because why was three, I went up three
to this three, right, and there goes the dot,
3611
05:30:48,260 --> 05:30:52,990
that's where that first person gets a dot,
okay, you put it there. And that's what you're
3612
05:30:52,990 --> 05:30:58,690
going to do with these other ones, too, is
four dots. Okay, I just threw all the dots
3613
05:30:58,690 --> 05:31:02,650
down, so you can kind of see what was going
on. But here's the second person, right? So
3614
05:31:02,650 --> 05:31:08,590
that person had an X of three. So I went over
three. And I just put those green arrows in
3615
05:31:08,590 --> 05:31:12,592
just so you can see what was going on, they're
really not part of the scatterplot is just
3616
05:31:12,592 --> 05:31:19,060
more like, like cheating, you know, to show
you because we're just practicing right? And
3617
05:31:19,060 --> 05:31:24,721
then that person, so had an X of three and
then a y of five, and you see where the dot
3618
05:31:24,721 --> 05:31:30,940
goes right. And then here, you can see where
the fourth got.or I'm sorry, the third that
3619
05:31:30,940 --> 05:31:35,080
goes because there's a four and a four. And
then here we have the fourth that. So this
3620
05:31:35,080 --> 05:31:40,030
is the scattergram of these four patients.
Of course, a lot of times you have like hundreds
3621
05:31:40,030 --> 05:31:46,830
of patients in there. But I just showed you
the simple example. Okay, now, because we
3622
05:31:46,830 --> 05:31:53,470
did that, I can talk about linear correlation,
you'll kind of get it right. linear correlation,
3623
05:31:53,470 --> 05:31:59,610
that term means that when you make a scatterplot
of xy pairs, it kind of looks like a line.
3624
05:31:59,610 --> 05:32:05,430
Now over here on the right is not like biology.
That's not like statistics. That's like algebra,
3625
05:32:05,430 --> 05:32:09,660
right? Because back in algebra, you'd have
these perfect lines where the dot was right
3626
05:32:09,660 --> 05:32:15,320
on the line and see the x and y. Notice there's
no diagnosis, nothing. That's algebra, right.
3627
05:32:15,320 --> 05:32:21,580
So perfect linear correlation. Looks like
graphing points in algebra. And if you actually
3628
05:32:21,580 --> 05:32:27,000
make a scatterplot, of like people, xy pairs,
and you see that, you should suspect there's
3629
05:32:27,000 --> 05:32:31,980
something wrong, it actually happened to me
once, one of our statisticians came to me
3630
05:32:31,980 --> 05:32:32,990
and said, Monica,
3631
05:32:32,990 --> 05:32:33,990
look
3632
05:32:33,990 --> 05:32:39,280
at this, you won't believe this. And I said,
Well, I don't believe this. What are you graphing?
3633
05:32:39,280 --> 05:32:47,530
And he said, on the x axis, he had put the
weight of every of the person's liver. And
3634
05:32:47,530 --> 05:32:54,120
on the y axis, he put the weight of the whole
person. And I'm like, I, how do you weigh
3635
05:32:54,120 --> 05:32:59,270
people's livers? Like, that sounds painful.
And he goes, Oh, let me go see. And what he
3636
05:32:59,270 --> 05:33:05,110
learned was that you don't waste people's
livers, you use an equation to estimate the
3637
05:33:05,110 --> 05:33:09,560
weight of their liver and guess what's in
the equation is their actual weight. So I'm
3638
05:33:09,560 --> 05:33:14,880
like, that's why I came out, like on a line
is because you were using the Y to calculate
3639
05:33:14,880 --> 05:33:20,270
the x. And he was like, Oh, you're so smart
for a secretary. So then I became an epidemiologist.
3640
05:33:20,270 --> 05:33:27,040
But anyway, if you ever see this in biology,
just suspect Something's fishy, because really,
3641
05:33:27,040 --> 05:33:32,440
things just don't end up right on line. But
if they get really close, you can say it's
3642
05:33:32,440 --> 05:33:37,190
close to perfect linear correlation. I just
wanted to let you know, that's what we're
3643
05:33:37,190 --> 05:33:43,340
what's going on here with this linear correlation.
Okay, so let's talk about facts about linear
3644
05:33:43,340 --> 05:33:49,240
correlation. So things can be linearly correlated,
without being perfectly on the line, obviously,
3645
05:33:49,240 --> 05:33:56,070
our little thing was, so if, if when you make
those dots, your scattergram, if you imagine
3646
05:33:56,070 --> 05:34:00,372
a line going through it, if you imagine that
the line is going up, like it kind of looks
3647
05:34:00,372 --> 05:34:07,860
like it's going up, this is called a positive
correlation. But you don't always have a line
3648
05:34:07,860 --> 05:34:13,208
going up. So I want you to look at this. And
I made up these data too. But on the x axis
3649
05:34:13,208 --> 05:34:18,780
is the number of patient complaints. So as
we go on, the patients are madder and madder.
3650
05:34:18,780 --> 05:34:24,030
They're grouchy and gross, you're making more
complaints. on the y axis, we have number
3651
05:34:24,030 --> 05:34:30,150
of nurses staffed on the shift, right? And
so as you go up, there's more nurses. Well,
3652
05:34:30,150 --> 05:34:34,430
sure enough, when you got a lot of nurses,
you don't have as many patient complaints,
3653
05:34:34,430 --> 05:34:39,860
right? Because they're being attended to.
So this is what you would say is some people
3654
05:34:39,860 --> 05:34:47,110
say inverse correlation. But in this presentation,
I'm calling it a negative correlation. Because
3655
05:34:47,110 --> 05:34:53,520
as one goes up, the other goes down. And as
one goes down, the other goes up, because
3656
05:34:53,520 --> 05:34:59,692
and that's depicted visually with this line
going down so you see, you can imagine line
3657
05:34:59,692 --> 05:35:05,570
going down That's a negative correlation.
Neither is better, you know, positive versus
3658
05:35:05,570 --> 05:35:12,042
negative, it just explains how these things
are behaving together how X and Y behave together.
3659
05:35:12,042 --> 05:35:13,570
But then
3660
05:35:13,570 --> 05:35:17,960
you can have situations where there's really
no correlation, like x and y really don't
3661
05:35:17,960 --> 05:35:22,880
have anything to do with each other. So as
you've seen, you know, when you're, when you
3662
05:35:22,880 --> 05:35:27,420
have patients in the hospital, some of them
have really big families, and those families
3663
05:35:27,420 --> 05:35:32,660
come a lot. And some of them don't really
have that many loved ones. So as you can see
3664
05:35:32,660 --> 05:35:39,630
along x, here are totally unique visitors,
meaning you just count each person wants.
3665
05:35:39,630 --> 05:35:45,730
So you could have, there's a patient who only
has one Unique Visitor. But if you look at
3666
05:35:45,730 --> 05:35:49,958
why they spent in the hospital, that person
that's been there seven days, and that that
3667
05:35:49,958 --> 05:35:57,260
visitor keeps coming, right. And then you
have maybe a patient here, the second one
3668
05:35:57,260 --> 05:36:01,180
is to unique visitors. And that person's only
been in one day, but both those people have
3669
05:36:01,180 --> 05:36:06,200
been there, then you have people like a person
with three unique visitors. And they've been
3670
05:36:06,200 --> 05:36:10,960
in the hospital for days, right. And those
are probably the same three people coming
3671
05:36:10,960 --> 05:36:16,140
back. So it really doesn't matter how long
a person's in the hospital, if they've got
3672
05:36:16,140 --> 05:36:21,792
a lot of loved ones who keep coming, they'll
keep coming or not. Right? Right, according
3673
05:36:21,792 --> 05:36:28,130
to this correlation. So you end up imagining
a straight line. And that's no correlation,
3674
05:36:28,130 --> 05:36:33,190
that's fine, too. Nothing is better or worse,
it's just that you make the scattergram to
3675
05:36:33,190 --> 05:36:41,840
try and understand how x and y are related.
This is always fun. Like in books, they always
3676
05:36:41,840 --> 05:36:48,240
make some sort of goofy picture. I don't know
why they do this, I would never get a goofy
3677
05:36:48,240 --> 05:36:54,140
picture, like they show in books about, you
know, this, I made up the correlation. This
3678
05:36:54,140 --> 05:36:58,820
is in the lobby, the number of the games in
the lobby, and the number of the books in
3679
05:36:58,820 --> 05:37:02,792
the lobby, they should really have nothing
to do with each other. But if you see something
3680
05:37:02,792 --> 05:37:08,122
just way goofy like this, just say it's no
correlation. I don't even know how I get this.
3681
05:37:08,122 --> 05:37:16,420
Hi, there. Alright, so we've been talking
about correlation. And it actually has two
3682
05:37:16,420 --> 05:37:21,830
attributes. So far, we've only talked about
one and that is direction, we talked about
3683
05:37:21,830 --> 05:37:26,850
positive, negative and no correlation. So
whenever you're talking about a correlation,
3684
05:37:26,850 --> 05:37:31,710
you have to say what direction it is. But
you also have to say the other thing, which
3685
05:37:31,710 --> 05:37:35,940
is what strength it is. So now we're going
to talk about how you figure out what the
3686
05:37:35,940 --> 05:37:42,800
strength is. So strength refers to how close
to the line, all of the dots, they fall really
3687
05:37:42,800 --> 05:37:49,790
close to the line, it is considered strong.
If they fall kind of close to the line, it's
3688
05:37:49,790 --> 05:37:54,980
called moderate. And if they are very close
to the line is weak. Now remember, that's
3689
05:37:54,980 --> 05:37:59,730
totally different from what direction is it
could be positive, strong, or negative, strong,
3690
05:37:59,730 --> 05:38:07,060
right, could be positive, moderate, or negative,
moderate. So this is just a statement, the
3691
05:38:07,060 --> 05:38:12,220
strength is a statement of how close the dots
you make in your scattergram file close to
3692
05:38:12,220 --> 05:38:20,360
the line that you end up dropping. So I thought
I'd just give you a few examples. So look
3693
05:38:20,360 --> 05:38:24,692
at this, I just made this up. This is what
a strong negative one would look like. Notice
3694
05:38:24,692 --> 05:38:32,270
how those pink dots are almost on the line.
And this is a strong positive. Again, even
3695
05:38:32,270 --> 05:38:36,870
one of the dots is on all right, not all of
them, you know, or it'd be perfect, but it's
3696
05:38:36,870 --> 05:38:42,130
never perfect. So this is really close. But
it's strong, positive. So strong just refers
3697
05:38:42,130 --> 05:38:48,790
to the fact that the dots are almost on the
line. Now, this is almost the same correlation,
3698
05:38:48,790 --> 05:38:54,182
but the dots are not really almost on the
line has to be fair and kind of going between
3699
05:38:54,182 --> 05:38:59,020
them, but they're kind of far away. And so
just eyeballing it, you would say this is
3700
05:38:59,020 --> 05:39:06,980
moderate. And here, it gets weak. And mainly
it's because the dots are more all over the
3701
05:39:06,980 --> 05:39:13,350
place. But you'll notice there's one that's
like right on the x axis. And then hey, look
3702
05:39:13,350 --> 05:39:18,920
up there, like in the title, there's one up
there, like way up there. And that's like
3703
05:39:18,920 --> 05:39:26,942
an outlier. And sometimes, when you get outliers,
they can really whack things out. So even
3704
05:39:26,942 --> 05:39:32,708
though this is a weak correlation, that line
looks like so powerful, because it's almost
3705
05:39:32,708 --> 05:39:38,400
basically connecting these two outliers. So
you just got to be careful, and that's part
3706
05:39:38,400 --> 05:39:43,590
of why you make a scattergram first is out
large can have a really powerful effect on
3707
05:39:43,590 --> 05:39:44,702
the correlation.
3708
05:39:44,702 --> 05:39:50,190
Especially it's an any of the four corners
of the plot. Like if you get a weird outlier
3709
05:39:50,190 --> 05:39:55,432
kinda in the middle, it's not going to do
as much as if it's in the upper right, upper
3710
05:39:55,432 --> 05:39:59,810
left, lower right or lower left. It can really
affect the direction like like, you know,
3711
05:39:59,810 --> 05:40:06,120
it's Like a seesaw, or a teeter totter, you
know, an outlier can get on and really change
3712
05:40:06,120 --> 05:40:14,860
the direction of it. And it can also mess
with how strong or weak the correlation is.
3713
05:40:14,860 --> 05:40:19,310
So that's why you really want to start with
a scatterplot. And that's why the way this
3714
05:40:19,310 --> 05:40:24,112
chapter is organized starts with the scatterplot.
This, you just want to look for outliers.
3715
05:40:24,112 --> 05:40:32,300
And also just see how X and Y look when you
plot them. Now we're going to get on to correlation
3716
05:40:32,300 --> 05:40:39,350
coefficient, R, we're going to get on to computation
and actually making a number. So you can not
3717
05:40:39,350 --> 05:40:45,840
just use watery terms like direction, you
know, positive, negative, or moderate, strong
3718
05:40:45,840 --> 05:40:53,261
weak to explain it, but you can actually put
a number on how correlated x and y are. So
3719
05:40:53,261 --> 05:41:00,190
remember, the word coefficient, we did it
with coefficient of variation, which is different.
3720
05:41:00,190 --> 05:41:06,430
So the CV, you know, is one kind of coefficient.
But what we're going to talk about is a different
3721
05:41:06,430 --> 05:41:12,590
kind. This time, our coefficient, this time
is called R. And just coefficient means the
3722
05:41:12,590 --> 05:41:18,010
number we just like to use it in statistics.
Now, it seems kind of weird, because like,
3723
05:41:18,010 --> 05:41:21,520
I'm talking about correlation, and people
are like, Well, why is it our Why isn't it
3724
05:41:21,520 --> 05:41:26,042
like see for correlation, then like, I don't
know, I didn't invent it. But this is how
3725
05:41:26,042 --> 05:41:35,880
you can remember you can go correlation, correlation.
So correlation coefficient, R. So just remember,
3726
05:41:35,880 --> 05:41:43,780
r means correlation. And technically our mean
sample correlation, population correlation
3727
05:41:43,780 --> 05:41:47,780
coefficient, right? Like his, you know, imagine
you're correlating like height and weight
3728
05:41:47,780 --> 05:41:53,650
and the population like, oh, everybody in
particular state, you actually need a Greek
3729
05:41:53,650 --> 05:41:57,600
letter for that. And I showed it on the screen,
I don't know it's this fancy p, I don't know
3730
05:41:57,600 --> 05:42:03,690
the right name of it. But we don't actually
cover it in this class. So I just want to
3731
05:42:03,690 --> 05:42:12,942
just show it to you, we're only going to focus
on R, which is the sample correlation coefficient.
3732
05:42:12,942 --> 05:42:19,090
So what is r? Well, it's like I said, it's
the numerical quantification of how correlated
3733
05:42:19,090 --> 05:42:27,000
a set of x y pairs are. And it's actually
calculated by plugging all of the XY pairs
3734
05:42:27,000 --> 05:42:33,370
into the equation, I'll show you how to do
it. And you can see that if you do it by hand,
3735
05:42:33,370 --> 05:42:39,230
if you have a lot of xy pairs that will take
forever. So I tried to limit that. And like,
3736
05:42:39,230 --> 05:42:44,990
remember, standard deviation and variance,
there was like a defining formula and a computational
3737
05:42:44,990 --> 05:42:50,650
formula. This time, I'm only going to show
you the computational formula, it's, in my
3738
05:42:50,650 --> 05:42:55,830
opinion, ways your to do, but it gets you
the same number. Alright. So that's what we're
3739
05:42:55,830 --> 05:42:59,901
going to do is we're going to take a set of
xy pairs, and we're going to calculate
3740
05:42:59,901 --> 05:43:00,901
our
3741
05:43:00,901 --> 05:43:07,192
M. But then how do you interpret our Well,
let me just prepare you mentally for what
3742
05:43:07,192 --> 05:43:12,060
we're going to get out of this calculation.
The our calculation produces a number and
3743
05:43:12,060 --> 05:43:17,770
the lowest number possible is negative 1.0.
So that's perfect negative correlation. So
3744
05:43:17,770 --> 05:43:23,180
if we were like in algebra, and we had an
A line going down, and all the dots were on
3745
05:43:23,180 --> 05:43:28,793
it, then the R would be negative 1.0. But
that never happens. Right? So if you want
3746
05:43:28,793 --> 05:43:33,860
to think about it is like if you have a negative
correlation, and you get an R, that's like
3747
05:43:33,860 --> 05:43:41,610
negative point nine, five, or something really
close to negative 1.0, that it's close to
3748
05:43:41,610 --> 05:43:46,480
negative 1.0. So it's close to perfect negative
correlation. That's how you want to think
3749
05:43:46,480 --> 05:43:51,542
about it. And then the opposite is the highest
possible number you can get for our is 1.0.
3750
05:43:51,542 --> 05:43:55,630
But most people never do that. except for
that one mistake I was telling you about.
3751
05:43:55,630 --> 05:44:00,720
And that would be perfect positive correlation.
So if you see that you calculate an R, and
3752
05:44:00,720 --> 05:44:06,910
it gets really close, like point nine, five,
like I said, or nine, eight or whatever, then
3753
05:44:06,910 --> 05:44:12,208
you're thinking, whoa, this is really close
to perfect positive correlation, right? And
3754
05:44:12,208 --> 05:44:17,860
then everything else is in between. So like,
you know, point five or negative point three
3755
05:44:17,860 --> 05:44:25,820
or point 02, or negative point, oh nine, like
all of those are between negative 1.0 and
3756
05:44:25,820 --> 05:44:31,610
1.0. And that's where r should be. So let's
say you calculate R and you get eight. Okay,
3757
05:44:31,610 --> 05:44:37,852
you did it wrong, right? Or you calculate
R and you get negative 2.3. Like that's not
3758
05:44:37,852 --> 05:44:44,942
right, it's got to be between negative 1.0
and 1.0. And if you make a scattergram, you
3759
05:44:44,942 --> 05:44:48,530
should know whether it should be on the negative
side of the positive side or it should give
3760
05:44:48,530 --> 05:44:55,420
you a hint. So this is just more to calibrate
what to expect from our because it's kind
3761
05:44:55,420 --> 05:45:01,550
of a big calculation. So I'm just going to
give you some pictorial example. Because remember,
3762
05:45:01,550 --> 05:45:08,860
every single time we make our right, um, we
also have a scatterplot behind it. And I just
3763
05:45:08,860 --> 05:45:12,980
thought, you know, it would be helpful to
see some real life examples of our, these
3764
05:45:12,980 --> 05:45:18,970
are real life examples, okay, real life, you
don't get this from just anything, right?
3765
05:45:18,970 --> 05:45:22,990
I'm just teasing. But anyway, so I started
with some negative hours because I'm feeling
3766
05:45:22,990 --> 05:45:28,830
negative today. I went into the literature
and I found this article
3767
05:45:28,830 --> 05:45:29,830
about,
3768
05:45:29,830 --> 05:45:38,040
oh, it's not MIT and Harvard. It's about the
evolutionary principles of modular gene regulation,
3769
05:45:38,040 --> 05:45:43,170
a nice and all I know, it's, I'm supposed
to cut down on eating bread. So that's all
3770
05:45:43,170 --> 05:45:48,990
I know about this. But they had these really
nice scatter plots. So and they calculated
3771
05:45:48,990 --> 05:45:52,760
are for them, so and they had a little line
on them. So I thought I'd show them to you.
3772
05:45:52,760 --> 05:45:58,840
So if you look, the one that's labeled D,
see where the dots are, right, and see where
3773
05:45:58,840 --> 05:46:05,860
the line is. And this looks kind of like a
moderate to strong, negative correlation,
3774
05:46:05,860 --> 05:46:10,740
right? Because the dots are kind of close
to the line. And then when the group calculated
3775
05:46:10,740 --> 05:46:16,560
are they got negative point seven. And so
that kind of makes sense, because, and then
3776
05:46:16,560 --> 05:46:22,208
I put my opinion in the lower right, these
aren't official cut points or anything, but
3777
05:46:22,208 --> 05:46:27,612
I usually use these as a guide, see how I
said negative point four to negative point
3778
05:46:27,612 --> 05:46:35,530
seven is moderate. So I would call that the
one monitor. Now let's look at E. So see how
3779
05:46:35,530 --> 05:46:43,310
the dots don't cluster so close to the line,
as they do with the D one, that's going to
3780
05:46:43,310 --> 05:46:48,890
make it a weaker correlation, it's still it's
still negative, right? So it's negative point
3781
05:46:48,890 --> 05:46:54,670
four, four. And when you look at my little
opinion, I still call that moderate, but it's
3782
05:46:54,670 --> 05:47:01,530
on the low end, see that. And then if you
look at AF, see how many of them are like
3783
05:47:01,530 --> 05:47:09,160
way far away from that line, and they're dragging
it down. So now it's in the even weaker correlation,
3784
05:47:09,160 --> 05:47:15,872
negative point two, five, right. And so then
that's weak. And so this is just some examples
3785
05:47:15,872 --> 05:47:21,270
to give you a pictorial. And now I'll be I
promise to be more positive, here's some positive
3786
05:47:21,270 --> 05:47:26,880
Rs, they didn't draw a line on this one, this
is a different article, right? Says obesity
3787
05:47:26,880 --> 05:47:32,630
is associated with macrophage accumulation,
and adipose tissue. So again, try to cut down
3788
05:47:32,630 --> 05:47:41,120
on bread. But anyway, um, if you look on the
left side, you'll see all of these x y pairs
3789
05:47:41,120 --> 05:47:45,208
plotted on the scattergram. And even though
we don't have a line there, we can imagine
3790
05:47:45,208 --> 05:47:50,770
it's going up. So we would expect this to
be positive. But we also would imagine they're
3791
05:47:50,770 --> 05:47:56,542
not really clustering around the line very
tightly. So when we see that the R is point
3792
05:47:56,542 --> 05:48:01,730
six, we're not surprised. I mean, it's on
the high side, a moderate in my world, which
3793
05:48:01,730 --> 05:48:08,192
makes sense. But go look on the right one,
you know, under the B one, look at how those,
3794
05:48:08,192 --> 05:48:12,190
you could almost connect the dots and get
a line out of that. So that's really tightly
3795
05:48:12,190 --> 05:48:17,500
hugging the line. And then we're not surprised
to see that the R is point nine, two. So that's
3796
05:48:17,500 --> 05:48:22,450
pretty strong. So I just wanted to give you
these tutorials before we actually went forth,
3797
05:48:22,450 --> 05:48:28,300
and calculated r because that's one thing
you can do is do the scatterplot have an expectation,
3798
05:48:28,300 --> 05:48:33,710
what r should look like. And then if you calculate
R and it's totally wacky, you know that you
3799
05:48:33,710 --> 05:48:41,800
did something wrong. Okay, let's calculate
our and let's use the computational formula.
3800
05:48:41,800 --> 05:48:49,200
Okay, I threw the formula up in the upper
left, and don't feel overwhelmed by it, we're
3801
05:48:49,200 --> 05:48:55,641
going to take that apart very carefully, right.
But before we even do that, I just want you
3802
05:48:55,641 --> 05:49:01,830
to have a flashback to chapter 3.2. c, all
those sums of are those capital sigma was
3803
05:49:01,830 --> 05:49:09,150
in the equation. So we're going to handle
calculating are a lot like we handled calculation,
3804
05:49:09,150 --> 05:49:14,730
calculating variance and standard deviation.
We're going to make like a table with columns.
3805
05:49:14,730 --> 05:49:18,290
And then we're going to fill in those columns
with calculations. And then we're going to
3806
05:49:18,290 --> 05:49:23,272
add up the columns to get all those numbers.
So already you were good at that, and 3.2,
3807
05:49:23,272 --> 05:49:28,530
you'll be good at this too. And then I made
up a story because it's a lot easier to check
3808
05:49:28,530 --> 05:49:34,990
your work if there's some story behind that
and statistics. So pretend we have seven patients
3809
05:49:34,990 --> 05:49:39,330
that have been going to your clinic for a
year. They're good patients, they keep coming.
3810
05:49:39,330 --> 05:49:46,760
So they came to the clinic over the year.
And at the last visit of the year. You measured
3811
05:49:46,760 --> 05:49:48,390
the diastolic blood
3812
05:49:48,390 --> 05:49:53,522
pressure, and what you predicted was or what
you thought would make sense as those with
3813
05:49:53,522 --> 05:49:57,890
a higher diastolic blood pressure would have
had more appointments over the year because
3814
05:49:57,890 --> 05:50:01,532
probably they're trying to stabilize and run
power. Sure, maybe they have other problems
3815
05:50:01,532 --> 05:50:06,910
that are driving it up. This makes perfect
sense, right? So what you wanted to do is
3816
05:50:06,910 --> 05:50:11,240
see if you are right, so you're going to take
the diastolic blood pressure at the last appointment
3817
05:50:11,240 --> 05:50:17,650
as your x, you know, because you think that
that's maybe the explanatory variable, or,
3818
05:50:17,650 --> 05:50:22,890
you know, that would be the independent variable
that would make it so have something to do
3819
05:50:22,890 --> 05:50:28,510
with whether or not they had a lot of appointments.
And then you take why as the number of appointments
3820
05:50:28,510 --> 05:50:34,790
over the last year, because you'd say, Okay,
hi, DBP probably means they have more appointments.
3821
05:50:34,790 --> 05:50:40,630
That's just your idea, maybe you're wrong,
but we're gonna do that. Okay. So, um, I put
3822
05:50:40,630 --> 05:50:47,600
in the title, just a reminder, access DVP.
And why is number of appointments so you don't
3823
05:50:47,600 --> 05:50:52,930
forget. And then we made up this tape. So
look at the first column, it's just the patient
3824
05:50:52,930 --> 05:50:56,190
number, it's nothing, you know, exciting,
we just want to keep track of which patient
3825
05:50:56,190 --> 05:51:05,280
is one, right. And then notice under x, we
just have all of their dbps. So this patient,
3826
05:51:05,280 --> 05:51:11,612
one at the last appointment had a 70 mmHg,
and patient two at 115, mmHg. That's
3827
05:51:11,612 --> 05:51:12,870
kind of alarming.
3828
05:51:12,870 --> 05:51:17,920
But these are fake data. So don't get worried
about these patients. But anyway, we just
3829
05:51:17,920 --> 05:51:23,720
fill in x. And then also, when you have their
chart out, you can look up how many appointments
3830
05:51:23,720 --> 05:51:28,100
they had over the last year and patient went
only at three, whereas patient two had like
3831
05:51:28,100 --> 05:51:33,730
45, which you can believe because sometimes
they're coming in all the time to get stuff,
3832
05:51:33,730 --> 05:51:40,300
adjusted. It but then you know, patient three,
only a 21 and patient four at seven. So you
3833
05:51:40,300 --> 05:51:44,860
can see these are the XY pairs for each of
these patients, right. And it's pretty simple
3834
05:51:44,860 --> 05:51:51,372
to go to the bottom and sum up each of the
columns, we have some of xs 678 and some of
3835
05:51:51,372 --> 05:51:56,960
y's 166. And also, I'm reminding you of the
our calculation, I put that in the upper right,
3836
05:51:56,960 --> 05:52:04,140
just so we can see what we're doing. I just
want to call your attention to one of the
3837
05:52:04,140 --> 05:52:10,210
terms in there, which is sum of X, which I
put in the parentheses here. And that we already
3838
05:52:10,210 --> 05:52:14,352
know, just from making the first part of this
table and adding it up. So we already have
3839
05:52:14,352 --> 05:52:20,930
that thing. And now I just wanted to point
out, if you saw the sum of x over here, it's
3840
05:52:20,930 --> 05:52:27,320
not exactly the sum, it's a sum of x y. So
the Y is mushed. Right next to it, that's
3841
05:52:27,320 --> 05:52:32,122
not some of x, that's some of x y. And that's
later in the game, we're gonna put the sum
3842
05:52:32,122 --> 05:52:37,070
of x y at the bottom of the last column. So
So that first term there, that's not some
3843
05:52:37,070 --> 05:52:46,880
of x, that's some of x y. Okay, now downstairs,
we see the sum of x to the second, right?
3844
05:52:46,880 --> 05:52:53,320
And that looks an awful lot like the one next
to it on the left, which says sum of x to
3845
05:52:53,320 --> 05:52:58,740
the second, right? And so how do you tell
the difference between the kind without the
3846
05:52:58,740 --> 05:53:05,532
parentheses and the kind with the parentheses.
So this is how I do. The rule is always regardless
3847
05:53:05,532 --> 05:53:11,880
what's going on, do what's in the parentheses
first. So that's easy to do. If you have parentheses,
3848
05:53:11,880 --> 05:53:17,870
if you got the parentheses version, you know
that the sum of x to the second with the parentheses
3849
05:53:17,870 --> 05:53:24,120
in it, is you just do the sum of X, and you
do the sum of X and E times by each other.
3850
05:53:24,120 --> 05:53:29,790
Right? But what if you don't have any? Well,
what I do is I say, Well, if I did have some,
3851
05:53:29,790 --> 05:53:36,000
I do it this way. But if I don't have any,
then I know I have to do the sum of the x
3852
05:53:36,000 --> 05:53:43,240
squared calm, right. So that's where you take
x times x x times x, x times x on each line,
3853
05:53:43,240 --> 05:53:50,458
put it there and sum that. So that's how I
go through it no matter where I am in statistics
3854
05:53:50,458 --> 05:53:57,280
or algebra. If I see that some symbol and
then the x squared, I first look for the parentheses.
3855
05:53:57,280 --> 05:54:02,590
If they're there, I know what to do. If they're
not there, then I know you don't do the thing
3856
05:54:02,590 --> 05:54:07,680
where you just take the sum of x squared,
you have to go and look at the bottom of the
3857
05:54:07,680 --> 05:54:14,670
column of the x to the second column and take
the sum of that. I hope this is helpful. All
3858
05:54:14,670 --> 05:54:22,200
right, so as you can see, there's, I've shown
you on the top of the equation is where you
3859
05:54:22,200 --> 05:54:28,890
just take the sum of X and the sum of Y. And
on the bottom, I'm showing you where you take
3860
05:54:28,890 --> 05:54:34,530
those and you take the square of them. And
then in the other term is the one where you
3861
05:54:34,530 --> 05:54:43,362
just take the sum of the call. All right.
And so there you go. So what happened here?
3862
05:54:43,362 --> 05:54:51,310
Well, we filled an x to the second so if you
go to a patient, 170 times 70 is 4900. That's
3863
05:54:51,310 --> 05:54:57,130
where we're getting that number. So you go
through and then patient to 115 times 115
3864
05:54:57,130 --> 05:55:03,730
is 13,225. So you go from All those and then
you sum those up. And that's what goes in
3865
05:55:03,730 --> 05:55:07,630
that first term. And then I'll bet you can
guess what the next
3866
05:55:07,630 --> 05:55:08,790
slide is.
3867
05:55:08,790 --> 05:55:13,291
Surprise. Now we do the y one, so don't get
confused because you kinda have to skip a
3868
05:55:13,291 --> 05:55:20,280
column there. So three times three is nine.
And so that's why in the Y squared, I'm 45
3869
05:55:20,280 --> 05:55:26,500
times 45 is 2025. That's how we're doing those.
You sum all that up, and then go look up at
3870
05:55:26,500 --> 05:55:35,230
the equation, that's where you put that sum
of Y squared. Now we have x, y. And this reminds
3871
05:55:35,230 --> 05:55:39,890
me of a student I had before. She was really
confused. She's like, Monica, I don't know
3872
05:55:39,890 --> 05:55:45,140
what to do with x, y, the x, y quantity. And
I go, What do you mean? I mean, it's pretty
3873
05:55:45,140 --> 05:55:53,250
obvious. You just take x times y, like here,
70 times three is 210. She goes, x times y,
3874
05:55:53,250 --> 05:55:58,420
where's the times? Like, how do you know it's
supposed to be times like, I don't see any
3875
05:55:58,420 --> 05:56:04,810
times. Right? I don't see any dimes either.
Like there's no like, like, how do you know
3876
05:56:04,810 --> 05:56:10,270
to do that? Well, anyway, I'll just tell you,
I guess, imagine, like a little multiplication
3877
05:56:10,270 --> 05:56:14,690
symbol between x and y. That's what's supposed
to be there. That's what you're supposed to
3878
05:56:14,690 --> 05:56:19,320
imagine, I guess I was so used to looking
at it was like, you're right, I guess you're
3879
05:56:19,320 --> 05:56:26,881
just supposed to assume that. So take x times
y. So for patient two, we just took 115 times
3880
05:56:26,881 --> 05:56:33,960
45. And that's how we got 5175. So you go
through each of those, it's a lot of processing.
3881
05:56:33,960 --> 05:56:39,180
And then you sum it up at the bottom, whoo,
that's a big number. And then you see, I circled
3882
05:56:39,180 --> 05:56:45,140
it in the our equation. So I think we figured
out where to put everything, obviously, n
3883
05:56:45,140 --> 05:56:49,480
is seven, right, because we have seven patients,
you see a bunch of ends in there. So I think
3884
05:56:49,480 --> 05:56:57,910
we have all our ingredients. So let's move
forward. So all I did here was rewrite the
3885
05:56:57,910 --> 05:57:03,872
exact same equation with all the ingredients
in it, right. So like I said, the N is seven.
3886
05:57:03,872 --> 05:57:10,920
And so wherever you see n, you'll see a seven.
See that sum of X, Y on the top, you see where
3887
05:57:10,920 --> 05:57:16,880
that goes, see some of x and some of y and
then downstairs, you'll see I filled in all
3888
05:57:16,880 --> 05:57:23,930
those numbers too. Now, let me just talk to
you a little bit about both levels, the numerator
3889
05:57:23,930 --> 05:57:29,920
and the denominator in the numerator, because
we have order of operation, you need to do
3890
05:57:29,920 --> 05:57:38,450
out the end times the sum of x y, that's seven
times 18,458, you need to do that out first.
3891
05:57:38,450 --> 05:57:44,570
And then you need to do the other one, you
know the 678 times 166 first, and then after
3892
05:57:44,570 --> 05:57:48,050
you're done with those two things, you have
to subtract the second one from the first
3893
05:57:48,050 --> 05:57:53,350
one, that's the order you have to do that
in to get the numerator right. Now for the
3894
05:57:53,350 --> 05:57:58,510
denominator, it's a little bit the same, but
a little more complicated. You see on the
3895
05:57:58,510 --> 05:58:06,500
left side, you have that seven times 67,892,
you have to do that out. And then you have
3896
05:58:06,500 --> 05:58:14,090
678 squared, you have to do that out, then
you have to take that, subtract it from the
3897
05:58:14,090 --> 05:58:19,460
first one. And after that, after you have
that, you take a square root of all of that,
3898
05:58:19,460 --> 05:58:24,720
and that's your first term. And then you still
have to go over to the other one, you have
3899
05:58:24,720 --> 05:58:34,362
to take seven times 6768. Keep that then take
166 times 166. Keep that, that that term,
3900
05:58:34,362 --> 05:58:38,220
you subtract from the first one. And after
you're done with all that, you take the square
3901
05:58:38,220 --> 05:58:43,760
root of that, and then those two things, you
have to multiply together. So that's a lot
3902
05:58:43,760 --> 05:58:50,690
of work, and you have to do it in the right
order. So here, I just wanted you to see how
3903
05:58:50,690 --> 05:58:57,660
you, you probably want to just work out this
term separately first, and then work out this
3904
05:58:57,660 --> 05:59:03,330
terms separately. And just like that thing
I was telling you about x y, those two terms,
3905
05:59:03,330 --> 05:59:06,730
once you work them out, you take the square
root of the left one in the square root of
3906
05:59:06,730 --> 05:59:13,622
the right one, you have to multiply them together
to get the denominator. So this slide is to
3907
05:59:13,622 --> 05:59:20,880
help you see I threw the numerator on that
was relatively easy. But these are the two
3908
05:59:20,880 --> 05:59:26,300
different numbers you should get from the
left side of the denominator and the right
3909
05:59:26,300 --> 05:59:31,330
side of the denominator just to check your
work. And then of course, once you multiply
3910
05:59:31,330 --> 05:59:40,940
them by each other, you get this number 17,561.3.
So ultimately, what the calculation for our
3911
05:59:40,940 --> 05:59:46,930
comes down to is you're trying to calculate
the numerator and you're trying to calculate
3912
05:59:46,930 --> 05:59:53,150
the denominator. And at the end, you divide
the numerator by the denominator and you get
3913
05:59:53,150 --> 05:59:57,792
the answer which is R. So we're going to do
that now.
3914
05:59:57,792 --> 06:00:06,150
And here's what we got is we got this 0.949.
And because we see that it's positive, then
3915
06:00:06,150 --> 06:00:12,480
we know it's a positive correlation. And then
remember my opinion. And also probably everyone's
3916
06:00:12,480 --> 06:00:17,480
opinion, because if you run that up, you go
point nine, five, well, that's getting really
3917
06:00:17,480 --> 06:00:23,670
close to 1.0. So most people would agree that
that's pretty strong. So how you would diagnose
3918
06:00:23,670 --> 06:00:31,920
this correlation is you would say it's positive,
and it's strong. Okay, I just want to wrap
3919
06:00:31,920 --> 06:00:38,481
this up by giving you a few facts about our
that I may not have covered yet. First, r
3920
06:00:38,481 --> 06:00:45,692
requires data with a bi variate normal distribution,
which is something we didn't check before
3921
06:00:45,692 --> 06:00:50,542
doing our r in this class, because I just
don't cover that. But please know, if you
3922
06:00:50,542 --> 06:00:55,958
take another statistics class, and they bring
up our, they might talk about checking for
3923
06:00:55,958 --> 06:01:02,390
the by various normal distributions. So just
know about. Next, please know that our also
3924
06:01:02,390 --> 06:01:06,970
does not have any units. So other things that
don't have units, remember, the coefficient
3925
06:01:06,970 --> 06:01:14,102
of variation didn't have any units, some things
just don't have units, and r is one of them.
3926
06:01:14,102 --> 06:01:21,880
Also, we did talk about how perfect linear
correlation is where r equals negative 1.0.
3927
06:01:21,880 --> 06:01:28,102
That's if it's a negative correlation, or
r equals 1.0, which is a positive correlation.
3928
06:01:28,102 --> 06:01:33,792
But I might not have mentioned that no linear
correlation is r equals zero. Now, you probably
3929
06:01:33,792 --> 06:01:39,500
won't see that in real life. But sometimes
I'll make an R, and the R is either positive
3930
06:01:39,500 --> 06:01:46,890
or negative. But it's 0.0000000. Something
right? Regardless of whether it's positive
3931
06:01:46,890 --> 06:01:52,522
or negative, if it's 0.00000, something, it's
really close to zero. So that means there's
3932
06:01:52,522 --> 06:01:59,420
probably like, no linear correlation. And
then we learned about positive or negative
3933
06:01:59,420 --> 06:02:05,122
art, but I just wanted to remind you of the
behavior of X and Y when you get those circumstances,
3934
06:02:05,122 --> 06:02:11,990
okay. So if you have a positive R, it means
as x goes up, y goes up. But it also means
3935
06:02:11,990 --> 06:02:18,890
as x goes down, y goes down. So they travel
together. When you get a negative r, it means
3936
06:02:18,890 --> 06:02:26,090
as x goes up, y goes down. But also it means
opposite, as x goes down, y goes up, so they
3937
06:02:26,090 --> 06:02:34,100
travel in the opposite directions. Now, here's
another fact about our little factoid, if
3938
06:02:34,100 --> 06:02:40,112
you choose to switch the axes, like let's
say I designate, you give me xy pairs, and
3939
06:02:40,112 --> 06:02:45,450
I designate a certain variable as x and the
certainly one is y, and you actually designate
3940
06:02:45,450 --> 06:02:51,510
them the opposite, it really doesn't matter
even in the equation, because you'll end up
3941
06:02:51,510 --> 06:02:59,420
with the same R value. So it doesn't matter
if you call the x my X, Y, and I call your,
3942
06:02:59,420 --> 06:03:05,390
you know, y x, like we can switch them, but
you'll still end up with the same are with
3943
06:03:05,390 --> 06:03:12,090
the calculation. Then finally, even if you
converted x&y to different units, you get
3944
06:03:12,090 --> 06:03:17,458
the same error. So let's say that you were
in England, and you were doing the correlation
3945
06:03:17,458 --> 06:03:22,140
between height and weight. And you were using
the metric system on the same patients that
3946
06:03:22,140 --> 06:03:28,070
I was using the US system, even though we'd
have different numbers, cuz obviously you
3947
06:03:28,070 --> 06:03:36,130
have to convert them, we'd still get the same
are when we're done. So finally, we get to
3948
06:03:36,130 --> 06:03:41,300
the last subject of this lecture, which is
lurking variables, which you've heard about
3949
06:03:41,300 --> 06:03:46,080
before. But the main point I want to make
is correlation is not causation. So you don't
3950
06:03:46,080 --> 06:03:47,110
want to be misled
3951
06:03:47,110 --> 06:03:50,140
by correlations.
3952
06:03:50,140 --> 06:03:54,202
So beware of lurking variable. So remember,
lurking variables are things lurking behind
3953
06:03:54,202 --> 06:03:59,970
the scenes, I caused things, right. And so
you may have realized that selecting x and
3954
06:03:59,970 --> 06:04:04,300
y, like if you have xy pairs, designating
which one is x and which one is y is kind
3955
06:04:04,300 --> 06:04:09,640
of political, because you're implying that
x could cause y. So let's say that you're
3956
06:04:09,640 --> 06:04:15,980
correlating height and weight, taller, people
are heavier. So you would cause x to be height
3957
06:04:15,980 --> 06:04:20,320
and y to be weight. You know, people don't
go, Oh, I'm too short, I should gain weight
3958
06:04:20,320 --> 06:04:24,400
so I can grow taller. You know, that's just
not the way things work. So you have to put
3959
06:04:24,400 --> 06:04:30,660
x as the height, and y is the weight. But
there are Riya. In reality, other causes of
3960
06:04:30,660 --> 06:04:35,042
weight besides height. In fact, there are
things that cause both height and weight,
3961
06:04:35,042 --> 06:04:41,010
like genetics, right? So a genetic profile
that leads to Thomas and also obesity could
3962
06:04:41,010 --> 06:04:45,300
be a lurking variable in the relationship
between height and weight. So there could
3963
06:04:45,300 --> 06:04:49,920
be some tall people that are always obese,
and it's not really just because they're tall.
3964
06:04:49,920 --> 06:04:54,910
It could be because they have the genetics
that programmed them to be tall and also obese,
3965
06:04:54,910 --> 06:05:02,240
right? And so here's an example where you
got to be real careful. Um, with correlation.
3966
06:05:02,240 --> 06:05:06,708
So there's been this claim that eating ice
cream causes murders, because they noticed
3967
06:05:06,708 --> 06:05:11,880
when in areas where ice cream sales go up,
murder rates rise. And I don't know about
3968
06:05:11,880 --> 06:05:17,122
you, but when I have some really good ice
cream, it just makes me so mad. I'm just kidding.
3969
06:05:17,122 --> 06:05:22,470
I mean, why would this happened? Right? Well,
the reality is summer and warm weather are
3970
06:05:22,470 --> 06:05:28,640
lurking variables, because we sell more ice
cream in the summer. You know, the ice cream
3971
06:05:28,640 --> 06:05:34,060
consumption goes up. But also people are outside
more and more murders occur. And you know,
3972
06:05:34,060 --> 06:05:40,670
I from Minnesota, where it gets really cold
for periods of the winter, and oh my gosh,
3973
06:05:40,670 --> 06:05:45,510
there are totally no murders, then, like people
just don't commit murders, when it's really
3974
06:05:45,510 --> 06:05:51,160
frigid out, it's just really inconvenient.
So that's a situation where there's a lurking
3975
06:05:51,160 --> 06:05:56,130
variable. And so you don't want to start,
you know, screwing up our ice cream laws and
3976
06:05:56,130 --> 06:06:02,060
making it so we can have ice cream, just because
you misappropriate that ice cream causes murders,
3977
06:06:02,060 --> 06:06:08,290
right? There's a lurking variable behind it,
that's having something to do with both. Here's
3978
06:06:08,290 --> 06:06:14,260
another one. And this was my professor in
my biostatistics class, they use the C put
3979
06:06:14,260 --> 06:06:22,330
up a really like a time series chart over
a long time, like since the 1900s. And they
3980
06:06:22,330 --> 06:06:28,270
pointed out as people purchase more onions,
the overtime is onion consumption goes up
3981
06:06:28,270 --> 06:06:32,970
and down. The stock market rises, right? So
when the stock market slow, people aren't
3982
06:06:32,970 --> 06:06:39,720
eating as many onions. And this is just true
over generations in the US. So um, yeah, we've
3983
06:06:39,720 --> 06:06:43,200
had some problems with our economy in the
US, do you think we should all start eating
3984
06:06:43,200 --> 06:06:49,780
a bunch of onions, right? So the healthy economy
is a lurking variable. And a healthy economy,
3985
06:06:49,780 --> 06:06:54,690
people buy more food, they including onions,
and also a healthy economy boost the stock
3986
06:06:54,690 --> 06:06:59,862
market. So you got to be careful about this
correlation is not causation. You know, and
3987
06:06:59,862 --> 06:07:04,220
so if you want to make the stock market go
up, don't make everybody onions. And definitely
3988
06:07:04,220 --> 06:07:11,820
don't make a stop eating ice cream, that would
make me very upset. So at the end of the day,
3989
06:07:11,820 --> 06:07:16,080
you're not going to be able to affect the
murder rate by bringing down the ice cream
3990
06:07:16,080 --> 06:07:20,390
consumption rate. And you're not going to
be able to fix the stock market by making
3991
06:07:20,390 --> 06:07:25,970
people eat onions. And so that's the whole
concept behind lurking variables. And correlation
3992
06:07:25,970 --> 06:07:34,290
is not necessarily causation. So in conclusion,
when you're doing your correlations, First,
3993
06:07:34,290 --> 06:07:38,612
make a scattergram because you want to get
an idea visual idea of the strength in their
3994
06:07:38,612 --> 06:07:44,390
direction. And you also want to look for outliers,
then go on and calculate are by hand, but
3995
06:07:44,390 --> 06:07:48,640
be really careful because it's a big hairy
calculation. And you don't want to make any
3996
06:07:48,640 --> 06:07:54,090
mistakes. And then finally, when you go to
interpret are Be careful of lurking variables.
3997
06:07:54,090 --> 06:08:01,580
And remember that correlation is not necessarily
causation. And now, time for some ice cream.
3998
06:08:01,580 --> 06:08:10,592
Hello, it's Monica wahi, your library college
lecturer here to ruin your day with chapter
3999
06:08:10,592 --> 06:08:18,060
4.2 linear regression and the coefficient
of determination. So at the end of this probably
4000
06:08:18,060 --> 06:08:24,070
painstaking lecture, the student should be
able to at least explain what the least squares
4001
06:08:24,070 --> 06:08:30,661
line is. Identify and describe the components
of the least squares line equation, explain
4002
06:08:30,661 --> 06:08:37,480
how to calculate the residuals, and calculate
and interpret the coefficient of determination,
4003
06:08:37,480 --> 06:08:45,870
or CD for short. Alright, so it's really cool
if you have a crystal ball, because then you
4004
06:08:45,870 --> 06:08:50,442
can make predictions, right, you just look
into the crystal ball. It's some nice equipment,
4005
06:08:50,442 --> 06:08:54,542
I've had friends who have them, they're very
nice to put out on your dining room table
4006
06:08:54,542 --> 06:09:00,470
as the centerpiece. Unfortunately, though,
they don't really play much into statistical
4007
06:09:00,470 --> 06:09:05,160
prediction. So what I'm going to show you
in this lecture is how we use statistics for
4008
06:09:05,160 --> 06:09:10,240
prediction instead of this beautiful crystal
ball. So we're going to start by talking about
4009
06:09:10,240 --> 06:09:14,390
what the least squares line is. And then we're
going to talk about the least squares line
4010
06:09:14,390 --> 06:09:19,940
equation, which is the crystal ball thing
we use only in statistics, okay. And then
4011
06:09:19,940 --> 06:09:24,230
we're going to talk about dealing with prediction
using the least squares line. And finally,
4012
06:09:24,230 --> 06:09:29,740
we're going to talk about the coefficient
of determination. So let's get started. And
4013
06:09:29,740 --> 06:09:36,690
let's get started with the term least squares.
criterion, right? So remember, criteria is
4014
06:09:36,690 --> 06:09:43,000
plural and criterion is singular. And it means
well criteria as stuff you need to meet right
4015
06:09:43,000 --> 06:09:48,200
to be eligible like you have to meet the criteria
for registration for college right? Well,
4016
06:09:48,200 --> 06:09:53,012
least squares Cartier tyrian is just one,
which is awesome, because then you only have
4017
06:09:53,012 --> 06:09:58,820
to meet one thing. So one of the things you
probably wondered when you were watching last
4018
06:09:58,820 --> 06:10:03,208
lecture is how do you know exactly where to
draw this line when you have a scatterplot.
4019
06:10:03,208 --> 06:10:08,060
Like, how do you know where to make the line
the most fair. So in the last chapter, when
4020
06:10:08,060 --> 06:10:12,060
we plotted the scatter grams, I just drew
a line there for demonstration. But there
4021
06:10:12,060 --> 06:10:17,710
actually is an official rule as to where the
line goes. Okay. And basically, the rule is
4022
06:10:17,710 --> 06:10:23,702
as has to meet the least squares criteria.
Okay? if it meets that criteria, there's only
4023
06:10:23,702 --> 06:10:27,790
one line that does, then that is where the
line goes. So how do we
4024
06:10:27,790 --> 06:10:29,900
get to that?
4025
06:10:29,900 --> 06:10:37,450
Well, this is roughly what it looks like.
When you draw the line, there is a vertical
4026
06:10:37,450 --> 06:10:44,470
distance from each of the dots to the line.
Now, as you can see, by the slide, sometimes
4027
06:10:44,470 --> 06:10:50,560
the dots are below the line. And sometimes
they're above the line. And so the word square
4028
06:10:50,560 --> 06:10:55,590
is indicates that whether it's up or down,
you're going to square it. So it's not going
4029
06:10:55,590 --> 06:10:59,872
to be negative anymore. Because whenever you
square a negative, it becomes positive. So
4030
06:10:59,872 --> 06:11:04,660
first, you're going to have to square all
of these things. Okay? So imagine you were
4031
06:11:04,660 --> 06:11:10,970
just going to try it out, like, maybe draw
this line, and then you calculate the squares,
4032
06:11:10,970 --> 06:11:14,830
and you'd be like, okay, that's how many and
then maybe you tilt the line a little. and
4033
06:11:14,830 --> 06:11:20,000
calculate the scores again. And your goal
would be to add when you added up all the
4034
06:11:20,000 --> 06:11:25,450
squares, to have the least ones. So the line
belongs where what causes smallest sum of
4035
06:11:25,450 --> 06:11:32,390
squares for the whole data set. So if your
software, which you're not you're a person,
4036
06:11:32,390 --> 06:11:36,640
right, but if you were software, you'd be
figuring that out using your software brain
4037
06:11:36,640 --> 06:11:41,680
as well, how exactly to tilt this line, and
where exactly to put it to minimize these
4038
06:11:41,680 --> 06:11:48,110
squares, but we're people. So I'm going to
go on and explain how people do this. So the
4039
06:11:48,110 --> 06:11:51,540
trick is, if you can figure out with the line
close, you can draw it on the scatterplot
4040
06:11:51,540 --> 06:11:56,410
and be right. But there is a challenge of
knowing exactly where it belongs on the graph.
4041
06:11:56,410 --> 06:12:01,360
And then also, you're probably realizing you
don't always have a graph to draw it on. Like
4042
06:12:01,360 --> 06:12:05,900
maybe you need to talk to somebody about where
the line goes, and you can't draw a picture.
4043
06:12:05,900 --> 06:12:11,362
So how you explain where the line goes as
you use an equation. And some of you may remember
4044
06:12:11,362 --> 06:12:15,730
this, and some of you may not, so I thought
I'd do a little quick review of how lines
4045
06:12:15,730 --> 06:12:22,620
and equations relate. Okay, so we're going
to get into the least squares line equation.
4046
06:12:22,620 --> 06:12:26,430
But first, I'm going to give you a little
flashback about algebra, and I'm sorry, if
4047
06:12:26,430 --> 06:12:30,820
this is painful, um, this is hard for me,
because I wasn't really that good at algebra.
4048
06:12:30,820 --> 06:12:35,270
But um, I and this isn't statistics, this
is algebra, but I just wanted you to remember
4049
06:12:35,270 --> 06:12:41,250
this part. Okay. So back in algebra, there
was a chapter, where you were given these
4050
06:12:41,250 --> 06:12:45,630
xy pairs, and then was different from statistics,
because they all lined up on a line, see,
4051
06:12:45,630 --> 06:12:49,990
these pink dots are just perfectly out of
line, okay, and these are the XY pairs. And
4052
06:12:49,990 --> 06:12:54,730
remember, you had to graph this kind of like
we had to do scatter plots. And then you were
4053
06:12:54,730 --> 06:13:02,160
given this equation, y equals b x plus a,
right? And that was the linear equation to
4054
06:13:02,160 --> 06:13:09,120
describe this line. And you were like, okay,
I don't get how to put this equation together
4055
06:13:09,120 --> 06:13:14,192
with this line. And so first, the teacher
would say, well, B stands for the slope of
4056
06:13:14,192 --> 06:13:18,500
the line, right? Because you have to know
the slope, I mean, the line can be tilted,
4057
06:13:18,500 --> 06:13:22,400
any which way. And so if you know the slope,
you already know something about the line.
4058
06:13:22,400 --> 06:13:28,670
And in algebra, how you would make the slope
as you calculate the rise over the run, right.
4059
06:13:28,670 --> 06:13:35,230
And so there, you know, be in algebra was
rise over run, and you'd get the slope. And
4060
06:13:35,230 --> 06:13:40,150
then you'd be like, great. But you'll always
needed another thing in order to define the
4061
06:13:40,150 --> 06:13:46,320
line. Because if you imagine this line is
in an elevator, it could still have the same
4062
06:13:46,320 --> 06:13:53,970
slope, but go up or down, right, so we need
to anchor it on the y axis somewhere. So h
4063
06:13:53,970 --> 06:14:00,510
stands for the Y interceptor where it's Spears
through the y axis. And, as you can see, by
4064
06:14:00,510 --> 06:14:07,032
the drawing, it looks like a is zero comma,
zero, right? But you don't have to look at
4065
06:14:07,032 --> 06:14:13,670
it, what you can do in algebra, is you to
get a is what you would do is go since you'd
4066
06:14:13,670 --> 06:14:19,942
filled in B, you just go grab an XY pair,
and plug the X and and plug the y and then
4067
06:14:19,942 --> 06:14:25,500
plug the B, you just got in and back. Calculate
the y intercept, right. And that's how you
4068
06:14:25,500 --> 06:14:30,200
would get the whole linear equation. And so
that's how you would do it in algebra. And
4069
06:14:30,200 --> 06:14:34,630
I just wanted to remind you that because we
do some similar things in statistics, it's
4070
06:14:34,630 --> 06:14:40,390
a little different. But I wanted to remind
you how to connect what a line looks like
4071
06:14:40,390 --> 06:14:46,640
with how this equation works. All right. Well,
welcome to statistics looks, those pink things
4072
06:14:46,640 --> 06:14:52,300
are not on a line. So we want to make a line
but now you know about the least squares criterion.
4073
06:14:52,300 --> 06:14:58,320
What you're trying to do is make a line that
minimizes the least squares, right? So here
4074
06:14:58,320 --> 06:15:03,520
we go. Um, remember Hello, I was just talking
about this linear equation back in algebra.
4075
06:15:03,520 --> 06:15:09,380
Well notice the difference. The main difference
here is the hat, right? The y is wearing a
4076
06:15:09,380 --> 06:15:15,080
hat. And that's universally in statistics,
whenever you see a letter or a number wearing
4077
06:15:15,080 --> 06:15:20,300
a hat, it means it's an estimate. Okay? So
of course, we're estimating why because if
4078
06:15:20,300 --> 06:15:23,600
you look on that line, none of these dots
actually falls
4079
06:15:23,600 --> 06:15:24,890
on that line.
4080
06:15:24,890 --> 06:15:29,910
And we don't really expect even an estimate
to fall on that line just close, right? You
4081
06:15:29,910 --> 06:15:35,080
know, because of the least squares, okay.
And so we almost have, in a way, the same
4082
06:15:35,080 --> 06:15:39,980
goal we did back in algebra, we have to get
that be that slope. And then we have to use
4083
06:15:39,980 --> 06:15:46,440
that to back calculate our a. Okay, so let's
go on with that. Um, so like I said, in the
4084
06:15:46,440 --> 06:15:52,292
software approach, you just feed all the XY
pairs in, and then the software just actually
4085
06:15:52,292 --> 06:15:57,730
prints out the B in the A, it just prints
out the slope and the y intercept, which is
4086
06:15:57,730 --> 06:16:02,380
why I love the software. But we don't get
to use that in our class. In our class, we
4087
06:16:02,380 --> 06:16:06,170
have to do the manual approach just because
it's painful. And I had to do too. So now
4088
06:16:06,170 --> 06:16:12,120
I'm making you do it right, me. Okay, what,
what we'll do is plug all the XY pairs into
4089
06:16:12,120 --> 06:16:17,130
an equation to get the slope, the speed. And
I promise you, I won't give you a ton of xy
4090
06:16:17,130 --> 06:16:22,532
pairs, you know, or you'll be there forever.
But this next step, we have to do, we didn't
4091
06:16:22,532 --> 06:16:27,271
have to do an algebra. And that is we're going
to have to go back to all of our x's, calculate
4092
06:16:27,271 --> 06:16:34,470
x bar, and go back to all of our y's and calculate
y bar. Remember, that's the mean of the x's
4093
06:16:34,470 --> 06:16:38,952
in the mean of the y's. And you're probably
wondering, Well, why do we have to do that?
4094
06:16:38,952 --> 06:16:44,862
I'll show you again. But in case you didn't
notice, though, those dots really didn't fall
4095
06:16:44,862 --> 06:16:51,272
on least squares line, they fell around, and
you need a.at least on that line to help back
4096
06:16:51,272 --> 06:16:56,090
calculate that wider set. And the rule of
the least squares line, one of the rules of
4097
06:16:56,090 --> 06:17:03,600
it is that x bar comma y bar is on that least
squares line. So you can know if you calculate
4098
06:17:03,600 --> 06:17:08,458
that out that that's actually on the least
squares line. Okay. And so finally, after
4099
06:17:08,458 --> 06:17:13,990
you do x bar and y bar, you plug in B, and
you plug in x bar for the x, and you plug
4100
06:17:13,990 --> 06:17:22,790
in y bar for the Y hat to back calculate the
A. So it's a similar, but different process
4101
06:17:22,790 --> 06:17:29,230
as algebra. So the moral of the story is you
need to recycle, right, we got to be good
4102
06:17:29,230 --> 06:17:33,710
to the environment. So what has happened?
Well, you wouldn't be at this point in your
4103
06:17:33,710 --> 06:17:40,380
life of making a least squares line, if you
hadn't already started out by making a scatterplot.
4104
06:17:40,380 --> 06:17:46,022
And then deciding you wanted to do R, and
then making are. And when you make Are you
4105
06:17:46,022 --> 06:17:50,250
end up with that big table, remember, and
you end up with all these calculations, like
4106
06:17:50,250 --> 06:17:56,160
some of x, some of y, some of x squared and
some of x y. Now you want to recycle those,
4107
06:17:56,160 --> 06:18:00,910
you want to save those calculations from our
because they fit also into the equation for
4108
06:18:00,910 --> 06:18:06,790
b. So you want to recycle that. Also, you
want to save the are you made, because you're
4109
06:18:06,790 --> 06:18:12,440
going to recycle that into the coefficient
of determination, which I'll explain later.
4110
06:18:12,440 --> 06:18:17,320
And then this is not about recycling, you'll
actually have to make this a new, but you
4111
06:18:17,320 --> 06:18:22,530
need to calculate x bar and y bar. Now you
never needed to do that before now, but now
4112
06:18:22,530 --> 06:18:29,370
you need this. And so yeah, so get together
your old r calculations, and then put your
4113
06:18:29,370 --> 06:18:36,890
x bar and y bar together and you'll be ready
to do the least squares line equation. Alright,
4114
06:18:36,890 --> 06:18:43,840
so here's a flashback. Remember this big table?
Remember our story, we had seven patients,
4115
06:18:43,840 --> 06:18:47,128
right? And x was their diastolic blood pressure
at
4116
06:18:47,128 --> 06:18:48,128
the last
4117
06:18:48,128 --> 06:18:52,980
visit they had of the year. And then why was
the number of appointments they had over the
4118
06:18:52,980 --> 06:18:57,150
year. And we thought, Well, if your diastolic
blood pressure, you know goes up, then maybe
4119
06:18:57,150 --> 06:19:00,860
you need more appointments because it's marker
of being sick. I don't know. That was my little
4120
06:19:00,860 --> 06:19:06,543
story. Okay, so over on the right now we'll
see that the formula, we have the formula
4121
06:19:06,543 --> 06:19:12,730
we're using for B, the tax gives you two formulas,
again, I've always got my favorite, it's the
4122
06:19:12,730 --> 06:19:19,110
one with the table, right? So here's the formula
for B. And then after you calculate B, you'll
4123
06:19:19,110 --> 06:19:25,980
notice in the formula for a, b is in the formula
for a so you got to do B first, right. So
4124
06:19:25,980 --> 06:19:31,070
a lot of times students are a little confused
and what the goal is here, the goal is to
4125
06:19:31,070 --> 06:19:35,570
if you look at the bottom of the slide, the
goal is to come up with what B is and what
4126
06:19:35,570 --> 06:19:40,180
A is, and then fill it in. And that's your
least squares line equation. So your least
4127
06:19:40,180 --> 06:19:45,650
squares line equation is always going to have
an A y hat in it. That's that's a variable
4128
06:19:45,650 --> 06:19:50,070
that just gets to stay there. It's always
going to have that equals and then after that,
4129
06:19:50,070 --> 06:19:54,531
whatever your B is going to be mushed up next
to that x so it's always gonna have that x
4130
06:19:54,531 --> 06:20:00,300
there. And then plus and then whatever you
get for a and just as a trick, if a Turns
4131
06:20:00,300 --> 06:20:07,140
out to be negative, then it ends up being
minus a, right. But that's the generic equation.
4132
06:20:07,140 --> 06:20:11,581
And our goal is to calculate B and A and fill
them in. And then we will say this is our
4133
06:20:11,581 --> 06:20:19,510
least squares line equation. Oh, remember
how I was saying, you actually need to make
4134
06:20:19,510 --> 06:20:23,910
some new calculations, right. So you need
to make y bar and you need to make x bar.
4135
06:20:23,910 --> 06:20:29,060
And it's a little easier to show when I've
got this column, the columns up. If you look
4136
06:20:29,060 --> 06:20:32,900
at the bottom of the slide, remember how some
of X was six, some D eight and remember how
4137
06:20:32,900 --> 06:20:39,890
our n is seven. And remember how a sum of
x divided by n is your x bar. And the same
4138
06:20:39,890 --> 06:20:44,550
goes for y, right, we have the sum of Y divided
by seven, I just wanted to quickly remind
4139
06:20:44,550 --> 06:20:48,970
you of this, that you need to generate these
things before, you can actually completely
4140
06:20:48,970 --> 06:20:56,460
finish the least squares line equation. I
just summarized like that I cut to the chase,
4141
06:20:56,460 --> 06:21:01,790
basically, I just summarize the the actual
numbers you're going to need and put them
4142
06:21:01,790 --> 06:21:06,660
over here. So we don't have to look at that
whole big table anymore. Alright, and you'll
4143
06:21:06,660 --> 06:21:11,300
notice that I grayed out the sum of Y squared
because I realized later we don't really use
4144
06:21:11,300 --> 06:21:18,200
that. Okay, so let's look under on the left
side under the big list of numbers we have.
4145
06:21:18,200 --> 06:21:22,990
And you'll see the B equation that I filled
in, right, and if you compare that to the
4146
06:21:22,990 --> 06:21:27,750
formula on the right side, you'll see what's
going on, you know that n is seven, right?
4147
06:21:27,750 --> 06:21:32,950
So wherever you see that seven, that's where
n is okay, then the top of equation, remember
4148
06:21:32,950 --> 06:21:36,640
some of x, y, let's just look that up. Yeah,
that's that big number 18,458,
4149
06:21:36,640 --> 06:21:46,290
I wanted to just be clear, you have to do
out that left side, the seven times the 18,458,
4150
06:21:46,290 --> 06:21:51,730
you have to do that one out, and then do out
the right side, which is that sum of x times
4151
06:21:51,730 --> 06:21:58,290
sum of Y which is 678 times 166, you have
to do that one out. And then after that, you
4152
06:21:58,290 --> 06:22:04,380
have to subtract the right one from the left
one, because of order of operation. Okay,
4153
06:22:04,380 --> 06:22:08,650
so that's how you make the numerator. Now
let's just look downstairs, again, we have
4154
06:22:08,650 --> 06:22:13,100
an n, so we know that's seven, and then that
sum of x squared. And remember, it doesn't
4155
06:22:13,100 --> 06:22:18,651
have the parentheses around the sum of X square,
if it had the parentheses around it, you'd
4156
06:22:18,651 --> 06:22:23,600
be taking like 678 and squaring that, but
it doesn't have the parentheses. So you have
4157
06:22:23,600 --> 06:22:30,272
to use that big numbers 67,892. Okay. And
again, like with the upstairs, you got to
4158
06:22:30,272 --> 06:22:36,520
do out that side of the equation, right, that
term, you've got to multiply that out before
4159
06:22:36,520 --> 06:22:41,750
even looking at the rest of the equation,
right. And then Oh, here we go. On the right
4160
06:22:41,750 --> 06:22:47,372
side of the denominator, we have some of x
squared, that's exactly the example I was
4161
06:22:47,372 --> 06:22:55,290
giving earlier. So you say 678 times 678.
And you have to do that one out, right. And
4162
06:22:55,290 --> 06:22:58,690
then after you do that one out, and you do
the first one out, then you subtract the second
4163
06:22:58,690 --> 06:23:02,981
one from the first one, remember order of
operation. And if you do it right, you should
4164
06:23:02,981 --> 06:23:09,310
get C below the on the left side of the slide,
you should get that for the numerator in that
4165
06:23:09,310 --> 06:23:13,590
for the denominator, and then you divide them
out and you get 1.1. And that's your B, right.
4166
06:23:13,590 --> 06:23:21,140
So there you go. That's how you do it. And
so now we got to worry about AES. So what
4167
06:23:21,140 --> 06:23:28,010
I did was I just wrote B at the top there,
so B is 1.1. And so now we can use B to try
4168
06:23:28,010 --> 06:23:34,042
and figure out a, so remember how I look at
my list. Remember, I did x bar and y bar for
4169
06:23:34,042 --> 06:23:40,590
you just so we had that ready. So now we're
going to calculate a by putting in Y bar minus
4170
06:23:40,590 --> 06:23:47,012
and remember order of operation again, we
got to do the B which is 1.1 times x bar.
4171
06:23:47,012 --> 06:23:51,740
So we do that one out first, and then subtract
it from 23.7. And remember, remember, I was
4172
06:23:51,740 --> 06:23:57,700
saying sometimes you get a negative a, well,
we got negative ad for a. Alright, so we got
4173
06:23:57,700 --> 06:24:04,940
our B, we got our a, and let's go. Now, oh,
if you want to check your work, this should
4174
06:24:04,940 --> 06:24:11,400
work out right. Like you should be able to
take the B times the x bar, right, which is
4175
06:24:11,400 --> 06:24:21,680
1.1 times 96.9 minus 80. You know the a and
you should get 23.7. So if that works out,
4176
06:24:21,680 --> 06:24:26,080
then you know you did everything right. But
remember what the goal was, the goal was to
4177
06:24:26,080 --> 06:24:31,510
actually fill in that least squares line equation.
So if you look over on the right, that's what
4178
06:24:31,510 --> 06:24:37,010
we did. So we still have our Y hat, we still
have our equals, now we have a 1.1 where the
4179
06:24:37,010 --> 06:24:43,450
B belongs. We still have that x because those
are variables that we had in the x, and then
4180
06:24:43,450 --> 06:24:48,080
we do minus 80. Because we came out with a
negative one. If it had been just plain 80
4181
06:24:48,080 --> 06:24:54,622
would say plus ad, okay. All right at the
beginning of this presentation, I teased you
4182
06:24:54,622 --> 06:24:58,500
that we were going to do prediction with the
least squares line equation. We weren't going
4183
06:24:58,500 --> 06:25:03,180
to use a crystal ball. We were going to Use
this equation. Well, I finally get to that
4184
06:25:03,180 --> 06:25:08,940
exciting part of this presentation. But, and
there's always a big, but I first have to
4185
06:25:08,940 --> 06:25:13,730
warm you up with some rules, right? First
of all, I just want you to reflect on what
4186
06:25:13,730 --> 06:25:19,170
we just did. And realize that we can draw
the least squares line. But unlike algebra,
4187
06:25:19,170 --> 06:25:24,730
our xy pairs probably aren't on it, right?
Like in this example, none of the XY pairs
4188
06:25:24,730 --> 06:25:31,378
are on it. So you need to be sure about at
least one xy pair that's actually going to
4189
06:25:31,378 --> 06:25:35,570
land on the least squares line. And the only
one that you can be sure of is going to land
4190
06:25:35,570 --> 06:25:41,850
on least squares line is x bar, comma y bar.
And if you reflect on it, that's why we had
4191
06:25:41,850 --> 06:25:46,390
to calculate that right, because we had to
use x bar and y bar in the calculation to
4192
06:25:46,390 --> 06:25:52,730
back calculate a the y intercept. Now, you
may be lucky and get a data set that there
4193
06:25:52,730 --> 06:25:58,560
is an x y pair that just happens to fall on
the least squares line, or maybe even a couple
4194
06:25:58,560 --> 06:26:04,830
or maybe more. But you can't trust that. So
if you need to trust that there's a point
4195
06:26:04,830 --> 06:26:10,708
on the least squares line, you know, it's
always going to be x bar comma y bar. All
4196
06:26:10,708 --> 06:26:11,708
right.
4197
06:26:11,708 --> 06:26:18,773
And now I want to focus more succinctly, on
to the slope or B, right. So remember, we
4198
06:26:18,773 --> 06:26:24,321
just in our example, calculated B and we got
1.1. For me, and that's a slope. So I want
4199
06:26:24,321 --> 06:26:29,520
to point it out that the slope B of the least
squares lines tells us how many units the
4200
06:26:29,520 --> 06:26:35,870
response variable or Y is expected to change
for each one unit of change and the explanatory
4201
06:26:35,870 --> 06:26:41,410
variable or x. So that's a little kind of
a tongue twister. But if you think of our
4202
06:26:41,410 --> 06:26:47,340
example, it's a little easier to understand.
So the fact that that slope was 1.1, in our
4203
06:26:47,340 --> 06:26:52,260
example, and that we were having XP DBP. And
why be number of appointments over the last
4204
06:26:52,260 --> 06:26:57,930
year, what we're essentially saying by that
is, for each increase in one mmHg of DBP,
4205
06:26:57,930 --> 06:27:04,720
or the X for each increasing one of those,
there is a 1.1 increase in the number of appointments
4206
06:27:04,720 --> 06:27:12,450
the patient had over the past year. So as
DBP goes up by one, then the appointments
4207
06:27:12,450 --> 06:27:17,140
goes up by 1.1. Well, I don't know what 1/10
of an appointment is, but you get what I'm
4208
06:27:17,140 --> 06:27:23,560
saying because it's just a Y, okay. And so
the number of units change in the Y for each
4209
06:27:23,560 --> 06:27:30,880
unit change in X is called the marginal change
in the Y. So which if you sort of think about
4210
06:27:30,880 --> 06:27:37,630
it, that's 1.1. So 1.1 is the slope. But 1.1
is also the marginal change in the Y for each
4211
06:27:37,630 --> 06:27:45,720
unit change in the x. Now, I also want to
just recall for you this concept of influential
4212
06:27:45,720 --> 06:27:51,370
points, right, so like with our if a point
is an outlier, and remember, we should have
4213
06:27:51,370 --> 06:27:54,720
done a scatterplot. And everything before
we got to this point, because we need our
4214
06:27:54,720 --> 06:27:59,958
we need all those sums of x's and sums of
y's and sums of sums and whatever, right.
4215
06:27:59,958 --> 06:28:04,640
And so like with AR, if a point is an outlier,
and you can see it on the scatterplot, it
4216
06:28:04,640 --> 06:28:07,952
can really drastically influenced the least
squares line equation, just like it's can
4217
06:28:07,952 --> 06:28:14,390
screw up our right. And so an extremely high
x or an extremely low X can do this. And I
4218
06:28:14,390 --> 06:28:20,210
was just, you know, pointing out a culprit
we have here on the scatterplot. So always
4219
06:28:20,210 --> 06:28:24,640
check your scattergram first for outliers,
because you could end up in a situation where
4220
06:28:24,640 --> 06:28:27,930
you're making a least squares line and there's
a bunch of outliers, you know, whacking it
4221
06:28:27,930 --> 06:28:35,352
out. Okay, now I'm gonna also bring up, you're
probably like, when do we get to the prediction
4222
06:28:35,352 --> 06:28:38,900
part? I'm like, you just have to relax, I
have to get through a few of these issues,
4223
06:28:38,900 --> 06:28:44,680
right? So one of them is the residual. And
you know, the word residual, like it kind
4224
06:28:44,680 --> 06:28:49,080
of sounds like residue, right? Like you said,
you know, somebody comes over and sits there
4225
06:28:49,080 --> 06:28:53,180
their cup on your coffee table without using
a coaster that leaves some residue and you
4226
06:28:53,180 --> 06:28:57,500
get all mad, okay, well, that's kind of what
a residual is. It's like kind of like residue,
4227
06:28:57,500 --> 06:29:01,850
it's like something left over, right. So once
the equation is there, once you make the least
4228
06:29:01,850 --> 06:29:07,160
squares line equation, there's something I
just want you to notice. And that is you can
4229
06:29:07,160 --> 06:29:12,570
take each x, remember how we had seven patients,
they each had an X, you can theoretically
4230
06:29:12,570 --> 06:29:19,310
take each x, plug it into the equation and
get the Y hat out, right? So I want to just
4231
06:29:19,310 --> 06:29:25,128
demonstrate doing that. So we have our equation
upper right here. So a patient one, I took
4232
06:29:25,128 --> 06:29:31,680
patient ones x which was 70. And I plugged
it in 70 times 1.1 minus 80. You know, I put
4233
06:29:31,680 --> 06:29:37,150
in the equation and I got negative three.
Now that's why had the real why I put it on
4234
06:29:37,150 --> 06:29:42,870
the screen here is actually three. So as you
can see, you know it's not the same answer,
4235
06:29:42,870 --> 06:29:48,440
right? And then patient two I did it with
patient two also I did 1.1 times 115 because
4236
06:29:48,440 --> 06:29:52,720
that's the x and then minus 80. You know,
because that's the rest of the equation. And
4237
06:29:52,720 --> 06:30:00,280
I got 46.5 Now that was a little closer, because
look at patient twos wise. That was 45 If
4238
06:30:00,280 --> 06:30:05,050
it's really close to this 46.5, that's a little
bit better. But the reason I was doing all
4239
06:30:05,050 --> 06:30:12,480
that is I just wanted to tell you the residual
is y minus y hat. So in the first case, we
4240
06:30:12,480 --> 06:30:17,480
have y hat was negative three and y was three.
So patient when we did three minus negative
4241
06:30:17,480 --> 06:30:21,850
three, and we got sick, so that's the residual,
it's kind of like residue, right? It's like
4242
06:30:21,850 --> 06:30:27,180
the residue leftover between Y hat and y,
right. And then patient who we did it again,
4243
06:30:27,180 --> 06:30:35,550
we took y which is 45 minus y hat, which was
bigger, it was 46.5. So we got negative 1.5.
4244
06:30:35,550 --> 06:30:40,660
So that's the residual. So So this is how
you calculate the residual. And this is what
4245
06:30:40,660 --> 06:30:46,952
it is, this is how you get it. But the bottom
line is, you don't want big residuals, right?
4246
06:30:46,952 --> 06:30:52,410
Because that would mean the line didn't fit
very well. So you'll find that if you have
4247
06:30:52,410 --> 06:30:56,650
a really good fitting line, you have very
small residuals. And so you're probably like,
4248
06:30:56,650 --> 06:31:01,872
well, what's a good fitting line? Well, we'll
get to the coefficient of determination, and
4249
06:31:01,872 --> 06:31:07,740
that'll help you see what constitutes a good
fitting line.
4250
06:31:07,740 --> 06:31:15,140
But first, I will get to the prediction part,
okay. So you're done with your least squares
4251
06:31:15,140 --> 06:31:20,490
line equation, and you want to use it for
prediction. So let's say you knew someone's
4252
06:31:20,490 --> 06:31:25,190
DVP, and you wanted to predict how many appointments
she or he would have in the next year. Now,
4253
06:31:25,190 --> 06:31:29,010
what you're not doing is you're not using,
you're not reusing your X's from your data,
4254
06:31:29,010 --> 06:31:33,800
we just did that to make the residuals, what
you're doing is actually imagining a new thing
4255
06:31:33,800 --> 06:31:39,900
out there. And you're gonna use this equation
for prediction. So you could plug in the DVP
4256
06:31:39,900 --> 06:31:46,050
as an X, and get the Y hat out, and say that's
your prediction, right? But you gotta use
4257
06:31:46,050 --> 06:31:51,380
some caution. If you use an X within the range
of the original equation, as you can see,
4258
06:31:51,380 --> 06:31:57,321
I put the x's up here, the range of the original
equation was like 70 to 125. Right, those
4259
06:31:57,321 --> 06:32:03,373
were, you know, the areas covered by x, right?
If you do that, if you pick an X, somewhere
4260
06:32:03,373 --> 06:32:07,110
in there, this type of prediction is called
interpolation. And people feel pretty good
4261
06:32:07,110 --> 06:32:11,190
about it. But if you use an x from outside
the range, like one that's really smaller,
4262
06:32:11,190 --> 06:32:17,330
like 65, or one that's bigger, like 130, then
it's called extrapolation. And then it's not
4263
06:32:17,330 --> 06:32:22,850
such a good idea, because you don't know if
it's really going to work, right. So here,
4264
06:32:22,850 --> 06:32:28,250
I'm going to give you an example of interpolation.
The patient in your study as a DBP of 80.
4265
06:32:28,250 --> 06:32:34,208
Okay, so 80s, right in there, it's in that
range. So let's use it right. So we do it.
4266
06:32:34,208 --> 06:32:37,510
Now, this looks familiar to you, because we
just did this when we did residuals, but we're
4267
06:32:37,510 --> 06:32:44,240
using a new person now. So 1.1, times 80,
minus 80, equals eight. So this is how we,
4268
06:32:44,240 --> 06:32:48,458
what we would do is predict that this patient
would come to eight appointments next year.
4269
06:32:48,458 --> 06:32:53,420
So there, that's how we use our least squares
line equation, like a crystal ball where we
4270
06:32:53,420 --> 06:33:00,740
can predict right? So is it really this easy,
right? Is this all you have to do to predict
4271
06:33:00,740 --> 06:33:08,020
the future? Well, it's not really that easy.
You can't make a linear equation out of any
4272
06:33:08,020 --> 06:33:14,050
old xy pair. So remember this from our last
lecture, see, the scatterplot. It looks like
4273
06:33:14,050 --> 06:33:19,050
what a cloud in That's right. It doesn't have
a linear equation, you know, it doesn't look
4274
06:33:19,050 --> 06:33:23,970
like it should make a line. But you know what,
you feed that stuff into the software, or
4275
06:33:23,970 --> 06:33:30,110
you feed that stuff into your B formula, and
you're a formula, you'll get, you'll get a
4276
06:33:30,110 --> 06:33:35,010
line out of it, even if there's no linear
correlation. And so if you get that line out
4277
06:33:35,010 --> 06:33:40,530
of some scatterplot, that looks like this,
then it's not a very good line, right? And
4278
06:33:40,530 --> 06:33:45,350
it wouldn't work very well for prediction,
right? Because this looks pretty unpredictable.
4279
06:33:45,350 --> 06:33:51,080
So for that reason, we can't just accept any
line that is handed to us. To evaluate if
4280
06:33:51,080 --> 06:33:55,700
our least squares line equation should be
used for interpretation, we need the coefficient
4281
06:33:55,700 --> 06:34:04,020
of determination. So here we are at the coefficient
of determination. And so remember how I said
4282
06:34:04,020 --> 06:34:12,190
you have to recycle, recycle recycle in this,
well get out your our time to recycle. So
4283
06:34:12,190 --> 06:34:19,140
the coefficient of determination is also called
r squared. And it literally means r times
4284
06:34:19,140 --> 06:34:26,240
r. And I just have to add this on. Just like
remember the coefficient of variation. Remember
4285
06:34:26,240 --> 06:34:34,410
that one, we always turn r squared into a
percent, right? And so you times it by 101%.
4286
06:34:34,410 --> 06:34:42,470
So in this example that we did remember, early
on in the last lecture, we did the R for this,
4287
06:34:42,470 --> 06:34:46,910
that not the scatterplot I just showed you,
but the for the one of DBP, and the appointments,
4288
06:34:46,910 --> 06:34:52,820
right? And we got an R that was really, really
strong positive correlation, right, we got
4289
06:34:52,820 --> 06:34:57,510
point nine, five. Well, if we want to calculate
r squared, which is the coefficient of determination,
4290
06:34:57,510 --> 06:35:02,730
we take point nine five times point nine,
five If and we get point nine oh, but we got
4291
06:35:02,730 --> 06:35:08,441
to do that percent thing. So we end up with
90%. So this is how you say it, you say that
4292
06:35:08,441 --> 06:35:16,390
90% is the variation that's explained? And
why, by the linear equation, right? So that's,
4293
06:35:16,390 --> 06:35:21,800
you know, y varies, right? Like how many appointments
they had, you know, it was different for each
4294
06:35:21,800 --> 06:35:28,750
person. Well, 90% of that variation is explained
by the equation. And of course, if you take
4295
06:35:28,750 --> 06:35:34,960
100 minus 90%, there's 10%, unexplained variation.
So there's still some variation that could
4296
06:35:34,960 --> 06:35:40,750
be explained by other variables, but not a
lot. And how you actually stated is, you know,
4297
06:35:40,750 --> 06:35:45,830
when you're done with this, if you were writing
a paper, you'd say, 90% of the variation in
4298
06:35:45,830 --> 06:35:51,530
the number of appointments is explained by
DBP. And I know people are like, explain,
4299
06:35:51,530 --> 06:35:55,650
like, it doesn't have a mouth, like, what
does it talking about? You just have to say
4300
06:35:55,650 --> 06:36:00,330
it this way. There's it's statistics ease,
this is how you say it. And by
4301
06:36:00,330 --> 06:36:06,291
contrast, or by complimentary, what you would
say is 10% of the variation in the number
4302
06:36:06,291 --> 06:36:12,120
of appointments is not explained by DBP. Right?
It could be explained by other things. Well,
4303
06:36:12,120 --> 06:36:17,970
we happen to get a nice, I see CD for coefficient
of determination. You know, we got a nice
4304
06:36:17,970 --> 06:36:23,600
high one. But what if it's a low? Well, let's
just think about it CD should be better than
4305
06:36:23,600 --> 06:36:30,101
at least 50%? Because that would be random,
right? And the higher the better. So if you're
4306
06:36:30,101 --> 06:36:34,670
on a test, nobody's going to give you a CD
of like 60% and say, Is this any good because
4307
06:36:34,670 --> 06:36:39,442
I don't know, you'd be very conflicted. In
real life, what I use it for is to compare
4308
06:36:39,442 --> 06:36:44,390
models, if one is 60%, and the others 55%.
Of course, I'm going to go with a 60%. One,
4309
06:36:44,390 --> 06:36:49,430
but it's still not very good, right. And if
it's low, you know, the higher the better,
4310
06:36:49,430 --> 06:36:54,030
basically. And if it's low, it means that
you probably need other variables to help
4311
06:36:54,030 --> 06:37:02,770
the x you use to explain more of the variation
because that x is not doing. Okay, in summary,
4312
06:37:02,770 --> 06:37:08,080
I just wanted to go over chapter four, so
you realize where we've been. Okay. So we
4313
06:37:08,080 --> 06:37:13,610
started out with a set of quantitative x,
y pairs. First thing we did was we made a
4314
06:37:13,610 --> 06:37:17,458
scatterplot, we wanted to look at the linear
relationship between x and y. And we wanted
4315
06:37:17,458 --> 06:37:23,080
to look at outliers. If we'd seen a lot of
outliers, or no linear relationship, we would
4316
06:37:23,080 --> 06:37:27,810
have stopped there. But because this is a
class we had to learn, I forced them to be
4317
06:37:27,810 --> 06:37:32,680
a scatterplot with a linear variation, and
not too many outliers. So we could move forward
4318
06:37:32,680 --> 06:37:38,542
and do our so we calculated our to see if
our correlation was positive or negative,
4319
06:37:38,542 --> 06:37:43,510
and weak, moderate, or strong. So that's what
you do if you find a linear relationship.
4320
06:37:43,510 --> 06:37:50,580
Next, in addition, in this lecture, we calculated
B and A to come up with the least squares
4321
06:37:50,580 --> 06:37:56,580
line equation. And I just wanted to you to
notice that the sign on B will always match
4322
06:37:56,580 --> 06:38:02,200
the sign on R. So if you have a positive R,
you'll have a positive slope, if you have
4323
06:38:02,200 --> 06:38:06,890
a negative or you have a negative slope, but
otherwise, the numbers won't match, just a
4324
06:38:06,890 --> 06:38:12,470
sign. And then also, I wanted you to notice
that strong correlations will give you high
4325
06:38:12,470 --> 06:38:16,560
coefficient of determination, even if they're
negative correlations, because remember, it's
4326
06:38:16,560 --> 06:38:22,042
r times r. And so negative times negative
are still as positive, right? So if you have
4327
06:38:22,042 --> 06:38:28,810
strong correlation, like negative point nine,
or point nine, it really doesn't matter what
4328
06:38:28,810 --> 06:38:34,550
direction if it's strong, then you're going
to get a high coefficient of determination.
4329
06:38:34,550 --> 06:38:39,708
So after we did this B and A thing, we use
that linear equation to calculate residuals,
4330
06:38:39,708 --> 06:38:44,810
right, like we took the x's from the original
data and put them in got the Y hat and calculated
4331
06:38:44,810 --> 06:38:51,660
the residuals. After that, we use R to calculate
the coefficient of determination or CD, to
4332
06:38:51,660 --> 06:38:56,320
decide if we wanted to use the literate equation
for prediction. Because if it was bad, we
4333
06:38:56,320 --> 06:39:00,730
weren't going to do that. But we decided was
good for prediction at 90%. And we decided
4334
06:39:00,730 --> 06:39:06,708
to use it. So that was our journey through
these xy pairs all the way down to the coefficient
4335
06:39:06,708 --> 06:39:12,220
of determination. Good job, you made it. So
in conclusion, the least squares criterion,
4336
06:39:12,220 --> 06:39:15,910
and calculating the least squares line was
the first thing we went over how to do that
4337
06:39:15,910 --> 06:39:21,032
and what it all means. And then I reviewed
some issues with prediction using the least
4338
06:39:21,032 --> 06:39:26,042
squares line, because it looks kind of easy.
It looks kind of, you know, better than sliced
4339
06:39:26,042 --> 06:39:29,550
bread, but there are some things you have
to think about. Finally, we went over the
4340
06:39:29,550 --> 06:39:35,410
coefficient of determination so that you could
figure out how good your least squares line
4341
06:39:35,410 --> 06:39:41,600
equation was. And I just wanted to point out
that CD kind of looks like CDs, you know,
4342
06:39:41,600 --> 06:39:47,952
like we used to have CDs. They were so pretty
and rainbowy like that. But now all CD means
4343
06:39:47,952 --> 06:39:57,530
is coefficient of determination. Hello, and
welcome back to statistics. It's Monica wahi
4344
06:39:57,530 --> 06:40:03,600
are labarre College lecturer and You've made
it to chapter seven, I broke up chapter seven
4345
06:40:03,600 --> 06:40:08,730
into bite sized pieces. And we're going to
start with chapter 7.1, talking about the
4346
06:40:08,730 --> 06:40:14,650
normal distribution and the empirical rule.
So here are your learning objectives for this
4347
06:40:14,650 --> 06:40:19,950
lecture. At the end of this lecture, you should
be able to state two properties of the normal
4348
06:40:19,950 --> 06:40:26,440
curve, state two differences between Chebyshev
intervals and the empirical rule, and explain
4349
06:40:26,440 --> 06:40:30,090
how to apply the empirical rule to a normal
distribution.
4350
06:40:30,090 --> 06:40:35,690
So, remember, distributions, we learned about
them a while back, but I'll remind you a little
4351
06:40:35,690 --> 06:40:39,920
bit about them. And then we're going to talk
about properties of the normal distribution,
4352
06:40:39,920 --> 06:40:45,530
or specifically the normal curve, that shape
that comes out of making a histogram of normally
4353
06:40:45,530 --> 06:40:50,320
distributed data, then we're going to remember
Chevy Chevy intervals, we're going to talk
4354
06:40:50,320 --> 06:40:54,800
about what Chevy Chevy did for us, and what
Chevy Chevy really didn't do for us. And then
4355
06:40:54,800 --> 06:40:59,730
we're gonna move on to the empirical rule,
which works very well, better than Chevy Chevy
4356
06:40:59,730 --> 06:41:03,920
intervals, when you have normally distributed
data. And then I'm going to show you an example
4357
06:41:03,920 --> 06:41:10,860
of how to apply the empirical rule to that
normally distributed data. So remember, the
4358
06:41:10,860 --> 06:41:15,740
normal distribution, in fact, remember distributions
at all right? So to get a distribution, and
4359
06:41:15,740 --> 06:41:19,500
a lot of people sort of forget this, by the
time we get to chapter seven, but I just wanted
4360
06:41:19,500 --> 06:41:25,032
to remind you, this is from an earlier lecture,
we had a quantitative variable, which was
4361
06:41:25,032 --> 06:41:30,532
how far a patient's had been transported.
And we determined classes, and we made a frequency
4362
06:41:30,532 --> 06:41:36,120
table. So remember that. And then after that,
we made a frequency histogram, and then made
4363
06:41:36,120 --> 06:41:40,530
a shape. And as you could see that shape,
which is the distribution, that shape in this
4364
06:41:40,530 --> 06:41:45,780
one was skewed, right, see that light on the
right, okay, but that's an example of something
4365
06:41:45,780 --> 06:41:50,352
we cannot apply the empirical rule to, because
the empirical rule only applies to normally
4366
06:41:50,352 --> 06:41:56,980
distributed data. So I had to give you an
example of that. And here's my example. So
4367
06:41:56,980 --> 06:42:02,000
when I was in my undergraduate in costume
design at the University of Minnesota, they
4368
06:42:02,000 --> 06:42:06,700
made us take a chemistry class and one of
those big lecture halls. So I was in a very
4369
06:42:06,700 --> 06:42:11,420
large class that probably had about 100 people.
And we were given this really difficult test,
4370
06:42:11,420 --> 06:42:17,220
it was 100 point test, and I was used to getting
like A's. And so when they were done with
4371
06:42:17,220 --> 06:42:22,060
the test, the T A's, were handing the tests
back to everybody. So they could see their
4372
06:42:22,060 --> 06:42:26,670
grade, while the professor was writing on
the board, and was reading the frequency of
4373
06:42:26,670 --> 06:42:32,870
all the different scores. And I remember the
TA handed me my test, and it said 73 on it.
4374
06:42:32,870 --> 06:42:39,800
And I'm used to getting like 90s, up to 100.
And I remember stating out loud, saying 73,
4375
06:42:39,800 --> 06:42:44,150
that is an awful score, I can't believe I
did so badly. I was talking like that. But
4376
06:42:44,150 --> 06:42:48,490
at the same time, the professor was writing
the frequencies on the board. And what I realized
4377
06:42:48,490 --> 06:42:57,010
is the top score was in the 80s. And I had
the third top score was 73. That's how hard
4378
06:42:57,010 --> 06:43:02,030
the test was. And that's a nice Shut up, because
I noticed everybody giving me dirty looks
4379
06:43:02,030 --> 06:43:08,500
because they had scored actually below me.
So I wanted you to imagine that class. And
4380
06:43:08,500 --> 06:43:13,910
I imagined what the normal distribution would
look like for that class with the distribution
4381
06:43:13,910 --> 06:43:18,670
of the scores. And the reason why I thought
it would be normal is because we all did badly,
4382
06:43:18,670 --> 06:43:25,380
right. And so nobody got 100. So we were all
below the 100. So I imagined this curve here
4383
06:43:25,380 --> 06:43:30,620
for you. And I imagined my class, I had 100
people just to make it easy. Of course, the
4384
06:43:30,620 --> 06:43:36,200
test was difficult. And nobody got 100 points.
And the mode, the median. And the mean, were
4385
06:43:36,200 --> 06:43:41,600
all near see great, because you remember how,
when you have a normal distribution, the mode,
4386
06:43:41,600 --> 06:43:48,800
median, and mean are all on top of each other.
So we all did pretty badly. So I'm going to
4387
06:43:48,800 --> 06:43:56,458
use this example of the fake chemistry test
scores to exhibit exemplify these properties
4388
06:43:56,458 --> 06:44:01,220
of the normal curve. So there's five I'm going
to talk about. The first is that the curve
4389
06:44:01,220 --> 06:44:04,990
is bell shaped with the highest point over
the mean. And so you can see I drew a scribbly
4390
06:44:04,990 --> 06:44:08,800
little curve, put a little arrow there to
show you that that's where the mean of the
4391
06:44:08,800 --> 06:44:14,150
scores were. And then I also wanted you to
notice that the curve is symmetrical with
4392
06:44:14,150 --> 06:44:19,452
a vertical line through the mean. So there's
like a mirror image of the curve on either
4393
06:44:19,452 --> 06:44:25,280
side. Now, it's not perfect, obviously. But
it should be roughly like that. And you know,
4394
06:44:25,280 --> 06:44:30,930
this is not true of skewed or bi modal or
these other things we've been talking about.
4395
06:44:30,930 --> 06:44:37,150
Okay, and the third property is that the curve
approaches the horizontal axis but never touches
4396
06:44:37,150 --> 06:44:43,240
it. You don't have to memorize this, but remember,
asym totw or asymptomatically close, that's
4397
06:44:43,240 --> 06:44:46,370
when a line gets really close to another line,
but they never touch.
4398
06:44:46,370 --> 06:44:50,660
It's so romantic. But anyway, that's a very
Bollywood thing to say, by the way, but uh,
4399
06:44:50,660 --> 06:44:57,080
so the curve approaches the horizontal axis
and never touches or crosses and then also
4400
06:44:57,080 --> 06:45:02,320
there's this inflection or these transition
points between cupping upward and downward.
4401
06:45:02,320 --> 06:45:07,022
And these transition points occur at about
the mean, plus one standard deviation and
4402
06:45:07,022 --> 06:45:11,800
about the mean minus one standard deviation.
And this is a little hard to explain. But
4403
06:45:11,800 --> 06:45:17,080
imagine you're on a roller coaster and you're
going up this normal curve. There's this part
4404
06:45:17,080 --> 06:45:21,200
where you're just mainly going on, well, the
part where it seems to kind of level out and
4405
06:45:21,200 --> 06:45:26,792
you're at the top of the curve, he starts
to relaxing. That's that inflection point.
4406
06:45:26,792 --> 06:45:30,040
And so as you're going over in the roller
coaster, and you're in that flat part, and
4407
06:45:30,040 --> 06:45:35,090
then you start kind of going down, that's
the second inflection. So that's where what
4408
06:45:35,090 --> 06:45:38,430
it's saying about is the property of this
curve is that you have these inflection points
4409
06:45:38,430 --> 06:45:43,628
like that. And they roughly occur at plus
or minus one standard deviation above and
4410
06:45:43,628 --> 06:45:49,250
below the mean. Then finally, and I call it
this, and just so you could see it, the area
4411
06:45:49,250 --> 06:45:54,490
under the entire curve is one, so think 100%.
So it would be nice if that were a square
4412
06:45:54,490 --> 06:45:59,192
or rectangle, or even a triangle, something
that we're used to in geometry, but it's not,
4413
06:45:59,192 --> 06:46:03,830
it's this goofy shape, right? But still, you
need to get it in your head that that shape
4414
06:46:03,830 --> 06:46:11,510
is worth 1.0 in proportion land, or 100% in
percent land. And what I mean by that is,
4415
06:46:11,510 --> 06:46:17,370
let's say we cut that shape and half, the,
each side would have 50% or point five on
4416
06:46:17,370 --> 06:46:23,532
it, then let's cut it a different way. So
the part of the curve on the right side of
4417
06:46:23,532 --> 06:46:28,390
that line is a fourth of the curve, or 25%
of the curve, even though it's goofy shaped,
4418
06:46:28,390 --> 06:46:33,208
and the part on the left side is 75%. So that's
what we're trying to get you to think like
4419
06:46:33,208 --> 06:46:38,760
is that, yeah, you can just declare that all
the area under the curve equals one or 100%.
4420
06:46:38,760 --> 06:46:42,140
But the reason why we're declaring that is
because we're gonna cut it up and say different
4421
06:46:42,140 --> 06:46:48,860
amounts of percent of the curve. Now we get
to the empirical rule, since we reviewed this
4422
06:46:48,860 --> 06:46:54,000
whole curve thing, and I'm going to make you
remember Chevy shove, I'm sorry, but you know,
4423
06:46:54,000 --> 06:46:58,680
let's talk about Chevy Chevy, Chevy shove
helped us get some intervals, right, in intervals
4424
06:46:58,680 --> 06:47:02,690
have boundaries, or limits, they have a lower
limit and an upper limit. That's how you know
4425
06:47:02,690 --> 06:47:07,860
what bounds the interval. So when we were
doing Chebyshev intervals, what we would do
4426
06:47:07,860 --> 06:47:12,080
is we'd figure out a lower limit and upper
limit, and we'd say at least so much percent
4427
06:47:12,080 --> 06:47:17,110
of the data falls in the interval, right?
So when we would choose the lower limit of
4428
06:47:17,110 --> 06:47:22,458
mu minus two times the standard deviation,
and the upper limit was mu plus two times
4429
06:47:22,458 --> 06:47:28,530
the standard deviation, we would say at least
75% of the data were in the interval. So I
4430
06:47:28,530 --> 06:47:33,090
wanted to just show you a demonstration using
my fake class. So remember, there were 100
4431
06:47:33,090 --> 06:47:37,550
students in the class, I actually came up
with a mu for them. And their mu on the test
4432
06:47:37,550 --> 06:47:44,060
was 65.5. So my 73 was better than the mean,
but not much better, right. So the mu for
4433
06:47:44,060 --> 06:47:51,200
that class was 65.5. And the standard deviation
was 14.5. So I calculated these chubby shove
4434
06:47:51,200 --> 06:47:56,970
this championship interval for 75% of the
data. So I took 65.5 minus two times 14.5.
4435
06:47:56,970 --> 06:48:02,440
And I got 36.5, which is a pretty bad grade.
And then the upper limit was pretty good,
4436
06:48:02,440 --> 06:48:09,570
right? 65.5 plus two times 14.5 equals 94.5.
On 100 point test, that's a pretty good grade,
4437
06:48:09,570 --> 06:48:14,340
right? So if you had 100 data points, or 100
students, at least 75 would have scored between
4438
06:48:14,340 --> 06:48:20,240
36.5 and 94.5. So you're probably already
realizing, okay, that doesn't really help
4439
06:48:20,240 --> 06:48:27,170
Monica, who scored 73. And this is a really
wide range, we say at least 75% of people
4440
06:48:27,170 --> 06:48:32,040
score there, you could probably guess that
without even knowing about chubby ship intervals,
4441
06:48:32,040 --> 06:48:36,910
right? So it didn't really help me narrow
down, like how well is this class doing? If
4442
06:48:36,910 --> 06:48:40,820
I had had the mu and the standard deviation,
I could have calculated this and said, Okay,
4443
06:48:40,820 --> 06:48:43,820
I'm no better off.
4444
06:48:43,820 --> 06:48:49,650
So championships theorem on the left side,
and applies to any distribution, you don't
4445
06:48:49,650 --> 06:48:53,680
need a normal distribution, you can use that
skewed distribution. Also, you'll notice it
4446
06:48:53,680 --> 06:48:59,050
says at least. So like this was at least 75%
of the data fell in there. Maybe even 100%
4447
06:48:59,050 --> 06:49:03,500
fell in there. So it doesn't really help us.
And as you go, let you start with two standard
4448
06:49:03,500 --> 06:49:09,640
deviations. If you go out three, it's 88.9%.
And four, it's 93.8%. You know, you might
4449
06:49:09,640 --> 06:49:14,680
as well start at the beginning and say almost
100% of the data falls in this interval. And
4450
06:49:14,680 --> 06:49:19,208
if you're saying that it's not very useful,
right. But it kind of gets stuck doing that
4451
06:49:19,208 --> 06:49:24,240
because championships theorem applies to any
distribution, the empirical rule is much more
4452
06:49:24,240 --> 06:49:30,700
elite. It only applies to the normal distribution.
And you'll see why if you are lucky enough
4453
06:49:30,700 --> 06:49:35,458
to get the normal distribution that you want
to use the empirical rule instead of championship.
4454
06:49:35,458 --> 06:49:39,690
Okay? Because Secondly, the empirical rule
says approximately It doesn't say at least,
4455
06:49:39,690 --> 06:49:46,042
so it's saying basically, not at least it's
saying about exactly this. So you can trust
4456
06:49:46,042 --> 06:49:52,910
it. Okay, you don't have like this unknown,
like maybe 100%. There's, so it says, This
4457
06:49:52,910 --> 06:49:58,022
is what it says and I'll show you a diagram
of it, but it says that 68% of the data are
4458
06:49:58,022 --> 06:50:03,920
in the interview interval. mu plus or minus
one standard deviation. So mu minus one standard
4459
06:50:03,920 --> 06:50:08,660
deviation all the way up to mu plus one standard
deviation 68% of the data are in there. And
4460
06:50:08,660 --> 06:50:12,140
you'll notice that Chevy chef didn't even
say anything about one standard deviation.
4461
06:50:12,140 --> 06:50:18,690
And so already, we've got something way more
useful if we apply the empirical rule, right.
4462
06:50:18,690 --> 06:50:25,532
So next we go to 95% of the data are in the
interval, mu plus or minus two standard deviations,
4463
06:50:25,532 --> 06:50:31,640
95%, approximately 95% are in there. Now,
if we had bought chubby chef, we'd be saying
4464
06:50:31,640 --> 06:50:38,290
about this too, we'd be saying 75%. Okay,
we'd be saying at least 75%, which could be
4465
06:50:38,290 --> 06:50:39,290
95%.
4466
06:50:39,290 --> 06:50:44,730
But here, if we're using the empirical rule,
we're relatively sure that it's 95% between
4467
06:50:44,730 --> 06:50:51,070
mu plus or minus two standard deviations you
can like better, right? Finally, if you get
4468
06:50:51,070 --> 06:50:55,840
out to three standard deviations, you're kind
of running out of data, because 99.7%, almost
4469
06:50:55,840 --> 06:51:00,708
all of them fall in that interval. So as you
can see, the empirical rule is going to give
4470
06:51:00,708 --> 06:51:06,890
you a more specific answer. But again, you
can only use it if you have a normal distribution,
4471
06:51:06,890 --> 06:51:11,878
but which we do. So let's go look at that.
Okay, this is a diagram that I'm going to
4472
06:51:11,878 --> 06:51:16,872
help I made it myself, actually, because I
thought it was the other diagrams I saw were
4473
06:51:16,872 --> 06:51:21,770
not pretty. And this one is very pretty in
my mind, but let me unpack this diagram for
4474
06:51:21,770 --> 06:51:22,770
you, because there's
4475
06:51:22,770 --> 06:51:23,980
a lot going on. And
4476
06:51:23,980 --> 06:51:28,350
first of all, I want you to notice the shape
of it, it's a normal distribution, okay. And
4477
06:51:28,350 --> 06:51:32,120
then I want you to notice that I put this
black line down the middle, and I put a little
4478
06:51:32,120 --> 06:51:37,261
arrow that says mu. So this is where we want
to imagine mu, it's no matter what your what
4479
06:51:37,261 --> 06:51:43,610
your actual numbers are from you. Like in
our case, this is 65.5 for our points. Just
4480
06:51:43,610 --> 06:51:47,820
imagine whatever your mu is, and whatever
your standard deviation is, this is where
4481
06:51:47,820 --> 06:51:52,940
you would put the meal, right, then you'll
notice that each of these sections that's
4482
06:51:52,940 --> 06:51:58,850
colored, has a little standard deviation symbol
in it, because that's representing that, that
4483
06:51:58,850 --> 06:52:04,700
the width of that is one standard deviation.
So if your standard deviation was like five,
4484
06:52:04,700 --> 06:52:09,670
then mu would be plus plus or minus five,
like the green one would be mu plus one standard
4485
06:52:09,670 --> 06:52:13,958
deviation. So it'd be mean plus five, and
then you draw that parallel line there and
4486
06:52:13,958 --> 06:52:18,060
see that arrow that says mu plus one zero
deviation, that would be there. And of course
4487
06:52:18,060 --> 06:52:21,720
I can, I just had to use the symbols, because
I don't know how big the standard deviation
4488
06:52:21,720 --> 06:52:27,040
really would be, or what the mean really would
be. But whatever it was mu plus one standard
4489
06:52:27,040 --> 06:52:32,452
deviation, if you go up there, you would see
that that green area represents 34% of the
4490
06:52:32,452 --> 06:52:37,850
data. And if you're lucky enough to have exactly
100 people, like I did in my demonstration,
4491
06:52:37,850 --> 06:52:43,220
that would mean that between mu and mu plus
one standard deviation of these test scores
4492
06:52:43,220 --> 06:52:49,170
would be 34 people's scores, right, so you
can really figure that out. Same with the
4493
06:52:49,170 --> 06:52:54,550
yellow section only, that's mu minus one standard
deviation, and 34% of the scores would be
4494
06:52:54,550 --> 06:52:55,550
between those two
4495
06:52:55,550 --> 06:52:57,180
numbers.
4496
06:52:57,180 --> 06:53:03,378
Now you'll see as you get up into the blue,
that's between one and two standard deviations
4497
06:53:03,378 --> 06:53:08,330
above the mu, you'll see that because the
roller coasters a lot lower to the ground
4498
06:53:08,330 --> 06:53:14,180
there, that section is really small, it's
only 13.5% of the data. And the same with
4499
06:53:14,180 --> 06:53:18,210
the orange one that's on the other side of
the mu. So that's below the mean. And that's
4500
06:53:18,210 --> 06:53:22,750
only 13.5. And then you'll notice that at
three standard deviations, between two and
4501
06:53:22,750 --> 06:53:27,850
three, there's a little tiny piece right,
the purple piece and the red piece, those
4502
06:53:27,850 --> 06:53:35,720
are only worth 2.35% of this shape. And then
I wanted to point out there is some stuff
4503
06:53:35,720 --> 06:53:41,570
at the end, in the little black part beyond
three standard deviations on either side,
4504
06:53:41,570 --> 06:53:46,220
there's point one 5%. And a lot of times people
forget that. But one way you can make sure
4505
06:53:46,220 --> 06:53:50,460
that you've got to remember that it's there
is that if you add up all these percents on
4506
06:53:50,460 --> 06:53:56,790
the slide, you'll get 100% because remember,
I promised you that the whole the whole curve
4507
06:53:56,790 --> 06:54:01,520
is worth 100%. And this is how we split it
up. I also want you to notice that there's
4508
06:54:01,520 --> 06:54:08,290
kind of a cheat, right? If you just add up
the green, blue, purple, and then the little
4509
06:54:08,290 --> 06:54:11,910
black part at the end, if you just add up
those percents, you'll get 50%, right, because
4510
06:54:11,910 --> 06:54:17,192
that's half the curve. And the same, you'll
get the same thing if you do the yellow, orange,
4511
06:54:17,192 --> 06:54:21,640
red, and the little part and the black at
the bottom. If you add those up, you'll get
4512
06:54:21,640 --> 06:54:26,900
50%. So that's how you want to just conceptualize
this whole empirical roll diagram. But now
4513
06:54:26,900 --> 06:54:35,050
we'll apply. So I put the empirical rule diagram
on the left, and then I put our class frequency
4514
06:54:35,050 --> 06:54:39,872
histogram on the right and look, I put the
meal and I put the standard deviation so we
4515
06:54:39,872 --> 06:54:44,730
could have it there. Now the first part of
this section, I'm just going to show you how
4516
06:54:44,730 --> 06:54:49,390
to fill in the numbers under the diagram.
Okay, and then after we fill in the numbers,
4517
06:54:49,390 --> 06:54:51,330
I'm going to talk to you about how to interpret
4518
06:54:51,330 --> 06:54:54,120
those numbers.
4519
06:54:54,120 --> 06:54:59,550
So let's start with easy let's write the mu
underneath the symbol for me, which was 65.5.
4520
06:54:59,550 --> 06:55:07,640
So we just wrote that was simple, okay. Now
let's do the plus or minus one standard deviation.
4521
06:55:07,640 --> 06:55:14,810
So you'll see 65.5, which is our mu minus,
and I put one times 14.5. I know I just did
4522
06:55:14,810 --> 06:55:19,042
that for demonstration purpose. So you see,
we're doing one times the standard deviation.
4523
06:55:19,042 --> 06:55:24,202
So if you subtract that from the meal, you
get 51. And so I wrote that 51 underneath
4524
06:55:24,202 --> 06:55:30,220
the mu minus one standard deviation. And if
you go the opposite way, and you add on 14.5,
4525
06:55:30,220 --> 06:55:36,080
you get 80. So I put that up there. So that's
I just labeled those two, you can kind of
4526
06:55:36,080 --> 06:55:38,160
guess what we're going to do on the next
4527
06:55:38,160 --> 06:55:39,160
slide.
4528
06:55:39,160 --> 06:55:43,740
Surprise, we're going to do almost the same
thing. All we're doing the mu minus two times
4529
06:55:43,740 --> 06:55:49,810
the standard deviation to get the 36.5. And
the mu plus two times the standard deviation
4530
06:55:49,810 --> 06:55:56,680
to get that 94.5. And you probably already,
we're ahead of me with this one. This is where
4531
06:55:56,680 --> 06:56:02,680
we do 65.5 minus three standard deviations,
and we get 22. And then we add three standard
4532
06:56:02,680 --> 06:56:09,390
deviations, and we get 109. And now we're
all able to So what does this all mean? Well,
4533
06:56:09,390 --> 06:56:15,310
remember, our n equals 100, just out of convenience.
So what does this mean? It means that 34%
4534
06:56:15,310 --> 06:56:22,940
of the scores are between 51 and 65.5. So
that's the yellow bar. Right? So 34 scores
4535
06:56:22,940 --> 06:56:27,958
were that because I 100 people in the class.
So I'm standing there in that class, and I've
4536
06:56:27,958 --> 06:56:34,610
got a 73. But I don't 34 of those people I'm
looking at have a score between 51 and 65.5.
4537
06:56:34,610 --> 06:56:40,000
I also know that another 34%, or another 34
in this class, because there's 100 have a
4538
06:56:40,000 --> 06:56:45,950
score between 65.5 and 80. And my 73 is somewhere
in there, right? So already, I'm getting an
4539
06:56:45,950 --> 06:56:53,390
idea that 68 people are 68% of the scores
are going to be between 51 and 80. Right.
4540
06:56:53,390 --> 06:57:01,230
And so I'm right there with 68% of the class.
So I'm going to go through some fake test
4541
06:57:01,230 --> 06:57:05,600
questions for you to just show you how to
come up with the answer. So let's say the
4542
06:57:05,600 --> 06:57:12,782
question was, what percent of the data student
scores are between 36.5 and 80? So think about
4543
06:57:12,782 --> 06:57:18,050
how you would answer that question. So see
where 36.5 is, it's on the lower limit of
4544
06:57:18,050 --> 06:57:23,210
the orange part, and see where the ad is,
it's on the upper limit of the green part.
4545
06:57:23,210 --> 06:57:29,872
So what you would do is you would add up the
percents in between right 13.5 plus 34, plus
4546
06:57:29,872 --> 06:57:35,292
34? And the answer to what percent of the
data are between 36.5 and 80? The answer would
4547
06:57:35,292 --> 06:57:44,360
be at 1.5%. Here's another question. What
cut point marks the top 16% of the scores.
4548
06:57:44,360 --> 06:57:49,080
So already, you know you're up in that area,
probably where the purple or the blue are,
4549
06:57:49,080 --> 06:57:55,458
right? And so what would make the top 16%?
Well, if you actually add together that point,
4550
06:57:55,458 --> 06:58:02,272
one 5%, from the little black part, the 2.35%,
from the purple, and the blue 13.5%, you'll
4551
06:58:02,272 --> 06:58:09,362
get 16%. So the cut point then for that all
the scores above 80, that would constitute
4552
06:58:09,362 --> 06:58:10,860
the top 16%
4553
06:58:10,860 --> 06:58:14,020
of the scores.
4554
06:58:14,020 --> 06:58:20,920
Here's another quiz question, what percent
of the scores are below 94.5. So we see 94.5
4555
06:58:20,920 --> 06:58:25,230
is at the upper limit of the blue section.
So you could kind of say, well, let's just
4556
06:58:25,230 --> 06:58:30,300
add up everything below. Right, we'll add
up everything below it, and that person, the
4557
06:58:30,300 --> 06:58:36,292
scores will be below 94.5. And so we do that
we add everything below it. But remember how
4558
06:58:36,292 --> 06:58:41,240
I said that there that the yellow, orange,
red, and the little black part there that
4559
06:58:41,240 --> 06:58:47,220
that equals 50%? If you just wanted to say
okay, that's 50% plus the green part, plus
4560
06:58:47,220 --> 06:58:53,330
the blue part, you could do that, and then
you get the same answer. So what are the cut
4561
06:58:53,330 --> 06:58:58,100
points from the middle 68% of the data? I
just wanted to show you an example. What if
4562
06:58:58,100 --> 06:59:04,300
they say middle, right? Well, you're gonna
have to be centered around me that right?
4563
06:59:04,300 --> 06:59:10,220
So the middle 68% means 34% above the mean,
and 34% below the mean. So the cut points
4564
06:59:10,220 --> 06:59:18,180
would be 51 to 80. Okay, now I'm going to
ask a similar question, but I'm going to use
4565
06:59:18,180 --> 06:59:25,700
different words. Okay. What is the probability
that if I select one student from this class,
4566
06:59:25,700 --> 06:59:31,170
that student will have a score less than 80?
Okay, so notice, I'm using totally different
4567
06:59:31,170 --> 06:59:38,260
terminology. I'm saying what is the probability
yet? The only the actual answer is what you
4568
06:59:38,260 --> 06:59:44,470
would probably guess, which is where you add
up all the percents below 80. So the point
4569
06:59:44,470 --> 06:59:49,720
of me giving you this quiz questions is to
point out that percent and probability mean
4570
06:59:49,720 --> 06:59:54,060
the same thing when you talk. So either I'm
gonna say what percent of the data are below
4571
06:59:54,060 --> 06:59:59,740
at the score of 80? Or what is the probability
that if I select one student, that student
4572
06:59:59,740 --> 07:00:05,220
was scored less than 80? That is actually
the same question. So the answer is going
4573
07:00:05,220 --> 07:00:11,522
to be I use that 50% trick here. That answers
me 50%, which is the whole bottom half of
4574
07:00:11,522 --> 07:00:19,350
that curve plus 34% gets up to 84%. Right?
So, so the probability that if I select on
4575
07:00:19,350 --> 07:00:24,720
student, that student will have a score less
than 80 is 84%. And that's the same as what
4576
07:00:24,720 --> 07:00:31,980
percent of the data is below 80 is 84%. Okay.
Here's another probability question, what
4577
07:00:31,980 --> 07:00:39,730
is the probability I will select a student
with a score between 36.5 and 51? Well, that's
4578
07:00:39,730 --> 07:00:46,020
as if I was asking, it's the same question
as what percent of the data are between 36.5
4579
07:00:46,020 --> 07:00:51,920
and 51? which you would know the answer that
that would be 13.5. That's the orange part,
4580
07:00:51,920 --> 07:00:57,180
right? But even if I say, what is the probability,
I will select a student with a score between
4581
07:00:57,180 --> 07:01:04,520
36.5 and 51 13.5%? So let's say that we were
at a casino, and we were betting, right. And
4582
07:01:04,520 --> 07:01:09,100
I'm like saying, okay, there's 100 students,
I'm going to just grab a score out, and I'm
4583
07:01:09,100 --> 07:01:15,140
betting a lot of money that I'm going to grab
somebody between 36.5 and 51. And you'd probably
4584
07:01:15,140 --> 07:01:21,250
be like, you don't want to bet on that. Because
you only have 13.5% probability of selecting
4585
07:01:21,250 --> 07:01:26,800
one, you probably want to bet if you're going
to bet on something in the in the yellow section
4586
07:01:26,800 --> 07:01:31,240
or something in the green section, because
they have higher probability. So that's how
4587
07:01:31,240 --> 07:01:36,070
you would think about probability. And percent,
even though they're kind of the same thing.
4588
07:01:36,070 --> 07:01:39,080
I just wanted to show you how they word the
questions differently.
4589
07:01:39,080 --> 07:01:45,280
But it means the same thing. So now I want
you to just sit back and think for a second.
4590
07:01:45,280 --> 07:01:50,140
So think about what would happen in a different
class taking the same hard test, meaning nobody's
4591
07:01:50,140 --> 07:01:55,730
getting 100%? What's the mu was the same,
meaning everybody's doing badly. But the standard
4592
07:01:55,730 --> 07:02:00,772
deviation was larger than 14.5? What would
that do to the intervals? So let's just stare
4593
07:02:00,772 --> 07:02:05,560
at this for a second. Let's say the mu was
still 65.5. But the standard deviation was
4594
07:02:05,560 --> 07:02:11,628
like 30. Okay, there was a lot of variation
in the class, that would already mean that
4595
07:02:11,628 --> 07:02:19,410
where the ad is right now, that that would
actually be 95.5. Right? And where that 51
4596
07:02:19,410 --> 07:02:25,040
is there. Now, if we have a standard deviation
of 30, that would actually be 35.5. I mean,
4597
07:02:25,040 --> 07:02:30,470
that'd be a way bigger interval, right. And
so the class I was in in chemistry was an
4598
07:02:30,470 --> 07:02:34,300
undergraduate class, I was in costume design.
This was a whole bunch of different kinds
4599
07:02:34,300 --> 07:02:38,870
of people in chemistry. And that's probably
why we even had kind of a big standard deviation
4600
07:02:38,870 --> 07:02:43,522
of 14.5. Even though I made that up. I mean,
in reality, we probably did have a big standard
4601
07:02:43,522 --> 07:02:49,780
deviation. I knew in the chemical engineering
department, they had chemistry classes for
4602
07:02:49,780 --> 07:02:53,890
chemical engineering majors, I'll tell you,
their standard deviation was probably a lot
4603
07:02:53,890 --> 07:02:59,910
smaller, because they were probably more alike
and got more similar grades as each other.
4604
07:02:59,910 --> 07:03:04,630
But with this diverse class, we probably had
a pretty big standard deviation. So that gets
4605
07:03:04,630 --> 07:03:08,280
to my last question, what if the standard
deviation was actually smaller than 14.5.
4606
07:03:08,280 --> 07:03:12,750
So if we were like in the chemical engineering
class, and they were taking chemistry, and
4607
07:03:12,750 --> 07:03:17,390
they had a smaller standard deviation, maybe
they might have had the same mean 65.5. But
4608
07:03:17,390 --> 07:03:24,580
let's say their standard deviation was like
five, then where the ad is now would be a
4609
07:03:24,580 --> 07:03:34,870
70.5. And where the 51 is, would be a 60.5.
And we'd have way more confidence of where
4610
07:03:34,870 --> 07:03:40,910
we knew the scores fell, like as I was standing
there with my 73. I would be saying like,
4611
07:03:40,910 --> 07:03:46,550
Oh, you know, my 73 is pretty high, if everybody
has a small standard deviation, right? Whereas
4612
07:03:46,550 --> 07:03:50,128
it's not that high here, because we have kind
of a big standard deviation. That's in the
4613
07:03:50,128 --> 07:03:56,320
first though the green part. So the reason
why I want you to think about that is, that's
4614
07:03:56,320 --> 07:03:57,740
why
4615
07:03:57,740 --> 07:03:58,740
this
4616
07:03:58,740 --> 07:04:04,870
shape goes by mu and standard deviation, because
it really matters how big the standard deviation
4617
07:04:04,870 --> 07:04:15,010
is, how big each of those areas are with the
different colors. So I just wanted to remind
4618
07:04:15,010 --> 07:04:20,700
you that percent, area and probability are
all related. The percents literally refer
4619
07:04:20,700 --> 07:04:27,240
to the percent of the area of the shape, okay?
And imagine the whole thing is 100%. So just
4620
07:04:27,240 --> 07:04:33,800
to remind you, the orange part is 13.5% of
the area of the hole shape, but it also is
4621
07:04:33,800 --> 07:04:40,850
the probability that an X like a student and
x falls between mu minus one standard deviations
4622
07:04:40,850 --> 07:04:47,120
and mu minus two standard deviations. And
that if I select 1x, from a group, this group
4623
07:04:47,120 --> 07:04:55,180
that I'm 13.5% is the probability that I will
get an X in that range. And so it means both
4624
07:04:55,180 --> 07:05:02,942
things. So in conclusion, the empirical rule
helps establish intervals that apply to normally
4625
07:05:02,942 --> 07:05:09,390
distributed data. And it's more useful than
trebuchet. Because it's more specific, these
4626
07:05:09,390 --> 07:05:14,330
intervals have a certain percentage of the
data points in them. And they also refer to
4627
07:05:14,330 --> 07:05:20,730
the probability of selecting an X in that
interval. And these intervals depend on the
4628
07:05:20,730 --> 07:05:26,020
mean and the standard deviation of the data
distribution. So if those change then exactly
4629
07:05:26,020 --> 07:05:31,860
where the numbers are on those intervals change.
Well, I hope you enjoyed my explanation of
4630
07:05:31,860 --> 07:05:39,870
the empirical rule. And now you can practice
doing it yourself at home. Good morning, good
4631
07:05:39,870 --> 07:05:45,480
day. And good afternoon. This is Monica wahi,
your library college lecturer here moving
4632
07:05:45,480 --> 07:05:51,532
you through chapter 7.2, and 7.3, z scores
and probabilities, I decided to merge these
4633
07:05:51,532 --> 07:05:56,280
two chapters together, because I thought they
actually kind of belong together, I didn't
4634
07:05:56,280 --> 07:05:59,941
really understand why they were separated.
So at the end of this lecture, you should
4635
07:05:59,941 --> 07:06:05,590
be able to explain how to convert an X to
a z score, show how to look up a z score in
4636
07:06:05,590 --> 07:06:11,300
a Z table. Explain how to find the probability
of an X falling between two values on a normal
4637
07:06:11,300 --> 07:06:16,070
distribution, describe how to use the Z table
to look up a z corresponding to a percentage,
4638
07:06:16,070 --> 07:06:21,560
and describe how to use the formula to calculate
x from a z score. Well, that sounds like a
4639
07:06:21,560 --> 07:06:25,290
lot, but you'll understand that at the end
of this lecture, first, I'm going to go over
4640
07:06:25,290 --> 07:06:29,920
what a z score is and what the standard normal
distribution is. Then I'm going to talk about
4641
07:06:29,920 --> 07:06:34,670
Z score probabilities. And what those are,
I'm going to show you how to use the Z table
4642
07:06:34,670 --> 07:06:39,350
to answer some harder questions besides the
ones I talked about during the z score probabilities
4643
07:06:39,350 --> 07:06:43,170
section, then I'm going to show you how to
use a slightly different formula to calculate
4644
07:06:43,170 --> 07:06:49,350
x from z. Finally, I'm going to just remind
you some tips and tricks about using z scores
4645
07:06:49,350 --> 07:06:55,890
and probabilities correctly. So all this talk
about z scores. So what is the z score? And
4646
07:06:55,890 --> 07:07:01,660
what is the standard normal distribution?
Well, let's take a look at this very, pretty
4647
07:07:01,660 --> 07:07:08,610
thing I made. You may recognize it from the
last lecture, it was my little Empirical Rule
4648
07:07:08,610 --> 07:07:14,400
diagram. So remember, the empirical rule,
remember how it required a normal distribution?
4649
07:07:14,400 --> 07:07:19,830
Well, that worked well for the cut points
available, right? Like mu mu plus or minus
4650
07:07:19,830 --> 07:07:25,030
one standard deviation, mu plus or minus two
standard deviations. If we ask questions that
4651
07:07:25,030 --> 07:07:31,150
were right on those cut points, we had good
answers. But what about in between those cut
4652
07:07:31,150 --> 07:07:37,120
points. So I wanted you to notice, in this
Empirical Rule diagram, these numbers at the
4653
07:07:37,120 --> 07:07:40,640
bottom, like I just circled them, like negative
three, negative two, negative one, and then
4654
07:07:40,640 --> 07:07:46,850
mew doesn't have a number. So pretend there's
a zero there. And then there's one, two and
4655
07:07:46,850 --> 07:07:55,670
three, okay? That is the standard normal distribution.
And that is also called z. So these things
4656
07:07:55,670 --> 07:08:03,190
on the right, those are z scores. So see the
green area, zero is the z score that's on
4657
07:08:03,190 --> 07:08:09,160
the lower limit of that, and one is the z
score at the upper limit of the green area.
4658
07:08:09,160 --> 07:08:13,378
So you can see that this whole curve, the
the standard normal distribution on the right,
4659
07:08:13,378 --> 07:08:18,360
the whole, the mean of the whole curve is
zero. And the standard deviation of the whole
4660
07:08:18,360 --> 07:08:24,270
curve is one. And that is what c score is.
So I just want you to notice the concept of
4661
07:08:24,270 --> 07:08:31,042
standard. I'm, I'm in the US. And in the US,
we use, you know, the US dollar, but one of
4662
07:08:31,042 --> 07:08:36,250
the things I've noticed is that a lot of countries
see it as a standard. So they'll map their
4663
07:08:36,250 --> 07:08:42,490
currency to the US dollar. So maybe the Euro
will map its currency to the US dollar, maybe
4664
07:08:42,490 --> 07:08:43,490
the Egyptian pound
4665
07:08:43,490 --> 07:08:48,470
will also map its currency to the US dollar.
And once it does that, it's a lot easier to
4666
07:08:48,470 --> 07:08:52,780
compare them, right. And so that's the main
reason for the standard normal distribution
4667
07:08:52,780 --> 07:08:58,790
is it helps you compare exes from different
distributions, different normal distributions
4668
07:08:58,790 --> 07:09:03,170
that have different means in different standard
deviations from each other. It helps you map
4669
07:09:03,170 --> 07:09:10,770
them to this normal standard normal distribution
here that standard, so you can compare them.
4670
07:09:10,770 --> 07:09:16,150
So let's talk about z scores, every value
on a normal distribution. So every x can be
4671
07:09:16,150 --> 07:09:22,800
converted to a z score, just like I was saying
how you can convert any currency to dollars,
4672
07:09:22,800 --> 07:09:23,800
there's some
4673
07:09:23,800 --> 07:09:25,670
formula for that.
4674
07:09:25,670 --> 07:09:31,120
You can convert every x on a normal distribution
to a z score. But you have to know how to
4675
07:09:31,120 --> 07:09:35,570
use the formula right? And what goes into
that formula. Well, first, you need the X
4676
07:09:35,570 --> 07:09:39,640
that you want to convert to a z score. So
you need to pick one, then you need to know
4677
07:09:39,640 --> 07:09:47,010
the mu of your distribution, your normal distribution,
and the standard deviation of your distribution.
4678
07:09:47,010 --> 07:09:52,040
And here are the two formulas that are used.
The one I was just talking about is on the
4679
07:09:52,040 --> 07:09:56,970
left is the formula for calculating the z
score. And we'll go over the one on the right
4680
07:09:56,970 --> 07:10:05,060
later in this lecture. So remember in the
last lecture, I was talking about a class
4681
07:10:05,060 --> 07:10:10,340
that had 100 people in it. And that all took
a really hard test, it was so hard, nobody
4682
07:10:10,340 --> 07:10:16,920
got 100%. And it was 100 point test. So nobody
got 100. The top score was in the 90s. So
4683
07:10:16,920 --> 07:10:17,920
um,
4684
07:10:17,920 --> 07:10:22,860
and remember, in the upper right there was
there's the meal, the meal was 65.5, which
4685
07:10:22,860 --> 07:10:24,950
is pretty bad score, 100
4686
07:10:24,950 --> 07:10:30,380
point test, and the standard deviation was
14.5. So I'm going to give you an example
4687
07:10:30,380 --> 07:10:35,840
of calculating a z score on that particular
distribution. So let's say you got a friend,
4688
07:10:35,840 --> 07:10:40,560
you have smart friend, and that's my friend
got a 90 in the face of all this? Well, let's
4689
07:10:40,560 --> 07:10:45,730
calculate the z score for 90 on this particular
distribution. Okay, so here's what we're going
4690
07:10:45,730 --> 07:10:50,380
to do is, first we're going to remind ourselves,
you don't have to do this in real life when
4691
07:10:50,380 --> 07:10:54,890
you're doing it. But I'm just doing this for
demonstration purposes, is what our Empirical
4692
07:10:54,890 --> 07:11:02,820
Rule stuff look like. Remember, at mu plus
one standard deviation was 80. And mu plus
4693
07:11:02,820 --> 07:11:08,320
two standard deviations was 94.5. So already,
you know, whatever your answer is going to
4694
07:11:08,320 --> 07:11:14,240
be for 90 is it's going to be between one
and two. Right. But we just don't know exactly
4695
07:11:14,240 --> 07:11:17,800
what it's going to be. So I'm just showing
you this for demonstration purposes to relate
4696
07:11:17,800 --> 07:11:22,692
it to the last lecture. But you don't have
to do this in real life when you calculate.
4697
07:11:22,692 --> 07:11:26,610
Okay, so we know that the Z we're going to
calculate is going to be somewhere between
4698
07:11:26,610 --> 07:11:33,770
one and two. And as you'll see, on the slide
here, I labeled over on the z curve, I labeled
4699
07:11:33,770 --> 07:11:39,050
where z equals zero, which is the mu that's
65.5. So we're going to anticipate we're going
4700
07:11:39,050 --> 07:11:43,510
to get a z score, that's somewhere between
one and two. And you'll see in blue, I listed
4701
07:11:43,510 --> 07:11:50,160
the ingredients, right, so we have the smartphone
score 90, we have the mu 65.5. And we have
4702
07:11:50,160 --> 07:11:57,420
standard deviation 14.5. And then we have
our z formula. So let's do it. Okay, so x
4703
07:11:57,420 --> 07:12:03,060
minus mu is going to be 90, which is our x
minus 65.5. You do that out first, and then
4704
07:12:03,060 --> 07:12:09,340
you divide it by 14.5. And look, our Z score
is 1.69. And that's exactly where we thought
4705
07:12:09,340 --> 07:12:14,590
it would be, it would be somewhere between
one and two. And so as you can see, you can
4706
07:12:14,590 --> 07:12:21,730
take any x and convert it to Z. Here we'll
do another example, only this friend is not
4707
07:12:21,730 --> 07:12:26,331
so smart. This friend actually got a score
that was kind of low, it was so low, it was
4708
07:12:26,331 --> 07:12:31,952
below the meal of 65.5, this poor friend only
got a 50. So let's try it again, let's do
4709
07:12:31,952 --> 07:12:39,090
a z score for 50. So again, you know this
is just for demonstration purposes. But remember,
4710
07:12:39,090 --> 07:12:47,060
in Empirical Rule land 51 was that mu minus
one standard deviation. So we're going to
4711
07:12:47,060 --> 07:12:52,900
expect that between again, negative one and
negative two is z is where our 50x is going
4712
07:12:52,900 --> 07:13:00,140
to land if we calculate the z score. And so
here we are, we calculate the z score, we
4713
07:13:00,140 --> 07:13:07,220
have 50 minus 65.5 divided by 14.5, and we
get negative 1.07. And the reason why it's
4714
07:13:07,220 --> 07:13:11,452
negative is, as you can see, it's on the left
of the meal,
4715
07:13:11,452 --> 07:13:15,640
so then the z score is gonna be negative.
And so as you can see, it's exactly where
4716
07:13:15,640 --> 07:13:21,500
we thought it would be, it would be a little
bit to the left of negative one.
4717
07:13:21,500 --> 07:13:25,720
So now we're going to get into something that's
a little bit harder, which is the z score
4718
07:13:25,720 --> 07:13:30,000
probability. So you're feeling pretty good
about the z score. But now let's talk about
4719
07:13:30,000 --> 07:13:36,420
the probabilities. Okay, so remember the probability
from the empirical rule, this is just old
4720
07:13:36,420 --> 07:13:40,730
Empirical Rule stuff. So remember, I gave
you a question at the end of that lecture,
4721
07:13:40,730 --> 07:13:46,260
I said, What is the probability I will select
a student with a score between 36.5 and 51?
4722
07:13:46,260 --> 07:13:55,230
And remember, the answer was like this orange
area, which is 13.5%. But what if you have
4723
07:13:55,230 --> 07:14:01,980
z scores like 1.69? The Smart friend, and
negative 1.07, which are the not so smart
4724
07:14:01,980 --> 07:14:06,070
friend, you know, in other words, you have
excess of 90 and 50, which are not on the
4725
07:14:06,070 --> 07:14:12,128
empirical rule? How do you figure out the
percent or the probability? That's the next
4726
07:14:12,128 --> 07:14:20,390
step with your z scores? Okay, so now let's
ask this question, let's say, what is the
4727
07:14:20,390 --> 07:14:26,780
probability that students scored above the
smartframe. Now, we could also ask for below,
4728
07:14:26,780 --> 07:14:32,310
but I'm just choosing to ask for above this
time. So in other words, what is the area
4729
07:14:32,310 --> 07:14:38,990
under the curve from z equals 1.69? All the
way up. So see, like a little ways through
4730
07:14:38,990 --> 07:14:46,620
that blue edge. We wish we knew the area for
everything up from 1.69 Z, through the purple
4731
07:14:46,620 --> 07:14:51,590
area through the little black thing at the
top. We wish we knew that area. We only know
4732
07:14:51,590 --> 07:14:55,560
from the empirical rule what's on the cut
points of like one and two, but we don't know
4733
07:14:55,560 --> 07:15:02,230
this in in between things. So how do we figure
that out? Well This is another problem here.
4734
07:15:02,230 --> 07:15:08,140
What is the probability that students scored
below the nozzle smart friend, right? And
4735
07:15:08,140 --> 07:15:14,560
in that case, see the diagram, we'd have to
figure out what is the part of the orange
4736
07:15:14,560 --> 07:15:19,750
that that friend gets plus the red and plus
a little black part of the bottom? What is
4737
07:15:19,750 --> 07:15:26,060
the percent or the proportion of the curve
that represents that. So that's what we're
4738
07:15:26,060 --> 07:15:33,932
getting into now. And that's what we do is
we look these up in a Z table. So what the
4739
07:15:33,932 --> 07:15:41,910
Z table is, is basically, they figured out
every single Z score, you could have between
4740
07:15:41,910 --> 07:15:49,650
negative 3.49. And I'll go into why negative
3.49, between negative 3.49 and positive 3.49.
4741
07:15:49,650 --> 07:15:51,420
And they went like every 100.
4742
07:15:51,420 --> 07:15:52,420
So
4743
07:15:52,420 --> 07:15:59,310
they figured out for every single one of those
these scores, what the probability is, and
4744
07:15:59,310 --> 07:16:04,390
they actually fit that all on a table. And
so now, what I'm going to show you how to
4745
07:16:04,390 --> 07:16:09,520
do is how to use that table to look up the
probabilities. And by the way, if you look
4746
07:16:09,520 --> 07:16:14,081
up a probability that happens to be on one
of those Empirical Rule cut points, you'll
4747
07:16:14,081 --> 07:16:19,110
get what the empirical rule says. It's just
said, the empirical rule is nice, because
4748
07:16:19,110 --> 07:16:22,628
you don't have to pull out the table. But
if you have something that's not on the empirical
4749
07:16:22,628 --> 07:16:30,570
rule, cut points, get out your Z table. So
how do you use the Z table? Well, the first
4750
07:16:30,570 --> 07:16:36,020
thing is you want to figure out what area
you want, right? So we're going to start and
4751
07:16:36,020 --> 07:16:41,530
do the not so smart friend, because that's
a little bit easier actually to demonstrate.
4752
07:16:41,530 --> 07:16:48,380
Okay, so what is the probability that students
scored below the not so smart friend? So,
4753
07:16:48,380 --> 07:16:53,410
which is a secret way of saying, what is the
area under the curve that makes up most of
4754
07:16:53,410 --> 07:16:59,060
that orange part, all the red and the little
black part at the bottom? What is that proportion.
4755
07:16:59,060 --> 07:17:06,340
And so for areas left of specified Z value,
you're supposed to use the table directly.
4756
07:17:06,340 --> 07:17:11,220
So I'm going to show you how to use that table
to look up negative 1.07. And then I'm going
4757
07:17:11,220 --> 07:17:16,720
to come back and tell you what they mean by
use it directly. Hi, there. So here we are
4758
07:17:16,720 --> 07:17:21,990
at the Z table. And if you have the book,
you can look it up in the appendix in on page
4759
07:17:21,990 --> 07:17:26,430
eight. But there's also a lot of z tables
on the internet. Sometimes they're arranged
4760
07:17:26,430 --> 07:17:31,010
a little differently. So I'm using this one
because it's from the book. So remember, the
4761
07:17:31,010 --> 07:17:37,830
Z that we're looking up, we're looking up
the Z of negative 1.07. So remember, I said
4762
07:17:37,830 --> 07:17:42,930
they had to somehow calculate all the different
probabilities for every single z between negative
4763
07:17:42,930 --> 07:17:49,830
3.49 through positive 3.49. Every 100th, they
had to come up with that, well, how did they
4764
07:17:49,830 --> 07:17:53,930
fit it all on their table? Well, this is what
they did. See, this is the being the Z table.
4765
07:17:53,930 --> 07:18:00,840
Remember, I said negative 3.49? Well, this
is negative 3.4. And then to find the Z and
4766
07:18:00,840 --> 07:18:06,048
negative 3.49, you have to imagine that the
nine is here, but it's going to be the last
4767
07:18:06,048 --> 07:18:13,230
one here. So see this nine here, this is what
it would be. So just for pretend, if we had
4768
07:18:13,230 --> 07:18:22,360
a z score of negative 2.58, I go 2.5. And
then I have to go over to the eight, one right
4769
07:18:22,360 --> 07:18:31,120
here. Okay. Or if I had one that was negative
2.10, right, or negative, just plain 2.1.
4770
07:18:31,120 --> 07:18:39,140
Right? Then I'd go over just one to this zero,
line and see these these little tiny things
4771
07:18:39,140 --> 07:18:43,780
in here. Those are all probabilities. In fact,
let's go look up our probability, which is
4772
07:18:43,780 --> 07:18:50,958
negative 1.07. So we're going to go down here,
negative, here we are at negative 1.0. And
4773
07:18:50,958 --> 07:18:54,112
then we have to go over to the seven column,
right, so what's the song? Here's a song,
4774
07:18:54,112 --> 07:19:00,782
it's three from the left, I guess I could
have guessed that. So we have negative 1.0987.
4775
07:19:00,782 --> 07:19:12,140
So this is point 1423. Otherwise known as
14 point 23%. So that's actually what you
4776
07:19:12,140 --> 07:19:17,100
get out of the Z table. That's the probability
that's the percent you're looking for. And
4777
07:19:17,100 --> 07:19:21,420
just in case, you're wondering, these aren't
all negative, the first page is negative.
4778
07:19:21,420 --> 07:19:28,530
The second page is positive is all the positive
Z scores all the way up to 3.49. But what
4779
07:19:28,530 --> 07:19:34,570
I want you to hold in your head is what we
just looked at, which was negative 1.07, which
4780
07:19:34,570 --> 07:19:35,900
is point 1423.
4781
07:19:35,900 --> 07:19:38,760
Okay, hold that thought.
4782
07:19:38,760 --> 07:19:46,260
Okay, here we are back at our slides. And
so look at that green part where it says four
4783
07:19:46,260 --> 07:19:51,060
areas to the left of a specified Z value,
which we're doing with the not so smart friend,
4784
07:19:51,060 --> 07:19:57,200
use the table entry directly. So here was
our table entry. It was point 1423. So we're
4785
07:19:57,200 --> 07:20:01,990
just going to use that number that we found
and we're gonna say the probability then,
4786
07:20:01,990 --> 07:20:09,180
is 14.23%. And that kind of makes logical
sense knowing the empirical rule. Now, I'm
4787
07:20:09,180 --> 07:20:16,860
going to show you an example of what why I
was saying, use it directly. In this next
4788
07:20:16,860 --> 07:20:20,250
example, we're going to look at the smart
friends probability. In fact, we're going
4789
07:20:20,250 --> 07:20:25,560
to ask what is the probability that the students
scored above the smart friend in the smart
4790
07:20:25,560 --> 07:20:31,090
friend set z equals 1.69. So I'm going to
demonstrate now, for areas to the right of
4791
07:20:31,090 --> 07:20:36,390
a specified Z value, you either look them
up in the table, then subtract result from
4792
07:20:36,390 --> 07:20:44,560
one, or you use the opposite z, which is in
this case would be negative 1.69. And you'll
4793
07:20:44,560 --> 07:20:49,430
get the same answer, whether you do with the
first way The second way, but I'm going to
4794
07:20:49,430 --> 07:20:54,490
demonstrate both okay. So first, I'm going
to demonstrate what happens when you look
4795
07:20:54,490 --> 07:20:59,640
up the probability in the table for that,
see, and then you subtract that probability
4796
07:20:59,640 --> 07:21:07,020
from one. So let's go look up z equals 1.69.
All right, here we are back at our Z table,
4797
07:21:07,020 --> 07:21:12,190
only this time, we're looking up a positive
z. So we don't want this first one, we want
4798
07:21:12,190 --> 07:21:18,650
the second one. So remember, we're looking
up z equals 1.69. So we're looking under here
4799
07:21:18,650 --> 07:21:25,120
for 1.6. And that's right here. And now we
have to go over to the nine column. So that's
4800
07:21:25,120 --> 07:21:35,200
going to be point 9545. So hold that thought,
point 9545. Okay, we're back with our probability
4801
07:21:35,200 --> 07:21:39,670
that we looked up in the Z table. Now remember,
we were supposed to look it up in the table
4802
07:21:39,670 --> 07:21:44,690
and subtract the result from one. So that's
what we're going to do now. So we found point
4803
07:21:44,690 --> 07:21:55,860
9545 in the table, we're going to take one
minus point 9545. And we get 0.0455, or 4.55%,
4804
07:21:55,860 --> 07:22:00,510
this little tiny piece, which kind of makes
sense, because it's right at the top of the
4805
07:22:00,510 --> 07:22:04,790
distribution, just a little piece of the blue,
and the purple, and then the little black
4806
07:22:04,790 --> 07:22:11,140
at the top. Alright, and so what you want
to imagine is that point 954, or five, which
4807
07:22:11,140 --> 07:22:20,452
is like 95.4, or 5%, that's the whole piece
below z equals 1.69. That's most of the blue,
4808
07:22:20,452 --> 07:22:25,470
the green, the yellow, the orange, the red,
and the little black at the bottom, that's
4809
07:22:25,470 --> 07:22:31,458
all in the point 9545. Okay, so again, we
were looking up in the area to the right of
4810
07:22:31,458 --> 07:22:37,340
the specified Z value, and I showed you the
first way of doing it, there's another way
4811
07:22:37,340 --> 07:22:43,640
of doing it, and that's where you just use
the opposite z from the get go. So we're going
4812
07:22:43,640 --> 07:22:50,450
to now use the opposite seat, we're going
to look up negative 1.69. All right, here
4813
07:22:50,450 --> 07:22:58,020
we are back at the Z table. Only this time,
we're looking at negative 1.69. So negative
4814
07:22:58,020 --> 07:23:03,430
1.6 is the first thing we need to find in
this column. So here we are negative 1.6.
4815
07:23:03,430 --> 07:23:08,208
And then we know nine is the last column.
I'm learning that. So we'll go over here.
4816
07:23:08,208 --> 07:23:14,430
And so that that looks familiar. Right point.
Oh, 455. Okay, hold that thought. All right,
4817
07:23:14,430 --> 07:23:20,880
well, back. And so as you know, if you look
it up in the table directly, like the 1.69
4818
07:23:20,880 --> 07:23:25,208
directly, and you take that probability, and
you subtract it from one, which is what we
4819
07:23:25,208 --> 07:23:34,190
did last, we got the same answer we got now,
right point, oh, 455, or 4.55%. So it is kind
4820
07:23:34,190 --> 07:23:39,570
of more efficient, to just use the opposite
z, if you're looking for areas to the right
4821
07:23:39,570 --> 07:23:45,430
of the specified Z value. But I always say
when you're done looking it up, compare it
4822
07:23:45,430 --> 07:23:50,590
to the picture. And I always say draw a picture
to, you know, I don't mind if you have normal
4823
07:23:50,590 --> 07:23:57,230
curves drawn, drawn over all of your homework,
or all over the wall, I guess, or maybe a
4824
07:23:57,230 --> 07:24:03,610
whiteboard, that's probably more efficient.
But it's best to draw it out. label on there,
4825
07:24:03,610 --> 07:24:09,950
where your z and your x are, and then just
look at it. Because we know that the little
4826
07:24:09,950 --> 07:24:17,340
piece above z equals 1.69 is not 95% of that
curve. It's just not it, that's over 50%.
4827
07:24:17,340 --> 07:24:23,800
And we can tell that little tiny pieces under
50%. So if you accidentally do the first way
4828
07:24:23,800 --> 07:24:30,260
and forget to subtract from one, you know,
maybe if you check it against your normal
4829
07:24:30,260 --> 07:24:31,260
curve drawing,
4830
07:24:31,260 --> 07:24:37,878
you'll realize oh, I made a mistake. So even
though there's two different ways to find
4831
07:24:37,878 --> 07:24:44,220
the probability, if it's to the right of the
z value, just try to make sure no matter which
4832
07:24:44,220 --> 07:24:51,401
ways you use that you finally do a reality
check against the drawing you make, just to
4833
07:24:51,401 --> 07:24:54,940
make sure you got the right piece because
there's only two pieces. There's a big piece
4834
07:24:54,940 --> 07:24:59,910
and a little piece of the skirt, and we got
4.55% we know that's a little piece and we
4835
07:24:59,910 --> 07:25:03,050
know From our drawing that we were looking
for the little piece. So that's how you do
4836
07:25:03,050 --> 07:25:09,400
your reality check. Okay, you thought that
there weren't any harder questions? Well,
4837
07:25:09,400 --> 07:25:13,070
here are some harder questions. So this is
a little bit more on probabilities in the
4838
07:25:13,070 --> 07:25:19,510
Z table. So here's another question we haven't
handled yet. What if you were looking at a
4839
07:25:19,510 --> 07:25:24,320
probability between two scores, such as the
probability the students will score between
4840
07:25:24,320 --> 07:25:27,560
50 and 90, so it's somewhere in the middle,
4841
07:25:27,560 --> 07:25:28,730
okay.
4842
07:25:28,730 --> 07:25:34,660
Note that in that case, when you have a between
one, you actually have two axes, and we'll
4843
07:25:34,660 --> 07:25:39,860
label them x one and x two, so the not so
smart friend is going to be x one, and the
4844
07:25:39,860 --> 07:25:45,420
smarter friend is going to be x two, just
to keep these x's straight. Okay. So the next
4845
07:25:45,420 --> 07:25:50,060
step is you're going to calculate z one and
z two. And I'm kind of cheating. Because we
4846
07:25:50,060 --> 07:25:53,560
already did these, we already knew the Z one
for the National smartphone was negative 1.07.
4847
07:25:53,560 --> 07:25:59,208
And we already knew the Z two, for the smarter
friend was 1.69. So I just put them on the
4848
07:25:59,208 --> 07:26:05,150
diagram. Okay, and then here's this beginning
of the strategy, and I'll just explain the
4849
07:26:05,150 --> 07:26:10,330
strategy, and then I'll do the strategy. So
for z one, you find the probability to the
4850
07:26:10,330 --> 07:26:14,920
left of the Z, so you find the little piece
to the left. And remember, you can take the
4851
07:26:14,920 --> 07:26:19,048
direct probability from the Z table. So that's
what direct means is you just get to copy
4852
07:26:19,048 --> 07:26:25,020
it directly out of this table. Then for z
two, you find the probability to the right
4853
07:26:25,020 --> 07:26:30,600
or above z. So you find the little piece there.
And you use one of those two methods I showed
4854
07:26:30,600 --> 07:26:38,480
you, which we did together. And then finally,
imagine like the whole curve, you're subtracting
4855
07:26:38,480 --> 07:26:44,180
the piece at the bottom, the Z, one probability,
and you're subtracting the piece at the top.
4856
07:26:44,180 --> 07:26:49,360
So you're trimming with those two pieces to
get the between probability. So that's the
4857
07:26:49,360 --> 07:26:56,042
strategy is basically you find out the the
size, the probability of each of the little
4858
07:26:56,042 --> 07:27:00,452
pieces on the sides, you subtract both of
those from one, and that traps whatever's
4859
07:27:00,452 --> 07:27:07,010
left in the middle. So I'll demonstrate this.
So remember, for z one, the probability to
4860
07:27:07,010 --> 07:27:14,440
the left of Z one was point 1423. We did that
together. And then we use both of those methods.
4861
07:27:14,440 --> 07:27:20,650
And they got the same answer to find the probability
to the right of z two, which was point o 455.
4862
07:27:20,650 --> 07:27:25,220
Okay, so that's a little piece at the top,
and then we got the little piece at the bottom.
4863
07:27:25,220 --> 07:27:32,420
And now we'll take one minus the piece at
the bottom minus the piece of the top and
4864
07:27:32,420 --> 07:27:39,250
the total is point 8122, or 81. Point 22%.
which kind of makes sense, that's a big piece
4865
07:27:39,250 --> 07:27:43,730
in the middle. So it wouldn't be surprising
if it was about 80% of the curve. So this
4866
07:27:43,730 --> 07:27:50,660
is how you do a between like. Here's another
question I haven't really handled, what have
4867
07:27:50,660 --> 07:27:55,720
you looking at a probability more than 50%?
So such as the probability that students will
4868
07:27:55,720 --> 07:28:04,030
score greater than 50? Right? Like, like the
big side? Okay? Well, actually, you just do
4869
07:28:04,030 --> 07:28:08,940
what you normally would do, you say four areas
to the right of the specified Z value, either
4870
07:28:08,940 --> 07:28:13,708
look up in the table and subtract the result
from one, or use the opposite z, which in
4871
07:28:13,708 --> 07:28:19,730
this case would be 1.07. So if we did method
one, we'd end up going one minus point 1423,
4872
07:28:19,730 --> 07:28:26,610
which we already looked at, and we get point
8577, we use method to we'd take the Z of
4873
07:28:26,610 --> 07:28:32,680
1.7, not negative 1.07, but 1.07. And we could
go look it up in the Z table, and we get point
4874
07:28:32,680 --> 07:28:39,130
8577. Again, 85 point 77%. So if this isn't
actually a harder question, I just wanted
4875
07:28:39,130 --> 07:28:42,298
to show you how it works when you're getting
like a bigger piece, bigger than 50% piece
4876
07:28:42,298 --> 07:28:50,780
of the distribution. And here's another sort
of similar example, where we're looking at
4877
07:28:50,780 --> 07:28:57,680
the probability that students will score less
than 90, okay. So that's easy, right for the
4878
07:28:57,680 --> 07:29:02,610
area's to the left of the specified Z value,
just use the table directly. So when we went
4879
07:29:02,610 --> 07:29:09,850
and looked up z equals 1.69, we got point
9545. So that's the answer. It's 95.45% of
4880
07:29:09,850 --> 07:29:18,470
the curve is below z equals 1.69, or below
x equals 90. So as I mentioned before, but
4881
07:29:18,470 --> 07:29:22,890
I'll just mention again, you're supposed to
treat all probabilities to the left of z equals
4882
07:29:22,890 --> 07:29:30,500
negative 3.49 as P equals zero. So I showed
you what negative 3.49 looks like in the Z
4883
07:29:30,500 --> 07:29:36,260
table. It's like point O two. Well, there's
not much smaller than that. So just, if you
4884
07:29:36,260 --> 07:29:43,910
actually calculate z and you get like negative
four, just say the P is zero, okay. Then the
4885
07:29:43,910 --> 07:29:49,190
second thing is treat all areas and probabilities
to the right of z equals 3.49, SP equals one
4886
07:29:49,190 --> 07:29:56,870
or 100%. So as you can imagine, you know,
3.49, that's at the top of the curve. So if
4887
07:29:56,870 --> 07:30:02,110
you calculate a Z and you got like a five,
you can just assume that's 100%, right or
4888
07:30:02,110 --> 07:30:10,458
one. Okay, um, so we've gone through how to
calculate z. And we've talked about looking
4889
07:30:10,458 --> 07:30:15,290
at probabilities in the Z table. And we've
even talked about manipulating those probabilities
4890
07:30:15,290 --> 07:30:23,798
to get certain probabilities. But we haven't
talked about calculating x when z is given.
4891
07:30:23,798 --> 07:30:30,060
So sometimes you're actually given a z. And
you are have to calculate the x back
4892
07:30:30,060 --> 07:30:35,630
from the Z. In fact, sometimes it's even harder.
Sometimes you're given a probability. And
4893
07:30:35,630 --> 07:30:39,628
the probability is not as easy. But you can
use the probability, remember that those little
4894
07:30:39,628 --> 07:30:43,140
percents in the middle of the table, you can
go find it in the middle of the table and
4895
07:30:43,140 --> 07:30:49,230
look up the Z that keys to it, and then put
it into this equation. And so I'm going to
4896
07:30:49,230 --> 07:30:54,620
just give you examples of some real life questions
that you might see, like on a homework or
4897
07:30:54,620 --> 07:31:00,180
on a task, probably not in real real life.
That where you need to calculate x, and you
4898
07:31:00,180 --> 07:31:08,292
need to use that formula in the red circle.
So let's say I was just bored. And I was wondering,
4899
07:31:08,292 --> 07:31:16,770
what is the score the test score on the story
distribution? That is add z equals 1.5? Okay,
4900
07:31:16,770 --> 07:31:20,750
so see where z equals 1.5? We never asked
that question before. So let's say I just
4901
07:31:20,750 --> 07:31:25,200
out of curiosity wanted to know, what would
the test score be of a student who was at
4902
07:31:25,200 --> 07:31:35,180
z equals 1.5. So what I would do is I would
take 1.5 times 14.5, because that's what the
4903
07:31:35,180 --> 07:31:39,900
formula says. It's z times the standard deviation.
And then I do that first because order of
4904
07:31:39,900 --> 07:31:47,120
operation. And then after doing that, I'd
add the mu, which is 65.5. And I get 87.3.
4905
07:31:47,120 --> 07:31:55,370
So the x, the student who got 87.3, that student
got a score, that's add z equals 1.5. Now,
4906
07:31:55,370 --> 07:32:00,378
as you probably imagine, people don't go around
asking so much about well, I wonder what that
4907
07:32:00,378 --> 07:32:05,830
person's score is at z equals negative 2.3?
Or whatever. They don't usually phrase it
4908
07:32:05,830 --> 07:32:11,310
like that. Usually, you see more like a question
like this, which is what is the score that
4909
07:32:11,310 --> 07:32:19,140
marks the top 7% of scores? And that's a secret
way of saying, We are looking for the Z at
4910
07:32:19,140 --> 07:32:24,470
p equals point. Oh, seven. Oh, so it's like
we turn that 7% backwards into probability.
4911
07:32:24,470 --> 07:32:29,890
And we say, we're actually looking for the
Z at p equals point. Oh, seven. Oh, so how
4912
07:32:29,890 --> 07:32:33,020
do you do that? Well, I'm going to show you.
4913
07:32:33,020 --> 07:32:34,290
Okay,
4914
07:32:34,290 --> 07:32:43,260
so we're on the hunt for probability. Point.
0700. Okay, so let's start at the top of the
4915
07:32:43,260 --> 07:32:47,280
table here. You'll see we're digging around
in the middle of the table, right? And you'll
4916
07:32:47,280 --> 07:32:51,580
see like point oh, that's nowhere near the
ballpark, because we're looking for point
4917
07:32:51,580 --> 07:32:57,460
O seven. Oh, so let's scroll up here. or scroll
down, actually. So now we're more we're in
4918
07:32:57,460 --> 07:33:02,400
the point O four neighborhood. Here's point
O six. Okay, we're getting close. Well, here
4919
07:33:02,400 --> 07:33:03,670
we have a point.
4920
07:33:03,670 --> 07:33:09,700
Oh, 708. And that's point oh, eight more than
we want it to be.
4921
07:33:09,700 --> 07:33:18,170
Well, here next door, we have point Oh, 694.
And that's only point oh, six less than we
4922
07:33:18,170 --> 07:33:24,410
want it to be right, because if it had point
O six more, it would be point O seven. Oh,
4923
07:33:24,410 --> 07:33:25,410
so this
4924
07:33:25,410 --> 07:33:30,640
is technically closer than this one, because
this is point O, O eight off. And this is
4925
07:33:30,640 --> 07:33:39,101
only off by point O six. So we're gonna choose
point o 694. As the probably the probability
4926
07:33:39,101 --> 07:33:44,000
of record for this for the top 7%. Only, we're
not going to just choose this, we're going
4927
07:33:44,000 --> 07:33:48,870
to figure out what is z at that score. So
what are we gonna do, we're gonna map back
4928
07:33:48,870 --> 07:33:54,780
here, negative 1.4. And then we got to go
all the way up, which we can guess is eight.
4929
07:33:54,780 --> 07:34:03,340
So it's negative 1.48. So hold that thought.
Okay, we started out looking for the Z p equals
4930
07:34:03,340 --> 07:34:14,480
0.0700. And but the closest we got was 0.0694,
and then map to z equals negative 1.48. Now,
4931
07:34:14,480 --> 07:34:23,390
what I want you to notice is negative 1.48
is actually on the left side of me. Okay,
4932
07:34:23,390 --> 07:34:29,990
so that is the z score at the bottom 7% of
the scores. So we're going to use the positive
4933
07:34:29,990 --> 07:34:36,610
version of that see, since we want the top
7%, so we're going to use 1.48. So the opposite
4934
07:34:36,610 --> 07:34:44,458
See, and now we're going to plug it into the
equation. So 1.48 times 14.5, which is the
4935
07:34:44,458 --> 07:34:52,420
standard deviation plus 65.5 equals 87. So
now at seven is the score that marks the top
4936
07:34:52,420 --> 07:35:01,740
7% of the scores. I'm going to do another
exercise for you. That does the this time
4937
07:35:01,740 --> 07:35:06,270
the bottom 3% of the scores because this is
often kind of challenging for students. So
4938
07:35:06,270 --> 07:35:10,980
I'll just give you a second demonstration.
So as you can imagine, we're going on the
4939
07:35:10,980 --> 07:35:20,020
hunt now for z at p equals 0.0300. So let's
go over to the Z table. All right, now we're
4940
07:35:20,020 --> 07:35:23,900
getting a little good at this, right? So we're
digging around in the middle, and we're looking
4941
07:35:23,900 --> 07:35:33,620
for 0.0300. Okay, and starting at the top,
we're in the 00. department. Oh, here's point
4942
07:35:33,620 --> 07:35:45,810
01. Something 02. Okay, we're getting close
to the point 0300. So a point, point 0301.
4943
07:35:45,810 --> 07:35:50,720
Could you ask for anything closer? Totally
Perfect. Okay, so that's what we're going
4944
07:35:50,720 --> 07:35:59,208
to use for our z is the the Z at 0.0301. So
let's look up that C so that c is negative
4945
07:35:59,208 --> 07:36:04,500
1.8. And then we look up eight, so it's negative
1.88.
4946
07:36:04,500 --> 07:36:06,420
Hold that thought.
4947
07:36:06,420 --> 07:36:13,550
All right. Well, we were on the hunt for P
equals Oh, point oh, three. Oh, and we didn't
4948
07:36:13,550 --> 07:36:20,878
find that. But we did find p equals point.
Oh, 301 and the table, and that mapped back
4949
07:36:20,878 --> 07:36:28,450
to z equals negative 1.88. Right. And now
we go back to the question, we see that we
4950
07:36:28,450 --> 07:36:35,110
want the bottom 3%, so we keep the negative.
Now if I'd asked about the top 3%, we'd lose
4951
07:36:35,110 --> 07:36:40,440
the negative we use 1.88 in the equation,
but since we want the bottom 3%, we're going
4952
07:36:40,440 --> 07:36:47,030
to keep the negative. Okay, so now let's do
the equation. So x equals and then in the
4953
07:36:47,030 --> 07:36:53,320
parentheses negative 1.88 times 14.5, which
is our standard deviation, then plus our mu,
4954
07:36:53,320 --> 07:37:00,930
which is 65.5. And the score we get is 38.2.
So 38.2 is the score that marks the bottom
4955
07:37:00,930 --> 07:37:09,120
3% of scores, and just be happy your score
is not in there. Okay, now, here's another
4956
07:37:09,120 --> 07:37:15,060
challenging hard question. What is the question
on the tester, probably not in real life,
4957
07:37:15,060 --> 07:37:21,430
but on a test says what scores mark the middle
20% of the data. And so I put little arrows
4958
07:37:21,430 --> 07:37:25,450
on there just to point out well, when they
say middle, they mean, it's hugging
4959
07:37:25,450 --> 07:37:26,910
the meal,
4960
07:37:26,910 --> 07:37:31,290
it's actually assuming that there's gonna
be 10% on the right side of the meal, and
4961
07:37:31,290 --> 07:37:37,840
10% on the left side of the meal. And so how
you start to do this is you figure out the
4962
07:37:37,840 --> 07:37:47,040
z score for one minus point two, which is
the 20% divided by two, which equals four,
4963
07:37:47,040 --> 07:37:53,560
right? So then after that, you know, because
one minus point two is point eight, and point
4964
07:37:53,560 --> 07:37:59,720
eight divided by two is point four. So we
get this point four. So we go find the z score
4965
07:37:59,720 --> 07:38:05,640
at point four, which you're good at using
the Z table now. So uh, so I'm, you know,
4966
07:38:05,640 --> 07:38:11,433
looked around, and I found point 4013, in
that, digging around in the middle of the
4967
07:38:11,433 --> 07:38:19,458
Z table, and that map back to negative z equals
negative point two, five, right. And so that
4968
07:38:19,458 --> 07:38:24,840
is then what I would put on for the lower
limit on that one, and then z equals point
4969
07:38:24,840 --> 07:38:30,282
two, five, the positive version goes on the
other side. So once you figured out both of
4970
07:38:30,282 --> 07:38:34,970
the Z's, the Z on the left and the Z on the
right, you just have to put them through the
4971
07:38:34,970 --> 07:38:41,610
equation. So for the left side, we use the
negative z. And for the right side, we use
4972
07:38:41,610 --> 07:38:46,280
the positive Z. And that's how we get our
limits. So what's for is mark the middle 20%
4973
07:38:46,280 --> 07:38:54,230
of the data 61.9 and 69.1. It's not weird
how that worked out. But anyway, 61.9 and
4974
07:38:54,230 --> 07:38:59,760
69.1. Mark the middle 20% of the data. I didn't
totally didn't do that on purpose. It just
4975
07:38:59,760 --> 07:39:06,140
worked out that way. All right, I can't believe
you made it through all this. I'll bet your
4976
07:39:06,140 --> 07:39:11,290
brain is ready to explode. So now is a good
time to talk about just a little review. Just
4977
07:39:11,290 --> 07:39:17,930
help me come down a little bit from this whole
really intense lecture. Okay. So first, I'm
4978
07:39:17,930 --> 07:39:24,330
going to do a little Z score quiz game show
style stuff here, right? So if you ever get
4979
07:39:24,330 --> 07:39:28,050
the question when you're on the test, and
you're like, Oh, my gosh, where is x? Where's
4980
07:39:28,050 --> 07:39:33,750
x? Well, if you can't find x, it's usually
in the question. So usually, the way these
4981
07:39:33,750 --> 07:39:40,370
questions go is somebody like maybe me, we'll
put a mu and a standard deviation at the top
4982
07:39:40,370 --> 07:39:45,820
of the question. And then there'll be like,
maybe five questions about that pertain to
4983
07:39:45,820 --> 07:39:50,570
that mu and that standard deviation, but they
asked about different axes. And when I would
4984
07:39:50,570 --> 07:39:53,730
teach this class, a person, you know, people
will come running up to me in the middle of
4985
07:39:53,730 --> 07:39:58,660
a test, which you probably shouldn't do. And
they would say, where's the x? Where's the
4986
07:39:58,660 --> 07:40:02,900
x you gave me you know? These pieces of the
equation but I can't find the x. And I'd be
4987
07:40:02,900 --> 07:40:07,580
like, walk on the question. Look in the question,
you know, because I don't want to give it
4988
07:40:07,580 --> 07:40:11,970
away, and then they'd all run back to their
seats and find it. So that's so if you're
4989
07:40:11,970 --> 07:40:17,560
wondering, your panic and where's x? Look
in the question, it's usually in the question.
4990
07:40:17,560 --> 07:40:23,410
Okay, so let's say you find an X, and what
do you do with an x? Okay, and you're stuck
4991
07:40:23,410 --> 07:40:28,500
with an X, what do you Well, usually, what
you have to do is calculate a z score. So
4992
07:40:28,500 --> 07:40:33,330
remember, if you've got an X, you probably
have a mu and a standard deviation, you can
4993
07:40:33,330 --> 07:40:37,410
calculate a z score on that. So if you're
panicking on a test, and you have an x, I
4994
07:40:37,410 --> 07:40:41,952
mean, Sandy nation, just for fun, calculate
a z score and see if it gets you anywhere.
4995
07:40:41,952 --> 07:40:46,620
Okay, well, let's say you have a z score,
what do you do with a Z score? Well, you always
4996
07:40:46,620 --> 07:40:51,140
look it up, right? I mean, if you're, if you're
going this direction, if you're getting if
4997
07:40:51,140 --> 07:40:56,340
you started with an X, and you get a Z, you
got to go to the Z table with. Okay, so that's
4998
07:40:56,340 --> 07:41:00,031
your next step. So if you're doing all this
work, calculate a z score. And then you're
4999
07:41:00,031 --> 07:41:05,570
done. You're like, Oh, my gosh, what's my
next step? Go look at the Z table. Well, what
5000
07:41:05,570 --> 07:41:10,792
is the question asks for an x, right? Well,
remember, we have a whole formula for that.
5001
07:41:10,792 --> 07:41:17,320
So use the x formula. So if there's no x anywhere,
and it's asking for an x, then use the other
5002
07:41:17,320 --> 07:41:18,320
formula, use the
5003
07:41:18,320 --> 07:41:20,260
x formula?
5004
07:41:20,260 --> 07:41:26,128
And what if the question gives you a P, or
I just said p for probability, but it could
5005
07:41:26,128 --> 07:41:30,950
be a percentage, like Remember, the top is
7%, and the bottom 3%? Well, if they give
5006
07:41:30,950 --> 07:41:37,048
you a percent, just start digging around in
the middle of the Z table, just start digging
5007
07:41:37,048 --> 07:41:41,040
around looking for that person. Because once
you start digging around, you realize that
5008
07:41:41,040 --> 07:41:45,590
map's back to a z. And then you can get into
the groove of using the x formula, and you'll
5009
07:41:45,590 --> 07:41:52,580
probably get yourself out of this pack. So
here are some final tips and tricks for getting
5010
07:41:52,580 --> 07:41:58,340
z scores and probabilities, right? And I've
said this one before, draw a picture. And
5011
07:41:58,340 --> 07:42:03,140
what do I mean by that graph out the question,
draw the curve, draw the line from you, which
5012
07:42:03,140 --> 07:42:08,330
goes in the middle. And where the X goes above
or below the mu, just start with that it doesn't
5013
07:42:08,330 --> 07:42:13,170
have to be the scale. But mainly, you want
to get those elements in there. There's 1x
5014
07:42:13,170 --> 07:42:18,040
shade, the part of the curve wanted either
above the X or below the x, you know, just
5015
07:42:18,040 --> 07:42:22,760
color it in. So that you get an idea of Do
you want the big part, the one that's greater
5016
07:42:22,760 --> 07:42:28,378
than 50%, or the little part, the one that's
less than 50%? If there are two x's, then
5017
07:42:28,378 --> 07:42:33,700
shade in the area wanted, which is usually
in between them. If it's a calculate the x
5018
07:42:33,700 --> 07:42:39,900
question, put where the Z or the P is. So
if it was like the top 7%, you could shade
5019
07:42:39,900 --> 07:42:44,792
in the top little part of the curve. If it
was the bottom 3%, you could cheat in the
5020
07:42:44,792 --> 07:42:50,010
bottom little part of the curve. So make this
picture and do it at the beginning. Okay,
5021
07:42:50,010 --> 07:42:54,720
then, note that x is usually in the question.
If you can't find x, and you're trying to
5022
07:42:54,720 --> 07:42:58,660
do the Z formula, and you're saying, Okay,
I'm trying to make a z score. That's what
5023
07:42:58,660 --> 07:43:02,890
it asks for. I'm trying to find a probability.
That's what it asks for looking the question,
5024
07:43:02,890 --> 07:43:08,650
and you'll probably find the accent there.
A big problem that I see is people mistake
5025
07:43:08,650 --> 07:43:15,590
little Z's for peace. Now, obviously, if you've
got a Z, that's like negative, you know, a,
5026
07:43:15,590 --> 07:43:18,542
p can't be negative, a probability can't be
negative. So you won't make that mistake.
5027
07:43:18,542 --> 07:43:25,510
Even if it's like negative point two, five,
right? You won't make that mistake. And if
5028
07:43:25,510 --> 07:43:30,100
the Z is bigger than one, you won't make that
mistake. So if you see a z equals 2.5, you're
5029
07:43:30,100 --> 07:43:34,900
like, obviously, that's not a probability.
But when you have a little BBC score, that's
5030
07:43:34,900 --> 07:43:41,700
between zero and one, like point O two, three,
it looks a lot like a P, but it's still a
5031
07:43:41,700 --> 07:43:45,440
z. So a lot of times people get a little lazy,
like they hate using the Z table, and
5032
07:43:45,440 --> 07:43:49,030
then they calculate the z score, and it's
really little, so they don't look it up. Don't
5033
07:43:49,030 --> 07:43:51,490
be fooled. You still have to look it up. So
5034
07:43:51,490 --> 07:43:56,030
if you're calculating z, you need a little
baby z like that it still is he still go look
5035
07:43:56,030 --> 07:44:00,890
it up. Okay. Then finally, remember how step
one was draw a picture. And I went on and
5036
07:44:00,890 --> 07:44:06,450
on about that. Step 99. Or the last step before
you're done with the question is check your
5037
07:44:06,450 --> 07:44:11,202
logic against that picture. So if you shaded
a big part of your picture, your probability
5038
07:44:11,202 --> 07:44:17,490
should be bigger than point five, or 50%.
If you shaded a little tiny part of your picture,
5039
07:44:17,490 --> 07:44:21,570
and you're getting like point nine, five,
something, you know that that's wrong. So
5040
07:44:21,570 --> 07:44:26,050
please check your logic against the picture.
Before you say that you're done with your
5041
07:44:26,050 --> 07:44:35,160
question. Okay. So you made it through this
long lecture about z, and about probabilities.
5042
07:44:35,160 --> 07:44:39,872
So I gave you an introduction to the standard
normal curve into those two Z score formulas.
5043
07:44:39,872 --> 07:44:45,570
I showed you how to calculate z scores, and
how to look at probabilities. And I also showed
5044
07:44:45,570 --> 07:44:51,400
you at the end, how to calculate x if given
a z score or a probability. Okay, and all
5045
07:44:51,400 --> 07:44:56,410
I want to say is, unfortunately, those students
those pretend students on that distribution,
5046
07:44:56,410 --> 07:45:02,378
they were none of them got 100% Okay? That's
not the case in our class, a lot of times
5047
07:45:02,378 --> 07:45:08,820
people get 100% on the quizzes. That's why
I can't use your grades as examples. Okay,
5048
07:45:08,820 --> 07:45:16,700
so good luck on the quiz. Well, hello, it's
time for statistics. It's Monica wahi, your
5049
07:45:16,700 --> 07:45:24,870
library college lecturer back with chapter
7.4 and 7.5 sampling distributions and the
5050
07:45:24,870 --> 07:45:31,040
central limit theorem. So at the end of this
lecture, you should be able to state the new
5051
07:45:31,040 --> 07:45:36,840
statistical notation for parameters and statistics,
for two measures of variation.
5052
07:45:36,840 --> 07:45:38,792
Name one type
5053
07:45:38,792 --> 07:45:44,510
of inference and describe it. explain the
difference between a frequency distribution
5054
07:45:44,510 --> 07:45:50,970
and a sampling distribution, describe the
central limit theorem in either words or formulas,
5055
07:45:50,970 --> 07:45:57,490
and also describe how to calculate the standard
error. So, here's your introduction to this
5056
07:45:57,490 --> 07:46:03,798
lecture. And as you can see, I must 7.4 and
7.5. Together Again, they felt like a natural
5057
07:46:03,798 --> 07:46:09,960
fit. First, we're going to review and maybe
overview on parameters, statistics, and also
5058
07:46:09,960 --> 07:46:15,860
inferences, we're going to just talk about
those ideas, because that will sort of easy
5059
07:46:15,860 --> 07:46:21,270
into the next part, which is where we start
talking about sampling distribution, which
5060
07:46:21,270 --> 07:46:26,650
is the new concept here. Okay. And then we'll
go on to talk about the central limit theorem.
5061
07:46:26,650 --> 07:46:32,202
And finally, I'll do a little demonstration
of how to find probabilities regarding x
5062
07:46:32,202 --> 07:46:33,202
bar.
5063
07:46:33,202 --> 07:46:35,690
So if you're not really sure about what that
means, don't worry, you should be able to
5064
07:46:35,690 --> 07:46:43,160
understand it at the end of this lecture.
All right, here's the first part, parameters,
5065
07:46:43,160 --> 07:46:49,270
statistics and inferences. And this is the
review and overview I promised you. So if
5066
07:46:49,270 --> 07:46:54,730
you remember from a long time ago, a statistic
is a numerical measure describing a sample.
5067
07:46:54,730 --> 07:47:01,820
And a parameter is a numerical measure describing
a population remember s s sample statistic
5068
07:47:01,820 --> 07:47:09,150
p p, population parameter, you probably remember
that. Okay, so we have different ways of notating
5069
07:47:09,150 --> 07:47:14,872
these. So if you look under measure, like
you see me right, and if it's a statistic,
5070
07:47:14,872 --> 07:47:20,130
it's x bar, and I say x bar on this on the
slide sometimes because it's hard to make
5071
07:47:20,130 --> 07:47:25,240
that little line always be positioned above
the x. So I'm just lazy to say x bar. And
5072
07:47:25,240 --> 07:47:30,940
then under parameter, it's that that new symbol,
so it's pronounced a meal, but it looks like
5073
07:47:30,940 --> 07:47:36,230
that thing on the slide. All right, um, the
next two variants and standard deviation,
5074
07:47:36,230 --> 07:47:43,000
remember how they're friends. And so the statistic
version is the s for variance, it's the s
5075
07:47:43,000 --> 07:47:50,220
with the little two up there, the exponent,
because you know, it's standard deviation
5076
07:47:50,220 --> 07:47:54,510
to the second is variance in the square root
of variance is a standard deviation.
5077
07:47:54,510 --> 07:48:01,130
So that's why they have s and then S to the
second for the statistic, okay. For the parameter,
5078
07:48:01,130 --> 07:48:06,970
it's that lowercase sigma symbol. And that's
it's that to the second when it's variance,
5079
07:48:06,970 --> 07:48:15,000
and it's just without the exponent, when it's
just the regular parameter of standard deviation,
5080
07:48:15,000 --> 07:48:16,000
right.
5081
07:48:16,000 --> 07:48:19,490
And you're used to seeing these on the slides.
This is just review. I'm also in mentioned
5082
07:48:19,490 --> 07:48:26,282
in the book proportion is p hat, and then
the parameter is P. But I don't really go
5083
07:48:26,282 --> 07:48:32,810
into that. I just wanted to do a little shout
out to it. Okay, let's think about the word
5084
07:48:32,810 --> 07:48:38,990
inference, like infer, like, if somebody implies
something, maybe you'll infer it. Like, he
5085
07:48:38,990 --> 07:48:44,180
implied, it would be hard if I came over late
that night. So I inferred that I shouldn't
5086
07:48:44,180 --> 07:48:50,110
come over late then. So like here, you know,
you may have heard the term where there's
5087
07:48:50,110 --> 07:48:56,160
smoke, there's fire. And so you see this on
the slide, there's a lot of smoke. Is there
5088
07:48:56,160 --> 07:49:01,700
fire, though, is that smoke coming from fire?
Because if you look at it, it probably could
5089
07:49:01,700 --> 07:49:08,660
be coming from fire. But there's sort of this
outside chance. It's not what we think it
5090
07:49:08,660 --> 07:49:13,070
is, like maybe, you know, I have if you've
ever used a fire extinguisher, they make all
5091
07:49:13,070 --> 07:49:18,850
this phone come out. Maybe it's that, you
know, or maybe it's like, if you've ever had
5092
07:49:18,850 --> 07:49:24,840
dry eyes, and then that makes a bunch of smoke.
Maybe it's not fire, right? So where there's
5093
07:49:24,840 --> 07:49:28,692
smoke, there's fire. That's an inference.
Well, let's see
5094
07:49:28,692 --> 07:49:30,420
if it's actually fire,
5095
07:49:30,420 --> 07:49:35,500
right. But we weren't sure we thought it was
likely to be fire. But we weren't sure. And
5096
07:49:35,500 --> 07:49:41,200
so there's inference is something that you
do in statistics, because you use probability
5097
07:49:41,200 --> 07:49:45,130
to make these inferences because you can't
see the fire. You can just see the smoke and
5098
07:49:45,130 --> 07:49:49,890
you're not sure, right? So there's three different
kinds. I'm going to talk about the first kind
5099
07:49:49,890 --> 07:49:55,114
of estimation, where we estimate the value
of a parameter using a sample. So the sample
5100
07:49:55,114 --> 07:50:00,010
is kind of like the smoke and the parameters
the fire we can't see. So we estimate
5101
07:50:00,010 --> 07:50:07,440
Okay, and we're going to talk about that in
chapter eight more. A second time, type of
5102
07:50:07,440 --> 07:50:12,160
inference we do is testing, where we do a
test to help us make a decision about a population
5103
07:50:12,160 --> 07:50:17,130
parameter. In other words, we don't know one,
but we want to make a decision about it. So
5104
07:50:17,130 --> 07:50:22,860
we do a statistical test. And we're not going
to get into that, that's in chapter nine.
5105
07:50:22,860 --> 07:50:28,200
Finally, there's regression, where we make
predictions or forecasts about a statistic,
5106
07:50:28,200 --> 07:50:34,560
that's a third kind of inference. And we actually
already did this in chapter 4.2. So the reason
5107
07:50:34,560 --> 07:50:42,260
why I bring up all of this is that estimation,
which is going to be in chapter eight, and
5108
07:50:42,260 --> 07:50:45,510
testing, which is going to be in chapter nine,
but we're not going over chapter nine in this
5109
07:50:45,510 --> 07:50:52,360
class. But um, but if we were, you know, you'd
have to know this because in this lecture,
5110
07:50:52,360 --> 07:50:57,180
I'm going to talk about sampling distributions
in the central limit theorem. And you need
5111
07:50:57,180 --> 07:51:01,708
to grasp those things in order to do those,
these two things on the slide that with the
5112
07:51:01,708 --> 07:51:07,372
box around them, estimation, and testing.
And so that's why I'm bringing this up now.
5113
07:51:07,372 --> 07:51:13,360
Okay, so now we're going to move on to talking
about sampling distribution, and how it's
5114
07:51:13,360 --> 07:51:20,830
different from a frequency distribution. Alright,
so let's just remind ourselves what a frequency
5115
07:51:20,830 --> 07:51:26,470
distribution actually is. Okay? So remember
that from a long time ago, what you would
5116
07:51:26,470 --> 07:51:33,680
have is a quantitative variable, you'd make
a frequency table. And then you use that to
5117
07:51:33,680 --> 07:51:39,260
graph the histogram, right. And here, I made
an example down there of frequency histogram
5118
07:51:39,260 --> 07:51:43,200
that shows a normal distribution. And so that's
what you would do, you know, step two would
5119
07:51:43,200 --> 07:51:50,080
be draw it. And then you see the shape and
figure out what the distribution was of that
5120
07:51:50,080 --> 07:51:58,362
quantitative variable, or that x, okay, because
each one of these is an X, like the middle
5121
07:51:58,362 --> 07:52:04,100
one, it's almost 30 X's that are in that frequency.
Okay, now we're going to talk about sampling
5122
07:52:04,100 --> 07:52:09,730
distribution, it's a little more complicated.
In a sampling distribution, you start out
5123
07:52:09,730 --> 07:52:14,230
with a population, that's the first thing
is you're dealing with population, then you
5124
07:52:14,230 --> 07:52:20,050
pick an N, of a certain size, like you pick
a number, that you're going to have your sample
5125
07:52:20,050 --> 07:52:28,160
size B. And then you take as many samples
of that size as possible from the population.
5126
07:52:28,160 --> 07:52:34,500
And then you make an x bar from each of the
samples. So there's a ton of samples, right?
5127
07:52:34,500 --> 07:52:38,110
Because and I'll show you a little demonstration.
So you can really wrap your mind around how
5128
07:52:38,110 --> 07:52:43,630
many different samples that can be. But each
one is going to have an x bar. And then you
5129
07:52:43,630 --> 07:52:47,930
make a histogram of all those x bars. So like
I said, I'm going to just kind of show you
5130
07:52:47,930 --> 07:52:53,202
what I'm talking about. So we're going to
imagine this is a population of people. And
5131
07:52:53,202 --> 07:52:57,490
we're going to imagine we're going to talk
about BMI or body mass index, just so you
5132
07:52:57,490 --> 07:53:01,878
can wrap your mind around this. So you start
with this population, let's decide on an N.
5133
07:53:01,878 --> 07:53:08,320
How about five five is good, right? So now
what the deal is, is I'm trying to take as
5134
07:53:08,320 --> 07:53:15,000
many samples of n as possible from all of
these people on the slide. So here's our first
5135
07:53:15,000 --> 07:53:21,030
sample we took, and we got an x bar for BMI
of 23. From these five people. Well, let's
5136
07:53:21,030 --> 07:53:25,590
try these five people. Now, look, we double
dipped with that first one, okay, but we get
5137
07:53:25,590 --> 07:53:32,090
this x bar of 21. And we can keep going. And
actually, there's gonna be a ton of these,
5138
07:53:32,090 --> 07:53:37,160
right, there's a ton of different ones. But
it's finite. I mean, at the end of the day,
5139
07:53:37,160 --> 07:53:42,600
there's only so many groups of five, I can
get out of this population on the slide, and
5140
07:53:42,600 --> 07:53:48,910
each group of five is going to have its own
x bar. So I could write down every single
5141
07:53:48,910 --> 07:53:53,730
one of those x bars I get for every single
group of five I can make out of this. And
5142
07:53:53,730 --> 07:53:59,740
then I can make a histogram of all the x bars.
And, of course, I'd start with a frequency
5143
07:53:59,740 --> 07:54:05,150
table. But look at the frequencies, they're
huge. That's because you can get just a ton
5144
07:54:05,150 --> 07:54:12,292
of samples out of one population. And so what
you'll see is if you make a histogram out
5145
07:54:12,292 --> 07:54:17,692
of that, it looks normally distributed, it's
just that the frequencies are really high,
5146
07:54:17,692 --> 07:54:21,910
because there's a whole bunch of different
samples you can take. And remember, this is
5147
07:54:21,910 --> 07:54:29,690
a frequency histogram of x bars. This is each
one of these frequencies is an x bar that
5148
07:54:29,690 --> 07:54:35,870
you got out of a group of five you could take.
And so that's what the sampling distribution
5149
07:54:35,870 --> 07:54:41,730
is, it ends up looking like a histogram, but
it's a histogram of all the possible x bars
5150
07:54:41,730 --> 07:54:47,540
you could get from all the possible samples
of whatever end size you picked from the population
5151
07:54:47,540 --> 07:54:49,890
that you
5152
07:54:49,890 --> 07:54:51,060
have.
5153
07:54:51,060 --> 07:54:57,010
So uh, so this is the fancy way, the official
statistical way of saying it is a sampling
5154
07:54:57,010 --> 07:55:03,850
distribution is a probability distribution
of A sample statistic, in this case x bar
5155
07:55:03,850 --> 07:55:10,690
based on all possible simple random samples
of the same size from the same population.
5156
07:55:10,690 --> 07:55:15,792
So that's what makes it the sampling distribution
and not a frequency distribution. And so in
5157
07:55:15,792 --> 07:55:19,900
the next section, so you're probably like,
Okay, great, that's wonderful. You just explained
5158
07:55:19,900 --> 07:55:23,610
that. But in the next section, we're going
to talk about the central limit theorem, here
5159
07:55:23,610 --> 07:55:28,390
comes a theorem, right. And there's a proof
for the theorem. And you need to understand
5160
07:55:28,390 --> 07:55:34,042
this concept of sampling distribution for
inference in order to understand this proof,
5161
07:55:34,042 --> 07:55:40,900
so I just had to go through this. Okay, now
we're on to the central limit theorem, and
5162
07:55:40,900 --> 07:55:48,542
how it's used for statistical inference. So
I'm gonna start by explaining it in words
5163
07:55:48,542 --> 07:55:54,110
and see that sampling distributions over there.
So this is the words around the central limit
5164
07:55:54,110 --> 07:55:58,970
theorem, it says, For any normal distribution,
and remember, we're talking about a normal
5165
07:55:58,970 --> 07:56:04,270
distribution here, the sampling distribution,
meaning the distributions of the x bars from
5166
07:56:04,270 --> 07:56:09,272
all possible samples, like we just talked
about, is a normal distribution, meaning it's
5167
07:56:09,272 --> 07:56:14,600
not skewed, it's not my model, whatever, it
looks kinda like what is on the slide. Okay.
5168
07:56:14,600 --> 07:56:23,590
And then to this is important, the mean of
the x bars is actually mu. So I had a student
5169
07:56:23,590 --> 07:56:31,260
who would say, Oh, the x bar of the x bars,
is mu. And that's actually true. If you actually
5170
07:56:31,260 --> 07:56:35,560
did the thing I described, which don't try
it at home, because you'll be up all night
5171
07:56:35,560 --> 07:56:41,700
taking samples, okay. But if you did, if you
actually got all samples of five from a population,
5172
07:56:41,700 --> 07:56:49,090
and got all their x bars, and you made a mean
of all those x bars, you'd get mu and how
5173
07:56:49,090 --> 07:56:53,240
you could check it is, of course, just easily
taking a mean of the entire population like
5174
07:56:53,240 --> 07:56:57,080
that would have been the easy way to do it.
But no, if you do it this way, where you get
5175
07:56:57,080 --> 07:57:00,863
every possible x bar for a particular sample
size, and then you make an x bar, those x
5176
07:57:00,863 --> 07:57:05,850
bars, you'll get meal. So that's, you know,
it's a proof. So that sounds like a thing,
5177
07:57:05,850 --> 07:57:10,840
that would be inappropriate, right? Now, here's
the next part three, the standard deviation
5178
07:57:10,840 --> 07:57:17,798
of all those x Mars is actually the population
standard deviation divided by the square root
5179
07:57:17,798 --> 07:57:23,110
of whatever and you picked. So in other words,
if you have the whole population data, and
5180
07:57:23,110 --> 07:57:27,000
you just found out the standard deviation,
you just have the standard deviation. But
5181
07:57:27,000 --> 07:57:30,890
if you did this thing with the x bar, where
you took all those x bars, and you found the
5182
07:57:30,890 --> 07:57:36,840
standard deviation of those x bars, that would
equal the population standard deviation divided
5183
07:57:36,840 --> 07:57:43,192
by the square root of whatever n, you use
to get all those x bars, again, sounds really
5184
07:57:43,192 --> 07:57:47,370
poufy In theory, but that's the third part
of the central limit theorem
5185
07:57:47,370 --> 07:57:48,780
in words.
5186
07:57:48,780 --> 07:57:54,770
And so here's some people like to look at
it from a formula standpoint. So you'll see
5187
07:57:54,770 --> 07:57:58,792
on the right side of the slide, in this little,
these little formulas, that N means the sample
5188
07:57:58,792 --> 07:58:03,670
size. And remember, I picked five, you could
pick a different one, right? And mu is the
5189
07:58:03,670 --> 07:58:09,452
mean of the x distribution, meaning the population
mean, right. And then that population standard
5190
07:58:09,452 --> 07:58:13,048
deviation symbol is the standard deviation
of the x distribution mean the population
5191
07:58:13,048 --> 07:58:18,480
standard deviation. So we look on the left.
Now this is just a formula version of what
5192
07:58:18,480 --> 07:58:24,540
I just the mu of all the x bars that you could
get from a particular sample in a particular
5193
07:58:24,540 --> 07:58:28,960
population is going to equal the mean or the
population. And the standard deviation of
5194
07:58:28,960 --> 07:58:33,530
all those x bars is going to equal the population
standard deviation divided by the square root
5195
07:58:33,530 --> 07:58:41,042
of whatever n you picked. So now, I just want
to point out the Z thing. We've been doing
5196
07:58:41,042 --> 07:58:47,480
this z thing, right, but we've been doing
it with 1x. Now, if you imagine grabbing a
5197
07:58:47,480 --> 07:58:53,430
bunch of x's, in other words, a sample, this
is the formula you're going to be using, which
5198
07:58:53,430 --> 07:59:01,820
is x bar minus mu over the standard deviation
divided by the square root of n, right? And
5199
07:59:01,820 --> 07:59:07,620
so that's kind of what we're moving into here
is what happens if you get a sample and you're
5200
07:59:07,620 --> 07:59:15,640
looking at x bar, not if you just grab 1x.
And you're looking at that. So I wanted to
5201
07:59:15,640 --> 07:59:21,510
point out, first of all, that this whole thing
is only supposed to happen if your n is greater
5202
07:59:21,510 --> 07:59:28,170
than 30. Okay? Otherwise, you shouldn't really
be doing this. Then the second thing I wanted
5203
07:59:28,170 --> 07:59:35,202
to point out is that this piece underneath
and the lower part of the equation, that's
5204
07:59:35,202 --> 07:59:41,440
called the standard error, they named that
piece. And part of the reason why I like that
5205
07:59:41,440 --> 07:59:47,670
they named that piece separately, is I usually
make that piece before I even do the equation.
5206
07:59:47,670 --> 07:59:52,270
So I just have that number sitting around
because, you know, there's a square root underneath
5207
07:59:52,270 --> 07:59:57,862
this standard deviation, and that whole thing
is underneath another thing so it's hard to
5208
07:59:57,862 --> 08:00:03,530
do all that dividing. So I usually just make
that standard error first, by taking the standard
5209
08:00:03,530 --> 08:00:07,250
population standard deviation divided by the
square root of n and just have that number
5210
08:00:07,250 --> 08:00:12,470
and then later I use it in this z equation.
So that's two things I wanted you to notice.
5211
08:00:12,470 --> 08:00:18,622
So I brought that out on the slide. Okay,
here's more on the central limit theorem.
5212
08:00:18,622 --> 08:00:24,770
So if the distribution of X is normal, then
the distribution of x bar is also normal.
5213
08:00:24,770 --> 08:00:29,580
So we look at the top, that's an example of
just an X distribution. And then if you go
5214
08:00:29,580 --> 08:00:33,950
do that thing, we take all those samples,
and you get all those x bars. And then you
5215
08:00:33,950 --> 08:00:38,590
make the histogram, you'll see the pink one
down, lower. Next bar distribution,
5216
08:00:38,590 --> 08:00:42,340
this is just a pictorial example.
5217
08:00:42,340 --> 08:00:50,208
But even if the distribution of X is not normal,
as long as there's more than 30, and is more
5218
08:00:50,208 --> 08:00:56,580
than 30, the central limit theorem says that
the x bar distribution is approximately normal.
5219
08:00:56,580 --> 08:01:03,970
So remember, a lot of that hospital data we've
been looking at, like a hospital beds in a
5220
08:01:03,970 --> 08:01:10,890
state, often you'll see a skewed distribution.
But if you have more than 30, hospitals, then
5221
08:01:10,890 --> 08:01:18,390
it what you could do is you could pick n n,
and take n bigger than 30. And take a bunch
5222
08:01:18,390 --> 08:01:22,730
of samples and get a bunch of x bar, it's
not just a bunch get all of them all of the
5223
08:01:22,730 --> 08:01:27,710
possible ones. And then when you if you made
that x bar distribution, even though the hospital
5224
08:01:27,710 --> 08:01:33,792
beds would be skewed, just as an X distribution,
their x bar distribution would be normal.
5225
08:01:33,792 --> 08:01:38,730
And that's one other important piece of the
central limit theorem. That's one important
5226
08:01:38,730 --> 08:01:45,190
piece of that proof is that all of those x
bars that you get, will end up on a normal
5227
08:01:45,190 --> 08:01:50,290
distribution, even if your underlying distribution
is not normal. So long as the end you're picking
5228
08:01:50,290 --> 08:01:57,060
is greater than 30. And finally, that leads
to you know, proofs are they build on each
5229
08:01:57,060 --> 08:02:01,860
other, that leads us to the concept that a
sample statistic is considered unbiased, just
5230
08:02:01,860 --> 08:02:10,190
unbiased, right? It's not perfect, but it's
unbiased. If the mean of its sampling distribution,
5231
08:02:10,190 --> 08:02:16,380
equals the parameter being estimated, in other
words, the fact that the x bar of the x bar
5232
08:02:16,380 --> 08:02:23,628
is is mu, means that an x bar is going to
be unbiased. It might not be mu, it might
5233
08:02:23,628 --> 08:02:29,841
not be exactly the same as the population
mean. But it will be unbiased. It's not a
5234
08:02:29,841 --> 08:02:38,280
biased representative of mu. All right, now
let's move on to finding probabilities regarding
5235
08:02:38,280 --> 08:02:42,230
x bar. So for those of you who want to actually
do something and apply something and stop
5236
08:02:42,230 --> 08:02:48,930
thinking about theory, let's go. Okay, but
let's remind ourselves, what are we doing?
5237
08:02:48,930 --> 08:02:54,910
Right? What are we doing? Well, what were
we doing in chapters 7.1 through 7.3, we were
5238
08:02:54,910 --> 08:03:01,470
looking at having a normally distributed x.
So we have this population of quantitative
5239
08:03:01,470 --> 08:03:06,360
values that were normally distributed. And
we had a population mean a mu, and we the
5240
08:03:06,360 --> 08:03:11,810
population standard deviation. And we kept
doing these exercises, where we were finding
5241
08:03:11,810 --> 08:03:17,298
the probability of selecting a value from
that population and x from that population
5242
08:03:17,298 --> 08:03:22,542
above or below a certain value of x, right.
And so we were looking at the probabilities,
5243
08:03:22,542 --> 08:03:27,060
and we'd look up the z score in the Z table
probabilities. And so basically, what we would
5244
08:03:27,060 --> 08:03:35,070
be doing is converting m x to z, right. And
we use this formula here to convert x to z.
5245
08:03:35,070 --> 08:03:39,650
So whenever we add an x, we could put it on
the Z distribution, and we could figure out
5246
08:03:39,650 --> 08:03:46,060
the probability. So here's what's different.
Now, you'll notice the first thing has not
5247
08:03:46,060 --> 08:03:49,920
changed, we're still talking about normally
distributed x's, we're still talking about
5248
08:03:49,920 --> 08:03:55,090
a population where we have a mu and a population
standard deviation. But now we're not just
5249
08:03:55,090 --> 08:04:02,370
grabbing 1x. From that population, we're grabbing
a sample. And because we're grabbing a sample,
5250
08:04:02,370 --> 08:04:07,622
we have to pick an N. So the N is going to
be different each time, right? So we're grabbing
5251
08:04:07,622 --> 08:04:11,470
a sample of the population. Well, how do we
boil that down to one number? Well, we're
5252
08:04:11,470 --> 08:04:18,640
taking the x bar are the mean value from that
sample. And that's what we're doing. The Z
5253
08:04:18,640 --> 08:04:26,378
score is that x bar instead of the x, because
we're taking a sample, so when you see the
5254
08:04:26,378 --> 08:04:33,230
formula below, you'll notice that the other
one just had x in it, because we only had
5255
08:04:33,230 --> 08:04:41,112
one, this one has x bar, and because we have
a sample, you also notice that downstairs,
5256
08:04:41,112 --> 08:04:45,160
what we had before was the population standard
deviation, but now
5257
08:04:45,160 --> 08:04:49,522
we have the standard error. Remember I talked
about that the population standard deviation
5258
08:04:49,522 --> 08:04:55,230
divided by the square root of n, that's where
n comes in, because it's going to matter which
5259
08:04:55,230 --> 08:05:04,160
what and you have to make the Z come out right?
Alright, so now that we're reminded of what
5260
08:05:04,160 --> 08:05:11,500
we're doing, we'll just explain how to do
it right. So let's say you do have an N, right,
5261
08:05:11,500 --> 08:05:16,730
and you have an x bar, like you grabbed your
n and you got an x bar, you can convert that
5262
08:05:16,730 --> 08:05:23,030
x bar to a z score using this formula, where,
of course, you have to be told the population
5263
08:05:23,030 --> 08:05:27,311
mean and the population standard deviation,
but then you'll have your x bar and you'll
5264
08:05:27,311 --> 08:05:31,970
have your n. So you can do the whole equation.
And then you'll get to see and guess what
5265
08:05:31,970 --> 08:05:35,030
you do. What do you do with a Z, you look
it up. So you look at the probability for
5266
08:05:35,030 --> 08:05:42,260
the z score in the Z table. Like in chapter
7.2, and 7.3. Only, this is just about x bar,
5267
08:05:42,260 --> 08:05:49,650
basically. So um, and then I thought, what
I would do is walk you through two examples.
5268
08:05:49,650 --> 08:05:56,240
You're already kind of good at this, because
this is not too different from 7.2, and 7.3.
5269
08:05:56,240 --> 08:06:01,340
But I just want to walk you through it, because
it is a little different when you have a sample
5270
08:06:01,340 --> 08:06:07,470
versus just 1x. Okay, so remember our poor
chemistry class that I was in when I got to
5271
08:06:07,470 --> 08:06:12,050
73? Well, remember, we were assuming it was
100 Student class. So there were 100 students
5272
08:06:12,050 --> 08:06:17,530
in the class and equals 100 in the class capital,
right, because they're the population. And
5273
08:06:17,530 --> 08:06:22,420
then if you look on the slide, you'll see
the mu of their scores was pretty bad. It
5274
08:06:22,420 --> 08:06:29,950
was 65.5 on 100 point test, and the population
standard deviation was 14.5. So this was the
5275
08:06:29,950 --> 08:06:35,660
population of this 100 Student class. So I'm
going to do some exercises here, let's say
5276
08:06:35,660 --> 08:06:40,480
we're going to pick a, we have to pick an
N bigger than 30. So we're going to pick an
5277
08:06:40,480 --> 08:06:47,220
N of 49. Right? Now, I'm coming up with a
little scenario here. To pass the class students
5278
08:06:47,220 --> 08:06:52,690
have to get at least 70, which is a C. So
let's pretend this is the question, what is
5279
08:06:52,690 --> 08:07:00,890
the probability of me selecting a sample of
49 students with an x bar greater than 70?
5280
08:07:00,890 --> 08:07:04,810
Notice how we ask the question a little bit
differently. What's the probability of me
5281
08:07:04,810 --> 08:07:10,390
getting a set of 49 students such that their
x bar is greater than 70? Does not kind of
5282
08:07:10,390 --> 08:07:15,680
remind you of the central limit theorem, where
we had to go back and get a like an N a five,
5283
08:07:15,680 --> 08:07:20,798
we got different ends of five? What what's
the probability of me getting one of those
5284
08:07:20,798 --> 08:07:29,050
samples that has an x bar in the greater than
70? That's the question, right. And I drew
5285
08:07:29,050 --> 08:07:35,880
this out here, remember our old z distribution
with our also our x distribution, and I kind
5286
08:07:35,880 --> 08:07:39,230
of drew where somebody is. But I wanted you
to point I wanted to point out for you, the
5287
08:07:39,230 --> 08:07:43,798
probability for an x bar is going to be smaller
than for x, because you're going to have to
5288
08:07:43,798 --> 08:07:51,798
do a lot of work to get that x bar to be above
70. Right? So here we go. So I'm just going
5289
08:07:51,798 --> 08:07:57,980
to remind you that the equation at the top
and the equation at the bottom are the same
5290
08:07:57,980 --> 08:08:03,560
equation. I'm just using the term assay for
the standard error. And I like to calculate
5291
08:08:03,560 --> 08:08:08,780
that separately, like I told you, so I like
to do that first. So we're going to do that.
5292
08:08:08,780 --> 08:08:15,280
And how do we do that? Well, the end was 49,
right? And I'm the population standard deviation
5293
08:08:15,280 --> 08:08:22,250
is 14.5. So that's where we get this, this
number, the standard error of 2.1. So now,
5294
08:08:22,250 --> 08:08:29,351
let's calculate the Z. All right, here's z.
So z is our x, which is our x bar, which is
5295
08:08:29,351 --> 08:08:39,360
70 minus 65.5, which is our mu, divided by
our prep cooked standard error, which is 2.1.
5296
08:08:39,360 --> 08:08:45,710
And we get a Z of 2.17. So we're tempted to
look that up. But let's look at our picture.
5297
08:08:45,710 --> 08:08:51,770
So here's our z distribution. And what we're
going for is this little piece at the top
5298
08:08:51,770 --> 08:08:57,350
right above 2.17. So that's a little piece.
So we got to look for that right? Let's go
5299
08:08:57,350 --> 08:09:03,970
look. So because we're going to go for the
piece at the top, we're going to use the opposite
5300
08:09:03,970 --> 08:09:10,920
z. There's remember two ways of doing this.
But everybody seems to prefer the way where
5301
08:09:10,920 --> 08:09:15,010
you use the opposite z if you're looking for
something to the right. So we're going to
5302
08:09:15,010 --> 08:09:21,440
use negative 2.17 to get a little piece, right?
Because when you look that up, I'm not going
5303
08:09:21,440 --> 08:09:28,810
to demonstrate you guys are good at this now.
You get P equals 0.0150. If you were to look
5304
08:09:28,810 --> 08:09:35,351
up 2.17, then you'd get the big piece. So
that's why we do this. And so then the answer
5305
08:09:35,351 --> 08:09:40,852
is, remember the question was what is the
probability of me selecting a sample or a
5306
08:09:40,852 --> 08:09:47,450
set of 49 students with an x bar that's greater
than 70. And remember how this real test really
5307
08:09:47,450 --> 08:09:48,450
sucked. I mean,
5308
08:09:48,450 --> 08:09:55,370
people that mu was 65.5. So it was pretty
hard to get a high score. So the probability
5309
08:09:55,370 --> 08:10:06,280
was pretty low as point 0.0150. Or if you
do that Present version 1.5%. Okay, now we're
5310
08:10:06,280 --> 08:10:11,860
going to try a different one. That one was
asking what is the probability of me selecting
5311
08:10:11,860 --> 08:10:17,440
a sample with an x bar greater than a certain
number? Now we're going to talk about the
5312
08:10:17,440 --> 08:10:23,680
probability of selecting a sample with the
x bar between two numbers, right? So again,
5313
08:10:23,680 --> 08:10:30,140
we're back with our poor student class that
with this terrible chemistry test, this time
5314
08:10:30,140 --> 08:10:35,150
I decided to choose the end of 36, you'll
notice that I always choose perfect squares
5315
08:10:35,150 --> 08:10:41,372
for ends because you have to take the square
root, and I'm just lazy. So okay, here's our
5316
08:10:41,372 --> 08:10:48,010
question, what is the probability of me selecting
a sample of 36 students with an x bar between
5317
08:10:48,010 --> 08:10:55,070
60 and 65. And just I drew this picture up
here to remind you that, that's gonna be on
5318
08:10:55,070 --> 08:11:00,710
the left side of meal, you know, we're going
to be dealing with negative Z's right. And
5319
08:11:00,710 --> 08:11:07,048
so we have to remember when we would have
two axes, back in 7.2, and 7.3. Well, this
5320
08:11:07,048 --> 08:11:12,250
is now a situation where we have 2x bars,
so you just got to name them x bar one and
5321
08:11:12,250 --> 08:11:18,250
x bar two. And, again, I show you this demonstration,
you know, these red arrows, but the probability
5322
08:11:18,250 --> 08:11:21,960
for x bar will be smaller than for x, because
it's harder to get a whole group of people
5323
08:11:21,960 --> 08:11:28,650
together to give you an x bar in between a
certain place. Alright, so this is not new,
5324
08:11:28,650 --> 08:11:33,530
these are the same formulas I showed you before,
I just want to emphasize that making your
5325
08:11:33,530 --> 08:11:39,140
standard error first, can really help you
as you move along through these problems,
5326
08:11:39,140 --> 08:11:42,400
it just makes it a little easier to calculate,
especially in this case, where we're going
5327
08:11:42,400 --> 08:11:50,740
to use the standard error twice. So again,
what we do is we take, this would look exactly
5328
08:11:50,740 --> 08:11:55,610
like the last standard error, but it's different
because our n is different. So this time,
5329
08:11:55,610 --> 08:12:01,950
our standard error comes out as 2.4. And what
I just want to remind you is that the more
5330
08:12:01,950 --> 08:12:08,260
and you get, the bigger that square root of
n gets, I mean, n gets bigger, the square
5331
08:12:08,260 --> 08:12:14,490
root of n gets bigger. And that's then the
smaller the standard error gets. So you can
5332
08:12:14,490 --> 08:12:22,810
make the standard error really small, if you
just get a lot of n, right. So here's z one
5333
08:12:22,810 --> 08:12:27,420
and z two, I put them both up there. But we
can just walk through this, you know, x bar
5334
08:12:27,420 --> 08:12:34,250
one is 60. And x bar two is 65. Because it's
between 60 and 65. So you see that, um, you
5335
08:12:34,250 --> 08:12:38,750
see what's going on in the slide. And like
I told you, you know, these were both of these
5336
08:12:38,750 --> 08:12:43,890
x bars are below the mu. So they're both kind
of negative Z's. And so we've got our negative
5337
08:12:43,890 --> 08:12:50,430
Z's. And that now we have to just remind ourselves,
well, what are we doing, right? And so you
5338
08:12:50,430 --> 08:12:56,840
see, z one is at negative 2.28. So that's
a little piece at the bottom, we're going
5339
08:12:56,840 --> 08:13:03,520
to want to trim off. And then the big piece
at the top for z two, that starts at negative
5340
08:13:03,520 --> 08:13:09,520
point two, one. So that's just remember, the
picture is really helpful. So now we're going
5341
08:13:09,520 --> 08:13:14,612
to go deal with the probabilities, right?
So for z one, we're looking at something to
5342
08:13:14,612 --> 08:13:22,150
the left, so we just leave the Z alone and
go look it up. And that's p equals 0.0113.
5343
08:13:22,150 --> 08:13:26,730
For z two, we got to flip the sign because
we have to use the opposite z, because we're
5344
08:13:26,730 --> 08:13:31,350
going for the right, so that was the probability
two and we can check that see, because we
5345
08:13:31,350 --> 08:13:37,720
can see that's more than 50% of that shape.
So it's point 5832. Okay, so we got our probabilities
5346
08:13:37,720 --> 08:13:44,298
now. And like just like last time, we got
to take one minus both of those pieces, right?
5347
08:13:44,298 --> 08:13:48,540
And then we get the probability in the middle.
And that's the probability of drawing us sample
5348
08:13:48,540 --> 08:13:56,300
of 36 students with an x bar between 60 and
65. And I just to translate that to the answer,
5349
08:13:56,300 --> 08:14:04,650
the probability is point 4055. Or if you rounded
it, you know, when you like, percents, you
5350
08:14:04,650 --> 08:14:12,610
could say 41%. So in conclusion, we reviewed
the parameters, and the statistics, and those
5351
08:14:12,610 --> 08:14:17,310
notations. And we talked about inferences
and what we're doing with inference. Next,
5352
08:14:17,310 --> 08:14:20,540
we talked about what a sampling distribution
is, and how that's different from
5353
08:14:20,540 --> 08:14:25,160
a frequency distribution. So you can tell
you know what's going on with that. Then I
5354
08:14:25,160 --> 08:14:29,290
presented to you the central limit theorem,
which may have been kind of confusing, because
5355
08:14:29,290 --> 08:14:33,650
you know, theorems always are, they're always
about different principles and about different
5356
08:14:33,650 --> 08:14:39,240
things equaling each other. But because of
the central limit theorem, we then have permission
5357
08:14:39,240 --> 08:14:44,220
to do the operations we're doing after that,
which is finding probabilities regarding x
5358
08:14:44,220 --> 08:14:48,900
bar. The central limit theorem says that,
you know, this is how the world works. So
5359
08:14:48,900 --> 08:14:53,170
you get to use the standard error, and you
get to do these kinds of calculations. So
5360
08:14:53,170 --> 08:14:59,550
now, you know how to in addition to finding
probabilities regarding x, you can find probabilities,
5361
08:14:59,550 --> 08:15:02,270
we got x bar. Don't you feel smart
655032
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.