subtitlecat.com

All language subtitles for [English (United States)] Statistics - A Full University Course on Data Science Basics [DownSub.com]

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:02,399 --> 00:00:03,399 Good day, 2 00:00:03,399 --> 00:00:08,500 everyone. This is your lecturer, Monica wahi. And we're going to start now with section 3 00:00:08,500 --> 00:00:16,870 1.1. What is statistics? So here's our learning objectives for this lecture. At the end of 4 00:00:16,870 --> 00:00:21,679 this lecture, the students should be able to state at least one definition of statistics. 5 00:00:21,679 --> 00:00:27,699 Yes, there's more than one, give one example of a population parameter. And one example 6 00:00:27,699 --> 00:00:34,580 of a sample statistic. Also, the student should be able to classify a variable into quantitative 7 00:00:34,580 --> 00:00:43,070 or qualitative and as nominal ordinal, interval, or ratio. So what we're going to cover in 8 00:00:43,070 --> 00:00:48,220 this lecture is, first I'm going to go over some definitions of statistics. Like I said, 9 00:00:48,220 --> 00:00:52,620 there's more than one. But they all sort of relate to the basic concept of why you're 10 00:00:52,620 --> 00:00:57,900 doing statistics, and especially not math. So what's the difference, right, then we're 11 00:00:57,900 --> 00:01:02,880 gonna go over a population parameter and sample statistic. And you'll know what those mean, 12 00:01:02,880 --> 00:01:09,680 at the end of the lecture. And finally, we're going to go over classifying levels of measurement. 13 00:01:09,680 --> 00:01:16,170 So let's start with the definition of statistics. And so we're going to go over these concepts 14 00:01:16,170 --> 00:01:22,050 like what it is. And also I'm going to define for you the concept of individuals versus 15 00:01:22,050 --> 00:01:26,940 variables. You may know definitions for those words already, but I'm going to give you them 16 00:01:26,940 --> 00:01:32,150 in statistics ease. And then I'm going to give you examples of statistics, individuals 17 00:01:32,150 --> 00:01:40,470 and variables in healthcare. So here are the definitions. What is statistics? statistics 18 00:01:40,470 --> 00:01:46,460 is the study, how to collect, organize, analyze, and interpret numerical information and data. 19 00:01:46,460 --> 00:01:53,250 Well, that sounds pretty esoteric, right? But if you actually think about it, even if 20 00:01:53,250 --> 00:01:55,490 he did a simple survey, like you just did 21 00:01:55,490 --> 00:01:56,710 a wiki, you just 22 00:01:56,710 --> 00:02:01,070 look on Yelp, right? You look on Yelp, and you see, you know, the restaurant, you want 23 00:02:01,070 --> 00:02:04,850 to go to some people say five stars or four stars, but there's a few two stars one star 24 00:02:04,850 --> 00:02:10,810 will do you go? I mean, there's a whole bunch of different answers. So how do you do that, 25 00:02:10,810 --> 00:02:16,069 you kind of have to analyze it, you kind of have to interpret it. So it's not that easy. 26 00:02:16,069 --> 00:02:21,540 So statistics is both the science of uncertainty, and the technology of extracting information 27 00:02:21,540 --> 00:02:27,950 from data. So in other words, if you've got a bunch of data about like a restaurant, um, 28 00:02:27,950 --> 00:02:32,480 you don't know how it's gonna be if you actually go there, right? You don't know for sure. 29 00:02:32,480 --> 00:02:38,349 But, uh, so it's the science of uncertainty. If you look on Yelp, and you're seeing almost 30 00:02:38,349 --> 00:02:44,560 everybody's giving it a four or five star, maybe it's gonna be good for you, right? But 31 00:02:44,560 --> 00:02:50,989 you don't know, maybe there's new management. That's the uncertainty. So statistics is used 32 00:02:50,989 --> 00:02:56,109 to help us make decisions, not just whether to go to the restaurant or not, but important 33 00:02:56,109 --> 00:03:01,439 statistics, such as in health care and public health. Well, I guess if it's an expensive 34 00:03:01,439 --> 00:03:05,700 restaurant, maybe it's important. But anyway, and health care and public health, you really 35 00:03:05,700 --> 00:03:09,609 need these statistics, because they really guide you. Like, for example, let's think 36 00:03:09,609 --> 00:03:14,569 of the Center for Disease Control and Prevention in the United States. So what do they do? 37 00:03:14,569 --> 00:03:19,279 They spend the whole year studying the different flu viruses that go round, because there's 38 00:03:19,279 --> 00:03:20,279 more than one. 39 00:03:20,279 --> 00:03:21,279 They spend 40 00:03:21,279 --> 00:03:25,900 the whole year doing that they organize, analyze, and interpret numerical information and data 41 00:03:25,900 --> 00:03:32,709 about these different viruses, the different influenza viruses that are going around. They 42 00:03:32,709 --> 00:03:37,859 extract that information. And you know, what decisions I make the make the decisions about 43 00:03:37,859 --> 00:03:46,500 what viruses to include, in the next year sexy? Are they always right? Sure enough, 44 00:03:46,500 --> 00:03:49,189 they're not. I mean, have you ever had a year where you're like, Oh, my gosh, everybody 45 00:03:49,189 --> 00:03:54,780 I know, got vaccinated, and they're still getting sick? Well, you know, give him a break. 46 00:03:54,780 --> 00:03:59,859 It's this sign some uncertainty, they it just didn't work out that time. However, this is 47 00:03:59,859 --> 00:04:06,879 probably better than just randomly guessing. Right. So that's statistics for you. Know, 48 00:04:06,879 --> 00:04:13,309 I promised you I'd tell you the statistics ease version of individuals and variables. 49 00:04:13,309 --> 00:04:18,608 Now, if you're outside statistics, you know that individuals are people, right. And you 50 00:04:18,608 --> 00:04:24,210 know that a variable is a factor, like a factor that can vary, you know, like, the only variable 51 00:04:24,210 --> 00:04:25,400 is I don't know what time 52 00:04:25,400 --> 00:04:27,590 something's going to happen. 53 00:04:27,590 --> 00:04:28,590 But when you 54 00:04:28,590 --> 00:04:34,910 enter the land of statistics, there are specific meanings to these two words. Individuals are 55 00:04:34,910 --> 00:04:40,090 people or objects included in a study. So if you're gonna do an animal study with some 56 00:04:40,090 --> 00:04:46,319 mice in it, those would be the individuals. If you do a randomized clinical trial, and 57 00:04:46,319 --> 00:04:51,860 you include people who have Alzheimer's in it, then patients are your individuals. But 58 00:04:51,860 --> 00:04:56,259 we do a lot of different things in healthcare. We sometimes study hospitals, like the rate 59 00:04:56,259 --> 00:05:01,300 of nosocomial infections, in which case if you're looking old bunch of stuff in hospitals, 60 00:05:01,300 --> 00:05:07,590 those would be the individuals. Sometimes we look at states rates of infant mortality, 61 00:05:07,590 --> 00:05:12,889 for example, in different states, in that case, states would be individuals. So as you 62 00:05:12,889 --> 00:05:16,621 can see at the bottom of the slide, a variable then is a characteristic of the individual 63 00:05:16,621 --> 00:05:22,840 to be measured, or observed. I give some examples on the slide. But like I was saying, you know, 64 00:05:22,840 --> 00:05:28,370 if you wanted to study a hospital, for example, I gave you the example of a variable of a 65 00:05:28,370 --> 00:05:34,110 rate of nosocomial infections, you could also have other variables about that individual 66 00:05:34,110 --> 00:05:35,879 or hospital, 67 00:05:35,879 --> 00:05:41,949 like the rate of in hospital mortality. And so, as you can see, one of the things we do 68 00:05:41,949 --> 00:05:46,479 in statistics is we sit down and we decide, well, who are going to be our individuals 69 00:05:46,479 --> 00:05:53,570 that we're going to measure? And what variables are we going to measure. So I just threw up 70 00:05:53,570 --> 00:06:00,360 here a few examples of different kinds of individuals we have, that we use a lot in 71 00:06:00,360 --> 00:06:07,389 health care and public health, and an example of just one variable, about those example 72 00:06:07,389 --> 00:06:12,561 individuals. But there would theoretically be many variables about them. And I just want 73 00:06:12,561 --> 00:06:19,240 you to notice, a lot of times, the individuals are geographic locations. Other times they 74 00:06:19,240 --> 00:06:26,960 might be institutions, like I said, like hospitals, or clinics, or programs. There's other things 75 00:06:26,960 --> 00:06:34,090 that they are, but these are just kind of the big ones. So, um, as I was describing, 76 00:06:34,090 --> 00:06:39,949 and just to review, what I went over, statistics is used in healthcare and other disciplines 77 00:06:39,949 --> 00:06:46,789 to, to aid in decision making, like I gave the example the CDC and their vaccine for 78 00:06:46,789 --> 00:06:52,419 influenza. And so therefore, it's really important to understand statistics, because you need 79 00:06:52,419 --> 00:06:56,020 to understand these processes in healthcare, like how do we figure out 80 00:06:56,020 --> 00:06:57,289 what to do? 81 00:06:57,289 --> 00:07:03,110 Like not only what do we do, but how do we figure out what to do. And that's really important 82 00:07:03,110 --> 00:07:09,409 because we use statistics a lot in healthcare. Now, we're going to move on to talk about 83 00:07:09,409 --> 00:07:15,729 what a population parameter is, and what a sample statistic is. So we're going to go 84 00:07:15,729 --> 00:07:21,370 over first definition of a population and the definition of a sample. So you're sure 85 00:07:21,370 --> 00:07:26,280 about what those mean. And we're going to talk about the data about a population and 86 00:07:26,280 --> 00:07:30,849 the data about a sample and how those are different. And then we're going to get into 87 00:07:30,849 --> 00:07:36,930 what I was just describing parameters and statistics. And I'll give you a few examples. 88 00:07:36,930 --> 00:07:41,759 So let's start with what is the population, again, another case where you just have a 89 00:07:41,759 --> 00:07:47,849 normal word, but it has a special meaning and statistics? Well, it's a group of people 90 00:07:47,849 --> 00:07:54,340 or objects with a common theme. And when every member of that group is considered this population, 91 00:07:54,340 --> 00:08:00,060 right. So here, here's just one example. So the theme would be like nurses who work at 92 00:08:00,060 --> 00:08:06,229 Massachusetts, Massachusetts General Hospital, so the population then if that was your theme, 93 00:08:06,229 --> 00:08:14,550 will be the list from human resources of every nurse out currently employed at mgh. Now, 94 00:08:14,550 --> 00:08:21,699 it really does depend on how you define that thing. Like I could have said, nurses who 95 00:08:21,699 --> 00:08:28,169 belong to the American nursing Association, right? And then we'd be looking at a different 96 00:08:28,169 --> 00:08:35,640 list. I could say nurses who live in New Orleans, in the city limits of New Orleans who live 97 00:08:35,640 --> 00:08:41,460 there, right, then we'll be looking at a different population. So really has to do with the details 98 00:08:41,460 --> 00:08:48,730 of how you describe the theme around that population. But the point is, once you describe 99 00:08:48,730 --> 00:08:56,320 that theme, the population is every single individual in there. So then, what is the 100 00:08:56,320 --> 00:09:03,980 sample? Well, it's a small portion of that population. It can be a representative sample, 101 00:09:03,980 --> 00:09:10,460 but it can also be a biased sample, and we're going to get into that. So let's just go back 102 00:09:10,460 --> 00:09:17,130 to mgh. And think let's say we were going to survey a sample of the population of nurses 103 00:09:17,130 --> 00:09:24,130 at mgh, let's say we only surveyed nurses in the intensive care unit. That would be 104 00:09:24,130 --> 00:09:29,250 a sample, but not a representative sample. So it would be a small portion of that population, 105 00:09:29,250 --> 00:09:35,840 but not a representative one. Probably more representative would be if we asked at least 106 00:09:35,840 --> 00:09:42,600 one nurse from each department. And so I just want to get in your head that the whole concept 107 00:09:42,600 --> 00:09:49,200 of sample is, is that it's just a small portion of the population. And it's not a portion 108 00:09:49,200 --> 00:09:55,570 of some other population. It's just that one. But the problem is you can get a biased one 109 00:09:55,570 --> 00:10:02,400 or representative one. So you have to think about So when you think about it, if you've 110 00:10:02,400 --> 00:10:09,620 got a whole population, then you would get variables about each individual in that population. 111 00:10:09,620 --> 00:10:15,230 And those variables would be your data. But if you chose samples, that you know, just 112 00:10:15,230 --> 00:10:20,950 a portion will be a lot less work, right? You'd still have to get variables about those 113 00:10:20,950 --> 00:10:25,800 individuals, but there's way fewer individuals, so it probably be easier. So in population 114 00:10:25,800 --> 00:10:31,480 data, data from every single individual in the population is available. And that's called 115 00:10:31,480 --> 00:10:40,600 a census. So I'm, I knew a person who decided to do a survey of every single professor at 116 00:10:40,600 --> 00:10:46,830 a college. She didn't take just some professors from each department, she sent the survey 117 00:10:46,830 --> 00:10:55,280 to every single professor. So she did not use a sample, she used a census. But in sample 118 00:10:55,280 --> 00:11:01,830 data, the data are only available from some of the individuals in the population. So if 119 00:11:01,830 --> 00:11:08,180 we go back to the researcher I described, if she had only taken some of the list, the 120 00:11:08,180 --> 00:11:17,180 email list of the professors at that college, then she would have been serving a sample. 121 00:11:17,180 --> 00:11:23,490 And that's actually very commonly used in research studies, especially if patients, 122 00:11:23,490 --> 00:11:29,320 why would you need to go get every, for example, kidney dialysis patient and study every single 123 00:11:29,320 --> 00:11:37,510 one, you only need a sample. And why is that because we have statistics. So I'm going to 124 00:11:37,510 --> 00:11:44,970 just give you a few examples of real population data in healthcare. You're probably familiar 125 00:11:44,970 --> 00:11:52,220 with Medicare, Medicare is the public insurance program in the United States, for elders. 126 00:11:52,220 --> 00:11:56,310 So even my grandma was on Medicare when she was alive, 127 00:11:56,310 --> 00:12:02,980 and she was not a US citizen, she was from India. So we really do a good job of covering 128 00:12:02,980 --> 00:12:08,430 our elders in the US with Medicare. In fact, I even read a statistics that said, almost 129 00:12:08,430 --> 00:12:16,750 100% of people aged 65 and over are in Medicare. And so therefore, if you download data from 130 00:12:16,750 --> 00:12:21,380 Medicare, they make it confidential, you only just replace all the personal identifiers. 131 00:12:21,380 --> 00:12:25,910 But there's this thing called the Medicare claims data set for every single transaction 132 00:12:25,910 --> 00:12:32,390 that happens, like if you're in Medicare, and you go get some treatment that's in there. 133 00:12:32,390 --> 00:12:39,290 So it has all the insurance claims filed by the Medicare population, because it has everybody, 134 00:12:39,290 --> 00:12:45,290 everything than that is population data. Also, in the United States, every 10 years, the 135 00:12:45,290 --> 00:12:50,390 government hires a bunch of people to go out and survey a bunch of people. And also, they 136 00:12:50,390 --> 00:12:54,370 send out a bunch of surveys. And the idea is to try to get every single person in the 137 00:12:54,370 --> 00:13:00,500 United States to fill out that survey. And that's called the United States Census. So 138 00:13:00,500 --> 00:13:07,610 now, I'm going to give you sort of a mirror image of the sample data. Okay. Remember how 139 00:13:07,610 --> 00:13:13,910 I was just talking to about Medicare? People who are enrolled in Medicare are called Medicare 140 00:13:13,910 --> 00:13:20,410 beneficiaries, and Medicare cares what they think. So they do a survey of a sample of 141 00:13:20,410 --> 00:13:27,150 individuals on Medicare. And they do this kind of often. I think they do it once a year. 142 00:13:27,150 --> 00:13:33,260 I'm not sure it's a phone survey. They only do a sample because they're going to use statistics 143 00:13:33,260 --> 00:13:38,640 to try and extrapolate that knowledge back to the population of Medicare beneficiaries. 144 00:13:38,640 --> 00:13:45,580 Also, in case you notice, the United States Census only takes place every 10 years. Do 145 00:13:45,580 --> 00:13:51,130 you think changes happen in between? Yep, lots of changes. Like you just think about 146 00:13:51,130 --> 00:13:58,200 Hurricane Katrina. That's very sad. It changed the population distribution in Louisiana, 147 00:13:58,200 --> 00:14:03,300 vary vary dramatically, and also other states around there. So how did they keep up? Well, 148 00:14:03,300 --> 00:14:08,450 they used the American Community Survey, the government does this the United States Census 149 00:14:08,450 --> 00:14:15,130 Bureau, and that, again, is done by phone. And that's conducted yearly. And it's a sample 150 00:14:15,130 --> 00:14:21,860 and so the US doesn't know exactly how many people would be in Louisiana or anywhere else. 151 00:14:21,860 --> 00:14:27,790 But they can use statistics to extrapolate that from the sample of the American Community 152 00:14:27,790 --> 00:14:38,730 Survey. I want to just do a shout out to statistical notation. So from now on, when we see a capital 153 00:14:38,730 --> 00:14:47,190 N, like let's say you sack capital N equals 25, then you can assume that 25 means a population 154 00:14:47,190 --> 00:14:52,820 that's just kind of a secret code we use in statistics. However, if you saw a lowercase 155 00:14:52,820 --> 00:15:00,160 n, n equals 25, and it was lowercase, then you could assume that this was a sample of 156 00:15:00,160 --> 00:15:06,440 the population. And again, it's just kind of like a secret code, you have to pay attention. 157 00:15:06,440 --> 00:15:12,040 When I'm talking and I say n, and you can see uppercase and lowercase. You don't know 158 00:15:12,040 --> 00:15:21,660 if I'm talking about a population, or a sample. Now I'm going to get into the concept of parameter 159 00:15:21,660 --> 00:15:28,980 versus statistic, I want you to notice that the word parameter starts with P PA. So parameter 160 00:15:28,980 --> 00:15:35,930 is a measure that describes the entire population. So for instance, anything that would come 161 00:15:35,930 --> 00:15:43,540 out of that whole Medicare claims data set, or that whole United States Census would be 162 00:15:43,540 --> 00:15:52,420 a parameter. On the other hand, a statistic statistic starts with S, and statistic is 163 00:15:52,420 --> 00:15:59,550 a measure that describes only a sample of a population. Here we have an, again, a situation 164 00:15:59,550 --> 00:16:06,790 where the word statistic is used, like daily on the news. In fact, sometimes I hear on 165 00:16:06,790 --> 00:16:14,290 the news, something like Oh, look at the rate of HIV in Africa, it's going up. That's a 166 00:16:14,290 --> 00:16:21,230 terrible statistic. I agree. It's terrible. But they mean parameter, because they're talking 167 00:16:21,230 --> 00:16:27,340 about all of Africa, every single person in Africa, if the rate of HIV is going up in 168 00:16:27,340 --> 00:16:33,860 Africa, they mean a parameter, they don't need a statistic. 169 00:16:33,860 --> 00:16:40,820 So here's an example of parameters and statistics that are based on the same population. So 170 00:16:40,820 --> 00:16:46,260 for example, the mean age of every American on Medicare is a parameter that's every single 171 00:16:46,260 --> 00:16:52,990 person. However, remember, the Medicare beneficiary survey, that's just a sample. So if we took 172 00:16:52,990 --> 00:16:58,350 the mean age of those people, we would just have a statistic. And again, you just have 173 00:16:58,350 --> 00:17:03,030 to pay attention, because if you listen to the news, you'll hear them use the word statistic 174 00:17:03,030 --> 00:17:10,420 to mean both parameter and statistic. But in this situation with, when you're practicing 175 00:17:10,420 --> 00:17:16,400 in the field of statistics, it's very important to point out when the number you're talking 176 00:17:16,400 --> 00:17:22,450 about comes from a population versus comes from a sample. So you should really use the 177 00:17:22,450 --> 00:17:31,800 term. This is a parameter if it's from a population, or this is a statistic, if it's from a sample. 178 00:17:31,800 --> 00:17:38,700 And so again, don't get confused. If you're listening to someone talk in a lecture or 179 00:17:38,700 --> 00:17:46,150 in a video, you might want to look for clues that a number is a population parameter, or 180 00:17:46,150 --> 00:17:52,920 as a sample statistic, if you hear that the data set that they use encompasses an entire 181 00:17:52,920 --> 00:17:58,630 population. And usually that's the kind of stuff done by governments, like remember when 182 00:17:58,630 --> 00:18:04,430 I was talking about the rate of HIV in Africa, lead probably be done by governments of the 183 00:18:04,430 --> 00:18:09,310 United Nations, or the World Health Organization. So when you're talking about numbers that 184 00:18:09,310 --> 00:18:10,310 might have come 185 00:18:10,310 --> 00:18:11,310 out of an 186 00:18:11,310 --> 00:18:17,380 entire population, usually done by the government, that's probably a population parameter. clues 187 00:18:17,380 --> 00:18:23,130 that someone's talking about a sample statistic is if you hear them talking about a study 188 00:18:23,130 --> 00:18:24,890 that recruited volunteers, 189 00:18:24,890 --> 00:18:25,890 well, 190 00:18:25,890 --> 00:18:30,340 then, if it's volunteers, they didn't get everybody in the population. So it's going 191 00:18:30,340 --> 00:18:37,440 to be a sample. Also, like surveys, for instance, surveys about who people are going to vote 192 00:18:37,440 --> 00:18:44,040 for you public opinion surveys, they're never going to ask some every single person in the 193 00:18:44,040 --> 00:18:50,100 state, who are you going to vote for build us ask a sample. So if you hear about a survey, 194 00:18:50,100 --> 00:18:56,490 you might even have them tell you say, n equals maybe a few 1000 people because that's all 195 00:18:56,490 --> 00:19:02,000 they surveyed. And so that's a clue that we're talking about a sample statistic rather than 196 00:19:02,000 --> 00:19:09,300 a population parameter. Now, I'm going to talk about the difference between descriptive 197 00:19:09,300 --> 00:19:14,510 statistics and inferential statistics. But first I'm going to remind you what the word 198 00:19:14,510 --> 00:19:22,230 infer means. So infer means to kind of get a hint from something indirectly. It's kind 199 00:19:22,230 --> 00:19:31,720 of the complement to imply. So if I said my friend implied that I should not call after 200 00:19:31,720 --> 00:19:38,370 9pm and I figured that out. I would say I inferred that I should not call my friend 201 00:19:38,370 --> 00:19:44,010 after 9pm. Okay. So in inferential is what I'm going to talk about next. But first I'm 202 00:19:44,010 --> 00:19:49,740 going to talk about descriptive descriptives is pretty easy, because you can do it to samples 203 00:19:49,740 --> 00:19:56,570 and you can do it to populations will variables from samples and populations, right. And so, 204 00:19:56,570 --> 00:20:01,050 descriptive statistics involve methods of organizing picturing in some Rising information 205 00:20:01,050 --> 00:20:05,540 from samples and populations. It's basically just making pictures of it right? Like look 206 00:20:05,540 --> 00:20:09,809 at that bar chart. And that's just a simple picture. And that can be made with just about 207 00:20:09,809 --> 00:20:17,140 any data. You get data from surveying people at work, you get data from surveying your 208 00:20:17,140 --> 00:20:22,750 friends, what they're going to bring to the potluck. If any of that can be used, you can 209 00:20:22,750 --> 00:20:28,950 go download the census data, you can make descriptive statistics out of that. But there's 210 00:20:28,950 --> 00:20:36,059 something very special about inferential statistics. And that involves methods of using information 211 00:20:36,059 --> 00:20:44,010 from a sample to draw conclusions regarding the population. Therefore, inferential statistics 212 00:20:44,010 --> 00:20:52,370 can only be done on a sample. And therefore and that's why that's called inferential. 213 00:20:52,370 --> 00:20:59,210 Right? Because infer, because the sample is going to give a hint about what the population 214 00:20:59,210 --> 00:21:03,370 is right? It's not going to say it directly, which is annoying, right? But that's that 215 00:21:03,370 --> 00:21:09,000 uncertainty thing I was telling you about. So the sample is going to imply something? 216 00:21:09,000 --> 00:21:14,840 Well, we're gonna infer something from the sample about the population, right? So that's 217 00:21:14,840 --> 00:21:19,240 what inferential statistics is, is where you take a sample, and you infer something about 218 00:21:19,240 --> 00:21:23,530 the population. Whereas descriptive statistics is more loosey goosey. You can just do that 219 00:21:23,530 --> 00:21:32,370 to samples and populations, kind of like make pictures out of it, right. So in statistics, 220 00:21:32,370 --> 00:21:37,880 it's really important to properly identify measures as either population parameters, 221 00:21:37,880 --> 00:21:44,130 or sample statistics. Because as you can see, you can only do inferential statistics on 222 00:21:44,130 --> 00:21:49,260 samples. And so you have to really know what you're doing when you're doing statistics, 223 00:21:49,260 --> 00:21:53,900 what you're talking about, because different types of data are used for parameters versus 224 00:21:53,900 --> 00:22:00,750 statistics. Alrighty, now we're going to get into classifying variables into different 225 00:22:00,750 --> 00:22:06,550 levels of measurement. So remember our variables, right, like we have individuals, and then 226 00:22:06,550 --> 00:22:11,390 we have variables about them. And those variables actually can only fall into two groups, quantitative 227 00:22:11,390 --> 00:22:15,730 versus qualitative. And then depending on which group they fall into, you can further 228 00:22:15,730 --> 00:22:21,221 classify them as interval versus ratio, or nominal versus ordinal. And I'm going to give 229 00:22:21,221 --> 00:22:28,610 you some examples of how to classify a few healthcare data, types of variables already, 230 00:22:28,610 --> 00:22:34,020 so I like to draw this picture. It's a four level data classification, I'll draw it solely 231 00:22:34,020 --> 00:22:39,800 here for you. So we start with human research data, that's what I like to start with. Alright, 232 00:22:39,800 --> 00:22:44,500 so we're going to split that into two. Remember, I said that, we're going to start by talking 233 00:22:44,500 --> 00:22:49,960 about quantitative. Another word that's often used for that is continuous, but we're going 234 00:22:49,960 --> 00:22:55,620 to use the word quantitative. So what does that mean? That is a numerical measurement 235 00:22:55,620 --> 00:22:59,100 of something. So like, this gives an example of temperature. So something 236 00:22:59,100 --> 00:23:03,810 with a number in it, I always think if I can make a mean out of it, it must be a quantitative 237 00:23:03,810 --> 00:23:11,050 variable, right? And so here's an example of quantitative variables. So time of admin, 238 00:23:11,050 --> 00:23:21,520 right? So imagine that you work a shift in the ER, right? And from maybe 8pm to 12. like 239 00:23:21,520 --> 00:23:27,309 midnight, right? So you have this for hours. And you could say, what the average time of 240 00:23:27,309 --> 00:23:32,540 admin would be for those who got admitted to the hospital, you know, somebody got admitted 241 00:23:32,540 --> 00:23:36,920 at like, eight o'clock, and then somebody at 815, and whatever, you could put that together, 242 00:23:36,920 --> 00:23:43,650 and you'd say what the average time was, also, like, if you were doing a study, and you as 243 00:23:43,650 --> 00:23:49,230 you were saying, patients with a particular condition like Alzheimer's disease, you could 244 00:23:49,230 --> 00:23:54,360 ask them their year of diagnosis, and then you could make an average of that. And so 245 00:23:54,360 --> 00:24:00,250 you know, that that is quantitative. systolic blood pressure is also numerical, and platelet 246 00:24:00,250 --> 00:24:05,500 count. And these are variables we run into all the time in healthcare. So we're, you 247 00:24:05,500 --> 00:24:11,380 said that this is quantitative. Now, we'll get back to our picture. So that's one side. 248 00:24:11,380 --> 00:24:16,000 So what if it's not quantitative? What else could it be? Well, the only other category, 249 00:24:16,000 --> 00:24:21,510 it could be is categorical or qualitative. I use the term qualitative, but some people 250 00:24:21,510 --> 00:24:28,300 use the term categorical, but that's kind of what it is, is that it's a quality of something 251 00:24:28,300 --> 00:24:35,170 or a characteristic of something like sex or race. So here are some qualitative variables 252 00:24:35,170 --> 00:24:41,080 in healthcare, like you can have type of health insurance, like whether you're on Medicare 253 00:24:41,080 --> 00:24:47,370 or Medicaid or different types of private insurance. Those are all just categorical, 254 00:24:47,370 --> 00:24:53,210 right? You can't make a mean out of that. Also country of origin. If you're in our group 255 00:24:53,210 --> 00:24:58,110 of students and their international students in there. Well, what countries are they from? 256 00:24:58,110 --> 00:25:03,090 Right? Well, you can't make a mean out of that. Also you have situations where you do 257 00:25:03,090 --> 00:25:08,370 have numbers involved, like the stage of cancer, right? That's depressing. Stage One, cancer, 258 00:25:08,370 --> 00:25:13,630 stage two, cancer, stage three, well, you never can make a mean, out of the stage of 259 00:25:13,630 --> 00:25:19,330 cancer, you wouldn't say, well, the mean stages is 1.4, or something like that. It's just 260 00:25:19,330 --> 00:25:25,430 a category. And of course, stage four is a lot worse than stage one. You know, they're 261 00:25:25,430 --> 00:25:32,430 not just equal categories, but their categories. Same with trauma center level level four Trauma 262 00:25:32,430 --> 00:25:38,809 Center, where you wouldn't make a mean out of the number of after the term Trauma Center, 263 00:25:38,809 --> 00:25:44,870 right, like what level it is. But you could say, well, in the state, maybe. So many percent 264 00:25:44,870 --> 00:25:49,390 of our trauma centers are level four trauma center. So it's really just a categorical 265 00:25:49,390 --> 00:25:55,590 variable, even though there's a number involved. Alright, so let's get back to our diagram, 266 00:25:55,590 --> 00:26:01,510 we figured out how to take any variable, and first split it into one of two categories 267 00:26:01,510 --> 00:26:08,020 is either quantitative, if it's numerical, or qualitative, if it's a characteristic. 268 00:26:08,020 --> 00:26:14,730 Now, we're going to just concentrate on quantitative because we're going to separate those variables 269 00:26:14,730 --> 00:26:19,410 into two categories. And the first one we're going to look at is interval. And the second 270 00:26:19,410 --> 00:26:26,309 one we're going to look at is ratio. So if a if you happen to decide a variable as quantitative, 271 00:26:26,309 --> 00:26:31,710 then it could be interval or ratio, but not if it's qualitative. Okay, if it's qualitative, 272 00:26:31,710 --> 00:26:38,640 it doesn't get to do that. So let's look at interval versus ratio. So on the left side 273 00:26:38,640 --> 00:26:43,810 of the side, we have interval, which is where it's quantitative, and the differences between 274 00:26:43,810 --> 00:26:46,580 data values are meaningful. 275 00:26:46,580 --> 00:26:47,580 And 276 00:26:47,580 --> 00:26:51,210 ratio has the same thing, the differences between the data values are meaningful. What 277 00:26:51,210 --> 00:26:56,630 does that mean by that? Well, remember how I was talking before how level one trauma 278 00:26:56,630 --> 00:27:01,700 center and level two trauma center that that those are really categories, and not quantitative 279 00:27:01,700 --> 00:27:08,570 variables, because the difference actually between them is not equal. Especially if you 280 00:27:08,570 --> 00:27:16,010 think of job classifications that might go in 1234, like nurse, one, nurse to nurse three, 281 00:27:16,010 --> 00:27:21,471 nurse four, or I worked at a job where we had office specialist one, office specialist 282 00:27:21,471 --> 00:27:23,860 to Office specialist three. 283 00:27:23,860 --> 00:27:26,850 And you know what the deal 284 00:27:26,850 --> 00:27:32,950 for going from office specialists to to Office specialist three was really hard, you really 285 00:27:32,950 --> 00:27:39,360 had to do a lot there. But to go from one to two wasn't that hard? So that was a categorical 286 00:27:39,360 --> 00:27:46,580 variable, right? Because the differences between the values were meaningless. Okay. Like the 287 00:27:46,580 --> 00:27:52,529 difference between s one and s two versus Oh, s two, and s three, they weren't equal. 288 00:27:52,529 --> 00:27:56,440 Whereas when you're dealing with a quantitative variable, regardless of whether it's interval 289 00:27:56,440 --> 00:28:03,010 or ratio, you're talking like years, or systolic blood pressure, one year for you is one year 290 00:28:03,010 --> 00:28:10,100 for me. So that's fine, right? But here's where the difference comes in between interval 291 00:28:10,100 --> 00:28:16,880 and ratio. So all quantitative variables have meaningful differences between their data 292 00:28:16,880 --> 00:28:24,920 values, but this hairsplitting thing here is that an interval, there is no true zero. 293 00:28:24,920 --> 00:28:32,010 And in ratio, there is a true zero. And this is how I try to think about it. an interval 294 00:28:32,010 --> 00:28:38,740 means kind of like, a space between two things. Like if you think of the word intermission 295 00:28:38,740 --> 00:28:43,540 is kind of like an interval. It's like an interval of time during a show where you get 296 00:28:43,540 --> 00:28:48,610 to get up and go the bathroom and get some coffee. So that's interval. And so if you 297 00:28:48,610 --> 00:28:53,230 have something that's a space in between, that's not going to have a zero, it doesn't 298 00:28:53,230 --> 00:28:59,150 really start anywhere, or end anywhere. It's in between. Whereas ratio, how are you number 299 00:28:59,150 --> 00:29:05,130 that is, I don't know if you remember from like high school, but you can't have a zero 300 00:29:05,130 --> 00:29:11,290 on the bottom of a ratio or a fraction. So that's the way I use a pneumonic. That ratio 301 00:29:11,290 --> 00:29:18,930 means that you cannot have a true zero. But how does this work out literally? Well, I'll 302 00:29:18,930 --> 00:29:25,690 show you. So let's go back to those examples I showed you of quantitative variables, right? 303 00:29:25,690 --> 00:29:30,120 Because those are the only ones we have to make this decision about whether they are 304 00:29:30,120 --> 00:29:37,010 interval ratio. So these are these examples. Now I'm going to remind you that ratio has 305 00:29:37,010 --> 00:29:42,059 a true zero. Remember that little pneumonic I said, like don't divide by zero. And so 306 00:29:42,059 --> 00:29:47,110 you know, like in a ratio, so they have a true zero. Well, let's think about it. It's 307 00:29:47,110 --> 00:29:53,031 not very pleasant to have a zero systolic blood pressure because you'd be dead. Same 308 00:29:53,031 --> 00:29:58,980 with the platelet count, but it is possible, right? But now when we go on to interval, 309 00:29:58,980 --> 00:30:05,799 we can't have Like zero time, like time of admet, you know are your diagnosis, there's 310 00:30:05,799 --> 00:30:13,230 no like, year zero. So as you probably just guessed, ratio is where it's at. In healthcare. 311 00:30:13,230 --> 00:30:18,429 There's not a whole lot of times when we have interval data, but we do, you know, anytime 312 00:30:18,429 --> 00:30:23,590 you have a time, so you got to keep that in mind that if you want to split your quantitative 313 00:30:23,590 --> 00:30:29,650 variables into either interval or ratio, you got to keep this in mind the difference between 314 00:30:29,650 --> 00:30:38,210 the true zero and the no true zero. Okay, here's our handy dandy diagram. We've just 315 00:30:38,210 --> 00:30:44,170 gone through the tree classifying quantitative data into interval versus ratio. Now let's 316 00:30:44,170 --> 00:30:48,780 go pay attention to the other side of the tree qualitative. So how do we split those? 317 00:30:48,780 --> 00:30:58,080 Um, well, we can split those into nominal versus ordinal. All right. So nominal applies 318 00:30:58,080 --> 00:31:04,630 to categories, labels, or names that cannot be ordered from smallest to largest. Okay, 319 00:31:04,630 --> 00:31:08,850 like I kind of think of when they have an advertisement, they say, for a nominal fee, 320 00:31:08,850 --> 00:31:14,290 you can do this, it means it's small, they're like, there's almost no difference. And so 321 00:31:14,290 --> 00:31:18,559 that's why I say, there's no difference, it's not smallest to largest is means they must 322 00:31:18,559 --> 00:31:24,710 be equal. That's how I remember it in my mind. But then ordinal applies to data that can 323 00:31:24,710 --> 00:31:29,000 be arranged in order in categories. But remember that thing I was saying about quantitative, 324 00:31:29,000 --> 00:31:33,929 it's not quantitative, right? Because the difference between the data values either 325 00:31:33,929 --> 00:31:39,440 cannot be determined or is meaningless, like I was talking about with cancer, especially, 326 00:31:39,440 --> 00:31:43,750 you know, if you go from stage three to stage four, that's materially different than stage 327 00:31:43,750 --> 00:31:48,690 one to stage two. So you really can't determine those things. So this is where we're gonna 328 00:31:48,690 --> 00:31:54,320 get into that it's ordinal. It's arranged in categories that can be ordered from smallest 329 00:31:54,320 --> 00:32:01,620 to largest. So remember, our old friends that I threw up there before of these examples 330 00:32:01,620 --> 00:32:07,710 of qualitative variables and healthcare? Well, let's just reflect on this nominal cannot 331 00:32:07,710 --> 00:32:12,950 be ordered, right. So that would be more like type of health insurance and country of origin 332 00:32:12,950 --> 00:32:17,919 because they could all be equal. Whereas ordinal is going to have a natural order, even though 333 00:32:17,919 --> 00:32:24,110 the differences between the levels is meaningless, which is what makes it so different from a 334 00:32:24,110 --> 00:32:29,330 quantitative variables. So which is why it stays on the qualitative side of the tree, 335 00:32:29,330 --> 00:32:34,450 it just gets labeled ordinal. So what you want to do is if you think you have a qualitative 336 00:32:34,450 --> 00:32:39,830 variable on your hands, look for a natural order. If there is one, it's ordinal. And 337 00:32:39,830 --> 00:32:48,490 if not, it's nominal. So all data can be classified as quantitative or qualitative. So if you 338 00:32:48,490 --> 00:32:53,640 have a variable, that's the first split you can make as the difference between quantitative 339 00:32:53,640 --> 00:32:58,750 and qualitative, but once you do that, you can further classify it as interval ratio, 340 00:32:58,750 --> 00:33:03,890 nominal, or ordinal. And it's really important to know how to classify data in healthcare, 341 00:33:03,890 --> 00:33:09,200 as you'll find out later. Because depending on how you classify it, you might be able 342 00:33:09,200 --> 00:33:15,840 to do different things with it in statistics already, so what we went over was the definition 343 00:33:15,840 --> 00:33:20,720 of statistics. And we talked a little about why you use it and how you use it, especially 344 00:33:20,720 --> 00:33:25,800 in healthcare. We went over what it means to talk about a population parameter and the 345 00:33:25,800 --> 00:33:31,240 sample statistic, and we went over some examples about them. And then we talked about classifying 346 00:33:31,240 --> 00:33:38,190 variables into the different levels of measurement, and even talked about a few examples there. 347 00:33:38,190 --> 00:33:46,840 So I hope you enjoyed my lecture. Greetings, this is Monica wahi lecturer at library college, 348 00:33:46,840 --> 00:33:55,780 bringing you your lecture on section 1.2 on the topic of sampling. 349 00:33:55,780 --> 00:33:56,780 So here 350 00:33:56,780 --> 00:34:01,871 are your learning objectives for this particular lecture. At the end of this lecture, the students 351 00:34:01,871 --> 00:34:08,030 should be able to define sampling frame and sampling error, the student should be also 352 00:34:08,030 --> 00:34:13,389 able to give one example of how to do simple random sampling. And one example of how to 353 00:34:13,389 --> 00:34:19,599 do systematic sampling. The students should be able to explain one reason to choose stratified 354 00:34:19,599 --> 00:34:26,270 sampling over other approaches, state to differences between cluster sampling and convenience sampling, 355 00:34:26,270 --> 00:34:33,029 and give an example of a national survey that uses multistage sampling. So let's jump right 356 00:34:33,029 --> 00:34:39,268 into it here. So we're going to go over in this lecture, sampling definitions, and then 357 00:34:39,268 --> 00:34:43,588 those different types of sampling I mentioned in the learning objectives, simple random 358 00:34:43,589 --> 00:34:49,969 sampling, stratified sampling, systematic sampling, and then convenience and multi state 359 00:34:49,969 --> 00:34:59,710 sing. So let's start with some sampling definitions. What is a sample Okay, so we're going to revisit 360 00:34:59,710 --> 00:35:05,900 that concept from the previous lecture, we're also going to talk about sampling frames, 361 00:35:05,900 --> 00:35:11,210 and what errors mean and errors of sampling frames. And then we're also going to just 362 00:35:11,210 --> 00:35:15,869 go right back over that and make sure you understand before we go on, and talk about 363 00:35:15,869 --> 00:35:22,499 the different types of sampling. So we take a sample of a population, because we want 364 00:35:22,499 --> 00:35:28,390 to do inferential statistics, remember that we want to infer from the sample to the population. 365 00:35:28,390 --> 00:35:34,109 And it's just not necessary to measure the whole population, it would be impractical. 366 00:35:34,109 --> 00:35:40,789 And it's cost a lot. And actually, what you'll find is, if you ever do an experiment, when 367 00:35:40,789 --> 00:35:46,049 where you actually do measure the whole population, you'll find that if you get, you know, a pretty 368 00:35:46,049 --> 00:35:51,509 good proportion of the population, and you just take that, you, that's all you really 369 00:35:51,509 --> 00:35:58,470 needed to talk to. So ultimately, we save resources, especially in health care, when 370 00:35:58,470 --> 00:36:05,249 we do a good job of sampling, and use that to infer to the population rather than having 371 00:36:05,249 --> 00:36:12,019 to take a census of the whole population all the top. So that brings us to the concept 372 00:36:12,019 --> 00:36:18,130 of sampling frame. So the sampling frame is the list of individuals from which a sample 373 00:36:18,130 --> 00:36:23,170 is actually selected. And the list may be this physical concrete list, like you could 374 00:36:23,170 --> 00:36:29,260 have a list of students enrolled at a nursing college, or in my other lecture, I gave an 375 00:36:29,260 --> 00:36:35,420 example of a list of nurses who work at Massachusetts General Hospital, that could be your list, 376 00:36:35,420 --> 00:36:40,780 you'd go to human resources and get that. Or it could be a theoretical list. It could 377 00:36:40,780 --> 00:36:46,079 be like the list of patients who present to the emergency department today, obviously, 378 00:36:46,079 --> 00:36:51,029 when you go into work, at the beginning of the shift, you're not going to know who's 379 00:36:51,029 --> 00:36:56,999 on that list yet. But it could be a theoretical list. But whatever that list is, that is your 380 00:36:56,999 --> 00:37:05,769 sampling frame. So that those are the people who actually could be selected for your study. 381 00:37:05,769 --> 00:37:12,109 So the sampling frame is the part of the population from which you want to draw the sample. And 382 00:37:12,109 --> 00:37:17,960 you want to work at such that everybody from your sampling frame has a chance of being 383 00:37:17,960 --> 00:37:22,890 selected for your sample. In other words, you don't want to leave anyone that should 384 00:37:22,890 --> 00:37:31,059 be in your sampling frame out in the cold. That leads us to the concept of under coverage. 385 00:37:31,059 --> 00:37:35,670 So what is it? It's omitting population members from the sampling frame? They're supposed 386 00:37:35,670 --> 00:37:41,130 to be on the list, but they're not there. So how can this happen? Well, let's say you 387 00:37:41,130 --> 00:37:45,309 did what I was suggesting in the previous slide, you got a list of nursing students, 388 00:37:45,309 --> 00:37:50,650 you know, from a college, let's say somebody signed up that day, or somebody was just admitted 389 00:37:50,650 --> 00:37:54,830 that day, maybe they didn't make it into the database in time and you're missing them. 390 00:37:54,830 --> 00:37:59,920 Or even like that HR list I talked about, at mgh, well, you know, I know how nurses 391 00:37:59,920 --> 00:38:04,119 are, sometimes they'll temp in different places, and maybe they're not on the payroll, maybe 392 00:38:04,119 --> 00:38:09,160 they're through a temp agency. And so then we would miss those nurses from the sampling 393 00:38:09,160 --> 00:38:15,099 frame. And then, you know, people who present at the emergency department at night might 394 00:38:15,099 --> 00:38:19,470 be different than those in the day. And so if you're really trying to sample from people 395 00:38:19,470 --> 00:38:24,330 who present to the emergency department, you can't just look at like some small period 396 00:38:24,330 --> 00:38:31,970 of time, you'd have to look at, you know, the whole 24 hour cycle. So if you omit population 397 00:38:31,970 --> 00:38:36,170 members from your sampling frame, they don't even get a chance to be in it. And that's 398 00:38:36,170 --> 00:38:43,800 called under coverage. Now, I'm going to shift around, we're jumping around with a few different 399 00:38:43,800 --> 00:38:44,800 definitions. 400 00:38:44,800 --> 00:38:49,470 And we're going to talk about errors. Now, this is something that took me a while to 401 00:38:49,470 --> 00:38:54,519 get used to in statistics, there's actually two kinds of errors in statistics. The first 402 00:38:54,519 --> 00:39:02,030 kind is I call it This is my own terminology, a fact of life error. It's just an error that 403 00:39:02,030 --> 00:39:08,160 happens. When you do statistics, it's not bad or good. It's just what happens. And in 404 00:39:08,160 --> 00:39:13,349 this case, I'm going to describe one of those. It's called a sampling error. So the sampling 405 00:39:13,349 --> 00:39:18,900 error just simply says the population mean will be different from your sample mean, and 406 00:39:18,900 --> 00:39:22,859 the population percentage will be different from your sample percentage. So what does 407 00:39:22,859 --> 00:39:28,299 that mean? That means that if I cut corners, like I said, I could write and just take a 408 00:39:28,299 --> 00:39:33,789 sample to infer to the population. If I actually do one of those experiments I was telling 409 00:39:33,789 --> 00:39:38,940 you about where I have the population data and I just take a sample and compare the means 410 00:39:38,940 --> 00:39:43,480 they will be different. Okay, I mean, there might be this huge coincidence where they're 411 00:39:43,480 --> 00:39:49,359 the same but they're typically different. Same if you do percentages, and and we just 412 00:39:49,359 --> 00:39:53,180 know this is going to happen. The statistics we account for it, we have ways of dealing 413 00:39:53,180 --> 00:39:58,479 with it. But we know that there's always going to be sampling error whenever you take a sample 414 00:39:58,479 --> 00:40:02,770 from a population To try to make a mean or percentage in the sample, it's just not going 415 00:40:02,770 --> 00:40:06,509 to be exactly what's in the populations fine. 416 00:40:06,509 --> 00:40:08,219 But then 417 00:40:08,219 --> 00:40:12,839 there are other errors and statistics, which are actually bad. And your it means you made 418 00:40:12,839 --> 00:40:19,529 a mistake. It's like mistakes, literally mistakes. And so as you go through learning about statistics, 419 00:40:19,529 --> 00:40:23,000 it's almost like you have to sit down and ask somebody, is this one of those fact of 420 00:40:23,000 --> 00:40:28,069 life errors? Or is this one of those errors you want to avoid? Well, we just talked about 421 00:40:28,069 --> 00:40:33,869 sampling error. That's just a fact of life error. But errors, you want to avoid non sampling 422 00:40:33,869 --> 00:40:42,200 error. That's basically using a bad list. I had an example in my life where I wanted 423 00:40:42,200 --> 00:40:48,920 to study a whole bunch of providers, right. And my friend gave me this list of providers, 424 00:40:48,920 --> 00:40:54,989 and and said, this is the entire list of all these providers in this particular professional 425 00:40:54,989 --> 00:41:01,640 society. But when I sent the email to that list, I found there were not only duplicates 426 00:41:01,640 --> 00:41:05,729 on this list, but a lot of people emailed me back and said, Why are you sending this 427 00:41:05,729 --> 00:41:14,529 to me? I'm not a provider. I'm not part of this professional society. And also, some 428 00:41:14,529 --> 00:41:19,700 people who were in that professional society, who had heard about the survey emailed me 429 00:41:19,700 --> 00:41:24,390 and said, Why didn't I get the survey. So this was a bad list. Some people had been 430 00:41:24,390 --> 00:41:32,650 left out of the sampling frame. So people who were in the society somehow weren't on 431 00:41:32,650 --> 00:41:37,470 my email list. And that's a problem, right? So you have to pay careful attention. This 432 00:41:37,470 --> 00:41:42,430 was actually a mistake I made, you have to pay careful attention that everyone in the 433 00:41:42,430 --> 00:41:46,960 population who was supposed to be represented in your sampling frame is actually there. 434 00:41:46,960 --> 00:41:51,480 So I should have really done a better job of calling the professional society and making 435 00:41:51,480 --> 00:41:59,719 sure that this list was a good list. So sampling error was caused by the fact that regardless 436 00:41:59,719 --> 00:42:06,130 of what you do, your sample will not perfectly resent represent the population. Whereas non 437 00:42:06,130 --> 00:42:11,880 sampling error, yeah, I was sloppy. It was poor sample design, sloppy data collection, 438 00:42:11,880 --> 00:42:16,680 and accurate measurement instruments, you can have bias and data collection, other problems 439 00:42:16,680 --> 00:42:22,329 introduced by the researcher. So this is your fault if there's non sampling error, but sampling 440 00:42:22,329 --> 00:42:23,880 error is just a 441 00:42:23,880 --> 00:42:27,809 fact of life. 442 00:42:27,809 --> 00:42:33,539 Little whiplash here, we're gonna now move on to the concept of simulations. So a simulation 443 00:42:33,539 --> 00:42:42,219 is defined technically as a numerical facsimile, or representation of a real world phenomenon. 444 00:42:42,219 --> 00:42:48,529 So it's like working through a pretend situation, to see how it would come out in the case that 445 00:42:48,529 --> 00:42:57,900 was real. And this, you know, when you study statistics, you end up doing a lot of simulations. 446 00:42:57,900 --> 00:43:03,859 And remember how I've been talking about an experiment you could do if you somehow did 447 00:43:03,859 --> 00:43:08,569 a census and had a whole bunch of data on a population, you could do an experiment where 448 00:43:08,569 --> 00:43:13,779 you just took a sample from that population and looked at their mean to see the sampling 449 00:43:13,779 --> 00:43:23,740 error. That's an example of a simulation. So to just conclude this little section, it's 450 00:43:23,740 --> 00:43:30,430 really important to do your best to avoid non sampling error. And this is achieved by 451 00:43:30,430 --> 00:43:35,219 making sure you do not have under coverage when sampling from your sampling frame. So 452 00:43:35,219 --> 00:43:40,719 this puts together some of our vocabulary. But just remember, sampling error is a fact 453 00:43:40,719 --> 00:43:47,469 of life. Okay, now we're going to specifically talk about different types of sampling. And 454 00:43:47,469 --> 00:43:55,769 we're going to start with simple random sample. Okay, so first, we're gonna start with just 455 00:43:55,769 --> 00:44:00,960 explaining what is meant by simple random sampling, then we're going to talk about two 456 00:44:00,960 --> 00:44:06,640 different methods of doing simple random sampling, they work the same way they achieve the same 457 00:44:06,640 --> 00:44:11,059 thing. It's just that depending on how you're doing your research, one might be more convenient 458 00:44:11,059 --> 00:44:16,819 for you than the other. Finally, we will go over the limits of simple random sampling, 459 00:44:16,819 --> 00:44:24,519 because all these sampling methods seem perfect. But then you got to take a look at their limitations. 460 00:44:24,519 --> 00:44:31,890 So let's first define simple random sampling. So here's a definition. A simple random sample 461 00:44:31,890 --> 00:44:39,159 of n measurements from a population is a subset of the population selected in such a manner 462 00:44:39,159 --> 00:44:46,269 that every sample of size n from the population has an equal chance of being selected. Well, 463 00:44:46,269 --> 00:44:52,359 it's kind of complicated, but what it means is, is that if you use the proper approach 464 00:44:52,359 --> 00:44:58,450 for simple random sampling, whatever sample you get, you could have had just as easily 465 00:44:58,450 --> 00:45:06,369 a chance of getting another batch, another group of people from that sample. In other 466 00:45:06,369 --> 00:45:10,750 words, like, let's say you have a list of the population of students in the class. So 467 00:45:10,750 --> 00:45:16,390 I'm going to define a class as a population. And you want to take a sample of five students 468 00:45:16,390 --> 00:45:21,190 from this bigger class. If you take a simple random sample, it means that all the different 469 00:45:21,190 --> 00:45:26,450 groups of five students you could pick from the list has an equal chance of being the 470 00:45:26,450 --> 00:45:33,200 sample group you actually pick. Now, you can just imagine that if you race into the class 471 00:45:33,200 --> 00:45:37,470 right at the beginning, and you take your sample of five and not everybody's in the 472 00:45:37,470 --> 00:45:43,810 class, what does that sound like, right, a sampling frame problem, maybe an under coverage 473 00:45:43,810 --> 00:45:49,480 problem, maybe biases creeping in there, right. And so you just got to be careful, if you're 474 00:45:49,480 --> 00:45:55,230 going to do simple random sampling, that you start with a list with everybody in your sample 475 00:45:55,230 --> 00:46:01,450 frame, because every single sample that you could possibly take should have equal chance 476 00:46:01,450 --> 00:46:10,240 of ending up being your sample. And I'll kind of explain it by explaining the two different 477 00:46:10,240 --> 00:46:14,359 methods that can be used of obtaining that 478 00:46:14,359 --> 00:46:15,359 sample. 479 00:46:15,359 --> 00:46:22,140 So one of the best things that you can do is just start with a really good list of all 480 00:46:22,140 --> 00:46:27,529 the people in your population. So maybe, you know, if I was going to study, I used to work 481 00:46:27,529 --> 00:46:32,489 at the army. So let's say I was going to study all the people who are active duty in the 482 00:46:32,489 --> 00:46:39,259 US Army, I would like to get a list of all of those people from an accurate place at 483 00:46:39,259 --> 00:46:48,650 the army. And I would like to have them have a unique ID. Okay. And that's true in the 484 00:46:48,650 --> 00:46:54,569 army, everybody in the army has a unique numerical ID. So what I would do, like in here, if you 485 00:46:54,569 --> 00:46:59,650 were looking at students, you'd take maybe take a student ID, so then you take the IDS 486 00:46:59,650 --> 00:47:06,339 from everybody on the list, and you cut them up, like you print them out, and you cut them 487 00:47:06,339 --> 00:47:11,599 up, and you put them in a hat, right, or a bag where you can't see in it. And they mix 488 00:47:11,599 --> 00:47:17,109 them all up where you can't see it. And you draw five of them up, or like in the picture, 489 00:47:17,109 --> 00:47:21,731 you know, what they did was mix up all those papers, and now they're not looking. And they're 490 00:47:21,731 --> 00:47:27,519 drawing a few out. Okay, so what did you just do, you just made sure, first of all, that 491 00:47:27,519 --> 00:47:31,799 everybody in the population had an ID number. And that when you printed it out and cut it 492 00:47:31,799 --> 00:47:35,549 up, all, you didn't lose any of them, if you drop them on the floor, or something that's 493 00:47:35,549 --> 00:47:39,329 not simple random sample, you got to make sure you keep all of them, and that you put 494 00:47:39,329 --> 00:47:44,829 them all in the hat, and that you didn't look and you draw five or whatever, because then 495 00:47:44,829 --> 00:47:49,200 any five of those slips of paper could have been drawn in there for your meeting with 496 00:47:49,200 --> 00:47:57,880 simple random sampling. Okay, that method will work, right? Another method that works, 497 00:47:57,880 --> 00:48:02,920 that might work better if you can't do this ID thing where you cut a paper is where you 498 00:48:02,920 --> 00:48:10,249 simply just make your own list of unique random numbers, right, you just make your own list. 499 00:48:10,249 --> 00:48:16,110 And then you assign those to the population. A great example is if you're, you know, kind 500 00:48:16,110 --> 00:48:20,259 of teaching kids and you want to put them in a random order, maybe you're gonna do a 501 00:48:20,259 --> 00:48:26,710 game or something. Well, all you do is you you get, like, let's say you have 10, kids, 502 00:48:26,710 --> 00:48:31,779 you number one to 10, you put it in the hat, and then you pull out the first number, let's 503 00:48:31,779 --> 00:48:35,950 say it's five, you give it to the first kid, right? And then you just keep pulling out 504 00:48:35,950 --> 00:48:40,359 numbers and giving them to the kids and then tell them to stand in order, right? So you 505 00:48:40,359 --> 00:48:44,410 generate a list of random numbers as long as the list of the population. So I said, 506 00:48:44,410 --> 00:48:50,069 What if you have 10 kids? Well, if you have, you know, 500 names, then you get 500 numbers, 507 00:48:50,069 --> 00:48:54,239 and they don't have to be one through 500. They just have to be unique. Okay, I like 508 00:48:54,239 --> 00:48:59,309 smaller numbers. So I'd say keep them small, but you can do what you want. And then, in 509 00:48:59,309 --> 00:49:05,099 any case, you randomly assign these numbers, you can use the hat, I'm big on hats to this 510 00:49:05,099 --> 00:49:11,329 population. And then, you know, you ask them to stand in order, or somehow you figure out 511 00:49:11,329 --> 00:49:15,499 it's kind of like a raffle you call out who's got number one, you know, and whoever says 512 00:49:15,499 --> 00:49:19,729 yes, you're like, you're lucky you get to be in my study, you know, so you can take 513 00:49:19,729 --> 00:49:26,150 the first five numbers in the order, right. And that's, that'll achieve the same thing 514 00:49:26,150 --> 00:49:30,440 as the last method, you'll get a simple random sample, it's just two different ways of doing 515 00:49:30,440 --> 00:49:37,759 it. So ultimately, being in a simple random sample means that the sample has an equal 516 00:49:37,759 --> 00:49:42,719 chase chance of being selected out of the hat that this group of people or a group of 517 00:49:42,719 --> 00:49:48,609 whatever has an equal chance of being selected. And you'll see this picture on the left here 518 00:49:48,609 --> 00:49:54,140 is bingo, as some of you may play bingo. You know, they pull balls out of there and they 519 00:49:54,140 --> 00:49:59,119 call off the names of the balls. Well, each ball has a unique actually a letter and a 520 00:49:59,119 --> 00:50:04,329 number unique on there. And that's how they make them random. That's they take a simple 521 00:50:04,329 --> 00:50:11,440 random sample of these bingo balls each time that they do a bingo game. So I described 522 00:50:11,440 --> 00:50:16,690 to you the first method of doing that using an old fashioned hat. The second method, you 523 00:50:16,690 --> 00:50:20,349 know, where you generate your own numbers, and you just make sure they're unique. And 524 00:50:20,349 --> 00:50:25,369 then you assign them to things and put them in order. Well, that's my electronic hat. 525 00:50:25,369 --> 00:50:30,950 That's how I handle it. If I have, for example, somebody sends me an Excel sheet with a list 526 00:50:30,950 --> 00:50:36,209 of hospitals on it. I'll just assign each hospital random number and sort them in order. 527 00:50:36,209 --> 00:50:41,690 And I'll sample the top few hospitals. That'll be how I get a simple random sample of possibles. 528 00:50:41,690 --> 00:50:46,829 That way, I'm not biased, picking out my favorite hospitals where all my friends work, right? 529 00:50:46,829 --> 00:50:51,470 If I do it that way, the first method or the second method, all members of the population 530 00:50:51,470 --> 00:50:56,390 have the equal probability of being selected in the sample. And more importantly, all possible 531 00:50:56,390 --> 00:51:00,700 samples, all possible groups had an equal chance of being selected. Of course, I only 532 00:51:00,700 --> 00:51:04,489 did it once. So I only got one of them. But the other ones that weren't selected had an 533 00:51:04,489 --> 00:51:06,729 equal chance of being selected. 534 00:51:06,729 --> 00:51:15,180 All right, you probably saw the limits, is this whole list? Even if I'm sampling hospitals, 535 00:51:15,180 --> 00:51:21,650 right? I still need a list of hospitals to sample from. So you may not know who's gonna 536 00:51:21,650 --> 00:51:26,769 show up in the emergency department that day, if you do, while you're psychic, because most 537 00:51:26,769 --> 00:51:31,450 people are not. So how would you sample from them using simple random sampling? So simple 538 00:51:31,450 --> 00:51:36,009 random sampling is okay, when you got a list like hospitals, but it's not so good when 539 00:51:36,009 --> 00:51:41,979 you don't know who's going to show up that day. And even if you do a simple random sampling, 540 00:51:41,979 --> 00:51:47,940 you need a good list. I made a mistake once, where I did a survey with a bunch of professionals 541 00:51:47,940 --> 00:51:54,809 using a professional society list. And when I sent out the survey, I learned that there 542 00:51:54,809 --> 00:51:59,089 were people on the list who were no longer part of the society that it was an old list. 543 00:51:59,089 --> 00:52:03,009 And more importantly, there were people who had joined the society that had not made it 544 00:52:03,009 --> 00:52:09,619 onto that list. So I was getting under coverage. So like, if you were doing a study with students, 545 00:52:09,619 --> 00:52:13,849 you know, what if they just left off the part time students, then you'd be missing them. 546 00:52:13,849 --> 00:52:18,079 So this is a great example of non sampling error. And so if you're going to do simple 547 00:52:18,079 --> 00:52:21,719 random sampling, you do need a list and you really want to research it and make sure it's 548 00:52:21,719 --> 00:52:30,150 the best list possible. So I just went over the characteristics of simple random sampling, 549 00:52:30,150 --> 00:52:35,890 and two different methods you can use from to sample from a list. And I also mentioned 550 00:52:35,890 --> 00:52:44,940 the limits of it. Now we'll talk about a different kind of sampling, stratified sampling. So 551 00:52:44,940 --> 00:52:50,420 we're gonna go over what it is. And then I'm just like, simple random sampling had all 552 00:52:50,420 --> 00:52:54,500 these steps to it, there are different steps in stratified sampling. And I'll give you 553 00:52:54,500 --> 00:53:00,219 some examples. And then of course, just like simple random sampling, this stratified sampling 554 00:53:00,219 --> 00:53:07,469 has limitations, and I'll talk about those. So I first wanted to just remind you what 555 00:53:07,469 --> 00:53:14,119 the word stratified means, or what strata are, the single word is stratum, and more 556 00:53:14,119 --> 00:53:19,529 than one a strata. Now you see that rock on the slide, you see that big, horizontal line 557 00:53:19,529 --> 00:53:25,910 across it, that those that's a stratum, there are strata, right? Those are strata of rock, 558 00:53:25,910 --> 00:53:31,680 if you stay geology, that'll the geologists will explain that where those breaks are, 559 00:53:31,680 --> 00:53:36,440 it means something happened often in the weather or the environment or whatever. But the reason 560 00:53:36,440 --> 00:53:43,650 why I put this picture up there is I want you to sort of imagine those layers. Because 561 00:53:43,650 --> 00:53:49,880 that's what we do in stratified sampling is first, we divide our list, of course, you 562 00:53:49,880 --> 00:53:55,339 know, a list, we divide our list into layers. Okay, so remember how I was just talking about 563 00:53:55,339 --> 00:53:59,519 simple random sampling? Like, what if I sample from hospitals? Well, I could take this hospital 564 00:53:59,519 --> 00:54:08,369 list and divide it until layers by for example, how close they are to the city, I could say, 565 00:54:08,369 --> 00:54:16,049 urban, suburban, and rural, I could first put them into those strata. Okay. And if I 566 00:54:16,049 --> 00:54:20,069 was doing that, I'd be doing stratified sampling. Same with students, like I could put them 567 00:54:20,069 --> 00:54:25,489 in, you know, first year nursing students, second year students, you know, and I'd have 568 00:54:25,489 --> 00:54:31,319 this them divided into strata first. Um, so this is what so why would you do that? Why 569 00:54:31,319 --> 00:54:36,369 not just do simple random sampling? Well, if you think about it, let's say that you've 570 00:54:36,369 --> 00:54:41,319 got a class like statistics, maybe a lot of you know, they're not that many first year 571 00:54:41,319 --> 00:54:47,150 students in it. So let's say the very small proportion is that way. If you do simple random 572 00:54:47,150 --> 00:54:52,690 sampling, you might just by lock miss all of them. Right. And so, if you're really concerned 573 00:54:52,690 --> 00:54:59,569 about what a minority thinks, then you can make sure to get representative from that 574 00:54:59,569 --> 00:55:04,769 stratum. By doing stratified sampling, because the first thing you do is you put those that 575 00:55:04,769 --> 00:55:13,809 list into groups. And then you take a simple random sample from each of the strata. So 576 00:55:13,809 --> 00:55:18,759 here's the steps. So step one, divide the entire population, the whole list you have 577 00:55:18,759 --> 00:55:23,920 into distinct subgroups called strata. And remember, each individual has to fit into 578 00:55:23,920 --> 00:55:28,249 one of those categories. So if you have somebody who's sort of halfway halfway between first 579 00:55:28,249 --> 00:55:32,670 year and second year, or you've got a hospital that's kind of on the border, it you got to 580 00:55:32,670 --> 00:55:37,609 choose, you got to put it in one of those groups. Step two, um, well, it's not really 581 00:55:37,609 --> 00:55:41,750 step two, but you've got to think about the strata like what is it based on, it's got 582 00:55:41,750 --> 00:55:46,670 to be based on one specific characteristics, such as age income, education level, you know, 583 00:55:46,670 --> 00:55:51,740 a great example is you could take people of all different incomes, right, that's a quantitative 584 00:55:51,740 --> 00:55:56,829 variable, but you can put them in strata by you know, less than a certain amount. And 585 00:55:56,829 --> 00:55:59,549 then that to that, that to that you can make, 586 00:55:59,549 --> 00:56:04,970 you know, four or five strata. And then, um, you know, you just want to make sure that 587 00:56:04,970 --> 00:56:10,450 all members of the stratum, each stratum, share the same characteristic. And then you 588 00:56:10,450 --> 00:56:15,549 could do step four, which is draw a simple random sample from each stratum. So like, 589 00:56:15,549 --> 00:56:20,769 in the case where I was describing, like, maybe you have a class with very few first 590 00:56:20,769 --> 00:56:27,699 year students, if you take a random sample of five from each strata, you know, each stratum, 591 00:56:27,699 --> 00:56:34,000 then you might be, you know, you're kind of getting almost like, extra votes from a small 592 00:56:34,000 --> 00:56:38,849 minority, right? Like, you're kind of treating them fairly, even though there's a way bigger 593 00:56:38,849 --> 00:56:46,549 group of the other people you're taking exactly five from. And, but you just that, that's 594 00:56:46,549 --> 00:56:52,339 the risk you take, because you want to make sure you hear from that small group. Because 595 00:56:52,339 --> 00:56:56,729 if you just do sample random sampling with groups, so small, you might just accidentally 596 00:56:56,729 --> 00:57:03,680 miss it. So here are some examples of stratified sampling. And you'll see this in the youth 597 00:57:03,680 --> 00:57:09,289 Behavioral Risk Factor Surveillance surveys that they do in high schools, that they'll 598 00:57:09,289 --> 00:57:14,690 stratify by grade, right, because if they did a simple random sample, you know, a lot 599 00:57:14,690 --> 00:57:19,229 of students drop out of junior and senior year, they get probably too many freshmen 600 00:57:19,229 --> 00:57:25,400 and sophomores. And so they're gonna want to look at getting a certain amount of freshman 601 00:57:25,400 --> 00:57:28,589 classes, certain amount of sophomore classes, certain amount of junior classes, student 602 00:57:28,589 --> 00:57:35,369 run the senior classes, so they can have enough of each to make good estimates, right. And 603 00:57:35,369 --> 00:57:42,279 in hospitals, they often sample providers from each department, right? Like, they don't 604 00:57:42,279 --> 00:57:48,089 just do a simple random sample of providers, if they're asking about like provider satisfaction, 605 00:57:48,089 --> 00:57:52,849 or if you know about a policy, they won't just do that, because they might, for example, 606 00:57:52,849 --> 00:57:59,339 Miss everybody in the ICU. Or if you're studying, you know, ICU is you have multiple ICU is 607 00:57:59,339 --> 00:58:00,339 there, 608 00:58:00,339 --> 00:58:01,339 then 609 00:58:01,339 --> 00:58:05,420 you would want to maybe stratify by ICU, just to make sure even if one of them's smaller, 610 00:58:05,420 --> 00:58:06,869 just to make sure you have 611 00:58:06,869 --> 00:58:07,869 a good, 612 00:58:07,869 --> 00:58:14,869 good solid representation from each ICU. So those are the reasons that push you to do 613 00:58:14,869 --> 00:58:19,319 stratified sampling. It's not always necessary. But when you have these situations where you 614 00:58:19,319 --> 00:58:23,380 have these distinct groups, especially the little one involved, and you want to hear 615 00:58:23,380 --> 00:58:30,660 from everybody, you really want to consider the stratified sampling. So of course, there's 616 00:58:30,660 --> 00:58:35,289 limitations. And I've been sort of leading up to this, what you end up doing is over 617 00:58:35,289 --> 00:58:42,109 sampling, one of the groups usually, you know, like the smallest group, if you make the same 618 00:58:42,109 --> 00:58:48,969 amount of people you take from that stratum, the same amount as you take from the big stratum. 619 00:58:48,969 --> 00:58:53,009 It's like the smallest group is having all these powerful votes and the biggest group 620 00:58:53,009 --> 00:58:58,180 has is weaker, you know, they're both equal when they're not technically equal in the 621 00:58:58,180 --> 00:59:03,690 population. But that's the way it goes, right? And I do higher level statistics, there's 622 00:59:03,690 --> 00:59:08,930 ways to adjust back for that, to just sort of say, take a penalty for that and go back 623 00:59:08,930 --> 00:59:14,410 and say, Well, what if the real pot you know, we can extrapolate this back to the population 624 00:59:14,410 --> 00:59:20,900 proportions? It's possible, but it's it takes some post processing is just the issue. And 625 00:59:20,900 --> 00:59:27,390 it's also like simple random sampling not really possible to do without a list beforehand. 626 00:59:27,390 --> 00:59:33,020 And it's also hard to do, because you actually have to split the list into groups into these 627 00:59:33,020 --> 00:59:37,150 strata. So let's say I had these hospitals and I didn't know where they were, I didn't 628 00:59:37,150 --> 00:59:42,440 know exactly if they were urban or rural or suburban. Well, that adds another level of 629 00:59:42,440 --> 00:59:50,059 complexity to this whole stratified sampling. So, in summary, I just went over what stratified 630 00:59:50,059 --> 00:59:53,520 means, and it means you know, putting things in groups and then taking from that, and I 631 00:59:53,520 --> 01:00:00,459 describe the steps involved. And it's a stratified sample. It goes a lot easily. A lot more easily 632 01:00:00,459 --> 01:00:04,749 if the strategist happened to be equal to begin with, you know, I gave the example of 633 01:00:04,749 --> 01:00:09,920 high schools, usually there's maybe slightly fewer people in junior and senior year, but 634 01:00:09,920 --> 01:00:14,930 it's kind of close. And it's always nice. Like if you're comparing ice use, for example, 635 01:00:14,930 --> 01:00:18,029 if the ice use are roughly the same size, because then you don't have to worry about 636 01:00:18,029 --> 01:00:26,019 this whole, one of them is smaller, but it's getting an equal vote. Already, now we are 637 01:00:26,019 --> 01:00:34,819 going to move on to talk about systematic sampling. Okay, well, systematic sampling 638 01:00:34,819 --> 01:00:40,780 actually can be done with or without a list. So it's a little more flexible than the kind 639 01:00:40,780 --> 01:00:47,680 of sampling we've been talking about. systematic sampling, it's easier for me to like, define 640 01:00:47,680 --> 01:00:52,309 it by describing the steps you go through to do it. So I'm just gonna explain how to 641 01:00:52,309 --> 01:00:57,049 do it. And then you'll understand, in fact, you'll understand why it's called systematic. 642 01:00:57,049 --> 01:01:03,489 So whether you have a list or not, what you have to do for step one is arrange all the 643 01:01:03,489 --> 01:01:10,999 individuals of the population in a particular order. Now, if it's a list, you just make 644 01:01:10,999 --> 01:01:16,699 it in whatever order you want to make it in. But if we're talking about, for example, patients 645 01:01:16,699 --> 01:01:20,180 coming into the ER, well, they come in, in the order that they want 646 01:01:20,180 --> 01:01:21,180 to. 647 01:01:21,180 --> 01:01:24,519 So they already are arranged in the list, right? You just don't know what that list 648 01:01:24,519 --> 01:01:32,180 is. Okay, then step two is pick a random individual as a start. So let's say I had a list of hospitals, 649 01:01:32,180 --> 01:01:39,650 and let's say it was just sorted by state, right? I, let's say I picked a random individual, 650 01:01:39,650 --> 01:01:44,930 maybe I went down, you know, seven on the list, and I picked that hospital. Or maybe 651 01:01:44,930 --> 01:01:50,710 you could be at the ER, you start your shift. And the seventh patient who is admitted to 652 01:01:50,710 --> 01:01:54,701 the ER, you pick that person, just I picked seven, I mean, you could have picked five, 653 01:01:54,701 --> 01:01:59,999 you could have picked 20, you know, just you pick a random person. Then the next step, 654 01:01:59,999 --> 01:02:05,789 step three is take every case member of the population in the sample. Now, don't try this 655 01:02:05,789 --> 01:02:11,880 in Scrabble case is not a word in Scrabble, okay? It's just a word and statistics ease, 656 01:02:11,880 --> 01:02:19,859 in what case means spelled k th, it means every so many. So let's pick a number and 657 01:02:19,859 --> 01:02:26,539 fill it in for K. So let's pick the number three. So let's say after you pick your first 658 01:02:26,539 --> 01:02:30,130 hospital from the list, or the first patient from the ER, it doesn't matter what number 659 01:02:30,130 --> 01:02:36,660 you chose for that, then you take every third after that. So every third patient that comes 660 01:02:36,660 --> 01:02:41,450 in after that, you ask them if they want to be in a study, or every third hospital after 661 01:02:41,450 --> 01:02:46,249 that original random one, I pick and I say, Okay, this is going to be part of my systematic 662 01:02:46,249 --> 01:02:51,049 sample. So as you can see, it's like pretty simple to do, it's easy to do, if you have 663 01:02:51,049 --> 01:02:56,680 a list, it's easy to if you don't have a list, it's just the deal is you have to pick K, 664 01:02:56,680 --> 01:03:01,189 well, first you pick a random place to start, then you pick K, and then you just keep going 665 01:03:01,189 --> 01:03:08,979 every so many. So you could do this with classes, you could take out a list of classes available 666 01:03:08,979 --> 01:03:14,920 at your college next semester, she pick a random number like three, you know, and it's 667 01:03:14,920 --> 01:03:18,900 sorted some way. So you go to the third class and you circle that, then you pick another 668 01:03:18,900 --> 01:03:24,189 random number like five and then after that you pick every fifth class. So after the third 669 01:03:24,189 --> 01:03:34,459 one, you go 45678, and then 910 11 1213. And you keep picking classes. Okay, this is not 670 01:03:34,459 --> 01:03:41,410 career advice. Okay? Do not pick your classes that way. This was just an example. Alright, 671 01:03:41,410 --> 01:03:45,819 so as you probably guessed, I'm going to be negative Nelly, again, there are problems 672 01:03:45,819 --> 01:03:52,239 with systematic sampling. If already things are set up, boy, girl, boy, girl, for example. 673 01:03:52,239 --> 01:03:57,490 If you pick like an even number, you're going to get all boys are all girls, right? And 674 01:03:57,490 --> 01:04:03,589 I noticed this actually, when I was doing a study in the lab, we wanted to study like 675 01:04:03,589 --> 01:04:08,589 whenever they put the assay through the machines, we thought some of the assays weren't running, 676 01:04:08,589 --> 01:04:15,470 right. And so we wanted to take a sample. And I wanted to take a systematic sample. 677 01:04:15,470 --> 01:04:21,279 But I wanted to take a systematic sample, like every seven days, and that's a week. 678 01:04:21,279 --> 01:04:29,119 And so I asked my colleague, does the lab vary day by day in what assez it runs because 679 01:04:29,119 --> 01:04:34,469 of it always runs the sexually transmitted disease assays, it saves them up and runs 680 01:04:34,469 --> 01:04:40,209 them all on Friday. And I'm sampling from every Friday, that's all I'm gonna get, right? 681 01:04:40,209 --> 01:04:44,940 That's actually called periodicity. You don't have to remember that I don't think I've ever 682 01:04:44,940 --> 01:04:50,099 even seen that written. It's just I remember my lecture in my class telling us that that's 683 01:04:50,099 --> 01:04:55,339 what you have to worry about with systematic sampling. It's not real common problem, though. 684 01:04:55,339 --> 01:05:00,650 But what's awesome about it is you can do it in a clinical setting. So you You can sample 685 01:05:00,650 --> 01:05:05,549 patients that way, coming into a clinic or coming to a central lab or like in the emergency 686 01:05:05,549 --> 01:05:10,380 room. And that's why this is a particular power, particularly powerful way to sample 687 01:05:10,380 --> 01:05:17,099 is that if you have an ongoing sort of patient influx, when you design your research, you 688 01:05:17,099 --> 01:05:21,470 could simply say, once you decide how many people you need to recruit for your sample, 689 01:05:21,470 --> 01:05:25,170 that you would use systematic sampling, and just have somebody in the clinic inviting 690 01:05:25,170 --> 01:05:33,680 every case person who qualifies every case patient who qualifies into your study. So 691 01:05:33,680 --> 01:05:39,739 it's easy to do systematic sampling, it's easy to do with or without a list. And you 692 01:05:39,739 --> 01:05:47,539 just pick a random starting point, and then you pick every case individual. Next, we're 693 01:05:47,539 --> 01:05:55,299 gonna move on to cluster sampling. So what is up with cluster sampling? Why do we need 694 01:05:55,299 --> 01:06:00,089 even other kinds of sampling? I just went over so many kinds. I mean, you could use 695 01:06:00,089 --> 01:06:05,479 stratified systematic or simple random sampling, why would you even need another kind? Well, 696 01:06:05,479 --> 01:06:11,030 cluster is very special. It's special, because it's the kind of sampling you use when you 697 01:06:11,030 --> 01:06:18,240 think there's a problem at a particular geographic location. Typically, that's how cluster sampling 698 01:06:18,240 --> 01:06:22,079 is used. And, and I'll explain it further. 699 01:06:22,079 --> 01:06:29,420 Imagine, for example, there's a particular factory that's is believed to admit fumes 700 01:06:29,420 --> 01:06:34,439 that cause problems with people's health. Well, you can't do simple random sampling 701 01:06:34,439 --> 01:06:40,449 all over the nation, right, or you won't even get people by that factory, can't really do 702 01:06:40,449 --> 01:06:46,130 easily do stratified or systematic sampling their cluster sampling is what's designed 703 01:06:46,130 --> 01:06:51,880 when you want to study something that's coming from a geographic location. So when you do 704 01:06:51,880 --> 01:06:58,099 cluster sampling, you start by dividing a map into geographic areas. So I'm from Minnesota, 705 01:06:58,099 --> 01:07:04,829 and I know that there was a mine there with vermiculite in it. And it was it was contaminated, 706 01:07:04,829 --> 01:07:10,180 a lot of people got sick from it. But they didn't know that's what was going on. So they 707 01:07:10,180 --> 01:07:17,739 first I think divided Minnesota into different geographic areas, areas. after dividing the 708 01:07:17,739 --> 01:07:23,079 area into these different geographic areas, some with the, with the bad thing in it, and 709 01:07:23,079 --> 01:07:30,230 some without the bad thing in it, you randomly pick these clusters or areas from the map. 710 01:07:30,230 --> 01:07:38,650 So the app, like if you'll see there on the screen, there's a map of the state of Virginia, 711 01:07:38,650 --> 01:07:45,809 and it's all been divided into different groups. And then this, this cluster is is highlighted, 712 01:07:45,809 --> 01:07:52,170 you usually probably pick more than one cluster, sometimes it's only four or five. But the 713 01:07:52,170 --> 01:07:59,049 idea is you try to enroll all of the individuals in the cluster, it's usually people, although 714 01:07:59,049 --> 01:08:04,300 you can do it with animals, if there's a disease going around among animals, you know, you 715 01:08:04,300 --> 01:08:09,770 would have these, the divide the area up into clusters, and then you try to measure all 716 01:08:09,770 --> 01:08:17,698 the animals in the cluster. So as you can imagine, not only is this sort of practically 717 01:08:17,698 --> 01:08:23,588 difficult, but there's reasons why people live together, right? People live in communities. 718 01:08:23,589 --> 01:08:27,910 I mean, people don't just randomly scattered themselves, you know, cultural communities 719 01:08:27,910 --> 01:08:33,630 grow. companies grow around art, you know, affluent communities have different people 720 01:08:33,630 --> 01:08:39,670 in them, then communities that have less money. So sometimes the people located in the cluster 721 01:08:39,670 --> 01:08:43,849 are all similar in a way that makes the problem hard to study. And this is, especially if 722 01:08:43,849 --> 01:08:49,880 you're studying some geographic thing, like maybe a factory or a sewage plant, that you 723 01:08:49,880 --> 01:08:55,770 think might be causing cancer, if you're in an area where there's a lot of pollution anyway, 724 01:08:55,770 --> 01:09:00,689 from other things, and a lot of low income people live there. Because if you're high 725 01:09:00,689 --> 01:09:06,869 income you can afford not to, well, they're already being exposed to higher rates of carcinogens 726 01:09:06,870 --> 01:09:11,109 and probably have a higher cancer rate. It's hard to tell what the independent effect might 727 01:09:11,109 --> 01:09:16,729 be of that thing in that geographic location because of the other similarities of the people 728 01:09:16,729 --> 01:09:25,499 around. And so this is cancer ends up being a really difficult, tough nut to crack. Because 729 01:09:25,500 --> 01:09:30,960 where we see high rates, there are often a lot of different geographic issues going on 730 01:09:30,960 --> 01:09:33,859 there in cluster sampling doesn't really help tease 731 01:09:33,859 --> 01:09:39,339 that out. 732 01:09:39,339 --> 01:09:45,069 So to wrap this up, cluster sampling is used when geography is important. So if there is 733 01:09:45,069 --> 01:09:50,359 something geographically located in a certain spot and you can't move it, then you kind 734 01:09:50,359 --> 01:09:57,329 of are stuck doing cluster sampling. So briefly, the map around that areas divided into different 735 01:09:57,329 --> 01:10:03,809 sub areas, right. And those are Not all the areas are picked, just a few are randomly 736 01:10:03,809 --> 01:10:09,449 picked. And then all of the people in that particular area are sampled. And of course, 737 01:10:09,449 --> 01:10:13,219 it's biased towards the people living in the area. If you you know, in the area you pick 738 01:10:13,219 --> 01:10:17,619 with a bunch of affluent people, you'll get affluent people pick an area with a bunch 739 01:10:17,619 --> 01:10:22,989 of immigrants, he'll get immigrants. And so a cluster sampling is not perfect, but you're 740 01:10:22,989 --> 01:10:27,739 kind of stuck with it. When there's a situation with geography, how long it was, remember 741 01:10:27,739 --> 01:10:34,099 it is, when I used to live in Florida, we'd like to drive up to Georgia because they had 742 01:10:34,099 --> 01:10:40,790 the best pecan clusters. That's like a type of dessert with pecans and Carmel and stuff. 743 01:10:40,790 --> 01:10:44,860 So when I think of cluster sampling, I think of those pecan clusters that they're only 744 01:10:44,860 --> 01:10:50,360 really good in Georgia. So that's my way of remembering that cluster sampling has to do 745 01:10:50,360 --> 01:10:57,849 with geography. Now I'm finally going to talk about the last two types of sampling that 746 01:10:57,849 --> 01:11:02,570 I'm going to cover in this lecture, convenience sampling and multistage sampling. They're 747 01:11:02,570 --> 01:11:07,059 both a little quick, so I'm going to just cover them quickly. First, we're going to 748 01:11:07,059 --> 01:11:12,440 start by talking about convenient sampling. And we like that name, right? It's convenient. 749 01:11:12,440 --> 01:11:16,790 Convenient sampling can be used under low risk circumstances, like if the findings of 750 01:11:16,790 --> 01:11:21,341 what you're doing aren't really that important. Like, for instance, let's say that you wanted 751 01:11:21,341 --> 01:11:25,540 to know what ice cream is the best from the restaurant next to the hospital, let's say 752 01:11:25,540 --> 01:11:29,520 a new restaurant opens up, and you're gonna go off your diet, you're gonna go get some 753 01:11:29,520 --> 01:11:34,059 ice cream, but you don't want to waste it right. So you want to ask people, what's the 754 01:11:34,059 --> 01:11:40,060 best one, you might ask your coworkers, you might ask, you know, the people at the restaurant, 755 01:11:40,060 --> 01:11:44,060 hey, what's the best ice cream, but the results are not so reliable, because you might end 756 01:11:44,060 --> 01:11:51,360 up on Yelp and see that other people disagree. So a convenient sampling is basically using 757 01:11:51,360 --> 01:11:57,880 results or data that are conveniently or readily obtained. And my master's degree, one of the 758 01:11:57,880 --> 01:12:03,210 things I did was I surveyed people anonymously who were coming to a health fair, I sat at 759 01:12:03,210 --> 01:12:08,430 a booth, and I gave them the survey, to view questions in it. That was definitely a convenient 760 01:12:08,430 --> 01:12:13,780 sample, you know, just people showing up for the health fair. And this can be useful when 761 01:12:13,780 --> 01:12:19,630 there's not a lot of resources allocated to the study, like, I was a starving master's 762 01:12:19,630 --> 01:12:24,280 student, right, like, I didn't have any money. So that that was perfect for me convenience 763 01:12:24,280 --> 01:12:30,320 sampling. And also, you know, the questions I was asking them about were just characteristics 764 01:12:30,320 --> 01:12:34,800 of whether or not they had risk for diabetes. Well, I'm not a doctor, and I wasn't going 765 01:12:34,800 --> 01:12:39,790 to do anything about it. But it was interesting. So it wasn't a very high risk survey to fill 766 01:12:39,790 --> 01:12:46,210 up. It and convenience sampling is convenient, because it uses an already assembled group 767 01:12:46,210 --> 01:12:51,949 for surveys like I was doing at the health fair. An example might be to ask patients 768 01:12:51,949 --> 01:12:55,949 in the waiting room to fill out a survey or ask students in a class, you know, sometimes 769 01:12:55,949 --> 01:12:59,880 I do when I'm teaching, I'll do a convenient sample of whoever sitting there. I'll say, 770 01:12:59,880 --> 01:13:02,230 Hey, is the homework that I signed you this week too hard? 771 01:13:02,230 --> 01:13:03,659 Well, it's always too hard. I 772 01:13:03,659 --> 01:13:08,570 don't even know why I do the survey. But anyway, um, sometimes as a teacher, you'll just want 773 01:13:08,570 --> 01:13:15,010 to do a convenient sample just to get the gauge on where the classes but there are problems 774 01:13:15,010 --> 01:13:19,489 with it, right? You can't just use it for everything, even though it's nice and convenient. 775 01:13:19,489 --> 01:13:23,949 There's bias in every group, right? So if I let everybody go on break, and then whoever's 776 01:13:23,949 --> 01:13:27,670 still sitting there, I asked them a thong works too hard, I might get a totally different 777 01:13:27,670 --> 01:13:32,780 answer than if I waited for everybody come back. Right. And, you know, just about any 778 01:13:32,780 --> 01:13:37,219 time you just waltz into a room, like when I went to the health fair, who do you think, 779 01:13:37,219 --> 01:13:40,699 is there a bunch of sick people? No, there's a bunch of health minded people there. And 780 01:13:40,699 --> 01:13:47,320 so I'm gonna get a bunch of bias, right. And also, more importantly, when you do convenient 781 01:13:47,320 --> 01:13:55,329 sampling, you often miss important subpopulations. So remember, stratified sampling, how sometimes 782 01:13:55,329 --> 01:14:01,870 people don't group evenly into the different strata? Maybe they do kind of in high schools, 783 01:14:01,870 --> 01:14:07,369 but especially when it comes to job classifications, they usually have fewer bigwigs than they 784 01:14:07,369 --> 01:14:14,579 do. lackeys, right. And if they just have a few bigwigs, if you do a simple random sample, 785 01:14:14,579 --> 01:14:19,599 you you might miss all of them. So maybe you try a stratified sample. On the other hand, 786 01:14:19,599 --> 01:14:24,909 if you walk into the break room that is used by the lackeys and you say, hey, I want to 787 01:14:24,909 --> 01:14:32,239 fill out my, you know, work satisfaction survey. All of the ones you're going to get are going 788 01:14:32,239 --> 01:14:36,690 to be from the lackeys, you're not going to get any representation from the upper job 789 01:14:36,690 --> 01:14:42,670 classes because they don't go in that lounge, so you'd be missing them. So that's the main 790 01:14:42,670 --> 01:14:48,170 problem with convenience sample is the results can be so severely biased because you're only 791 01:14:48,170 --> 01:14:56,119 asking the small, biased group of people that probably are all alike in some way. It's not 792 01:14:56,119 --> 01:14:58,890 very representative sample. 793 01:14:58,890 --> 01:15:00,570 Next, 794 01:15:00,570 --> 01:15:06,960 I'm going to talk about multi stage sampling. So, you know, if you have a kid and the kids 795 01:15:06,960 --> 01:15:12,090 crying somebody like What's up, you say, well, the kids going through stage as well. That's 796 01:15:12,090 --> 01:15:16,160 exactly what you're doing when you're doing multi stage sampling, as you're going through 797 01:15:16,160 --> 01:15:23,050 stages. It's basically like mixing and matching, the different sampling I just talked about, 798 01:15:23,050 --> 01:15:28,699 only you do one stage, and then two stages, and then three stages, and then four stages, 799 01:15:28,699 --> 01:15:33,340 or maybe even more. And that's how you get your sample. So if you're imagining why I 800 01:15:33,340 --> 01:15:39,340 got to start with a lot of people, you're probably right, I just gave an example I made 801 01:15:39,340 --> 01:15:45,460 up of a way that you could do multistage sampling is you could start one with stage one as a 802 01:15:45,460 --> 01:15:51,150 cluster sample, right? Remember, where you take out a map, and then you divide into areas? 803 01:15:51,150 --> 01:15:56,770 Well, let's divide into states and take two census regions of states like about 10 states 804 01:15:56,770 --> 01:16:01,770 from those clumps. Okay, now, we limited it to that. Now let's go to stage two of our 805 01:16:01,770 --> 01:16:07,370 multistage sampling. Now, from each of those, we could take a random sample of counties, 806 01:16:07,370 --> 01:16:13,250 right. So we go and look at all the counties and then take that random sample. Then after 807 01:16:13,250 --> 01:16:20,030 we get those counties, stage three, we could take a stratified sample of schools from each 808 01:16:20,030 --> 01:16:26,030 county. So some of the counties will be totally rural, some will be totally urban, but most 809 01:16:26,030 --> 01:16:31,090 will have some mix. So we'll take a look at a few schools from the urban a few schools 810 01:16:31,090 --> 01:16:36,800 from the rural in stage three from the stratified will tell you a stratified sample schools 811 01:16:36,800 --> 01:16:41,020 from the simple random sample of counties from this cluster sample of states. Okay, 812 01:16:41,020 --> 01:16:47,000 now we got our schools, stage four could be a stratified sample of classrooms. So once 813 01:16:47,000 --> 01:16:51,080 we figured out our urban schools or rural schools, we could go in there and look at 814 01:16:51,080 --> 01:16:56,790 all the classrooms, freshman, sophomore, junior senior and take a stratified sample of those. 815 01:16:56,790 --> 01:17:01,949 So it's basically mixing and matching. But you're right, you got to start with a lot 816 01:17:01,949 --> 01:17:05,780 to begin with, if you're gonna whittle it down, and a whole bunch of stages, doesn't 817 01:17:05,780 --> 01:17:11,460 have to be four I just gave you for. Now I'm going to give you a real life example. This 818 01:17:11,460 --> 01:17:17,969 is the National Health and Nutrition Examination Survey. And Haynes definitely not a Master's 819 01:17:17,969 --> 01:17:23,810 project. This is done by the Centers for Disease Control and Prevention at the United States, 820 01:17:23,810 --> 01:17:30,610 right. So what I'm kind of hinting towards is the kinds of places doing multistage sampling 821 01:17:30,610 --> 01:17:38,960 our governments, not only do you have to start with a whole bunch of people and things and 822 01:17:38,960 --> 01:17:43,800 individuals, states and schools, and what have you, right, is that it's a lot of work 823 01:17:43,800 --> 01:17:49,960 to do all the sampling, and it better be for good reason. And the National Health and Nutrition 824 01:17:49,960 --> 01:17:55,780 Examination Survey is a good reason. That's, that's a survey that's done by the CDC to 825 01:17:55,780 --> 01:18:02,079 try and measure America's Health. Of course, it's doing inferential statistics, right, 826 01:18:02,079 --> 01:18:08,040 it's taking sample and trying to extrapolate that information back to the population. And 827 01:18:08,040 --> 01:18:11,551 so it's got to be really careful about how it does a sampler you can't just waltz in 828 01:18:11,551 --> 01:18:17,460 and do a bunch of convenient sampling. So this is how it does it, just briefly, they 829 01:18:17,460 --> 01:18:24,679 start by in stage one, sampling counties. Then from those counties, they sample something 830 01:18:24,679 --> 01:18:31,330 called segments, which is defined in the census, it's their different areas, from those segments, 831 01:18:31,330 --> 01:18:36,800 those areas, they sample households. And that's what they mean, like, wherever you live as 832 01:18:36,800 --> 01:18:41,670 a household. Even if you live in a dorm, that's a household or you live in assisted living, 833 01:18:41,670 --> 01:18:47,780 that's a household. I'm an apartment building house. So they sample those and once they 834 01:18:47,780 --> 01:18:53,400 knock on your door of your household, they sample individuals from the house. So they 835 01:18:53,400 --> 01:19:02,090 use four stages of sampling. And that's a real life example of multi stage sampling. 836 01:19:02,090 --> 01:19:08,989 So in summary, convenience and multi stage sampling, with respect to convenience sampling, 837 01:19:08,989 --> 01:19:16,199 you want to avoid it unless it's really a low risk question you're asking about. And 838 01:19:16,199 --> 01:19:20,199 you also want to avoid it unless it's really the only type of sampling possible under the 839 01:19:20,199 --> 01:19:26,920 circumstances. When you have situations where you have patients with very rare disease, 840 01:19:26,920 --> 01:19:32,300 probably convenience sampling from your Rare Disease clinic is reasonable. There, it's 841 01:19:32,300 --> 01:19:38,869 also used when resources are low. And so those are a few good reasons to try to use convenient 842 01:19:38,869 --> 01:19:45,139 sampling. It's really something that you want to use only if it's the thing 843 01:19:45,139 --> 01:19:50,170 you're stuck with. It's much better to look towards these other sampling approaches I 844 01:19:50,170 --> 01:19:56,420 described. And then finally, multistage sampling is usually used in large governmental studies. 845 01:19:56,420 --> 01:20:00,739 So don't expect to actually design anything alone with multistage sampling. When that 846 01:20:00,739 --> 01:20:06,010 happens, I showed you those four things for that survey that the CDC does hundreds of 847 01:20:06,010 --> 01:20:11,480 people work on that even just a sampling tons of people work to try and set that up. It's 848 01:20:11,480 --> 01:20:16,929 very difficult. But I wanted you to know about that kind of sampling, because it's important 849 01:20:16,929 --> 01:20:23,909 in healthcare, and it happens a lot. So in conclusion, we made it through the sampling 850 01:20:23,909 --> 01:20:29,920 lecture didn't wait. I first started by describing some definitions, you needed to be able to 851 01:20:29,920 --> 01:20:35,400 understand all these different types of sampling. Then I went into simple random sampling, and 852 01:20:35,400 --> 01:20:41,120 showed you how to do it two different ways and what it achieves and also its limitations. 853 01:20:41,120 --> 01:20:46,800 We next talked about stratified sampling, why you do that and how you do that, and the 854 01:20:46,800 --> 01:20:51,800 limitations of that one, too. Then we got into systematic sampling, which is a little 855 01:20:51,800 --> 01:20:58,719 more flexible, and pretty easy to explain. Next, we talked about cluster sampling, and 856 01:20:58,719 --> 01:21:04,369 why you might need to pull that tool out of your sampling toolbox. And then finally, we 857 01:21:04,369 --> 01:21:10,219 covered convenient sampling and multistage sampling. Already. Well, I hope you better 858 01:21:10,219 --> 01:21:15,679 understand sampling now and can keep all of these different types of sampling straight 859 01:21:15,679 --> 01:21:26,679 in your mind. Hello, everybody, it's Monica wahi labarre. College lecture for statistics 860 01:21:26,679 --> 01:21:35,489 are on to Section 1.3. Introduction to experimental design. And here are your learning objectives. 861 01:21:35,489 --> 01:21:41,460 So at the end of this lecture, you should be able to first state the steps of conducting 862 01:21:41,460 --> 01:21:47,099 a statistical study, and then select one step of developing a statistical study and state 863 01:21:47,099 --> 01:21:52,610 the reason for the step, you should be able to name one common mistake that can introduce 864 01:21:52,610 --> 01:21:59,199 bias into a survey and give an example should be able to explain what a lurking variable 865 01:21:59,199 --> 01:22:05,110 is, and give an example of that. And you should be able to define what a completely randomized 866 01:22:05,110 --> 01:22:06,110 experiment 867 01:22:06,110 --> 01:22:07,449 is. 868 01:22:07,449 --> 01:22:12,789 So let's get started. This lecture is in a cover four basic topics. First, we're going 869 01:22:12,789 --> 01:22:19,829 to look at the steps to conducting a statistical study, you may think there's a lot of steps 870 01:22:19,829 --> 01:22:26,389 to conducting a study, this is from the point of view of the statistician. Okay? Then we're 871 01:22:26,389 --> 01:22:30,650 gonna go over basic terms and definitions. And by now, you're probably used to the fact 872 01:22:30,650 --> 01:22:37,350 that in statistics, certain words are reappropriated. And they mean something specific in statistics. 873 01:22:37,350 --> 01:22:42,420 So we'll talk about that. Then we'll talk about bias and what that is and how to avoid 874 01:22:42,420 --> 01:22:48,800 it in when designing your studies. Finally, we'll talk about randomization in particular 875 01:22:48,800 --> 01:22:56,409 topics you need to think about when thinking about randomization. So let's get started. 876 01:22:56,409 --> 01:23:01,889 We're going to start with, of course, basic terms and definitions. And so first, we're 877 01:23:01,889 --> 01:23:06,989 going to review these steps that I keep talking about to conducting a statistical study. But 878 01:23:06,989 --> 01:23:13,130 there's some vocabulary, vocabulary that comes up. And so we're going to talk about those 879 01:23:13,130 --> 01:23:17,870 vocabulary terms that come up. And then also, I'm going to give you a few examples from 880 01:23:17,870 --> 01:23:24,829 healthcare. So here are the steps I keep talking about. So these are the basic guidelines for 881 01:23:24,829 --> 01:23:29,900 planning a statistical study. So the first thing you want to do is state your hypothesis. 882 01:23:29,900 --> 01:23:34,840 And you know, I'm in a scientist a while now. And I can't tell you how many times I get 883 01:23:34,840 --> 01:23:40,239 in a group of us, and people are all curious, and they start thinking about let's do a study. 884 01:23:40,239 --> 01:23:44,119 And it's only halfway through our conversation that I suddenly say, Hey, wait a second, we 885 01:23:44,119 --> 01:23:49,110 don't have a hypothesis, what's our apotheosis? So it's easy, even for scientists to forget 886 01:23:49,110 --> 01:23:56,429 that that's really step one, is you have to have a hypothesis. And so whatever hypothesis 887 01:23:56,429 --> 01:24:03,991 you pick, the hypothesis is about some individuals, if I have a hypothesis about hospitals, those 888 01:24:03,991 --> 01:24:09,000 are the individuals I have a hypothesis about patients. Those are the individuals. But it's 889 01:24:09,000 --> 01:24:14,680 important actually, to nail that down. Because am I talking about patients in the hospitals? 890 01:24:14,680 --> 01:24:19,889 Or am I talking about the hospitals, so make sure that you understand after you, you know, 891 01:24:19,889 --> 01:24:27,400 percolate and decide on your hypothesis, who the actual individuals of interest are? And 892 01:24:27,400 --> 01:24:33,369 that's because you're going to have to marry measure variables about these individuals. 893 01:24:33,369 --> 01:24:38,900 So step three is to specify all the variables you're going to need to measure about these 894 01:24:38,900 --> 01:24:41,460 individuals. You know, and of course, they relate to the 895 01:24:41,460 --> 01:24:43,140 hypothesis. 896 01:24:43,140 --> 01:24:50,010 So it's good thing is that was step one, right? Step four is to determine whether you want 897 01:24:50,010 --> 01:24:57,610 to use the entire population in your study or a sample. If you already have a bunch of 898 01:24:57,610 --> 01:25:02,469 data like you have the census data you You might as well use the entire population. But 899 01:25:02,469 --> 01:25:06,500 typically, if you don't have the data, you're going to want to sit down and think about 900 01:25:06,500 --> 01:25:11,340 using a sample. And if you do that, while you're sitting down, you should probably also 901 01:25:11,340 --> 01:25:19,030 choose the sampling method on the basis of what I talked about in the sampling lecture. 902 01:25:19,030 --> 01:25:23,670 Now that you've figured out your hypothesis, you got your individuals, you figured out 903 01:25:23,670 --> 01:25:27,920 your variables, and you figured out whether you're going to do a census or a sample, if 904 01:25:27,920 --> 01:25:33,889 you're going to do a sample what type of sample Step five is you think about the ethical concerns 905 01:25:33,889 --> 01:25:38,830 before data collection. If you're going to be asking some sensitive questions, you think 906 01:25:38,830 --> 01:25:44,530 about privacy, if you're going to be doing some invasive procedures, you think about 907 01:25:44,530 --> 01:25:48,929 how painful that would be, and how hard that would be on somebody, especially if they're 908 01:25:48,929 --> 01:25:54,199 not even, you know, it's they're just healthy. And you're just doing an experiment of unhealthy 909 01:25:54,199 --> 01:25:58,690 people just to better understand biology. So you have to really sit down and think about 910 01:25:58,690 --> 01:26:05,679 these ethical concerns. And they may change slightly your study design. Finally, after 911 01:26:05,679 --> 01:26:11,409 you get steps one through five, are taken care of, that's when you actually jump in 912 01:26:11,409 --> 01:26:16,909 and collect the data. And like I was saying, you know, when I meet with my scientist, friends, 913 01:26:16,909 --> 01:26:21,850 we get all excited about an idea. We're often talking about Step six, we're like, oh, we 914 01:26:21,850 --> 01:26:27,690 should do a survey, we should this we should that. And I realized I ended up saying, Hey, 915 01:26:27,690 --> 01:26:32,010 we actually have to go back to step one and start talking about a hypothesis, because 916 01:26:32,010 --> 01:26:36,670 I suddenly realized, I don't even know what data to collect, right? If you don't go through 917 01:26:36,670 --> 01:26:43,929 the steps in order, you really aren't doing it right. Step seven, is after you get the 918 01:26:43,929 --> 01:26:50,239 data, you finally use either descriptive or inferential statistics to answer your hypothesis. 919 01:26:50,239 --> 01:26:57,410 And that's what statistics is about. It's here for that. And then finally, after you 920 01:26:57,410 --> 01:27:03,330 use the statistics, you have to write up what you find, even if you're at a workplace. And 921 01:27:03,330 --> 01:27:07,130 they asked you to do a little survey that happened once when I was working somewhere. 922 01:27:07,130 --> 01:27:13,989 And they wanted us to do a survey. Their hypothesis was that they didn't have enough leadership 923 01:27:13,989 --> 01:27:18,670 programs, and they weren't building good leaders they could promote. And so I was on a team 924 01:27:18,670 --> 01:27:24,070 that did the survey, we didn't, you know, really publish it, like, everywhere. But we 925 01:27:24,070 --> 01:27:30,219 made an internal report, right. And in that internal report, we had to do step eight, 926 01:27:30,219 --> 01:27:36,630 which we had to note any concerns about data collection or analysis, you know, that happened 927 01:27:36,630 --> 01:27:41,800 when we were doing a report. And we also had to make recommendations for future studies, 928 01:27:41,800 --> 01:27:49,390 or if you wanted to study this in future groups of employees. So in science, what it usually 929 01:27:49,390 --> 01:27:57,699 ends up being is a peer reviewed literature report, right? is you do a scientific study, 930 01:27:57,699 --> 01:28:03,050 maybe you get a grant. And then you do all these steps. And then step eight is where 931 01:28:03,050 --> 01:28:09,119 you actually prepare a journal publication. And in that, you have to note any concerns 932 01:28:09,119 --> 01:28:13,739 about your data collection or analysis, anything that might have gone wrong, or not gone exactly 933 01:28:13,739 --> 01:28:22,190 the way you planned, or something you need to take into account to really properly interpret 934 01:28:22,190 --> 01:28:28,010 what the study found. You also want to make recommendations for future studies, especially 935 01:28:28,010 --> 01:28:33,039 if you screwed something up, or especially if you answered a really good question. No 936 01:28:33,039 --> 01:28:39,360 reason to per separate on that question, why don't we move forward and ask the next one. 937 01:28:39,360 --> 01:28:45,980 Now, these are a lot of steps to remember. So I'm going to help you try to remember them 938 01:28:45,980 --> 01:28:51,760 in sort of clumps. So let's look at the first clump, which are steps one through three, 939 01:28:51,760 --> 01:28:59,650 which is data hypothesis, identify the individuals of interest, and specify the variables to 940 01:28:59,650 --> 01:29:08,000 measure. So let's give an example of that. So let's say our hypothesis was air pollution 941 01:29:08,000 --> 01:29:13,480 causes asthma, and children who live in urban settings. You know, that's how we'd stated 942 01:29:13,480 --> 01:29:19,080 or we could say that as a research question, like does air pollution cause asthma in children 943 01:29:19,080 --> 01:29:24,659 who live in urban settings. And so in that case, the individuals would be children in 944 01:29:24,659 --> 01:29:30,369 urban settings, and the variables we'd have to measure our air pollution at least, and 945 01:29:30,369 --> 01:29:35,639 asthma at least. And of course, we'd want to know more things about these individuals, 946 01:29:35,639 --> 01:29:40,780 these children, we probably measure their income and where exactly they were living, 947 01:29:40,780 --> 01:29:46,579 and how old they were, and if they're male or female, and these kinds of things, but 948 01:29:46,579 --> 01:29:52,700 that just kind of helps you think about the first three steps together. Now let's think 949 01:29:52,700 --> 01:29:58,110 about the second three steps together four, five, and six, which is determine if you're 950 01:29:58,110 --> 01:30:03,360 going to use a population or sample If it's sample, pick the sampling method, look at 951 01:30:03,360 --> 01:30:11,619 the ethical concerns and then actually collect the data. So, when you do that, you can either 952 01:30:11,619 --> 01:30:18,309 quote unquote, collect data, you know, like, by using existing data by downloading data 953 01:30:18,309 --> 01:30:24,429 from the census, or like Medicare, they have data sets available that are, are de identified, 954 01:30:24,429 --> 01:30:28,719 so you don't know who exactly is in there. Or you can collect data yourself, like do 955 01:30:28,719 --> 01:30:37,230 a survey or, you know, get a bunch of patients that will allow you to measurement. When you 956 01:30:37,230 --> 01:30:43,600 use it, a government data set, often you can make population measures out of it. And so 957 01:30:43,600 --> 01:30:49,630 you don't really have to go through a lot of sampling, or ethics, because they've already 958 01:30:49,630 --> 01:30:57,150 provided it for you. And it's confidential. And that's kind of your data collection. But 959 01:30:57,150 --> 01:31:02,079 most of the time, what you'll see, especially for studying patients, and treatments, and 960 01:31:02,079 --> 01:31:07,559 cures, and things like that, those are on a smaller scale. So you end up collecting 961 01:31:07,559 --> 01:31:14,489 data from a sample for those estimates. And again, you need to choose a sampling approach. 962 01:31:14,489 --> 01:31:20,809 And then you need consent, if legally found to be human research. So I just want to share 963 01:31:20,809 --> 01:31:26,789 with you in case you didn't know, if you want to go do research on humans, you're a nursing 964 01:31:26,789 --> 01:31:33,250 student, or your medical students or a dental student, any any students or or your dentist, 965 01:31:33,250 --> 01:31:40,619 your physician, whatever, a nurse, you can't just make up a survey, or study design and 966 01:31:40,619 --> 01:31:47,119 go out and do it, you have to get approval from an ethical board. And that ethical board 967 01:31:47,119 --> 01:31:53,099 will talk to you if what you're doing is considered li li human research, that you need to get 968 01:31:53,099 --> 01:32:00,719 consent from the patients or the participants in your study if they're humans. And if you're 969 01:32:00,719 --> 01:32:05,710 collecting data about children, for example, you have to get the consent of their parents 970 01:32:05,710 --> 01:32:10,449 and the assent of the children. And in the United States, that way, we have a setup, 971 01:32:10,449 --> 01:32:15,469 it's called an institutional review board for the protection of human subjects and research 972 01:32:15,469 --> 01:32:22,640 or the short answer is IRB. And so I just want to make sure that if you ever do design 973 01:32:22,640 --> 01:32:27,079 a study that you know about this IRB thing, and you realize you have to go through this 974 01:32:27,079 --> 01:32:32,980 ethical board and make sure that they're cool with it. Before you can move on to the next 975 01:32:32,980 --> 01:32:40,219 step of designing a statistical study. All right, finally, we're on to the last clump 976 01:32:40,219 --> 01:32:47,239 of steps, which is seven, and eight, right? So that's using descriptive or inferential 977 01:32:47,239 --> 01:32:51,880 statistics to answer your hypothesis you in six, you collected the data. Now we're going 978 01:32:51,880 --> 01:32:56,969 to do the statistics. And then step eight is noting any concerns about your data collection 979 01:32:56,969 --> 01:33:00,349 or analysis and making recommendations for future studies. 980 01:33:00,349 --> 01:33:05,010 So you can kind of imagine this is where we're sitting in our offices, and writing up our 981 01:33:05,010 --> 01:33:10,000 research, whether we're writing an internal report to our bosses, over writing for the 982 01:33:10,000 --> 01:33:17,599 scientific literature to publish for everybody. So at this point, I just want to remind you 983 01:33:17,599 --> 01:33:23,880 that it matters whether you picked a census or a sample, for your study design. Because 984 01:33:23,880 --> 01:33:28,309 if you pick the census, you're going to do a certain kind of analysis. And if you pick 985 01:33:28,309 --> 01:33:33,119 the sample, you're going to do a different kind of analysis and statistics. So again, 986 01:33:33,119 --> 01:33:40,361 that's all kind of cycles back to your study design. And what's important here is I want 987 01:33:40,361 --> 01:33:48,429 to talk to you about the two different main types of studies. Now within these two categories, 988 01:33:48,429 --> 01:33:54,039 you have different subtypes. But these are the two main types that you can have. The 989 01:33:54,039 --> 01:34:00,309 first is called an experiment. experiment is where a treatment or intervention is deliberately 990 01:34:00,309 --> 01:34:07,389 assigned to the individuals. So you can kind of imagine that if you enter a study, and 991 01:34:07,389 --> 01:34:12,119 they assign you to take a drug in the study that you weren't taking before, that would 992 01:34:12,119 --> 01:34:17,810 be an experiment. But another thing could happen. I mean, you could do this to individuals, 993 01:34:17,810 --> 01:34:22,520 you could do it to animals, but you could do it, I keep getting the example of hospitals, 994 01:34:22,520 --> 01:34:28,730 we could choose some hospitals and say, Hey, you need to try a new policy as the intervention 995 01:34:28,730 --> 01:34:34,869 and and that was assigned by the researcher. So that makes this an experiment. And the 996 01:34:34,869 --> 01:34:40,290 reason why we have experiments is sometimes you need them. The purpose is to study the 997 01:34:40,290 --> 01:34:47,159 possible effect of the treatment or the intervention on the variables measured. And so that's one 998 01:34:47,159 --> 01:34:52,679 option you can do is have an experimental study where the researcher assigns the individuals 999 01:34:52,679 --> 01:35:01,309 to do certain things in the study. There's another kind of study The other kind, which 1000 01:35:01,309 --> 01:35:07,211 is called observational, and the way you can think about it is in experiments, the researcher 1001 01:35:07,211 --> 01:35:13,130 does something, they intervene, they give a treatment, right? But an observational, 1002 01:35:13,130 --> 01:35:21,699 the researcher doesn't do that the researchers just observes. So, if you enroll in the study, 1003 01:35:21,699 --> 01:35:25,270 and you say, Do I have to take a drug? Am I supposed to eat something? What am I supposed 1004 01:35:25,270 --> 01:35:30,429 to do? And the researcher just says, No, we're just going to measure you, we're just going 1005 01:35:30,429 --> 01:35:34,030 to ask you questions, and we're going to measure things about you, we're not going to tell 1006 01:35:34,030 --> 01:35:40,010 you to do anything different, then you're in an observational study. So no treatment 1007 01:35:40,010 --> 01:35:44,880 or intervention is assigned by the researcher in an observational study. Now, let's say 1008 01:35:44,880 --> 01:35:48,090 you're taking a drug, you know, just because maybe you have migraines, you're taking a 1009 01:35:48,090 --> 01:35:51,789 migraine drug, well, you just keep taking it, or you can stop taking it, you know, they 1010 01:35:51,789 --> 01:35:55,560 don't care, they might ask you about taking the drug, but they're not going to assign 1011 01:35:55,560 --> 01:36:02,869 you to take it. It's an observational study. I wanted to give you a couple of real life 1012 01:36:02,869 --> 01:36:11,199 examples. So Women's Health Initiative up on the slide was mainly an experiment, okay. 1013 01:36:11,199 --> 01:36:16,310 This is was run by the United States government, but of course, had the cooperation of many, 1014 01:36:16,310 --> 01:36:24,040 many universities and, and health care centers, and most importantly, women. So women in America, 1015 01:36:24,040 --> 01:36:29,560 women who were postmenopausal, volunteered to be in the study. And the study actually 1016 01:36:29,560 --> 01:36:37,349 had two separate sections, the experiment section, and the observational study section. 1017 01:36:37,349 --> 01:36:42,310 They really wanted women to qualify for the experiment, and that the purpose of the experiment 1018 01:36:42,310 --> 01:36:48,320 was to study whether hormone replacement therapy, which is a therapy for symptoms that women 1019 01:36:48,320 --> 01:36:54,630 can get if they're postmenopausal, that are unpleasant. What whether that therapy is good 1020 01:36:54,630 --> 01:37:00,949 for women, or bad for women, because they thought maybe it helps them the post menopause 1021 01:37:00,949 --> 01:37:08,829 system symptoms. But they thought maybe it causes cancer, right? So they know. So what 1022 01:37:08,829 --> 01:37:14,760 they had to do was assign, get a bunch of women who were agreeing, you know that they 1023 01:37:14,760 --> 01:37:20,000 would take whatever was assigned to them. And they had to assign the drug to some of 1024 01:37:20,000 --> 01:37:25,570 these women. So that's what made an experiment. The problem is not all the women qualified 1025 01:37:25,570 --> 01:37:31,270 for the study. So they had a separate observational study, if if the woman did not qualify to 1026 01:37:31,270 --> 01:37:38,599 get the experimental drug assigned to her, then she could be in the observational study. 1027 01:37:38,599 --> 01:37:43,750 And because this is these big government studies, why not, you know, somebody wants to be in 1028 01:37:43,750 --> 01:37:49,800 a study, why not study them, just put them in the observational section. 1029 01:37:49,800 --> 01:37:57,789 A very huge, popular long, ongoing study. That's an observational study, again, run 1030 01:37:57,789 --> 01:38:03,730 by Well, this one actually started out of Harvard. And that's called the nurses Health 1031 01:38:03,730 --> 01:38:10,670 Study. Some really smart person figured out a long time ago, that nurses are, are smart 1032 01:38:10,670 --> 01:38:16,280 people, they understand their own health, they understand other people's health. And 1033 01:38:16,280 --> 01:38:21,820 they're good at filling out surveys about health. So they started studying nurses and 1034 01:38:21,820 --> 01:38:26,829 regularly sending them surveys, of course, they didn't tell the nurses what to do. They 1035 01:38:26,829 --> 01:38:31,940 didn't assign the nurses any sort of drug to take or any diet or intervention or anything. 1036 01:38:31,940 --> 01:38:38,460 They just observe the nurses, they send the nurses a survey, and about the nurses health, 1037 01:38:38,460 --> 01:38:43,320 and then the nurse vault fills out that information. I think it's every two years that they do 1038 01:38:43,320 --> 01:38:44,320 that, 1039 01:38:44,320 --> 01:38:46,020 they're still doing it. 1040 01:38:46,020 --> 01:38:53,989 Also, at this point, I do want to point out the concept of replication. So just the word 1041 01:38:53,989 --> 01:39:03,030 replication, right, regular speaking means to copy, right? Like, if you ever, you know, 1042 01:39:03,030 --> 01:39:08,770 have a new roommate, you might need to replicate your key. So you have a copy of the key for 1043 01:39:08,770 --> 01:39:16,079 the new roommate? Well, part of the whole science thing is that studies must be done 1044 01:39:16,079 --> 01:39:20,820 rigorously enough to be replicated. So those are little keywords in there. A rigorous study 1045 01:39:20,820 --> 01:39:28,659 means one that's done really carefully, like thinking about sampling very carefully. You 1046 01:39:28,659 --> 01:39:34,309 know, like avoiding, for example, non sampling error not being sloppy, not getting a lot 1047 01:39:34,309 --> 01:39:40,870 of under coverage, using a good sampling frame. You know, I'm just giving you examples that 1048 01:39:40,870 --> 01:39:45,980 you might know about. But there's a lot of things that have to be done in research to 1049 01:39:45,980 --> 01:39:50,599 do it properly. It's just like driving or anything else. You really have to keep your 1050 01:39:50,599 --> 01:39:55,969 eye on a lot of different things and you want to try to do them perfectly. And the main 1051 01:39:55,969 --> 01:40:01,309 reason why you want to do that is so if somebody tries to do this same experiment you did or 1052 01:40:01,309 --> 01:40:05,829 roughly the same experiment you did. Because you can't do exactly the same, right? If I 1053 01:40:05,829 --> 01:40:10,420 study this hospital over here, and somebody wants to study that hospital over there, well, 1054 01:40:10,420 --> 01:40:14,639 they're going to get different people in there, right? But even so if that person decides 1055 01:40:14,639 --> 01:40:20,130 that they want to study that hospital over there, if I did my study rigorously, then 1056 01:40:20,130 --> 01:40:27,099 it won't be so hard for that person to replicate how I did the study. And then we can see if 1057 01:40:27,099 --> 01:40:32,289 that person and my study if we get the same thing, or if there's something slightly off 1058 01:40:32,289 --> 01:40:38,870 or what's going on. And so replicating the results of both observational studies and 1059 01:40:38,870 --> 01:40:44,340 experiments, is necessary for science to progress. So you'll know that a lot of experiments are 1060 01:40:44,340 --> 01:40:50,210 done on drugs, before they can be approved to be given to everybody, because they can't 1061 01:40:50,210 --> 01:40:55,409 just do one study, they have to replicate it, to make sure that the findings are all 1062 01:40:55,409 --> 01:41:01,429 sort of coming in about the same and that we can deduce some information about it, you 1063 01:41:01,429 --> 01:41:09,949 really just don't want to rely on one study for your findings. So I just went over several 1064 01:41:09,949 --> 01:41:14,699 steps that we need to follow when we're doing a statistical study, and we actually have 1065 01:41:14,699 --> 01:41:20,320 to follow them in order. And you also have to determine the type of study you're doing, 1066 01:41:20,320 --> 01:41:25,670 you know, is an experiment, or observational study. And there's a ton of study decisions 1067 01:41:25,670 --> 01:41:31,809 you have to make. So you got to keep that in mind. Now, we're going to talk about avoiding 1068 01:41:31,809 --> 01:41:38,290 bias in specifically survey design. Now, you can do a lot of different kinds of studies. 1069 01:41:38,290 --> 01:41:44,230 But let's just talk about surveys, because that happens a lot in nursing. Nurses interact 1070 01:41:44,230 --> 01:41:50,500 with patients a lot, and with the community with each other. And often they gather information 1071 01:41:50,500 --> 01:41:56,000 about those interactions or attitudes or, or how the healthcare system functions by 1072 01:41:56,000 --> 01:42:03,809 using a survey. So surveys can provide a lot of information and useful information. But 1073 01:42:03,809 --> 01:42:08,940 it's important that all aspects of survey design and administration when you're giving 1074 01:42:08,940 --> 01:42:13,980 it, you got to think about minimizing bias and try you know, try to get a representative 1075 01:42:13,980 --> 01:42:21,059 sample trying to get accurate measurements. And so several considerations should be made. 1076 01:42:21,059 --> 01:42:29,320 When you want to think about non response and also voluntary response, okay, so I talked 1077 01:42:29,320 --> 01:42:36,940 a lot about sampling in the previous lecture. But just because you invite someone to participate 1078 01:42:36,940 --> 01:42:41,670 in your study, like maybe you're doing systematic sampling, and every third patient, you asked, 1079 01:42:41,670 --> 01:42:47,130 Would you like to fill out a survey? That doesn't mean they're going to, right? And 1080 01:42:47,130 --> 01:42:51,000 so if that person says no, thank you, even though there were a sample, that's called 1081 01:42:51,000 --> 01:42:56,070 non response. So if I was helping you with a survey, and you said, Hey, I was getting 1082 01:42:56,070 --> 01:43:01,769 a lot of non response, I would look at the proportion if you approach 200 people, and 1083 01:43:01,769 --> 01:43:09,650 80 said, No, you know, that's only a 20% response rate and an 80% non response rate. if many 1084 01:43:09,650 --> 01:43:16,079 people are refusing your survey, the few who actually completed are likely to have a biased 1085 01:43:16,079 --> 01:43:17,449 opinion. 1086 01:43:17,449 --> 01:43:26,179 I've noticed this at in in situations where things are really bad, okay. Like, I remember 1087 01:43:26,179 --> 01:43:34,070 going to a subway station and it was flooded, and it was really in a bad situation. And 1088 01:43:34,070 --> 01:43:40,639 there was a man handing out surveys from the Transportation Authority. And he was like, 1089 01:43:40,639 --> 01:43:46,340 please take my survey, please take my survey. And everybody was waving past him. They didn't 1090 01:43:46,340 --> 01:43:52,190 want to grab a survey. While you know me, I got a bleeding heart for surveys. So I took 1091 01:43:52,190 --> 01:43:58,039 his survey, and I filled it out. You know, I think the transportation authorities not 1092 01:43:58,039 --> 01:44:04,730 so bad. Right? I lived in Florida, there's no transportation there, right? So and here 1093 01:44:04,730 --> 01:44:10,080 in Massachusetts, we got a great transportation system, even if it's flooded or doesn't work 1094 01:44:10,080 --> 01:44:15,411 half the time, right. It's way better than not having one. Well, I'm not the only one 1095 01:44:15,411 --> 01:44:21,389 who grabbed a survey a bunch of nice Pollyannas, like me grabbed a survey. So probably the 1096 01:44:21,389 --> 01:44:27,429 Trent Transit Authority thinks that everybody loves the subway when everybody was waving 1097 01:44:27,429 --> 01:44:32,650 past this poor guy because they were so disgusted, because the station was flooded. 1098 01:44:32,650 --> 01:44:39,130 Right? So if so many people are refusing your survey, a high proportion, the feebly will 1099 01:44:39,130 --> 01:44:42,989 actually fill it out are going to be kind of weird, probably like me. You know, you're 1100 01:44:42,989 --> 01:44:49,269 gonna get a bunch of happy people when most of the people who said no might be sad people. 1101 01:44:49,269 --> 01:44:54,750 And so, the reason they may not be completing your survey has may have to do with how they 1102 01:44:54,750 --> 01:45:01,140 feel about your topic. This is not just in terms of satisfaction. Let's say you want 1103 01:45:01,140 --> 01:45:07,481 to talk about how many drinks per night somebody has. Okay? Do you think a lot of people who 1104 01:45:07,481 --> 01:45:12,090 are struggling with alcoholism are gonna want to fill out that survey? You know, how about 1105 01:45:12,090 --> 01:45:18,480 illegal drugs or other illegal activity, people who are into that they don't always feel so 1106 01:45:18,480 --> 01:45:23,690 good about talking about it. And so, you know, you might get a few people to fill out your 1107 01:45:23,690 --> 01:45:28,330 survey, but those are not necessarily the people who are engaging in the behaviors. 1108 01:45:28,330 --> 01:45:35,590 So the fact that we have the freedom to choose whether or not we want to be in a survey is 1109 01:45:35,590 --> 01:45:41,370 great. But from a researcher standpoint, is you have to be careful. If you get low response 1110 01:45:41,370 --> 01:45:46,300 rates, you need to ask yourself who was not responding? And, you know, am I missing a 1111 01:45:46,300 --> 01:45:54,989 good share of opinion there? And then, when you get people who do respond, you got to 1112 01:45:54,989 --> 01:46:02,350 be careful with that two, respondents may lie on purpose. If you've got a pretty cool 1113 01:46:02,350 --> 01:46:09,389 survey, but you suddenly ask a question, that's too personal. People might just lie. If you 1114 01:46:09,389 --> 01:46:15,900 ask, maybe a students you're doing a sin, you know, maybe satisfaction survey with how 1115 01:46:15,900 --> 01:46:22,530 the front desk runs at a dorm or something. If you, you know, ask a question, have you 1116 01:46:22,530 --> 01:46:28,970 ever cheated on a test? You know, my, everybody's probably gonna say no. Also, if you ask a 1117 01:46:28,970 --> 01:46:33,050 question where people don't really know the answer, offhand, they're not gonna put it. 1118 01:46:33,050 --> 01:46:38,639 Like if you ask somebody, you know, when you're, you know, you asked a kid who's been living 1119 01:46:38,639 --> 01:46:43,760 in the house forever, when your parents bought the house? How much did it cost? I mean, they're 1120 01:46:43,760 --> 01:46:49,380 not gonna know. Maybe they'll know, but probably not. And so you want to be careful when you 1121 01:46:49,380 --> 01:46:53,870 design your questions that you're not asking anything that's so personal, everybody's in 1122 01:46:53,870 --> 01:46:58,480 lie about it? Or that you're not asking a question, then you would have Trump people 1123 01:46:58,480 --> 01:47:02,110 try to be accurate, they're probably not even give you the right answer, because it's just 1124 01:47:02,110 --> 01:47:09,460 too hard to think about. Um, respondents also to, you know, to surveys may lie without meaning 1125 01:47:09,460 --> 01:47:15,639 to, like, inadvertently. Again, if you ask a question about something that happened really 1126 01:47:15,639 --> 01:47:22,060 a long time ago, they're not probably going to get it right. This is called recall bias, 1127 01:47:22,060 --> 01:47:27,789 like you can have you can you know how, like, you can look back at a time in your life, 1128 01:47:27,789 --> 01:47:31,860 like, especially if you went through something really harsh, like if you were a part of a 1129 01:47:31,860 --> 01:47:37,099 sports team, and you went to state and it was really tough that you don't remember the 1130 01:47:37,099 --> 01:47:42,270 tough part, right? You sit around singing, you know, your sports songs, and you say, 1131 01:47:42,270 --> 01:47:48,000 Hey, that was awesome. Well, that's recall bias, right? Because after winning state, 1132 01:47:48,000 --> 01:47:53,650 everything looks rosy. But, you know, on the bus, there really wasn't that easy. So people 1133 01:47:53,650 --> 01:47:58,239 tend to have recall bias, it's influenced by events that have happened since the original 1134 01:47:58,239 --> 01:48:01,900 event. So if you're giving people a survey, and you're saying, Well, before you applied 1135 01:48:01,900 --> 01:48:07,929 for nursing school, you know, what did you think this? Or did you think that, you know, 1136 01:48:07,929 --> 01:48:11,730 they might just tell you and think they're telling you the truth, but they're actually 1137 01:48:11,730 --> 01:48:17,010 lying. If you actually managed to go back in time and ask them, then they tell you something 1138 01:48:17,010 --> 01:48:23,929 different. So again, you can kind of screw up your own data by screwing up your own questions. 1139 01:48:23,929 --> 01:48:30,500 So you want to think about how you word your questions. You can also screw up your questions 1140 01:48:30,500 --> 01:48:37,780 by introducing a hidden bias. Something happened to me recently, where a company sent me a 1141 01:48:37,780 --> 01:48:44,329 free app. And they said, try our free app, and I downloaded it, and it was awful. Okay. 1142 01:48:44,329 --> 01:48:51,710 And then about a month later, they sent me a survey. And these were the questions I said. 1143 01:48:51,710 --> 01:48:58,599 When do you use the app? You know, what time of day? Do you use it? Right? Like how how, 1144 01:48:58,599 --> 01:49:03,239 how do you use it? Do you read scientific literature? Do you read news? And the problem 1145 01:49:03,239 --> 01:49:07,060 was, I couldn't really answer any of this. Because from the day I downloaded it, I never 1146 01:49:07,060 --> 01:49:13,010 used it. It was so bad. Right? So question wording may induce a certain response. They 1147 01:49:13,010 --> 01:49:18,780 were asking me how do you use this, but they didn't give me a choice of I don't. So I had 1148 01:49:18,780 --> 01:49:23,289 to say something. I don't even know what I said. I mean, there was nothing I could say 1149 01:49:23,289 --> 01:49:29,330 To be honest, because of that bias. So you have to be careful that you aren't too rosy 1150 01:49:29,330 --> 01:49:35,690 about whatever your topic is, and and assume everybody loves everything. I mean, you've 1151 01:49:35,690 --> 01:49:39,239 got to put out questions like are you even using the software? Did you have any problems 1152 01:49:39,239 --> 01:49:45,440 with the software? Right? I'm just assuming they're using it and liking it and using it. 1153 01:49:45,440 --> 01:49:52,320 You know, like it's supposed to be used is a big assumption. Order of questions and other 1154 01:49:52,320 --> 01:49:56,420 wording may induce a certain response and you'll see this a lot if you take a public 1155 01:49:56,420 --> 01:50:04,140 opinion poll. I used to do a lot of polling We'd ask questions like, how likely are you 1156 01:50:04,140 --> 01:50:10,340 to vote for candidate x? You know, very likely someone likely? Somewhat unlikely and not 1157 01:50:10,340 --> 01:50:15,440 at all likely? And people say, I don't know, no, no likely. And then you'd say, Well, what 1158 01:50:15,440 --> 01:50:23,590 if you knew that candidate x supported this new proposition? proposition? 69. Right, then 1159 01:50:23,590 --> 01:50:30,510 would you be more likely to vote for candidate x? And so that's why order of questions other 1160 01:50:30,510 --> 01:50:35,280 wording and stuff. They're trying to see if I add this fact that that fact is that going 1161 01:50:35,280 --> 01:50:41,239 to make the person like the candidate better. And so you do have to think about the order 1162 01:50:41,239 --> 01:50:46,269 you put the questions. And if you want to ask about two different subjects, kind of 1163 01:50:46,269 --> 01:50:51,969 think about which subject should come first, because it might color the respondents answering 1164 01:50:51,969 --> 01:50:58,000 of the subsequent subject. And also on the slide, I wanted to point out that the scales 1165 01:50:58,000 --> 01:51:05,039 of questions may not accurately measure responses. Do your feelings always fit on a scale from 1166 01:51:05,039 --> 01:51:10,420 one to five? Well, you know, yelps kind of figured it out. If people's feelings about 1167 01:51:10,420 --> 01:51:15,889 restaurants tend to fit on a scale of one to five, I'd have a lot of trouble filling 1168 01:51:15,889 --> 01:51:22,140 that out if they gave me a scale of one to 17. Right. But sometimes people have more 1169 01:51:22,140 --> 01:51:28,270 granular feelings about things, maybe they need a longer scale one to seven. Um, you'll 1170 01:51:28,270 --> 01:51:34,610 see a lot of pain scales, where they offer more than just five choices, because probably 1171 01:51:34,610 --> 01:51:41,699 pain can maybe go from one to seven or one to 10. So think about your scales when you're 1172 01:51:41,699 --> 01:51:51,981 creating these questions, because that's your choice if you're designing the study. Another 1173 01:51:51,981 --> 01:51:58,659 point to be made is the influence of the interviewer. Now, we don't have as much interviewing going 1174 01:51:58,659 --> 01:52:03,559 on these days, because we have the internet where we can do anonymous surveys, and people 1175 01:52:03,559 --> 01:52:11,210 just fill them out self report, we have Robo phones that you can call robo call. And using 1176 01:52:11,210 --> 01:52:18,869 an automated voice, that's obviously not a person, you can get survey data. But there's 1177 01:52:18,869 --> 01:52:22,989 always situations where you actually have to interview people, especially if somebody 1178 01:52:22,989 --> 01:52:28,550 is really sick in bed, and you have to show up there, you have to talk to them. And so 1179 01:52:28,550 --> 01:52:34,750 even on the phone, you have to interview people, and they can hear your voice, right. So you 1180 01:52:34,750 --> 01:52:39,480 got to think about when you're pairing up whoever's being interviewed with whoever's 1181 01:52:39,480 --> 01:52:45,829 interviewing, um, I've found that it's best to have the interviewer come from the same 1182 01:52:45,829 --> 01:52:52,400 population as the research participant, in general, the only time that can be a problem 1183 01:52:52,400 --> 01:52:59,159 is a thirst from the same community, and there's a privacy issue. But it can be very helpful, 1184 01:52:59,159 --> 01:53:07,530 for the most part, not always, to have your interviewers be actually from the population 1185 01:53:07,530 --> 01:53:13,500 that you would be studying, you know, from the individuals that you would be studying. 1186 01:53:13,500 --> 01:53:19,690 So for instance, if you need to interview a bunch of young African American, you know, 1187 01:53:19,690 --> 01:53:25,860 like some African American teenage men, like I recently saw a study on how health care 1188 01:53:25,860 --> 01:53:30,900 in the United States really isn't suited for them. And it needs to improve and needs to 1189 01:53:30,900 --> 01:53:36,410 better cater to this population. Well, let's say you wanted to better understand that, 1190 01:53:36,410 --> 01:53:41,330 the best thing would be is to hire a young African American male and train him on how 1191 01:53:41,330 --> 01:53:44,909 to be good interviewer and do be good data collector, because you probably get the best 1192 01:53:44,909 --> 01:53:47,249 data that way. 1193 01:53:47,249 --> 01:53:53,020 On the other hand, let's think of different ways that that could go, you could take a 1194 01:53:53,020 --> 01:54:02,460 person who was older, who is maybe of a different race, and maybe that would change how this 1195 01:54:02,460 --> 01:54:08,250 young African American male would respond to this interviewer. I mean, the interviewer 1196 01:54:08,250 --> 01:54:17,889 could be like, in many ways, like the respondent, but the respondents perception might change, 1197 01:54:17,889 --> 01:54:25,829 then how they answer all verbal and nonverbal influences matter, you know, clothing, the 1198 01:54:25,829 --> 01:54:30,670 setting that the person's being interviewed in. And so I'm not saying there's really a 1199 01:54:30,670 --> 01:54:37,410 solution to all this. I'm just saying, make some good decisions. Like I remember working 1200 01:54:37,410 --> 01:54:45,250 on a data set where there were some questions that had been asked about some older men about 1201 01:54:45,250 --> 01:54:51,510 their sexual function. And I, it looks the data look funny to me in the statistician 1202 01:54:51,510 --> 01:54:57,900 who was there during data collection told me that they had chosen young, female nursing 1203 01:54:57,900 --> 01:55:04,429 students to interview these elders. Men about their sexual habits. And I just said, you 1204 01:55:04,429 --> 01:55:14,130 know, that might be subject to interviewer influence. And then you of course have to 1205 01:55:14,130 --> 01:55:19,999 worry about vague wording. Just because it looks clear to you doesn't mean it looks clear 1206 01:55:19,999 --> 01:55:27,849 to everyone. There are simple ways of avoiding vague terms in the survey, when you can just 1207 01:55:27,849 --> 01:55:32,619 put a number on it. So instead of asking a person, if they've waited a long time in the 1208 01:55:32,619 --> 01:55:40,420 waiting room, you can say, more than 10 minutes. You can say exactly like within the last month, 1209 01:55:40,420 --> 01:55:47,119 have you done certain a certain activity or within the next year? Do you expect to change 1210 01:55:47,119 --> 01:55:54,110 schools or whatever. And so try to wherever you can use numbers or something very specific, 1211 01:55:54,110 --> 01:55:59,580 you know, instead of go to the clinic, go to the public health clinic at this particular 1212 01:55:59,580 --> 01:56:05,769 corner, or whatever. And then you're going to get some pretty accurate information. 1213 01:56:05,769 --> 01:56:06,769 But 1214 01:56:06,769 --> 01:56:11,540 sometimes you're stuck using vague terms, because you're studying vague terms, right? 1215 01:56:11,540 --> 01:56:18,789 I was doing a study of controllable lifestyle attitudes towards controllable lifestyle in 1216 01:56:18,789 --> 01:56:24,110 medical students. So we asked this question, how important is having a controllable lifestyle 1217 01:56:24,110 --> 01:56:29,000 to you in your future career? Well, what does that mean? That's pretty vague. So what we 1218 01:56:29,000 --> 01:56:32,909 did is we use this grounding this anchoring language, 1219 01:56:32,909 --> 01:56:38,900 we added the sentence, a controllable lifestyle is defined as one that allows the physician 1220 01:56:38,900 --> 01:56:44,699 to control the number of hours devoted to practicing his or her specialty. So even though 1221 01:56:44,699 --> 01:56:49,849 we're talking about something kind of wofully, and watery, loosey goosey like control of 1222 01:56:49,849 --> 01:56:54,570 a lifestyle, who knows what that means? And that's not to say that that sentence could 1223 01:56:54,570 --> 01:57:00,090 be interpreted differently by people it certainly is. But if you're stuck with vague wording, 1224 01:57:00,090 --> 01:57:04,110 try to put some grounding language in it. So everybody's at least sort of led in the 1225 01:57:04,110 --> 01:57:11,809 same direction with their thought before they answer the question. Now, I want to also point 1226 01:57:11,809 --> 01:57:15,560 out, you probably have noticed, there's all these issues, you have to think about when 1227 01:57:15,560 --> 01:57:22,480 doing surveys, there's this other issue called the lurking variable, well, you know, lurk 1228 01:57:22,480 --> 01:57:29,139 means to sneak around behind the scenes, right? Behind the scenes, a lurking variable is a 1229 01:57:29,139 --> 01:57:35,730 variable that's associated with a condition, but it may not actually cause it. I remember 1230 01:57:35,730 --> 01:57:43,020 when I was studying epidemiology, they talked about how a lot of people with motorcycle 1231 01:57:43,020 --> 01:57:49,429 accidents, you unfortunately got in motorcycle accidents that they had tattoos. So therefore, 1232 01:57:49,429 --> 01:57:53,679 they said, Everybody shouldn't get a tattoo, you might get it in a motorcycle accident? 1233 01:57:53,679 --> 01:57:59,199 Well, that's a great example of a lurking variable. Yeah, a lot of people who do get 1234 01:57:59,199 --> 01:58:05,170 into motorcycle accidents, have tattoos, but that the tattoos don't cause that. Um, we 1235 01:58:05,170 --> 01:58:10,489 also know that having more education increases income, but people have the same education 1236 01:58:10,489 --> 01:58:14,630 level do not all make the same income, there's this thing, you know, called, it's sexism. 1237 01:58:14,630 --> 01:58:21,370 And it's called racism. So it matters whether you're a woman or a man, it matters, the color 1238 01:58:21,370 --> 01:58:27,249 of your skin. If the you know, if you've got a darker skin, doesn't matter, that you have 1239 01:58:27,249 --> 01:58:32,579 the same education as somebody with lighter skin, you're still gonna make less money. 1240 01:58:32,579 --> 01:58:37,079 And so you have these lurking variables behind the scenes. So when people are looking at 1241 01:58:37,079 --> 01:58:41,780 Well, why are people you know, making less income, because they're less educated, whatever? 1242 01:58:41,780 --> 01:58:50,239 Well, you got to look for also the lurking variables. So current studies show that why 1243 01:58:50,239 --> 01:58:54,369 women and African Americans make less money on the whole, it's not explained by fewer 1244 01:58:54,369 --> 01:59:01,380 of them working or fewer of them getting degrees. It's really these lurking variables. And so 1245 01:59:01,380 --> 01:59:07,639 you got to think critically. And I guess what I would say is, whenever you do a survey, 1246 01:59:07,639 --> 01:59:12,390 if you're studying something that has a lot of lurking variables associated with it, make 1247 01:59:12,390 --> 01:59:17,929 sure you measure those variables. Like early studies where they were looking to see if 1248 01:59:17,929 --> 01:59:24,999 drinking a lot of alcohol causes lung cancer. Some of them forgot to really study how much 1249 01:59:24,999 --> 01:59:31,519 these people would smoke. Because we know smoking causes lung cancer. And we know if 1250 01:59:31,519 --> 01:59:36,179 you're hanging out in a place with a lot of drinking and they allow smoking, you'll see 1251 01:59:36,179 --> 01:59:41,119 a lot of people smoking too. They seem to go hand in hand. So you don't want to miss 1252 01:59:41,119 --> 01:59:46,630 measuring variables that you think might be lurking variables. It's no problem to measure 1253 01:59:46,630 --> 01:59:54,570 them and not use them later, but just make sure they're included. So, as a final note 1254 01:59:54,570 --> 02:00:01,499 on bias, I just want to point out that survey results are so important. for healthcare, 1255 02:00:01,499 --> 02:00:07,170 and for the progression of science, that you really owe it to even a simplest survey, to 1256 02:00:07,170 --> 02:00:12,610 think about all of these things, these possible things that could go wrong, just with the 1257 02:00:12,610 --> 02:00:17,989 wording of questions or with how you're approaching things, and just really consider how you can 1258 02:00:17,989 --> 02:00:24,449 improve it. It's really important to pay attention to avoiding bias when you're designing and 1259 02:00:24,449 --> 02:00:31,750 conducting your survey. So think about all these things at the design phase. Finally, 1260 02:00:31,750 --> 02:00:37,929 I'll get into the last section of this lecture, which is about randomization, which I think 1261 02:00:37,929 --> 02:00:44,059 a lot of us have heard about. So I'm going to explain the steps to a completely randomized 1262 02:00:44,059 --> 02:00:50,409 experiment. And after I go through all that, I'm going to also talk about the concept of 1263 02:00:50,409 --> 02:00:57,770 a placebo and the placebo effect. Then we're going to briefly touch on blocked randomization, 1264 02:00:57,770 --> 02:01:08,320 and also define for you what is meant by blinding. So why ever randomize, right? So what randomizing 1265 02:01:08,320 --> 02:01:16,510 is, is when you take a bunch of respondents or participants in your study, and you randomly 1266 02:01:16,510 --> 02:01:22,719 choose what group they go in. And if you remember, like I was talking about experiment versus 1267 02:01:22,719 --> 02:01:28,139 observational study, we can't do that in observational study. This is definitely an experiment because 1268 02:01:28,139 --> 02:01:30,310 you're telling them what group to go, 1269 02:01:30,310 --> 02:01:35,050 right. So randomization is used to assign individuals to treatment groups. And when 1270 02:01:35,050 --> 02:01:38,940 you do that, when you randomly assign them, not only you're assigning them, but you're 1271 02:01:38,940 --> 02:01:43,480 randomly assigning them, you're not picking, you know, you're using like dice or some sort 1272 02:01:43,480 --> 02:01:49,690 of random method, and helps prevent bias and selecting members for each group. It distributes 1273 02:01:49,690 --> 02:01:53,869 the lurking variables evenly, even if you don't know about the lurking variables, even 1274 02:01:53,869 --> 02:02:00,580 if you aren't measuring them. By using this randomization method, they get equally allocated 1275 02:02:00,580 --> 02:02:09,060 in each group. So just to remind you, how you actually do that is, first I remember 1276 02:02:09,060 --> 02:02:15,469 the steps to that statistical study, you have to follow those. And after you get to the 1277 02:02:15,469 --> 02:02:20,610 point where you have ethical approval, that's when you start doing the data collection step. 1278 02:02:20,610 --> 02:02:25,610 And that's where you start recruiting sample or, you know, hanging up signs and saying, 1279 02:02:25,610 --> 02:02:30,260 Be in my study, and people come in, and you see if they qualify, and if they qualify, 1280 02:02:30,260 --> 02:02:36,289 you've got this group of sample, right. And what you do with those people is you say thank 1281 02:02:36,289 --> 02:02:40,989 you for being in my study. And you measure the confounders, which is another word for 1282 02:02:40,989 --> 02:02:46,440 lurking variables. You also measure the outcome, whatever you're trying to study, if you're 1283 02:02:46,440 --> 02:02:50,869 doing a randomized experiment, I know I've been involved in a lot of these where they're 1284 02:02:50,869 --> 02:02:57,079 studying drugs for lowering blood pressure. So they'll often have maybe two groups or 1285 02:02:57,079 --> 02:03:02,289 three groups, where they're randomizing people into, but they don't do that first, the first 1286 02:03:02,289 --> 02:03:05,760 thing to do is get everybody in there and measure their blood pressure, right? The outcome, 1287 02:03:05,760 --> 02:03:10,530 you know, because they want to know that before, they are going to take a picture of that before. 1288 02:03:10,530 --> 02:03:15,059 And they also measure confounders, like smoking, remember, smoking is not good for your blood 1289 02:03:15,059 --> 02:03:20,010 pressure, you know, other things are not good for your blood pressure, like not exercising, 1290 02:03:20,010 --> 02:03:25,749 well measure all of those things. Okay, now, here's where we get into things. That's when 1291 02:03:25,749 --> 02:03:31,019 the whole randomization happens. So I showed this picture of a dye, but we usually use 1292 02:03:31,019 --> 02:03:36,869 a computer for it. So we got all these people together. And now you know, randomly, we put 1293 02:03:36,869 --> 02:03:41,540 them in different groups. And in this example, on the slide, we're just going to pretend 1294 02:03:41,540 --> 02:03:47,079 that there's two groups. And in fact, we can't really study blood pressure on the slide. 1295 02:03:47,079 --> 02:03:51,540 Because we're going to give one group treatment and the other group placebo, which is an inactive 1296 02:03:51,540 --> 02:03:57,440 treatment, it's fake, it doesn't work. Of course, the treatment and the placebo are 1297 02:03:57,440 --> 02:04:02,070 going to look the same to the people taking it or, you know, we're going to fool them. 1298 02:04:02,070 --> 02:04:06,300 They don't, they won't know. But the reason why in real life, you can't do that with a 1299 02:04:06,300 --> 02:04:07,670 blood pressure study 1300 02:04:07,670 --> 02:04:08,699 today 1301 02:04:08,699 --> 02:04:13,300 is we know that high blood pressure is really bad for you. So it's really unethical to give 1302 02:04:13,300 --> 02:04:17,739 someone a placebo, you got to give them some sort of drug to lower the blood pressure. 1303 02:04:17,739 --> 02:04:22,479 So usually when we do studies like this on blood pressure, now, new blood pressure drugs, 1304 02:04:22,479 --> 02:04:29,429 Group A is treatment in Group B is old treatment, like they usually take a new treatment and 1305 02:04:29,429 --> 02:04:35,099 give it to group by an old treatment to Group B, see if they can find just a better treatment. 1306 02:04:35,099 --> 02:04:41,119 But if we were talking about something like all timers, especially late stage old timers, 1307 02:04:41,119 --> 02:04:46,570 there's no treatment. Okay? And so what go what's on the side here, Group A, that gets 1308 02:04:46,570 --> 02:04:52,239 treatment and Group B, which gets this Sham pill, this placebo, that would be ethical 1309 02:04:52,239 --> 02:04:56,530 then, but let's just cross our fingers that someday that's not ethical anymore and that 1310 02:04:56,530 --> 02:05:00,440 we do get a treatment right. 1311 02:05:00,440 --> 02:05:01,440 Okay. So 1312 02:05:01,440 --> 02:05:06,739 after you put them in the two groups with sort of missing from the slide is time passes, 1313 02:05:06,739 --> 02:05:11,420 people in Group A take whatever they're supposed to take their treatment. And in this example, 1314 02:05:11,420 --> 02:05:15,980 on the slide, people in Group B, take the fake treatment, the placebo, and neither of 1315 02:05:15,980 --> 02:05:21,960 them, you know, usually knows what's happening. But it takes a while, right. And in the olden 1316 02:05:21,960 --> 02:05:27,409 days before we knew high blood pressure was bad. These were the study designs. And this 1317 02:05:27,409 --> 02:05:33,420 is what ended up happening is that you would see, at the beginning where they measured 1318 02:05:33,420 --> 02:05:37,880 the confounders and the outcome, everybody had high blood pressure, they all look the 1319 02:05:37,880 --> 02:05:43,999 same. But after treatment, Group A would go down, whereas group and Group B would go down 1320 02:05:43,999 --> 02:05:50,139 a little bit from CBOE effect, which I'll explain in the next slide. But that's how 1321 02:05:50,139 --> 02:05:55,659 we learned that you can make blood pressure go down with these different pills. Finally, 1322 02:05:55,659 --> 02:06:02,749 after that time passed, it could be six weeks, it could be years, however long that took 1323 02:06:02,749 --> 02:06:08,460 after that passed, when it was over, we'd measure again, the confounders because they 1324 02:06:08,460 --> 02:06:13,400 could have changed. And the outcome, which in my example, was blood pressure, or, you 1325 02:06:13,400 --> 02:06:20,869 know how serious some of these Alzheimer's disease would be, if we were doing that. So 1326 02:06:20,869 --> 02:06:25,960 I promised you on the last slide that I talked to you about more about what a placebo is, 1327 02:06:25,960 --> 02:06:32,080 and the placebo effect, found this great picture of old placebos from the National Institutes 1328 02:06:32,080 --> 02:06:37,630 of Health. So a placebo is this fake drug that's given and it's actually kind of hard 1329 02:06:37,630 --> 02:06:44,429 to make placebos. Just imagine a drug you may need to take me even excetera and or something 1330 02:06:44,429 --> 02:06:51,039 like that. Imagine we had to study etc. And we'd have to make a fake excedrin that tasted 1331 02:06:51,039 --> 02:06:57,719 like it and look like it. Because then Otherwise, the people who are randomized to the placebo 1332 02:06:57,719 --> 02:07:02,829 group would be able to totally tell that they were in the placebo group, and that's not 1333 02:07:02,829 --> 02:07:09,389 good to do. So, what the reason why you need a placebo is there's this thing called the 1334 02:07:09,389 --> 02:07:16,059 placebo effect. And that occurs when there is no treatment, but the participant assumed 1335 02:07:16,059 --> 02:07:24,390 she is receiving treatment and responds favorably. Now, sometimes I talk about one of my favorite 1336 02:07:24,390 --> 02:07:32,190 epidemiologists, comedians, Ben Goldacre, he reported in one of us, I think one of his 1337 02:07:32,190 --> 02:07:39,500 TED talks about a study where they everybody they enrolled, um, they didn't have a disease, 1338 02:07:39,500 --> 02:07:44,570 right, I guess they had a mild disease. And they told everybody, either they were going 1339 02:07:44,570 --> 02:07:49,800 to give them nothing, or they were going to give them a pill, that's a placebo, it doesn't 1340 02:07:49,800 --> 02:07:55,600 do anything. Or they're going to give them an injection. That's a placebo injection, 1341 02:07:55,600 --> 02:08:00,460 it doesn't do anything. And what they found is of the three groups, the people who got 1342 02:08:00,460 --> 02:08:05,790 the injection did the best. And the people, you know, the fake injection, people got the 1343 02:08:05,790 --> 02:08:10,960 fake pill, the placebo pill, that is second best that people didn't get anything didn't, 1344 02:08:10,960 --> 02:08:15,849 the worst. And that his point is, that's what the placebo effect is, for some reason, when 1345 02:08:15,849 --> 02:08:21,389 we're getting injected. Even with just sailing, we think we're getting some sort of drug and 1346 02:08:21,389 --> 02:08:28,190 it psychologically, or however, affects our bodies. The same thing when we're taking a 1347 02:08:28,190 --> 02:08:36,979 pill. I don't know if you've ever seen kids, you know, saying, Oh, I need medicine 90 minutes. 1348 02:08:36,979 --> 02:08:40,070 And then then the parent gives them an m&m, right, they think it's a pill, they're happy 1349 02:08:40,070 --> 02:08:45,789 with it. But actually, the placebo effect can cause real effects on your health, it 1350 02:08:45,789 --> 02:08:51,349 can make you feel better just because you think you're taking a drug. And so that's 1351 02:08:51,349 --> 02:08:57,440 why it's super important to include a placebo group, if you don't have a comparison group, 1352 02:08:57,440 --> 02:09:03,110 like I described with blood blood pressure in all your studies, because if you just have 1353 02:09:03,110 --> 02:09:07,860 one group where they're taking it, they'll all say it's good. They would say it's good 1354 02:09:07,860 --> 02:09:14,469 if it was water, right. So the placebo is given to what's called a control group, and 1355 02:09:14,469 --> 02:09:18,789 they receive the placebo. Now, if you're studying like acupuncture, you can't really give up 1356 02:09:18,789 --> 02:09:24,499 placebo acupuncture. So what they'll do is they'll sort of hang, hang up a little curtain 1357 02:09:24,499 --> 02:09:31,940 and kind of tap you and you don't know whether you're getting real or it's called sham acupuncture. 1358 02:09:31,940 --> 02:09:36,120 Other things have to happen like that when you're doing these studying these interventions 1359 02:09:36,120 --> 02:09:42,699 that aren't pills. Those are called attention controls, right? Where we have like a sham 1360 02:09:42,699 --> 02:09:48,190 acupuncture. So in any case, you've got to think about this because you need a controller 1361 02:09:48,190 --> 02:09:55,690 comparison group. That's fair. Whenever you're testing in an experiment in a randomized experiment, 1362 02:09:55,690 --> 02:10:00,300 a new thing 1363 02:10:00,300 --> 02:10:05,920 promised you I'd talk a little bit about blocked randomization, I won't get much into it. But 1364 02:10:05,920 --> 02:10:11,060 sometimes when you go to randomize, right, you know, you get this whole group of people, 1365 02:10:11,060 --> 02:10:15,250 they're all about the same, but you're gonna split them into a group A and Group B, one's 1366 02:10:15,250 --> 02:10:20,199 gonna get maybe a drug and the others maybe gonna get the placebo. Sometimes you get worried 1367 02:10:20,199 --> 02:10:25,889 that the groups are going to be unbalanced with respect to a particular lurking variable. 1368 02:10:25,889 --> 02:10:29,789 In blood pressure, we'd always care about smoking, we want the equal amount of smokers 1369 02:10:29,789 --> 02:10:35,920 in each group. You know, a lot of times we we care about gender, we want equal amounts 1370 02:10:35,920 --> 02:10:40,520 of men and women in each group. So if you're worried about that, with randomization, you 1371 02:10:40,520 --> 02:10:45,059 can't just do it one at a time, because you might just randomly put too many men in one 1372 02:10:45,059 --> 02:10:52,059 group. So what you have to do is block randomization. So see, I drew all these blocks on the on 1373 02:10:52,059 --> 02:10:57,550 the screen, and you'll see that there's nobody in them, they're just blank, I just put xxx. 1374 02:10:57,550 --> 02:11:03,469 So this is before you do your study, you have these blank blocks. And what you do is as 1375 02:11:03,469 --> 02:11:06,999 you enroll those people remember you have to measure them and make sure that they qualify 1376 02:11:06,999 --> 02:11:13,599 for the study, as you get them in, you can just write them in the blocks, right. So here, 1377 02:11:13,599 --> 02:11:18,909 I just put their fake initials, you know, so let's say that XYZ came in first, that's 1378 02:11:18,909 --> 02:11:25,420 a woman, and then maybe NSW came in, and that's another woman, you just keep putting the women 1379 02:11:25,420 --> 02:11:30,239 there. And then when the men come in, you put them in, and you fill up the blocks, then 1380 02:11:30,239 --> 02:11:37,079 here's a trick, you actually randomize the entire blocks, right? So block one and block 1381 02:11:37,079 --> 02:11:42,889 three ended up in Group A, and but magic, you got to equal men and women there. And 1382 02:11:42,889 --> 02:11:49,510 then Group B equal men and women. And so that's how you do with blocks. So but you know, there's 1383 02:11:49,510 --> 02:11:54,440 some limitation to this, like, if you get multiple races in your study, maybe, you know, 1384 02:11:54,440 --> 02:11:59,889 four or five racial groups. If you make a five block, you've got to fill up the whole 1385 02:11:59,889 --> 02:12:05,900 block before you randomize it. And, you know, sometimes you're you're in an area where certain 1386 02:12:05,900 --> 02:12:10,869 racial groups are rare. And you might have trouble filling up your blocks. So there's 1387 02:12:10,869 --> 02:12:14,650 some limitations of this too. 1388 02:12:14,650 --> 02:12:16,070 Now, 1389 02:12:16,070 --> 02:12:21,880 I had mentioned the situation where you really don't want if you're going to do an experiment, 1390 02:12:21,880 --> 02:12:26,249 right, not an observational study, experiment. And you're going to randomize people either 1391 02:12:26,249 --> 02:12:33,540 to a drug or some sort of intervention versus placebo, or a drug versus another drug, an 1392 02:12:33,540 --> 02:12:39,540 old drug, you really don't want them to know what group they're in. I mean, because you 1393 02:12:39,540 --> 02:12:42,429 have to be ethical. before they enter the study, you have to tell them, you're gonna 1394 02:12:42,429 --> 02:12:46,210 put them in one or two group, one of two groups, but you got to tell them, you're not going 1395 02:12:46,210 --> 02:12:52,170 to know what group you're in wallets going on. So blinding is where the, where any person 1396 02:12:52,170 --> 02:12:58,269 is deliberately not told of the treatment assignment. So he or she is not biased in 1397 02:12:58,269 --> 02:13:04,020 reporting study information. And it actually doesn't have to just be the participant in 1398 02:13:04,020 --> 02:13:09,760 the study, it can be researched, like, the most common one is a participant is blinded 1399 02:13:09,760 --> 02:13:16,249 to treatment or placebo. But I've been in studies or I've been worked on studies of 1400 02:13:16,249 --> 02:13:22,999 like Alzheimers disease, right? Well, they'll they want to take the patients are the participants 1401 02:13:22,999 --> 02:13:29,909 in the study might have Alzheimer's disease, and look at their image, the MRI of their 1402 02:13:29,909 --> 02:13:37,790 head. And often, they'll have also a neurologist interview them, they'll also see a neuro psychologist. 1403 02:13:37,790 --> 02:13:41,989 And they often want those three different groups, they imaging group, the neuro psychology 1404 02:13:41,989 --> 02:13:48,150 group and the neurology group, not to know about each other's opinion of this particular 1405 02:13:48,150 --> 02:13:55,469 patient. So they'll blind them to each other's opinion. So blinding AR is much more complicated 1406 02:13:55,469 --> 02:14:00,449 than just blinding the participant to whether or not they're in placebo, or they're in drug 1407 02:14:00,449 --> 02:14:07,820 group. But double blind is a really important concept. And that means that both the participant 1408 02:14:07,820 --> 02:14:13,440 and the study staff do not know the treatment assignment. So everybody who's operating with 1409 02:14:13,440 --> 02:14:18,249 the patient doesn't know it. So you're probably thinking that's really pretty serious, right? 1410 02:14:18,249 --> 02:14:23,360 Like, what if that person gets sick, and goes to the emergency room, and they're taking 1411 02:14:23,360 --> 02:14:27,340 an experimental drug or they could be taking placebo? Who knows what they're taking? Well, 1412 02:14:27,340 --> 02:14:33,280 in that case, what happens is there's an unblinding procedure, there just has to be as part of 1413 02:14:33,280 --> 02:14:39,460 ethics. It's already set up in the study. If somebody goes to the emergency room, there's 1414 02:14:39,460 --> 02:14:46,369 a person that can be called to unblind. The pate, the participant who's now a patient, 1415 02:14:46,369 --> 02:14:50,360 and and once they're unblind, they learn what they were taking. Even if they were taking 1416 02:14:50,360 --> 02:14:55,479 placebo, the whole thing's over. Right? Even the study staff work. It's just a fact of 1417 02:14:55,479 --> 02:15:00,090 life. It has to happen sometime. But for the most part, what we tried to do is keep things 1418 02:15:00,090 --> 02:15:07,310 steady. double blind because it makes things the least biased in the most fair. So 10, 1419 02:15:07,310 --> 02:15:12,010 the session on randomization, the purpose of randomization, why we go through all this 1420 02:15:12,010 --> 02:15:17,909 when we're testing treatments, especially, is that it's used to reduce bias. And especially 1421 02:15:17,909 --> 02:15:22,960 if you have a particular variable you're concerned about like gender, like we were talking about 1422 02:15:22,960 --> 02:15:28,729 race, or smoking, smoking status, you can use a block randomization to even out each 1423 02:15:28,729 --> 02:15:33,940 group. And then blinding further prevents bias, right? Because people don't know what 1424 02:15:33,940 --> 02:15:38,530 they're taking in the study staff don't know what they're giving them. And the reason why 1425 02:15:38,530 --> 02:15:42,940 you have to really think about blinding is the placebo effect is necessary to take into 1426 02:15:42,940 --> 02:15:47,510 account, you're always going to get the placebo effect every time you give somebody something. 1427 02:15:47,510 --> 02:15:54,909 So you've got to account for that in your study design. So in conclusion, I went over 1428 02:15:54,909 --> 02:15:59,409 the steps to conducting a statistical study in order and kind of give you tips on how 1429 02:15:59,409 --> 02:16:04,949 to remember that we looked at some basic terms and definitions. And we talked about how to 1430 02:16:04,949 --> 02:16:10,710 avoid bias in survey design, because there's a lot of different considerations. And finally, 1431 02:16:10,710 --> 02:16:17,360 we talked more in depth about specifically about randomization in experiments. All right. 1432 02:16:17,360 --> 02:16:22,640 Now, you know, a lot, maybe too much. I hope you enjoyed my lecture. 1433 02:16:22,640 --> 02:16:31,349 Hi, Whoa, it's me again, Monica wahi, your statistics lecturer from labarre College. 1434 02:16:31,349 --> 02:16:37,139 Now we're going to go go back and cover what I didn't cover in the last lecture about chapter 1435 02:16:37,139 --> 02:16:45,529 2.1, which are frequency histograms and distributions. So here are your learning objectives for this 1436 02:16:45,530 --> 02:16:50,110 lecture. So at the end of this lecture, you should be able to state the steps for drawing 1437 02:16:50,110 --> 02:16:55,330 a frequency histogram, you should also be able to name two types of distributions and 1438 02:16:55,330 --> 02:17:00,650 explain how they look, you should be able to define what an outlier is, and say one 1439 02:17:00,650 --> 02:17:07,049 reason why you would make a frequency histogram. Finally, you should be able to define what 1440 02:17:07,049 --> 02:17:14,309 a relative frequency is and what a cumulative frequency is. Okay, so let's get started. 1441 02:17:14,309 --> 02:17:19,089 First, we're going to review frequency histograms and relative frequency histogram. So you'll 1442 02:17:19,090 --> 02:17:24,850 figure out what I'm talking about there. Then we're going to go over five common distributions 1443 02:17:24,850 --> 02:17:29,751 in statistics, so you know what that's all about. And then I'm going to talk about outliers. 1444 02:17:29,751 --> 02:17:35,820 Now, you'll notice I have a lot of pictures in this presentation of skylines. And the 1445 02:17:35,820 --> 02:17:43,730 reason why is they remind me of histograms. So let's talk about what is a frequency histogram. 1446 02:17:43,730 --> 02:17:51,260 So a frequency histogram is important in statistics, because, as you'll see, you need to make one 1447 02:17:51,260 --> 02:17:56,299 in order to see what the distribution is. So I'm going to go first explain what one 1448 02:17:56,299 --> 02:18:00,840 is, like, show you what one looks like. And then I'll explain how to make one. And then 1449 02:18:00,841 --> 02:18:05,450 I'll explain the relative frequency histogram. And then we'll move on to looking at why do 1450 02:18:05,450 --> 02:18:12,020 we need that for distributions. So here's another skyline because it looks like a histogram 1451 02:18:12,020 --> 02:18:17,889 to me. So what is a frequency histogram? Well, it's actually a specific type of bar chart. 1452 02:18:17,889 --> 02:18:23,468 And it's made from data in a frequency table. So you might see a frequency histogram and 1453 02:18:23,468 --> 02:18:28,029 go, well, that looks like a boring old bar graph. Well, it's not just any old bar graph, 1454 02:18:28,030 --> 02:18:32,840 it's got specific properties that I'm going to talk to you about in this lecture. Okay. 1455 02:18:32,840 --> 02:18:38,070 Both frequency histograms and relative frequency histograms are bar charts with their special 1456 02:18:38,070 --> 02:18:43,790 bar charts that have to be done a certain way. And why? Because if they're done that 1457 02:18:43,790 --> 02:18:48,509 way, in their histograms, they will reveal the distribution of the data, which I'll explain 1458 02:18:48,510 --> 02:18:58,020 later. So here is a frequency table, we had this before. This was of those fake patient 1459 02:18:58,020 --> 02:19:03,710 transport miles, right. So you'll notice here were the class limits, and then we put in 1460 02:19:03,710 --> 02:19:08,819 the frequency and we even threw in this relative frequency. Okay, so this is the frequency 1461 02:19:08,820 --> 02:19:13,360 table I'm going to use as a demonstration for how you make a frequency histogram, you 1462 02:19:13,360 --> 02:19:20,820 first need a frequency table. Okay, now, here's the histogram version of what's in that frequency 1463 02:19:20,820 --> 02:19:27,650 table. So I'm going to annotate this one image to explain the order in which you draw it 1464 02:19:27,650 --> 02:19:33,389 basically by hand. So the first thing you do is draw this vertical line for the y axis, 1465 02:19:33,389 --> 02:19:36,449 okay, you just draw a line. 1466 02:19:36,450 --> 02:19:38,709 Next, you write 1467 02:19:38,709 --> 02:19:46,949 words next to the line, and you always start with frequency of, and then whatever In our 1468 02:19:46,950 --> 02:19:52,080 example, it was patience, okay. And I'm telling you, you need to do it in this order, or you'll 1469 02:19:52,080 --> 02:19:58,280 get confused. So you start with that first line, and then you write this frequency. Okay. 1470 02:19:58,280 --> 02:20:02,910 Next, you draw the whole horizontal line for the x axis, 1471 02:20:02,910 --> 02:20:04,210 okay. 1472 02:20:04,210 --> 02:20:12,200 And then after that you write the classes below. Remember, like the lowest class is 1473 02:20:12,200 --> 02:20:16,740 one to eight, that's a lower class and an upper class limit of the lowest class, like 1474 02:20:16,740 --> 02:20:22,300 you literally write those labels in. And why do I, why am I so freaking out about this 1475 02:20:22,300 --> 02:20:28,050 order is because I totally get confused if I do not do this y axis first. Because then 1476 02:20:28,050 --> 02:20:32,580 all there's all these numbers. And it's totally confusing. So just try to do it in this order. 1477 02:20:32,580 --> 02:20:41,510 Okay. Now, number six, I had to flip the slide here. Okay, at step six, use drawn like the 1478 02:20:41,510 --> 02:20:46,690 basic background, you've got the x and y axis and those labels. So now you have to start 1479 02:20:46,690 --> 02:20:50,921 drawing in the bars. So for your first bar, you look at the first class, and you find 1480 02:20:50,921 --> 02:20:56,340 the frequency on the table, which I think it was 14 or something. And so you look for 1481 02:20:56,340 --> 02:21:04,750 it on the y axis, and you want to label the y axis so that the maximum one is is incorporated 1482 02:21:04,750 --> 02:21:10,990 in it, like you see our maximum is above 20. So we wouldn't want to end our Y axis at 20, 1483 02:21:10,990 --> 02:21:16,280 or 15, or something, you have to make it bigger, so you can put everybody on there. But our 1484 02:21:16,280 --> 02:21:22,650 first one was what at 14, so we draw this horizontal line around the 14, right there, 1485 02:21:22,650 --> 02:21:27,040 that that horizontal line, because we're gonna make that first bar, 1486 02:21:27,040 --> 02:21:28,040 then 1487 02:21:28,040 --> 02:21:31,530 you draw the two vertical lines down, and you position it over where you labeled the 1488 02:21:31,530 --> 02:21:39,780 class. And that makes the bar and then you, you actually color in the bars, like and you 1489 02:21:39,780 --> 02:21:44,561 repeat this for each class, right? So you go, that's why I labeled the classes first 1490 02:21:44,561 --> 02:21:48,960 on the x axis just to make sure everything is even. And then I go through and I make 1491 02:21:48,960 --> 02:21:54,370 all the bars. And again, this is why you need to prepare your frequency table first. So 1492 02:21:54,370 --> 02:22:02,320 you know how to graph it, you know what to put on this graph? Okay, this is the relative 1493 02:22:02,320 --> 02:22:07,272 frequency histogram, you already understand what relative frequency is, right? It's that 1494 02:22:07,272 --> 02:22:14,181 proportion, the proportion of your sample that's in each class. And so the change, if 1495 02:22:14,181 --> 02:22:17,541 you're going to do a relative frequency histogram, you basically go through the same steps, it's 1496 02:22:17,541 --> 02:22:24,601 just you're changing what's on the y axis, you change what you label it, okay? But the 1497 02:22:24,601 --> 02:22:30,620 x axis stays the same. And even though you're, you're charting the relative frequencies, 1498 02:22:30,620 --> 02:22:35,410 like, you'll be like, Okay, this is a totally different number, what you'll see is the pattern 1499 02:22:35,410 --> 02:22:40,760 ends up being the same. So it takes on the similar pattern, which is the pattern is actually 1500 02:22:40,760 --> 02:22:45,681 what we're going after, that's the thing I'm going to talk about with a disparate distribution. 1501 02:22:45,681 --> 02:22:50,750 And so I tend to prefer since the pattern is going to come out the same, I tend to prefer 1502 02:22:50,750 --> 02:22:56,710 using a relative frequency histogram, versus a frequency histogram. Because if I have two 1503 02:22:56,710 --> 02:23:02,351 different groups, like let's say, there were two hospitals, and I gathered two sets of 1504 02:23:02,351 --> 02:23:09,110 data, and I wanted to compare the models transported, then I could use this relative frequency histogram, 1505 02:23:09,110 --> 02:23:16,351 and not only with the patterns be evident, but I could compare them fairly, like whatever's 1506 02:23:16,351 --> 02:23:23,010 35, you know, point three, five or 35%. In this, even if the other hospital maybe had 1507 02:23:23,010 --> 02:23:30,330 tons more transports, I could see it as like 35%. And I could really compare the percent, 1508 02:23:30,330 --> 02:23:34,970 right. So that's why I lean towards relative frequency histogram. But ultimately, you're 1509 02:23:34,970 --> 02:23:43,771 going to get the same pattern on your histogram, whether you use frequency or relative frequency. 1510 02:23:43,771 --> 02:23:49,630 So again, another picture of a skyline. So you can see why I think of skylines because 1511 02:23:49,630 --> 02:23:54,500 they look like histograms, right? So after making a frequency table, what you do with 1512 02:23:54,500 --> 02:23:58,940 quantitative data, right? Because you're trying to organize it, it's also important to then 1513 02:23:58,940 --> 02:24:04,141 make a frequency histogram and or relative frequency histogram, and why it's because 1514 02:24:04,141 --> 02:24:08,990 it reveals a distribution. And now, that's what we're going to talk about. We're going 1515 02:24:08,990 --> 02:24:14,421 to talk about distributions. So first, I'm going to define what I'm talking about with 1516 02:24:14,421 --> 02:24:18,190 the distribution. And now you're gonna see a lot of other kinds of pictures like this 1517 02:24:18,190 --> 02:24:23,860 on the right, see that that shape? That's one of our distributions, okay. And so that's 1518 02:24:23,860 --> 02:24:28,480 a little prequel to what I'm going to say. So first, we're going to talk about what these 1519 02:24:28,480 --> 02:24:34,860 distributions are. Then I'm going to describe what an outlier is, and, and how you can detect 1520 02:24:34,860 --> 02:24:40,920 them by using histograms. Finally, I'm going to wrap it up by explaining what cumulative 1521 02:24:40,920 --> 02:24:44,590 frequency is and when an old jive is. 1522 02:24:44,590 --> 02:24:50,970 Okay, so what is this distribution thing I keep talking about? Well, it's actually just 1523 02:24:50,970 --> 02:24:57,670 a shape. It's the shape that is made if you draw a line along the edges of the histograms 1524 02:24:57,670 --> 02:25:05,830 bars, so On the left, you see I drew the scribbly shape. But you'll notice you can do it with 1525 02:25:05,830 --> 02:25:10,690 a stem and leaf too. This is not the same data graphed on the right in the stem and 1526 02:25:10,690 --> 02:25:15,521 leaf. I'm just using, you know, recycling the old picture that I used before. But you 1527 02:25:15,521 --> 02:25:23,271 see, you can do the same drawing that squiggly line, you know. And that's actually the distribution. 1528 02:25:23,271 --> 02:25:26,920 I mean, they don't all look exactly like that. But that's what you do is you draw this line 1529 02:25:26,920 --> 02:25:33,400 thing. I know, it's kind of odd that that's what a distribution is, is just a shape. But 1530 02:25:33,400 --> 02:25:39,820 there's actually five of them that we use a lot. There's way more than five, actually, 1531 02:25:39,820 --> 02:25:44,410 in statistics, but you have to get into kind of higher level statistics to care about those, 1532 02:25:44,410 --> 02:25:50,391 we're only going to concentrate on these five. Okay. So the first one is called normal distribution. 1533 02:25:50,391 --> 02:25:55,760 And it's called that everywhere, except I noticed the book call that mound shaped symmetrical 1534 02:25:55,760 --> 02:26:01,740 distribution, but I'm going to call it a normal distribution. And there's nothing really normal 1535 02:26:01,740 --> 02:26:07,261 about it, it's just named that for some reason. And then there's a uniform distribution, skewed 1536 02:26:07,261 --> 02:26:12,811 left distribution, skewed right distribution, and by modal distribution, so those are the 1537 02:26:12,811 --> 02:26:18,830 five we're going to cover. So let's start here with the normal distribution. So as you 1538 02:26:18,830 --> 02:26:23,501 can see, on the right, somebody made a histogram. And then they do that squiggly line. Well, 1539 02:26:23,501 --> 02:26:27,811 actually, it was me who made this histogram and drew the squiggly line. And notice the 1540 02:26:27,811 --> 02:26:32,141 squiggly line, what it looks like, it kind of looks like what the book called it, it's 1541 02:26:32,141 --> 02:26:38,351 mound shaped and symmetrical. But that's the shape of the normal distribution, it looks 1542 02:26:38,351 --> 02:26:43,990 like that it's got kind of hokey things on the side, and, and a mound in the middle. 1543 02:26:43,990 --> 02:26:48,170 And if that's what your histogram ends up looking like, where it's kind of like a little 1544 02:26:48,170 --> 02:26:54,110 mountain like that, then you've got a normal distribution. Okay, let's look at a different 1545 02:26:54,110 --> 02:26:58,790 histogram. Okay? In this histogram, you'll notice that like, each of the bars, each of 1546 02:26:58,790 --> 02:27:04,040 the frequencies is almost like the same, right? It's either five or six. And it doesn't matter 1547 02:27:04,040 --> 02:27:10,331 what class we're talking about. When it's like that, the little line you draw across, 1548 02:27:10,331 --> 02:27:16,370 it's not squiggly at all, it's straight. I don't see this very often in healthcare data. 1549 02:27:16,370 --> 02:27:20,830 But it does happen in other kinds of data more frequently. And this is called the uniform 1550 02:27:20,830 --> 02:27:26,290 distribution, which makes sense, it's almost all of these bars are a uniform height. So 1551 02:27:26,290 --> 02:27:32,761 that's what a uniform distribution is. Okay, now, this is one kind of like the one we were 1552 02:27:32,761 --> 02:27:37,931 looking at before, where it looks kind of like a slide like at a playground, where, 1553 02:27:37,931 --> 02:27:42,740 you know, like, you climb up the right side, and then you slide down to the left side. 1554 02:27:42,740 --> 02:27:48,650 Okay? And that whenever it's like that, where it's low on one side and high on the other, 1555 02:27:48,650 --> 02:27:56,650 it's called skewed. The problem is, which way is it skewed? Right? And how I remember 1556 02:27:56,650 --> 02:28:03,090 which way to say it's skewed? Is it skewed, where it's light or short? So here, I would 1557 02:28:03,090 --> 02:28:08,650 say it's light on the left. So it's skewed left, right? Because on the left side, it's 1558 02:28:08,650 --> 02:28:12,400 really the bars are all short. And then you can just imagine what's going to come next 1559 02:28:12,400 --> 02:28:18,621 here? Well, look at this, this is skewed, right, because it's light on the right. It's 1560 02:28:18,621 --> 02:28:24,660 short on the right. So it's skewed, right. So technically, I mean, both of them are just 1561 02:28:24,660 --> 02:28:29,030 skewed distributions. I like I just like to explain them separately. Because sometimes 1562 02:28:29,030 --> 02:28:33,460 people don't know which way to say is left to right. And this is how I remember light 1563 02:28:33,460 --> 02:28:43,280 on the left, light on the right. Finally, we have bi modal. Now, the word mode in some 1564 02:28:43,280 --> 02:28:51,561 areas of statistics, and then engineering and stuff often means like a high point. And 1565 02:28:51,561 --> 02:29:00,811 by modal means two high points. So as you can see, it looks like a camel with two humps. 1566 02:29:00,811 --> 02:29:07,460 And it's a little hard sometimes to tell by modal from normal. Because if you remember 1567 02:29:07,460 --> 02:29:12,730 normal, like let's say you have a normal distribution, but you just have one little 1568 02:29:12,730 --> 02:29:17,791 one little bar kind of in the middle, you're like, is this bi modal, or is this normal? 1569 02:29:17,791 --> 02:29:24,610 How I tell coach people to see if it's bi modal is if there's a really big space between 1570 02:29:24,610 --> 02:29:30,182 the two humps that's not so apparent on this image here. But you'll see class three and 1571 02:29:30,182 --> 02:29:35,230 class four, they're both short. If only one of them was short, I might I might have called 1572 02:29:35,230 --> 02:29:40,410 it a normal distribution. But I've really seen by modal distributions when it comes 1573 02:29:40,410 --> 02:29:47,550 to like lab data, because my best friend is a pathologist, and he'll show me you know, 1574 02:29:47,550 --> 02:29:51,990 with situations where people have like really super high platelet counts, and then like 1575 02:29:51,990 --> 02:29:56,830 no platelets practically and there's nothing in the middle. And that's where you'll see 1576 02:29:56,830 --> 02:30:04,340 a bi modal distribution. Now we're gonna talk about outliers. And outliers are data values 1577 02:30:04,340 --> 02:30:09,330 that are, quote very different from other measurements in the data. What's very different, 1578 02:30:09,330 --> 02:30:15,240 right? Like it's an opinion. But people in statistics come up with different formulas 1579 02:30:15,240 --> 02:30:19,701 to try and figure out if something is very different from the other measurements. And 1580 02:30:19,701 --> 02:30:25,160 we'll talk about that actually, later in later chapters in the class, not so much for identifying 1581 02:30:25,160 --> 02:30:30,610 outliers, but just to just to better understand our distributions. But just as a quick and 1582 02:30:30,610 --> 02:30:36,760 dirty representation of what would be an obvious outlier lit, like nobody would disagree on 1583 02:30:36,760 --> 02:30:41,341 is this histogram here. So you'll notice I just threw down nine classes, I made up this 1584 02:30:41,341 --> 02:30:45,801 data. But you'll see a class two and class three, there's just like nothing, and there's 1585 02:30:45,801 --> 02:30:50,240 nothing in class eight. But when you get, and then suddenly, there's something in class 1586 02:30:50,240 --> 02:30:53,521 one and something in class nine. And when you have these big gaps, this is kind of like 1587 02:30:53,521 --> 02:30:57,920 that platelets, like I was telling you about only this maybe would be you know, you would 1588 02:30:57,920 --> 02:31:01,061 say this is tri modal, like there's three modes, but there's not really three modes, 1589 02:31:01,061 --> 02:31:06,240 right? There's a wacky low one and a wacky high one, and everything else is in the middle. 1590 02:31:06,240 --> 02:31:12,061 So because that one in class one, and that one, and class nine, they're so far away from 1591 02:31:12,061 --> 02:31:18,601 what's in the middle, like just about every statistician would agree, these are both outliers. 1592 02:31:18,601 --> 02:31:25,580 But you can just imagine how much we argue about what actually is an outlier. It's especially 1593 02:31:25,580 --> 02:31:32,580 hard when you're getting data on weight of people. Some people really do weigh 400 500, 1594 02:31:32,580 --> 02:31:40,080 maybe even 600 pounds, you don't know if they're really outliers, or data mistakes, or what 1595 02:31:40,080 --> 02:31:44,851 to do with them. They're real people. And maybe they have really high weights. And unfortunately, 1596 02:31:44,851 --> 02:31:51,480 some of them have really low weights too. So the one of the main points of doing the 1597 02:31:51,480 --> 02:31:58,730 histogram is not only to look for these distributions, but also to see if you've got any super obvious 1598 02:31:58,730 --> 02:32:03,851 outliers that you're just gonna have to think about before you proceed with your analysis. 1599 02:32:03,851 --> 02:32:11,710 Now, I'm going to talk to you about what cumulative frequency means, you know, the word accumulate 1600 02:32:11,710 --> 02:32:16,641 means to just like keep accumulating things like if you have a gutter on your house, it 1601 02:32:16,641 --> 02:32:20,851 will accumulate leaves, like old leaves will sit there and new leaves will keep coming 1602 02:32:20,851 --> 02:32:25,450 and the old ones will still be there, until it like totally clogs your gutter, and you 1603 02:32:25,450 --> 02:32:30,891 have to clean it. So that's what cumulative frequency is, is where it accumulates all 1604 02:32:30,891 --> 02:32:34,870 the frequencies. So you see on the slide, you know, in the first class, when they ate, 1605 02:32:34,870 --> 02:32:38,880 we had a frequency of 14. So your cumulative frequency, those are like the leaves at the 1606 02:32:38,880 --> 02:32:45,081 first beginning of the season, that's all you got is 14. But when you add on the next 1607 02:32:45,081 --> 02:32:51,280 class 21. Now you add to the cumulative frequency, it accumulates, you add that 21 to the 14, 1608 02:32:51,280 --> 02:32:57,190 and now you've got 35. And if you can extrapolate as you walk up all these classes, eventually 1609 02:32:57,190 --> 02:33:03,851 you get to the total, right. And so yeah, so that's what you got. And the first class 1610 02:33:03,851 --> 02:33:08,450 is always the same as the frequency and each cumulative frequency is equal to or higher 1611 02:33:08,450 --> 02:33:10,101 than the last one. 1612 02:33:10,101 --> 02:33:15,971 I'll have to say in healthcare, we don't really use cumulative frequency a whole lot, you'll 1613 02:33:15,971 --> 02:33:22,090 see it but we are really into relative frequency, I'll just tell you that. But some groups are 1614 02:33:22,090 --> 02:33:28,420 into cumulative frequency and those who are, they like to plot it in a plot called an Ojai. 1615 02:33:28,420 --> 02:33:32,920 And again, I'll be honest, and healthcare, I've never seen an old giant that was just 1616 02:33:32,920 --> 02:33:37,290 in the scientific literature, which is why you'll see this is about NFL teams salaries, 1617 02:33:37,290 --> 02:33:41,710 because I think they use it a lot more in economics. But at any rate, what you'll see 1618 02:33:41,710 --> 02:33:46,750 is that the classes are along the x axis, you know, you're used to that, because that's 1619 02:33:46,750 --> 02:33:52,170 what we do in a frequency histogram. But along the y axis, you see these numbers called cumulative 1620 02:33:52,170 --> 02:33:58,170 frequency. And you just graph it, right, but one of the things you'll just notice is that 1621 02:33:58,170 --> 02:34:03,670 it's going to go up, like each one is going to either, unless you have a class with zero 1622 02:34:03,670 --> 02:34:07,160 in it, it's going to stay the same for that one. But otherwise, it's just going to keep 1623 02:34:07,160 --> 02:34:11,260 going up. So you'll always see some sort of shape like this, where it's always going up 1624 02:34:11,260 --> 02:34:21,940 and it hits the top. At the end, it hits the total cumulative frequency at the end. So, 1625 02:34:21,940 --> 02:34:26,830 just to review, there are five main types of distributions used in statistics. And I 1626 02:34:26,830 --> 02:34:31,771 emphasize mean, there's other ones, but these are the ones we're going to look at. And so 1627 02:34:31,771 --> 02:34:35,580 that's why we were doing our histograms and our seven leaf displays is we were looking 1628 02:34:35,580 --> 02:34:40,001 for these distributions. And also we were looking for outliers. And then finally, I 1629 02:34:40,001 --> 02:34:44,670 just quickly did a shout out for your Oh, jive here and your cumulative frequency. So 1630 02:34:44,670 --> 02:34:51,420 you know what, what's up with that. So in conclusion, the purpose of the histogram is 1631 02:34:51,420 --> 02:34:56,171 to reveal the distribution and also the stem and leaf displays reveal the distribution. 1632 02:34:56,171 --> 02:35:02,660 And you look then, for outliers. You'll probably wondering, Well, why do we do all this work 1633 02:35:02,660 --> 02:35:06,881 to, to reveal the distribution, we'll you'll find in later chapters and matters, what kind 1634 02:35:06,881 --> 02:35:13,420 of distribution you have, what kind of statistics you can do insert, in a way, you know, like 1635 02:35:13,420 --> 02:35:17,300 I went kind of, on and on about the normal distribution. Well, we all really like that 1636 02:35:17,300 --> 02:35:20,950 in statistics, we're all really partial to that, because it allows you to do a whole 1637 02:35:20,950 --> 02:35:26,271 bunch of different statistics, you know, pretty easily if you get a normal distribution. However, 1638 02:35:26,271 --> 02:35:31,591 what's often happens is in healthcare, because I've done it, is you get a skewed distribution 1639 02:35:31,591 --> 02:35:37,260 left skewed right skewed, and then you have to make some decisions, that makes it a little 1640 02:35:37,260 --> 02:35:41,931 harder. Also, I've had to buy moral distribution before I'm remembering that one day, that 1641 02:35:41,931 --> 02:35:47,080 was kind of an issue, and then I had to figure that one out. So that's roughly why we have 1642 02:35:47,080 --> 02:35:51,280 to go through this chapter and figure out how to do these distributions. And then later, 1643 02:35:51,280 --> 02:36:00,300 I'll explain to you what you do with that knowledge. Hello, there, it's Monica wahi 1644 02:36:00,300 --> 02:36:08,040 labarre College statistics lecturer. We're going to circle back now to chapter 2.2. And 1645 02:36:08,040 --> 02:36:12,840 talk about these other graphs, I'm doing things a little out of order, because it makes sense 1646 02:36:12,840 --> 02:36:19,760 to me. I hope it makes sense to you too. Well, for this lecture, we're going to have these 1647 02:36:19,760 --> 02:36:23,630 learning objectives. So when you're done with this lecture, you should be able to describe 1648 02:36:23,630 --> 02:36:29,230 a case in which a time series graph would be appropriate, you should be able to explain 1649 02:36:29,230 --> 02:36:34,580 the difference between what would be graphed on a bar graph versus a time series graph, 1650 02:36:34,580 --> 02:36:39,190 you should be able to describe the type of data graphed in a pie chart. And you should 1651 02:36:39,190 --> 02:36:44,961 also be able to list two considerations to make when choosing what type of chart to develop. 1652 02:36:44,961 --> 02:36:50,351 Alright, so let's get started here. What I'm going to be doing it in this lecture is, first 1653 02:36:50,351 --> 02:36:55,500 I'm going to explain what a time series graph is. Then I'm going to talk about a bar graph. 1654 02:36:55,500 --> 02:36:59,580 And of course, I'm going to show you roughly how to make these, I'm gonna explain a pie 1655 02:36:59,580 --> 02:37:05,090 chart and how to make that. And then I'm going to go over a review of all the graphs I've 1656 02:37:05,090 --> 02:37:13,250 talked about for chapter two. And just summarize when to use what type of graph. So let's start 1657 02:37:13,250 --> 02:37:19,590 with the time series graph. And actually, the word time is the key. 1658 02:37:19,590 --> 02:37:20,590 The time 1659 02:37:20,590 --> 02:37:26,460 we're going to talk about this time series graph and what our time series data, right. 1660 02:37:26,460 --> 02:37:32,500 As you can see, by this little example, time is across the x axis. And that's kind of a 1661 02:37:32,500 --> 02:37:38,070 hint for where we're going. Okay, so then I'll show you roughly how to plot one. And 1662 02:37:38,070 --> 02:37:44,380 I'll explain why we have these time series graphs, like how you interpret them and why 1663 02:37:44,380 --> 02:37:54,540 you even make them. So, of course, I'm an epidemiologist. So what am i into m&m mortality, 1664 02:37:54,540 --> 02:38:01,500 morbidity. So here's a nice time series graph, wonderful graph of the percentage of visits 1665 02:38:01,500 --> 02:38:08,710 for influenza like illness reported by the US outpatient influenza like illness surveillance 1666 02:38:08,710 --> 02:38:17,141 network, by surveillance week, and this is October 1 2006, through May 1 2010. And you're 1667 02:38:17,141 --> 02:38:23,011 like, oh, time? Yeah, that's the deal. time series data are made of measurements for the 1668 02:38:23,011 --> 02:38:29,880 same variable, for the same individual taken in intervals over a period of time. Only. 1669 02:38:29,880 --> 02:38:36,391 In this case, in the example here, the individual is not a person, right? Because remember, 1670 02:38:36,391 --> 02:38:40,450 individuals are just what you measure what you're measuring variables about. Here, the 1671 02:38:40,450 --> 02:38:47,601 individuals are actually weeks, right? Because every week, they're making a measurement. 1672 02:38:47,601 --> 02:38:52,021 So like I said, time series data are made of measurements for the same variable, which 1673 02:38:52,021 --> 02:38:58,010 is what percentage of visits for influenza like illness. So every week they went to I 1674 02:38:58,010 --> 02:39:03,980 don't know who is in like, what clinics are in this outpatient influenza like illness 1675 02:39:03,980 --> 02:39:08,420 surveillance network, but let's just pretend there's like 10 clinics in there. So each 1676 02:39:08,420 --> 02:39:13,370 week, these clinics have to go in and say, Yeah, I had, for example, 100 visits this 1677 02:39:13,370 --> 02:39:19,250 week, and 10 of them were for influenza, like illness. So then that would be 10%. That week, 1678 02:39:19,250 --> 02:39:23,080 for that clinic. Well, they got all the clinics together, and they found out what the percents 1679 02:39:23,080 --> 02:39:27,870 were. And you can see on the y axis, right, there's the percentage, and then you see on 1680 02:39:27,870 --> 02:39:34,160 the x axis all the weeks in the year. So um, so you've seen these before, right? You especially 1681 02:39:34,160 --> 02:39:38,540 see it with stock market, right? You go on Yahoo, and look at your favorite stock, right? 1682 02:39:38,540 --> 02:39:42,910 You know, we're also rich, we own so much stock, and so you track your favorite stock 1683 02:39:42,910 --> 02:39:48,330 that way. Personally, I'm spend more time looking at mortality and morbidity, things 1684 02:39:48,330 --> 02:39:53,070 like influenza, but hey, there after I get some money, I'll be looking at stock market 1685 02:39:53,070 --> 02:40:01,080 prices. So when we see these time series data graphed in these time series graphs It's often 1686 02:40:01,080 --> 02:40:10,921 about things like influenza rates. Other rates, you'll see life expectancy, rates of heart 1687 02:40:10,921 --> 02:40:14,681 attack. And that's usually what we see, because we're trying to affect those rates. And we're 1688 02:40:14,681 --> 02:40:19,880 trying to see if they're going up or down. So I'm going to just roughly go through how 1689 02:40:19,880 --> 02:40:24,141 you make one, if you ever wanted to make one, the first thing you need is a table, kind 1690 02:40:24,141 --> 02:40:29,380 of like the one on the right, I just made up these data, they don't mean anything. But 1691 02:40:29,380 --> 02:40:34,960 roughly what you need is a column that says, in this case, I put year, the influenza people 1692 02:40:34,960 --> 02:40:39,540 they put a week, but you have to put like regular time increments in the first column. 1693 02:40:39,540 --> 02:40:45,311 And then you have to put that variable measured at that time in the next column. So let's 1694 02:40:45,311 --> 02:40:49,240 say it's today, and you're like, Oh, I want to measure how many times I went to the gym 1695 02:40:49,240 --> 02:40:53,940 each week, you know, weekly over the last few months? Well, you're gonna have to reconstruct 1696 02:40:53,940 --> 02:40:58,380 that data, right? Like maybe from your memory or your calendar. So normally, when you're 1697 02:40:58,380 --> 02:41:03,900 going to go do time series stuff, you start and you collect the data as you go along. 1698 02:41:03,900 --> 02:41:08,130 And then it's nice and accurate. Okay, so let's say you did that, and you managed to 1699 02:41:08,130 --> 02:41:12,811 get some time series data together, then how do you plot? Well, the first thing you do, 1700 02:41:12,811 --> 02:41:18,080 and I'm using this influential thing, as an example, is you draw a horizontal line and 1701 02:41:18,080 --> 02:41:23,561 you make that your x axis, now you gathered your data based on years or weeks or something. 1702 02:41:23,561 --> 02:41:28,040 So you can label those time periods there, because you already know those time periods. 1703 02:41:28,040 --> 02:41:33,870 And so you just label that x axis. There, then you draw the vertical line for your y 1704 02:41:33,870 --> 02:41:41,400 axis. And again, you've done all your measurements, right? So if you were measuring how many times 1705 02:41:41,400 --> 02:41:47,300 you went to the gym per week, you know, maybe once a day, you know, that would be seven 1706 02:41:47,300 --> 02:41:52,480 would be the maximum, right? So you didn't want to make sure your y axis is tall enough 1707 02:41:52,480 --> 02:41:58,211 to get that seven. And if you had a good week there. And so that's really what you're looking 1708 02:41:58,211 --> 02:42:02,251 for in the y axis, you don't want to too tall, like you see the highest point that they have 1709 02:42:02,251 --> 02:42:07,420 Ooh, in 2009, they had an outbreak there, they needed to make sure that the y axis was 1710 02:42:07,420 --> 02:42:11,000 tall enough so that they could graph that. But other than that, you don't want to too 1711 02:42:11,000 --> 02:42:15,960 much taller. And then make sure you label it. I'm big on labeling here, because otherwise 1712 02:42:15,960 --> 02:42:17,420 people get confused. 1713 02:42:17,420 --> 02:42:22,630 Okay, now we're going on to the next step, then this is where you get into actually putting 1714 02:42:22,630 --> 02:42:28,240 in your data. Now, because there were so many weeks, like if you look at like 2007 is only 1715 02:42:28,240 --> 02:42:34,000 about like the x axis is only about two inches wide. And all like 52 weeks of 2007 were plotted 1716 02:42:34,000 --> 02:42:40,331 in there. So it literally looks like a super smooth line. But honestly, what they did was 1717 02:42:40,331 --> 02:42:46,790 they went and they put each point in. And so they put each point in separately, and 1718 02:42:46,790 --> 02:42:53,830 then they connected the dots. And that's why it looks so smooth. If you only have 1719 02:42:53,830 --> 02:42:54,830 a few 1720 02:42:54,830 --> 02:43:01,320 points, and you have a wider x axis, it'll be a more choppier, it will be, it'll look 1721 02:43:01,320 --> 02:43:06,460 a little bit more like they'll stock market. Graphs like that go up and down, up and down 1722 02:43:06,460 --> 02:43:10,230 and kind of look like a roller coaster and not so smooth. But if you have a lot of points 1723 02:43:10,230 --> 02:43:15,000 and you mission together ends up looking really smooth. You also I just wanted to point out 1724 02:43:15,000 --> 02:43:19,901 can have more than one line on the graph. For more than one set of data values. Like 1725 02:43:19,901 --> 02:43:24,450 here, they're comparing, I don't know some sort of book performance, how much it was 1726 02:43:24,450 --> 02:43:30,771 sold. In US versus Canada, you just have to make sure that you have a legend if you do 1727 02:43:30,771 --> 02:43:37,930 that, so people can tell the lines apart. So to summarize, time series graphs are useful 1728 02:43:37,930 --> 02:43:43,900 for understanding trends over time, like whether things go up or down like you saw on that 1729 02:43:43,900 --> 02:43:49,120 influenza chart, we could see when there apparently was kind of an epidemic or an outbreak. So 1730 02:43:49,120 --> 02:43:53,561 graphing more than one set of time series data, like you saw in the last graph on one 1731 02:43:53,561 --> 02:43:59,471 graph can help and comparing the differences between the datasets I worked at for the US 1732 02:43:59,471 --> 02:44:04,320 Army. And there's a lot of problems with people getting injured in the army. And so I made 1733 02:44:04,320 --> 02:44:09,650 a lot of time series graphs of rates of injury over the years because we were trying to do 1734 02:44:09,650 --> 02:44:14,330 things to make the rates of injury go down. And then that way we could see if the trend 1735 02:44:14,330 --> 02:44:19,660 was there that we were actually making them go down. So that's the main goal of these 1736 02:44:19,660 --> 02:44:26,940 time series graphs. Now, I'm going to move on to talk about a bar graph, which can display 1737 02:44:26,940 --> 02:44:33,450 quantitative or qualitative data. And I'm going to first start with the features of 1738 02:44:33,450 --> 02:44:38,940 the bar graph. here's just an example on the right here. I'm going to talk about how to 1739 02:44:38,940 --> 02:44:44,750 make one and then we're going to talk about what happens when you change the scale meaning 1740 02:44:44,750 --> 02:44:51,190 the x axis like how how tall the x axis is, on a bar chart because it really changes things. 1741 02:44:51,190 --> 02:44:56,110 I call it a bar chart sometimes, or bar graph. They're really the same thing. I don't know 1742 02:44:56,110 --> 02:45:00,550 why they chose graph in the book. But then finally, there's I want to do A little shout 1743 02:45:00,550 --> 02:45:05,490 out to what purrito charts are, we don't really use them much in healthcare, but I still wanted 1744 02:45:05,490 --> 02:45:11,570 you to know about them. Alright, so let's look at the features of a bar graph. The first 1745 02:45:11,570 --> 02:45:16,320 thing you want to know is that they the bars can be vertical or horizontal. So don't, even 1746 02:45:16,320 --> 02:45:20,190 though I'm showing you this horizontal, or this vertical example, don't be thrown off, 1747 02:45:20,190 --> 02:45:25,440 if you see a horizontal example. Regardless of whether they're vertical or horizontal, 1748 02:45:25,440 --> 02:45:31,510 the bars are supposed to have a uniform width, and uniform spacing, they can't be wider or 1749 02:45:31,510 --> 02:45:39,650 skinnier. And they have to be spaced apart at a uniform rate. I'm gonna use, like I said, 1750 02:45:39,650 --> 02:45:46,680 this big one here, as an example, to talk about bar graphs, I just want you to notice 1751 02:45:46,680 --> 02:45:51,920 what is being graphed here. And this is the percentage of people in the US not covered 1752 02:45:51,920 --> 02:45:56,811 by health insurance. And it's split up by race and ethnicity. And it's looking at the 1753 02:45:56,811 --> 02:46:03,080 years 2008 through 2012, which is like bad, right? Like, you want people to have health 1754 02:46:03,080 --> 02:46:09,410 insurance. Okay, um, so item three here says the length of the bars represent either the 1755 02:46:09,410 --> 02:46:14,960 variables frequency or percentage of occurrence. So if we were looking at instead of percent 1756 02:46:14,960 --> 02:46:18,970 like it's I've circled percentage, because that's what we're looking at in this one, 1757 02:46:18,970 --> 02:46:23,570 we could have looked at, you know, number of visits at a health care clinic, and that 1758 02:46:23,570 --> 02:46:28,500 would be frequency, right. But we haven't been looking at percentage here. So I, so 1759 02:46:28,500 --> 02:46:36,670 I just wanted to call that out. So you'll see then, on the y axis, we have the measurement 1760 02:46:36,670 --> 02:46:42,980 scale. And as long as we write it there, and we use that same measurement scale, for graphing 1761 02:46:42,980 --> 02:46:46,930 each of the bars, we will be fulfilling the item for which is the same measurement scale 1762 02:46:46,930 --> 02:46:51,430 is used for each mark. I don't know why anybody do it any other way. But that's part of the 1763 02:46:51,430 --> 02:46:58,330 features of the bar graph. Now, this is a feature that really is like my pet peeve, 1764 02:46:58,330 --> 02:47:03,881 I get so irritated when I find a bar graph or any other graph where things are not labeled, 1765 02:47:03,881 --> 02:47:10,461 I get totally confused. So you really want to put on a title, you need to put the bar 1766 02:47:10,461 --> 02:47:16,150 labels, at least on the app on the x axis, right? Like you have to know see how it says 1767 02:47:16,150 --> 02:47:20,170 white alone, black alone, like you wouldn't even know what those bars were unless somebody 1768 02:47:20,170 --> 02:47:26,710 put something there, right. And some people also add the actual values for each bar, I'll 1769 02:47:26,710 --> 02:47:31,551 do that if there's space, like there was space here. If it gets too busy, I don't do that. 1770 02:47:31,551 --> 02:47:36,460 But um, because you can kind of see them from the graph. 1771 02:47:36,460 --> 02:47:41,480 Now, you're probably wondering, um, you're probably kind of having a flashback, you're 1772 02:47:41,480 --> 02:47:46,540 like, this looks totally like a histogram. What is the difference? Well, I started by 1773 02:47:46,540 --> 02:47:53,290 talking to you about histograms, they're actually a special case of a bar graph, right? So bar 1774 02:47:53,290 --> 02:47:59,471 graphs are more general. And the histogram is a specific type of bar graph. So histograms 1775 02:47:59,471 --> 02:48:06,580 are bar graphs that must have classes of a quantitative variable on the x axis. So you 1776 02:48:06,580 --> 02:48:14,061 can already see that the bar graph I'm showing you is not a histogram, because it says categorical, 1777 02:48:14,061 --> 02:48:20,910 qualitative things, it doesn't have a class, right? Also histograms must have frequency 1778 02:48:20,910 --> 02:48:26,040 or relative frequency on the y axis, which as you can see this as percentage of something. 1779 02:48:26,040 --> 02:48:30,641 So that's not that. So this isn't a histogram. But whenever you make a histogram, you're 1780 02:48:30,641 --> 02:48:37,200 just making kind of a special bar graph. And I just wanted to point that out, so you weren't 1781 02:48:37,200 --> 02:48:43,500 confused. Now, I said, I was going to warn you about what goes wrong when you change 1782 02:48:43,500 --> 02:48:50,220 the scale. And what I mean by changing the scale is when you look at that y axis, notice 1783 02:48:50,220 --> 02:48:58,300 how it the top of it the way this person made, it, is at 35, or 35%. But notice that the 1784 02:48:58,300 --> 02:49:04,220 highest racial group without health insurance, which is unfortunately, those of Hispanic 1785 02:49:04,220 --> 02:49:11,870 origin, that that's close to 30. But it's not all the way up to 35. So I'm not exactly 1786 02:49:11,870 --> 02:49:18,300 sure why they made it so high. So I wanted to see what would happen, what the shape would 1787 02:49:18,300 --> 02:49:22,710 change these bars, if I actually made the top 30. So I regenerated this, and then you'll 1788 02:49:22,710 --> 02:49:29,670 see what happens. See, it's the same data. I just made it and I made the top 30. It's 1789 02:49:29,670 --> 02:49:35,220 kind of subtle, but suddenly all the bars look bigger, right? So if I were like some 1790 02:49:35,220 --> 02:49:39,930 advocate and running around saying this is terrible, you know, these people don't have 1791 02:49:39,930 --> 02:49:45,010 insurance. I'd like to look at the one on the left more than the one on the right. But, 1792 02:49:45,010 --> 02:49:51,960 you know, in a way, that's a little misleading, right? It's the same data. So the differences 1793 02:49:51,960 --> 02:49:58,040 between bars are more dramatic when we change the scale to be shorter, a little bit more 1794 02:49:58,040 --> 02:50:04,490 dramatic. But let's go The other way, and this is where I see people do things a lot. 1795 02:50:04,490 --> 02:50:10,580 Let's see what happens, see how that the the top of the y axis is 35. Right now, let's 1796 02:50:10,580 --> 02:50:17,970 double that. Let's just make it 70. And then let's see what happens. As you can see, the 1797 02:50:17,970 --> 02:50:25,010 differences between the bars look small, right? Like, the difference between that big Hispanic 1798 02:50:25,010 --> 02:50:33,670 origin one and the lower white and Asian alone ones isn't really that big anymore. So my 1799 02:50:33,670 --> 02:50:38,530 opponents would rather look at that graph. In fact, everything looks kind of small. on 1800 02:50:38,530 --> 02:50:43,601 that graph, it's a Oh, there's no problems with insurance. Um, and that's, you know, 1801 02:50:43,601 --> 02:50:47,460 when people talk about lying with statistics, so to speak, I mean, these are the kind of 1802 02:50:47,460 --> 02:50:54,500 tricks people do to try and change how things appear. And the best way to do it is to just 1803 02:50:54,500 --> 02:50:59,940 do kind of what I suggested is look at the next one up from your tallest one. And do 1804 02:50:59,940 --> 02:51:06,590 that, use that as your top of your y axis, what I would have to do with the army is I 1805 02:51:06,590 --> 02:51:11,541 was looking at rate of knee injury, and also rate of ankle injury. But knee injury was 1806 02:51:11,541 --> 02:51:18,901 way more common. And so if I wanted to compare the two, I always use the same scale, because 1807 02:51:18,901 --> 02:51:25,190 otherwise, people wouldn't be able to see that the ankle injury was really, really low. 1808 02:51:25,190 --> 02:51:32,511 Compared to the knee injury, even though they're both important. Um, let's hall with a taller 1809 02:51:32,511 --> 02:51:37,530 y axis, the differences between the bars look dress less dramatic, and also the taller you 1810 02:51:37,530 --> 02:51:42,540 make your y axis, the less it looks like you have of the bars, so you got to be really 1811 02:51:42,540 --> 02:51:47,610 careful. I don't think you would do that. But you know, other people do that, when they're 1812 02:51:47,610 --> 02:51:52,800 trying to make their points. So just be careful for that. Also, a term that was mentioned 1813 02:51:52,800 --> 02:51:58,080 in the book is the term clustered and clustered bar graph. It's not that complicated, it just 1814 02:51:58,080 --> 02:52:03,290 means more than one bar is graph for each category. You'll see in the in the last one 1815 02:52:03,290 --> 02:52:10,381 I did, it was just on on one topic. And here, if you look at this one on the right, and 1816 02:52:10,381 --> 02:52:15,830 of course, I mixed it up a little I did the horizontal version. But this is life expectancy 1817 02:52:15,830 --> 02:52:18,120 at birth. 1818 02:52:18,120 --> 02:52:23,820 And it's it's separated by you'll see that there's three sets of bars, right? There's 1819 02:52:23,820 --> 02:52:28,580 both sexes together, in there's a bunch of bars for that. And you see the legend Hispanic, 1820 02:52:28,580 --> 02:52:32,580 non Hispanic, black, non Hispanic, white, and then they mix them all together all races 1821 02:52:32,580 --> 02:52:38,160 origin. And then they also have separate set of bars for male and female. And so this would 1822 02:52:38,160 --> 02:52:43,280 be clustered. And if you do that, you really need a legend so people can tell what's going 1823 02:52:43,280 --> 02:52:49,210 on. You'll also notice that you know, life expectancy, that's good. If it's high, right, 1824 02:52:49,210 --> 02:52:56,620 you want to live to be 8090 100. But if you look at the bottom of the slide where we have 1825 02:52:56,620 --> 02:53:01,950 the x axis, if we mean if we started at zero, and just made it all long, it would not even 1826 02:53:01,950 --> 02:53:06,271 fit on the slide. So what they'll do is they'll make these little hash marks with this little 1827 02:53:06,271 --> 02:53:13,750 squiggle, and indicate that they just skipped ahead. But like I said in the first part of 1828 02:53:13,750 --> 02:53:19,840 this, if they skip ahead on the female one, they have to skip ahead on all of them. Right, 1829 02:53:19,840 --> 02:53:24,280 so everything is skipped ahead there. This is a fair comparison. It's just like we're 1830 02:53:24,280 --> 02:53:28,960 sort of, it's like, we're fast forwarding through the movie up to about 50. And then 1831 02:53:28,960 --> 02:53:33,780 looking at the differences there because everything's the same up to that. So that's just another 1832 02:53:33,780 --> 02:53:39,760 thing about scale is notice whether it's clustered if you've got a legend, and also look for 1833 02:53:39,760 --> 02:53:47,990 the squiggle. Okay, now I'm going to give you a shout out to a purrito chart. And you 1834 02:53:47,990 --> 02:53:51,510 probably already noticed, we don't really use these much in healthcare, because this 1835 02:53:51,510 --> 02:53:57,960 example is about causes of an engine overheating. Well, we don't do that a lot in healthcare. 1836 02:53:57,960 --> 02:54:06,650 And you'll see I kind of slapped on a label on the y axis, the word frequency, okay. So 1837 02:54:06,650 --> 02:54:11,570 in a perrito chart, this is you remember how I was saying this histogram is a special bar 1838 02:54:11,570 --> 02:54:17,920 chart, or bar graph will pre though chart is a different kind of special bar graph. 1839 02:54:17,920 --> 02:54:24,090 Okay. And then that one, the height of the bar indicates the frequency of an event. Like 1840 02:54:24,090 --> 02:54:31,360 if you look at these events here, like damage radiator core, that happened 31 times right? 1841 02:54:31,360 --> 02:54:36,080 And then happened more often than faulty fans, which only happened 20 times. So what they 1842 02:54:36,080 --> 02:54:40,080 do is they figure out what happened the most and the second most and least whatever, and 1843 02:54:40,080 --> 02:54:45,500 they deliberately arranged them in order left to right, according to decreasing height. 1844 02:54:45,500 --> 02:54:51,601 It's a way of sort of zoning in on what is the most important problem you're finding. 1845 02:54:51,601 --> 02:54:57,521 So it's really meant to graph frequencies of problems. I actually only saw one purrito 1846 02:54:57,521 --> 02:55:02,830 chart I've ever ever in healthcare, so So far, I really looked for one. And what it 1847 02:55:02,830 --> 02:55:09,061 was about was, it was about things that can happen that are bad in a nursing home. And 1848 02:55:09,061 --> 02:55:15,820 I remember the tallest bar was for falls, right? Like people fall in a nursing home. 1849 02:55:15,820 --> 02:55:21,970 And then there was a smaller bar for medication errors that happens. The reason why we don't 1850 02:55:21,970 --> 02:55:27,131 I think the reason why we don't use these a lot in healthcare is, you know, let's pretend 1851 02:55:27,131 --> 02:55:31,530 that's what this was of it, let's pretend this 31 instead of damage radiator course 1852 02:55:31,530 --> 02:55:35,670 that 31 Falls? Well, the first thing you'd probably ask is, well, how many people are 1853 02:55:35,670 --> 02:55:41,811 in that, that nursing home? You know, and how long did you collect data for right? 31 1854 02:55:41,811 --> 02:55:47,220 Falls is pretty bad. But it's not bad. If you have hundreds of people over 10 years 1855 02:55:47,220 --> 02:55:52,321 of that all you get a 31 Falls, you're doing pretty well. So I would say that the reason 1856 02:55:52,321 --> 02:55:57,050 why we don't use preto charts a lot in healthcare is that sort of leaves out some important 1857 02:55:57,050 --> 02:56:02,841 information about these serious events. And so we like to look at things in different 1858 02:56:02,841 --> 02:56:12,110 ways. So just to summarize, about bar graphs, bar graphs must be made following a few rules, 1859 02:56:12,110 --> 02:56:17,300 I talked to you about the you know the difference. with, you know, you have to keep the width 1860 02:56:17,300 --> 02:56:22,870 the same and, and how you have to label the axes. So we know what you're talking about. 1861 02:56:22,870 --> 02:56:27,710 Because you can visualize both quantitative and qualitative data using a bar chart. So 1862 02:56:27,710 --> 02:56:32,222 these labels become really important, as do scales, right? Like, I showed you how you 1863 02:56:32,222 --> 02:56:36,271 change the scale, and you can make things look different. So you want to be careful 1864 02:56:36,271 --> 02:56:42,160 and be cognizant of that. And also, I did a shout out to purrito charts, and I explained 1865 02:56:42,160 --> 02:56:46,980 why I think they're not used that much in healthcare. 1866 02:56:46,980 --> 02:56:51,391 Now we're going to jump into pie charts. You know, just even the thought of a pie chart 1867 02:56:51,391 --> 02:56:56,110 makes me hungry, doesn't make you hungry. Um, so here's what a pie chart is. They're 1868 02:56:56,110 --> 02:57:02,580 also called circle graphs. They're used with counts or frequencies that are mutually exclusive. 1869 02:57:02,580 --> 02:57:08,361 And that sounds really fancy. But all it means is when every individual can only fall in 1870 02:57:08,361 --> 02:57:12,230 one category. So I'm going to give you the example on the right, which is actually from 1871 02:57:12,230 --> 02:57:16,820 a real report you should probably read. It was a survey that was done by the Massachusetts 1872 02:57:16,820 --> 02:57:23,160 nursing Association, and they got 339 nurses to fill out the survey, one of the questions 1873 02:57:23,160 --> 02:57:30,311 was, do you receive annual blood borne pathogen training? Now the answer is only going to 1874 02:57:30,311 --> 02:57:37,681 be yes or no. They can't say yes and no. That is what mutually exclusive is, is where you 1875 02:57:37,681 --> 02:57:43,881 can only answer one answer. So as you can see, 234 people said yes, which is good. And 1876 02:57:43,881 --> 02:57:48,271 105 said no, which is bad, I'm worried about that. 1877 02:57:48,271 --> 02:57:50,530 But these pie charts 1878 02:57:50,530 --> 02:57:54,101 are often made in graphing programs, because they're a little difficult to do by hand. 1879 02:57:54,101 --> 02:58:00,880 And I'll explain to you why. And unlike peredo, charts, these are super common in healthcare, 1880 02:58:00,880 --> 02:58:06,760 as you can see right there on the slide. So let's look at the features of a pie chart. 1881 02:58:06,760 --> 02:58:14,790 Um, I actually just made up this fake pie chart, I pretended I had a class where I gave 1882 02:58:14,790 --> 02:58:20,430 a five point quiz, right? And the reason why I did that is I wanted to show you how to 1883 02:58:20,430 --> 02:58:26,630 do it with a quantitative variable. Because remember, the last one, it was yes or no. 1884 02:58:26,630 --> 02:58:30,590 And that's qualitative. Those are the the answers that the nurses could give to that 1885 02:58:30,590 --> 02:58:35,710 survey question. Well, this is a different one. This is where I actually put, you know, 1886 02:58:35,710 --> 02:58:41,620 fake students in their their points on this quiz into classes, right? Like you see zero 1887 02:58:41,620 --> 02:58:47,870 points, one to two points, three to four points and five points, right. So regardless of whether 1888 02:58:47,870 --> 02:58:52,540 you're doing yes, no, no qualitative, or, you know, different categories like that, 1889 02:58:52,540 --> 02:58:58,940 or you're doing classes like this, every individual in your data must be in only one of the categories, 1890 02:58:58,940 --> 02:59:04,801 only one of the classes kind of like frequency tables and histograms. You everybody gets 1891 02:59:04,801 --> 02:59:09,040 one vote. And that's really important in a pie chart, even though it can be used with 1892 02:59:09,040 --> 02:59:14,930 qualitative or quantitative variables. And you'll see later What I mean by that. And 1893 02:59:14,930 --> 02:59:21,130 so here is just a fake example I made of how you would then make a pie chart out of a quantitative 1894 02:59:21,130 --> 02:59:24,521 variable. 1895 02:59:24,521 --> 02:59:25,521 So 1896 02:59:25,521 --> 02:59:30,840 I'm just gonna briefly go over how you would do this by hand and I'm realizing I've never 1897 02:59:30,840 --> 02:59:37,970 done this by hand. I always use Excel as you probably recognize that lovely purple color, 1898 02:59:37,970 --> 02:59:43,300 which comes out of Excel. But if you were going to do it by hand, I guess you'd have 1899 02:59:43,300 --> 02:59:48,240 to go buy one of those things in the lower left, which is a protractor because that helps 1900 02:59:48,240 --> 02:59:54,220 you see the degrees of a circle. Remember, it's a whole circle has 360 degrees, right? 1901 02:59:54,220 --> 02:59:58,551 I don't know if you remember all this from like trigonometry. And but then like a half 1902 02:59:58,551 --> 03:00:04,130 circle would be 182 Freeze. And so that's how you figure out like how much of the piece 1903 03:00:04,130 --> 03:00:08,271 of the pie you need is using this protractor. So if you're going to make a pie chart by 1904 03:00:08,271 --> 03:00:13,470 hand, you first have to make a table, you'll see we make tables constantly and statistics. 1905 03:00:13,470 --> 03:00:20,680 And I put class in the first column, because I was doing one that required class because 1906 03:00:20,680 --> 03:00:24,910 it's quantitative. If you were doing that one with the nurses saying yes or no, you 1907 03:00:24,910 --> 03:00:29,700 would put category and you just say yes or no, right, and then total, then of course, 1908 03:00:29,700 --> 03:00:34,501 next, you put the frequency. And I always put total to add it up to try and make sure 1909 03:00:34,501 --> 03:00:38,710 you know my fake class apparently, and 37 people in it. So I just want to make sure 1910 03:00:38,710 --> 03:00:44,820 you know, everything adds up, then the next step room will remind you of relative frequency, 1911 03:00:44,820 --> 03:00:49,750 it's where you figure out the proportion of the circle that that's going to take up, right. 1912 03:00:49,750 --> 03:00:56,830 So see, the five points out the seven people who got five points? Well, if you divide seven 1913 03:00:56,830 --> 03:01:02,490 by 37, you're going to get point one, nine, well, that's I like percent. So that's 19%. 1914 03:01:02,490 --> 03:01:09,240 So that would say what proportions the circle they get, right. And then finally, in the 1915 03:01:09,240 --> 03:01:14,930 last column, remember how it's telling you the whole circle is 360 degrees, when you 1916 03:01:14,930 --> 03:01:21,320 take that proportion you get, and you multiply it by 360, to figure out how many degrees, 1917 03:01:21,320 --> 03:01:25,570 you're going to make your circle. And that's why you need the protractor. And that's also 1918 03:01:25,570 --> 03:01:29,601 why I always use Excel for this because it makes it so you don't have to worry about 1919 03:01:29,601 --> 03:01:35,391 those things. All you would need for Excel is actually just the class or the categories, 1920 03:01:35,391 --> 03:01:41,271 and the frequency. And then if you use their automatic pie graph function, then you can 1921 03:01:41,271 --> 03:01:47,271 get all this other stuff out very quickly. So I just wanted to make a few notes about 1922 03:01:47,271 --> 03:01:52,851 pie charts. This is the thing I'm coming back to is this mutually exclusive categories. 1923 03:01:52,851 --> 03:01:57,851 So I want you to imagine that I do a survey, right. And I asked the question, what is your 1924 03:01:57,851 --> 03:02:03,631 favorite color? And I give some choices like red, green, blue, whatever, there's only going 1925 03:02:03,631 --> 03:02:08,370 to be one answer to everybody's question, right? Because you can only have one favorite, 1926 03:02:08,370 --> 03:02:14,761 right? And that then is eligible to be used in a pie chart, because everybody gets one 1927 03:02:14,761 --> 03:02:20,330 vote. But a lot of times, I'll see people who do a different survey question, they'll 1928 03:02:20,330 --> 03:02:25,621 say, check off all of the colors you like. So if I get that I'm like, Oh, I love red. 1929 03:02:25,621 --> 03:02:29,841 I like orange, I like green, I'm checking off a bunch. There's some people I know who 1930 03:02:29,841 --> 03:02:33,681 don't really like color, like they just were gray and black. So they probably wouldn't 1931 03:02:33,681 --> 03:02:38,040 check off anything. And then there are the people who just check off one or two. Well, 1932 03:02:38,040 --> 03:02:44,591 as you can see, people can have multiple votes or no votes or whatever. And if you have that 1933 03:02:44,591 --> 03:02:48,141 situation, like I was telling you, where people can say multiple things, you've got to go 1934 03:02:48,141 --> 03:02:54,181 into bargraph land, okay? Because a whole bunch of people can like read a whole bunch 1935 03:02:54,181 --> 03:02:58,340 of people can like green, a whole bunch of people can like blue. And you won't get a 1936 03:02:58,340 --> 03:03:04,601 circle out of that. If everybody answers just one answer. And so therefore, everybody's 1937 03:03:04,601 --> 03:03:10,680 in a mutually exclusive category, then you can use the pie chart. I also wanted to let 1938 03:03:10,680 --> 03:03:16,561 you know that I find it and I think a lot of people do more informative to put the percentage 1939 03:03:16,561 --> 03:03:22,630 on the actual chart, then the frequency, some people put both the frequency and the percentage, 1940 03:03:22,630 --> 03:03:28,710 which is good, it's not so helpful to just put the frequency as you see that the nursing 1941 03:03:28,710 --> 03:03:35,040 report did on the left. And it's because you really don't know, you know, 234 seems like 1942 03:03:35,040 --> 03:03:38,610 a lot. But what proportion is that of the circle, that's what you would kind of want 1943 03:03:38,610 --> 03:03:43,370 to know. Whereas if you look on the right on mine, you can see like, for instance, only 1944 03:03:43,370 --> 03:03:49,271 5% God zero point, that's a small amount, right? You know what 5% means? It's just hard 1945 03:03:49,271 --> 03:03:55,391 to tell, you know, if you look at that one on the left, and looks a little like two thirds, 1946 03:03:55,391 --> 03:03:59,440 which would be 66%. But we don't know what the percent is, right. And so it's really 1947 03:03:59,440 --> 03:04:05,091 helpful to have that percent. And always include a title and a legend. Because if you're, if 1948 03:04:05,091 --> 03:04:07,820 you're graphing a pie chart, you're gonna have more than one category, and so people 1949 03:04:07,820 --> 03:04:12,811 are gonna want to know what that color means. 1950 03:04:12,811 --> 03:04:17,021 This looks so good, doesn't look good. Um, pie charts are common in healthcare, and they 1951 03:04:17,021 --> 03:04:21,351 graph mutually exclusive categories. Okay, so so you'll see this all the time. And like 1952 03:04:21,351 --> 03:04:26,440 I said, it's easier to make using software, I use Excel, it can come out of other software, 1953 03:04:26,440 --> 03:04:31,910 but I just like Excel because you can really put fancy labels on and you can do that squiggle 1954 03:04:31,910 --> 03:04:38,711 thing and but choosing a graph requires some consideration, like whether or not you actually 1955 03:04:38,711 --> 03:04:45,271 want to make a pie chart or a bar chart or whatever, requires some thought. And also, 1956 03:04:45,271 --> 03:04:49,021 regardless of the chart you make, you should follow these rules. You should always provide 1957 03:04:49,021 --> 03:04:54,550 a title, okay? Even if it's just for your private use. Trust me, I've done this. I go 1958 03:04:54,550 --> 03:04:59,251 back and I'm like, I don't even know what I grabbed. So take your time sit down, write 1959 03:04:59,251 --> 03:05:06,650 a little title. So you remember what you also labeled the axes. Because, again, you think 1960 03:05:06,650 --> 03:05:10,061 you're going to remember or maybe you think it's obvious everybody in the audience is 1961 03:05:10,061 --> 03:05:16,591 going to tell, don't leave anything to be assumed, just be absolutely clear about what's 1962 03:05:16,591 --> 03:05:22,480 on each axis. Always identify your units of measure. So if you're talking about a rate 1963 03:05:22,480 --> 03:05:28,530 per 10,000 people or a percentage, or maybe you're talking about an average, or you're 1964 03:05:28,530 --> 03:05:32,672 talking about a frequency, it doesn't matter, just make sure you're clear about what you're 1965 03:05:32,672 --> 03:05:39,920 talking about. In the units of measure, usually, this ends up on the y axis. So the thought 1966 03:05:39,920 --> 03:05:46,150 is to make the graph as clear as possible, thinking font size, thinking number of items 1967 03:05:46,150 --> 03:05:51,040 graph, you know, I've sometimes seen a bunch of time series graphs where they put so many 1968 03:05:51,040 --> 03:05:57,500 lines on there, I can't even see anything. Or they'll have these really tiny font sizes. 1969 03:05:57,500 --> 03:06:03,660 Or they'll just try to put too much on one graph. And it's hard to read. So if you find, 1970 03:06:03,660 --> 03:06:08,102 if you have trouble reading it, probably everybody else will. So you want to modify it. So I 1971 03:06:08,102 --> 03:06:13,040 just throw this on the right. Can you tell what's missing from the above graph? The above 1972 03:06:13,040 --> 03:06:17,021 graph is really missing a lot of information. I mean, we don't even know what it's about 1973 03:06:17,021 --> 03:06:21,510 we, we can kind of guess it's a time series graph because of the time at the bottom. But 1974 03:06:21,510 --> 03:06:25,160 what else right? So the person who made this really knew what they were talking about, 1975 03:06:25,160 --> 03:06:31,230 but we don't, and you don't want that to happen to your graph. Okay, so here, what I'm going 1976 03:06:31,230 --> 03:06:37,000 to do is review all the different graphs I've talked about in chapter two, and talk about 1977 03:06:37,000 --> 03:06:42,400 the cases where that graph is useful. So you can keep the straight in your heads what why 1978 03:06:42,400 --> 03:06:47,470 we have all these graphs, right. So first, there's the frequency histogram. Remember 1979 03:06:47,470 --> 03:06:52,551 that that was only for quantitative data. And that's what you make when you want to 1980 03:06:52,551 --> 03:06:57,570 see the distribution, right? Remember, the distribution was a shape. And, and a frequency 1981 03:06:57,570 --> 03:07:04,330 histogram is a particular type of bar graph that is meant for showing these distributions. 1982 03:07:04,330 --> 03:07:09,040 I also showed you how to make a relative frequency histogram, which is almost the same thing, 1983 03:07:09,040 --> 03:07:13,841 only it graphs the relative frequency instead of the frequency. And that also will show 1984 03:07:13,841 --> 03:07:18,940 you the distribution, right, because the pattern will be the same. But this one's specifically 1985 03:07:18,940 --> 03:07:24,141 good for comparing to other data. So if you have two sets of data, maybe from two different 1986 03:07:24,141 --> 03:07:29,200 locations are two different groups, then you want to use the relative frequency histogram, 1987 03:07:29,200 --> 03:07:35,650 because then it's easier to compare distributions, right. I also showed you how to make a stem 1988 03:07:35,650 --> 03:07:39,860 and leaf display, I explained what the stem and leaf is, what the leaves are, and what 1989 03:07:39,860 --> 03:07:46,221 the stem is. And that's also for quantitative data. And that's also if you want to see the 1990 03:07:46,221 --> 03:07:49,830 distribution, it's also good for organizing the data, it's a little easier to make by 1991 03:07:49,830 --> 03:07:56,061 hand than a histogram. Because a histogram makes you make a frequency table first, and 1992 03:07:56,061 --> 03:08:00,750 stem and leaf display, you can kind of skip that step. So again, these first three were 1993 03:08:00,750 --> 03:08:06,730 just about trying to take quantitative data and visualize it so you can look at distributions 1994 03:08:06,730 --> 03:08:13,150 and also look for outliers. Next, we went into the time series graph. And that is really 1995 03:08:13,150 --> 03:08:18,790 about time, right? That's for graphing a variable that changes over time. And as measured at 1996 03:08:18,790 --> 03:08:24,220 regular intervals, mainly to see trends like is it going up? Is it going down? Was there 1997 03:08:24,220 --> 03:08:29,771 an epidemic, and that's what a time series graph is for a bar graph. Now this is the 1998 03:08:29,771 --> 03:08:37,220 generic bar graph, not the specific histogram, like I described, but the generic bar graph 1999 03:08:37,220 --> 03:08:43,540 can be used for qualitative data or for quantitative data. And it can be used for displaying frequency 2000 03:08:43,540 --> 03:08:49,521 or percentage, and we went over some examples. Then I shouted out to the perrito chart, which 2001 03:08:49,521 --> 03:08:56,230 is a special bar graph, right. And that special bar graph graphs frequencies of rare events, 2002 03:08:56,230 --> 03:09:01,990 in descending order, usually bad things, you know, rare bad things. And again, we don't 2003 03:09:01,990 --> 03:09:07,900 really use this much in healthcare. Finally, I went over the pie graph. And that's four 2004 03:09:07,900 --> 03:09:12,931 mutually exclusive categories, quantitative or qualitative. And we use those a lot in 2005 03:09:12,931 --> 03:09:15,230 healthcare. 2006 03:09:15,230 --> 03:09:21,251 So in conclusion, in this particular lecture, I first went over the time series graphs, 2007 03:09:21,251 --> 03:09:26,440 and explained how they show changes over time. And then I went over bar graphs and showed 2008 03:09:26,440 --> 03:09:31,061 you how they can display quantitative and qualitative data. They can be up and down 2009 03:09:31,061 --> 03:09:36,891 or horizontal. I showed you some different examples. And then we went through pie charts, 2010 03:09:36,891 --> 03:09:40,601 looking at mutually exclusive categories, which I think are my favorite, like look at 2011 03:09:40,601 --> 03:09:46,561 this pie. This makes me so hungry. Um, but at the end, it's important to pick the right 2012 03:09:46,561 --> 03:09:52,460 chart. Because you want to have a useful visualization of your data. If you're trying to look for 2013 03:09:52,460 --> 03:09:57,150 a distribution. Choose the right kind of visualizations, the right kind of graphs, if you want to instead 2014 03:09:57,150 --> 03:10:02,061 look for trends over time. You get to choose the right kind of work. So I gave you some 2015 03:10:02,061 --> 03:10:08,931 pointers on how to do that. And now my mouth is watering. So I'm gonna go eat some pie. 2016 03:10:08,931 --> 03:10:18,061 Yoo hoo, it's Monica wahi. Again, your statistics lecturer from labarre College, I decided to 2017 03:10:18,061 --> 03:10:24,771 chop up chapter two and reconfigure it. So this first lecture is going to be on part 2018 03:10:24,771 --> 03:10:33,400 of chapter 2.1, frequency tables, and the entire chapter 2.3, which is stem and leaf 2019 03:10:33,400 --> 03:10:39,931 displays. So here are your learning objectives for this lecture. At the end of this lecture, 2020 03:10:39,931 --> 03:10:46,051 you should be able to state the steps for making a frequency table defined class, upper 2021 03:10:46,051 --> 03:10:51,570 class limit and lower class limit, you should be able to explain what relative frequency 2022 03:10:51,570 --> 03:10:56,601 is and why it's useful for comparing groups. Also, you should be able to state the steps 2023 03:10:56,601 --> 03:11:02,120 for making a stem and leaf display. And finally, you should be able to describe the difference 2024 03:11:02,120 --> 03:11:08,540 between an ordered and ordered leaf. And if all that sounds foreign to you, don't worry, 2025 03:11:08,540 --> 03:11:14,460 you'll understand it all at the end of this lecture. So just to introduce what I'm going 2026 03:11:14,460 --> 03:11:19,660 to cover, first, I'm going to define for you what a frequency table actually is. And then 2027 03:11:19,660 --> 03:11:23,540 I'll explain to you how to make one which will help you understand even better what 2028 03:11:23,540 --> 03:11:29,830 it is. After that I'm jumping right into what a stem and leaf display is, and how to make 2029 03:11:29,830 --> 03:11:34,780 one of those in the main reason why I can combine these is because I feel like stem 2030 03:11:34,780 --> 03:11:40,390 and leaf displays can help you make frequency tables. That connection was not really made 2031 03:11:40,390 --> 03:11:46,120 in the book. So I'm making it here. So let's just start with the frequency table. So what 2032 03:11:46,120 --> 03:11:51,391 is one of those? Well, you know, when I think of frequency, I think of the radio, right? 2033 03:11:51,391 --> 03:11:56,921 Like I think of REM what's the frequency? KENNETH? I think that was a last hit. Okay, 2034 03:11:56,921 --> 03:12:01,840 that's not what we're talking about. We're talking about frequency, like the word frequently, 2035 03:12:01,840 --> 03:12:07,470 like How frequently do you go to work per week, right. And you would count how many 2036 03:12:07,470 --> 03:12:10,690 times you go to work or go to class per week? 2037 03:12:10,690 --> 03:12:11,690 Well, frequency 2038 03:12:11,690 --> 03:12:17,660 is, like frequently, it's like how frequent something happens. So first, I'm going to 2039 03:12:17,660 --> 03:12:22,410 explain to you what a frequency table is, and why you make them, then I'm going to define 2040 03:12:22,410 --> 03:12:26,830 some more terms, I just defined frequency, I'm going to just define some more that you're 2041 03:12:26,830 --> 03:12:30,761 going to need to know. And then I'm going to explain the steps for making a frequency 2042 03:12:30,761 --> 03:12:40,780 table and a relative frequency table. So remember, quantitative data, I'll just remind you qualitative 2043 03:12:40,780 --> 03:12:47,510 data are categorical. So that's like gender race diagnosis, where you put individuals 2044 03:12:47,510 --> 03:12:53,750 into categories. And quantitative data are numerical. Remember, like age, heart rate, 2045 03:12:53,750 --> 03:12:58,460 blood pressure. Now, I just want to calibrate you to the idea that this whole frequency 2046 03:12:58,460 --> 03:13:05,610 table thing, this, this whole thing is about quantitative data. And so this entire lecture 2047 03:13:05,610 --> 03:13:12,740 actually is focusing only on quantitative data and not qualitative data already. So 2048 03:13:12,740 --> 03:13:17,470 when you have quantitative data, as you probably noticed, if you've ever had it, right, like, 2049 03:13:17,470 --> 03:13:22,780 let's say that you, let's say you go on Yelp, you know, I always give that example. And 2050 03:13:22,780 --> 03:13:27,740 you tried to decide whether to go to a restaurant or not. You have a bunch of fives, and fours 2051 03:13:27,740 --> 03:13:32,480 and threes and twos and one stars, how do you know, you know, you just have a pile of 2052 03:13:32,480 --> 03:13:37,200 numbers. So how do you organize them, I'm going to give you like a totally fake example 2053 03:13:37,200 --> 03:13:42,980 I made up Okay, so I'm pretending that 60 patients were studied for the distance, they 2054 03:13:42,980 --> 03:13:47,511 needed to be transported in an ambulance. So how far they needed to be transported from 2055 03:13:47,511 --> 03:13:52,891 where they call the ambulance, and were picked up and actually got to the hospital. So the 2056 03:13:52,891 --> 03:13:59,090 shortest transport in my fake data, or the minimum was one mile, which is awesome. That's 2057 03:13:59,090 --> 03:14:02,420 kind of what happens to me because I live right near a hospital, hopefully, I don't 2058 03:14:02,420 --> 03:14:07,341 need to be in an ambulance very often. But that's what happens in urban centers, the 2059 03:14:07,341 --> 03:14:12,710 longest transport the maximum was 47 miles, which would really suck. And I just want to 2060 03:14:12,710 --> 03:14:18,160 point that out that happens to people in the rural areas because of lack of access. So 2061 03:14:18,160 --> 03:14:22,311 this is kind of realistic, even though it's fake data. But anyway, it's hard to just look 2062 03:14:22,311 --> 03:14:27,990 at a pile of numbers. So how do we understand these data? Well, now I'm going to start those 2063 03:14:27,990 --> 03:14:33,910 definitions. The word class means the interval in the data. So in Remember, we're talking 2064 03:14:33,910 --> 03:14:39,490 quantitative data. So let's say I just made up well, how many people got transported between 2065 03:14:39,490 --> 03:14:47,720 30 and 40 miles, okay. That would be a class of 30 to 40, right. And the class limit is 2066 03:14:47,720 --> 03:14:52,860 the lowest and highest value that can fit in the class. So carrying on with my example 2067 03:14:52,860 --> 03:14:58,731 of a class I just randomly picked 30 to 40. If we made that a class we would say 30 would 2068 03:14:58,731 --> 03:15:05,221 be the lower class. limit, and 40 would be the upper class limit. Make sense? Alrighty. 2069 03:15:05,221 --> 03:15:10,171 So then, of course, you have the width of the class or the class width. So that's how 2070 03:15:10,171 --> 03:15:15,920 wide the classes. So carrying on with the example, if the upper class limit was 40, 2071 03:15:15,920 --> 03:15:21,550 and the lower class limit was 30, what you do is you minus 30, from 40, which you get 2072 03:15:21,550 --> 03:15:26,450 10. And then you add one, and n equals 11. That's a little formula. But if you're like 2073 03:15:26,450 --> 03:15:32,591 me, and you count on your fingers, you would go 3031 32 6034, blah, blah, blah, and you'd 2074 03:15:32,591 --> 03:15:39,900 realize that there are 11 numbers in that. Now we get to frequency, like I sort of quickly 2075 03:15:39,900 --> 03:15:46,640 explained in that is how many values from the data fall in the class. So how many patients 2076 03:15:46,640 --> 03:15:52,771 were transported 30 to 40 miles. Or another way of saying it is, if you look in all the 2077 03:15:52,771 --> 03:15:59,630 data you have, and you find every single person that either got 3031 3233, blah, blah, blah, 2078 03:15:59,630 --> 03:16:06,880 up to 40, count all those people up that then you will get the frequency for that class. 2079 03:16:06,880 --> 03:16:14,271 Okay, but you probably realize you do need to decide on classes before you go counting 2080 03:16:14,271 --> 03:16:19,160 frequencies, because you need to know the lower and upper class limits. So let's talk 2081 03:16:19,160 --> 03:16:24,521 about some rules about classes. First of all, classes have to be the same width, you can 2082 03:16:24,521 --> 03:16:30,761 have 30 to 40, and then 40 to 42, right, or 41 to 42, right? You can't have skinny class, 2083 03:16:30,761 --> 03:16:37,561 fat class, they have to have the same width. But, um, there are different ways to pick 2084 03:16:37,561 --> 03:16:44,721 it, right? So, class width can be determined empirically isn't that a fancy word empirically 2085 03:16:44,721 --> 03:16:50,370 just means you just choose it because you like it, right. And if you ever look at survey 2086 03:16:50,370 --> 03:16:52,710 data, about just about anything, when they 2087 03:16:52,710 --> 03:16:59,780 look at the quantitative variable of age, they often put that in classes. And as you'll 2088 03:16:59,780 --> 03:17:06,440 see on the slide, these are the classes we often see 18 to 2425 to 3435 to 44. And you 2089 03:17:06,440 --> 03:17:10,120 can go on, right, like, that's what you normally see. And that means, empirically, you just 2090 03:17:10,120 --> 03:17:15,990 picked it out of the hat. And already, you're probably noticing Well, 18 to 25, or 18, to 2091 03:17:15,990 --> 03:17:17,090 2465. and 2092 03:17:17,090 --> 03:17:18,360 older, those classes 2093 03:17:18,360 --> 03:17:22,970 aren't really equal as the ones in the middle, right? Like, what's the upper class limit 2094 03:17:22,970 --> 03:17:29,140 for 65 and older? Okay, well, that's just normally what happens in the world, and especially 2095 03:17:29,140 --> 03:17:35,181 in healthcare, and healthcare, when you pick classes. Even though the classes are technically 2096 03:17:35,181 --> 03:17:39,890 supposed to be the same width, you really should be guided by the scientific literature. 2097 03:17:39,890 --> 03:17:46,650 And you'll see why later, when I show you the other videos in this chapter. It's because 2098 03:17:46,650 --> 03:17:52,170 you really want to be able to compare whatever you find to whatever other people have found 2099 03:17:52,170 --> 03:17:55,610 before you. And therefore you don't want to cut up your classes in different ways, or 2100 03:17:55,610 --> 03:18:03,641 it's hard to compare them. However, in the book, they teach this class with formula, 2101 03:18:03,641 --> 03:18:09,830 so I thought I should really show you that, too. So here's the class with formula that 2102 03:18:09,830 --> 03:18:15,370 I don't really see used much in healthcare statistics, but I'm going to teach you anyway. 2103 03:18:15,370 --> 03:18:19,521 So this is the formula. First you calculate this number, you find the maximum in your 2104 03:18:19,521 --> 03:18:24,040 data, and you're in the minimum in your data, and you subtract the minimum from the maximum. 2105 03:18:24,040 --> 03:18:29,671 So the example I was giving from the fake data about the transport is 47 was a maximum, 2106 03:18:29,671 --> 03:18:35,021 and the minimum was one. So I did the first step and got 46. Okay, looking back into the 2107 03:18:35,021 --> 03:18:40,301 formula, you divide whatever you got there by the number of classes desired. In other 2108 03:18:40,301 --> 03:18:45,841 words, like however many, you know, categories you want, right. So if you never want too 2109 03:18:45,841 --> 03:18:53,230 many, like you don't want 10 or something, you know, 34567, usually something in that 2110 03:18:53,230 --> 03:18:58,601 range is a good number of classes. So let's pick six just for fun. So we'll take that 2111 03:18:58,601 --> 03:19:05,141 46 number we got we divided by six and we get 7.7. Then back to the formula side, how 2112 03:19:05,141 --> 03:19:10,681 you decide then your class width is you increase this number, you get to the next whole number. 2113 03:19:10,681 --> 03:19:14,080 Now a lot of people are confused by that, because even if I've gotten something like 2114 03:19:14,080 --> 03:19:20,771 low, like 7.1, I'd still go up to eight, you have to increase it up to the next whole number. 2115 03:19:20,771 --> 03:19:25,400 So you have like this, this integer, you know, that's a number without any decimals after 2116 03:19:25,400 --> 03:19:30,061 it. So you have this integer for your class with so our class with in this example then 2117 03:19:30,061 --> 03:19:40,601 would be eight. So, um, now I described to you that whole class with, but I'm not going 2118 03:19:40,601 --> 03:19:45,351 to use it in the example because we don't really do that much in healthcare and it makes 2119 03:19:45,351 --> 03:19:50,220 it actually kind of hard to understand because you want something that's a little intuitive, 2120 03:19:50,220 --> 03:19:57,110 like if you look on the slide right now, you know, less than 20 miles 21 to 2930 to 39 2121 03:19:57,110 --> 03:20:04,101 and then 40 or more, that may A little more sense in your head. You know, that's how we 2122 03:20:04,101 --> 03:20:11,570 think of miles. If I had put like 18 to 24, and 25 to 29, you know, we don't really think 2123 03:20:11,570 --> 03:20:16,340 that way. So this is helpful in healthcare to boil it down to something like this. And 2124 03:20:16,340 --> 03:20:20,351 by the way, if I was writing a real paper in the sort of real data, I'd be looking at 2125 03:20:20,351 --> 03:20:24,430 the papers before this that talked about transport times and looking at those 2126 03:20:24,430 --> 03:20:26,360 class limits. Okay, 2127 03:20:26,360 --> 03:20:31,431 so a frequency table displays each class, along with the frequency, the number of data 2128 03:20:31,431 --> 03:20:35,250 points in each class, as you can see, the class limits are on the left side of the simple 2129 03:20:35,250 --> 03:20:40,450 frequency table, you know, the classes, and then the frequencies on the right side, right. 2130 03:20:40,450 --> 03:20:46,580 And you'll notice that they all add up to 60, because we measured 60, fake patients, 2131 03:20:46,580 --> 03:20:50,860 and it's really good to do that little check. Because you don't want to double count people 2132 03:20:50,860 --> 03:20:56,591 put them in two classes, they only get to be in one, etc. So selecting arbitrary class 2133 03:20:56,591 --> 03:21:00,671 limits, can make the frequency table unbalanced. So in other words, doing this empirical thing 2134 03:21:00,671 --> 03:21:07,480 can make it sort of weird because less than 20 is big, and 40 or more miles is big. And 2135 03:21:07,480 --> 03:21:12,660 it's bigger than the other classes. So it's does it kind of breaks the rules of class 2136 03:21:12,660 --> 03:21:17,490 with but not following the scientific literature can make your results not comparable, and 2137 03:21:17,490 --> 03:21:23,190 can make the science less useful. And so that's why I sort of flail against the book with 2138 03:21:23,190 --> 03:21:31,790 this class with formula thing. So I'm, I'm going to just give you another example for 2139 03:21:31,790 --> 03:21:37,740 a frequency table. Okay. This one is more, it's also health carry, you know, glucose 2140 03:21:37,740 --> 03:21:43,050 is measured in the blood and expressed in milligrams, 400 milliliters, right? So glucose 2141 03:21:43,050 --> 03:21:47,800 is a huge molecule, and it should be cleared from the blood, especially a fasting. So if 2142 03:21:47,800 --> 03:21:51,790 you're not eating anything, you're not putting any glucose in your body supposed to be like 2143 03:21:51,790 --> 03:21:57,820 metabolizing. That problem is some people don't metabolize glucose very well, you know, 2144 03:21:57,820 --> 03:22:02,740 that's what diabetes is. So you, you care about how much glucose is sitting around people's 2145 03:22:02,740 --> 03:22:03,740 blood. 2146 03:22:03,740 --> 03:22:04,740 So blood 2147 03:22:04,740 --> 03:22:09,490 glucose levels for a random sample of 70, women were recorded after a 12 hour fast. 2148 03:22:09,490 --> 03:22:15,420 And this is what they got, they got the minimum was 45, the maximum was 109. And they picked 2149 03:22:15,420 --> 03:22:26,021 six classes. So this is how they set up their class limits. And again, this is using a class 2150 03:22:26,021 --> 03:22:30,740 with formula. And just to demonstrate, you know, it sort of comes out a little weird 2151 03:22:30,740 --> 03:22:38,431 here. But then they they got these frequencies, okay. And this is again, just another example, 2152 03:22:38,431 --> 03:22:42,961 using this time the class width formula to get our six classes and to make sure that 2153 03:22:42,961 --> 03:22:48,180 they covered everybody. Now, you'll notice in this, we start with the minimum like 45 2154 03:22:48,180 --> 03:22:52,340 to 55. And we end with the maximum, which is up to 110. And that's really the clearest 2155 03:22:52,340 --> 03:22:57,870 way to do it. It's just not typically done that way. If you read, like scientific literature 2156 03:22:57,870 --> 03:23:05,641 and healthcare, you just don't see these frequency tables labeled like that. So and just to wrap 2157 03:23:05,641 --> 03:23:11,681 up this part, make sure all of your data points are accounted for only once in one of the 2158 03:23:11,681 --> 03:23:17,580 classes. So whether you use a class with formula, or you use empirical or arbitrarily picked 2159 03:23:17,580 --> 03:23:24,521 classes, every single data point only gets one vote, it can only be in one of the classes. 2160 03:23:24,521 --> 03:23:29,311 And, and also, you don't want to leave any of the data points out. So you want to make 2161 03:23:29,311 --> 03:23:32,931 sure that that happens that you account for all of them. And also you need to make sure 2162 03:23:32,931 --> 03:23:37,330 your classes cover all the data, right. And healthcare when we do that thing up to 20, 2163 03:23:37,330 --> 03:23:42,930 and 65. And over all that stuff, we cause that to happen. However, if you're going to 2164 03:23:42,930 --> 03:23:47,410 use a class with formula, you really have to pay attention to where your minimum and 2165 03:23:47,410 --> 03:23:52,680 your maximum are. Because then you want to make sure all of your classes cover all of 2166 03:23:52,680 --> 03:23:59,300 your data. And like I mentioned, make sure the total of your classes of the frequencies 2167 03:23:59,300 --> 03:24:03,240 in your classes adds up to the total number of data points, it's just a little check, 2168 03:24:03,240 --> 03:24:10,391 make sure you didn't do something wrong. Now I'm going to talk about what is a relative 2169 03:24:10,391 --> 03:24:15,561 frequency table. And that builds on what you already just learned about frequency. So we 2170 03:24:15,561 --> 03:24:21,140 all know what our relatives are. They're like our family, right? We have relationships with 2171 03:24:21,140 --> 03:24:28,370 them. And so what relative means is in relationship to the rest of the data, okay? So in statistics, 2172 03:24:28,370 --> 03:24:35,330 they often use this fancy F to stand for frequency. And, as I've mentioned before, the sample 2173 03:24:35,330 --> 03:24:42,120 size, if you have a sample, they use a lowercase n. So what they use as the formula for relative 2174 03:24:42,120 --> 03:24:49,061 frequency is F divided by n. And if you're clever with math, you realize what that means 2175 03:24:49,061 --> 03:24:55,220 is is if you take a frequency of any of the classes, you know, it's just a portion of 2176 03:24:55,220 --> 03:25:00,630 the whole sample, and you divide it by the total sample, which is that n you You'll get 2177 03:25:00,630 --> 03:25:07,690 the proportion of values that are in that class, it's not really that fancy. So relative 2178 03:25:07,690 --> 03:25:11,511 frequency is something very useful to put in a frequency table. So you'll see that I, 2179 03:25:11,511 --> 03:25:16,380 I kind of crammed it in onto the right side, this is the old frequency table I just showed 2180 03:25:16,380 --> 03:25:23,190 you with glucose, but I crammed in this relative frequency next to it. So it's super easy to 2181 03:25:23,190 --> 03:25:29,160 calculate, like, for example, for the first one, see, 45 to 55, the frequency is three, 2182 03:25:29,160 --> 03:25:34,390 what did I do? Pull out the old calculator? Well, I actually I use Excel. And I did three 2183 03:25:34,390 --> 03:25:40,070 divided by 70, because I was a total. And I got Oh point oh four. And those of you don't 2184 03:25:40,070 --> 03:25:44,391 really like proportions, you can do that thing where you move the decimal two places to the 2185 03:25:44,391 --> 03:25:50,061 right, and then put us percent sign. So that would be like 4% of those 70 people are in 2186 03:25:50,061 --> 03:25:55,261 that first class. And then the same thing happened with the next one, I took, you know, 2187 03:25:55,261 --> 03:26:03,400 the 56 to 66, I took seven divided by 70, which came out 2.10. And those of you into 2188 03:26:03,400 --> 03:26:08,320 percents, I'm really into percents, I like moving that decimal over, I think of it as 2189 03:26:08,320 --> 03:26:14,150 10%, then, but whatever, as you can see at the bottom, and all has to equal 1.0. If you 2190 03:26:14,150 --> 03:26:19,590 like proportion, land, or 100%, if you're like me, and you like percent land. But in 2191 03:26:19,590 --> 03:26:24,271 any case, this is all you have to do to do the relative frequency table, you just make 2192 03:26:24,271 --> 03:26:30,811 another column and do all those calculations. And it's super easy to calculate it. And it's 2193 03:26:30,811 --> 03:26:35,720 very helpful. So why did we even do this, because we had a pile of 2194 03:26:35,720 --> 03:26:41,061 quantitative data, and it was really hard to organize right. And the first thing was 2195 03:26:41,061 --> 03:26:46,061 we had to do was select class width. And I talked about the politics behind that. But 2196 03:26:46,061 --> 03:26:49,940 ultimately, whatever you do you do in the lower in the upper class limits need to be 2197 03:26:49,940 --> 03:26:55,330 determined and put in the first column of your frequency table. Then in your second 2198 03:26:55,330 --> 03:27:00,101 column, which are the frequencies, you count up, how many are in that class, and you fill 2199 03:27:00,101 --> 03:27:05,851 it in. And then if you make that third column, then you can do that dividing thing and get 2200 03:27:05,851 --> 03:27:10,801 your relative frequencies. And that's great. That's how you build your frequency table. 2201 03:27:10,801 --> 03:27:17,180 And as I go through future lectures, you'll see even more why you would make that table 2202 03:27:17,180 --> 03:27:22,190 like how useful that can be. Given that you have quantitative data, and it kind of gets 2203 03:27:22,190 --> 03:27:27,500 all over the place, it's very helpful to organize it in that table. 2204 03:27:27,500 --> 03:27:28,500 Now I'm going to 2205 03:27:28,500 --> 03:27:32,311 move on to talk about the stem and leaf. And the reason why I picked talking about it. 2206 03:27:32,311 --> 03:27:38,973 Now it's because it's on the theme of organizing quantitative data. So I'm going to talk to 2207 03:27:38,973 --> 03:27:44,750 you about what the stem and leaf plot actually is. Here's a just an example on the slide 2208 03:27:44,750 --> 03:27:50,920 and how you make one. And why why you might make one of these you'll find it feels a lot 2209 03:27:50,920 --> 03:27:55,910 like making a frequency table. But why do you make these instead of a frequency table? 2210 03:27:55,910 --> 03:28:03,120 And it's just more food for thought. So first, one of the things that I got hung up on when 2211 03:28:03,120 --> 03:28:09,280 I took biostatistics is I could not get over the fact that it was called a stem and leaf. 2212 03:28:09,280 --> 03:28:14,720 So I had to understand that. So this is an example of a stem and leaf there. So why is 2213 03:28:14,720 --> 03:28:19,960 it called a seven leaf? Well, there's always the stem. And that's so see these corn stalks, 2214 03:28:19,960 --> 03:28:25,240 I'm from Minnesota, I'm used to seeing them, you'll notice that there's a stem, right, 2215 03:28:25,240 --> 03:28:30,211 like this big corn stock has the stem, that thing you see that vertical line and a bunch 2216 03:28:30,211 --> 03:28:36,181 of numbers on the left, that part of the stem and leaf plot is called the stem. And then 2217 03:28:36,181 --> 03:28:40,940 leaves are added onto the sim as we tally up the length of the leaves. And that may 2218 03:28:40,940 --> 03:28:45,061 not make much sense right now, but I'll show you how to make one. But essentially, what 2219 03:28:45,061 --> 03:28:50,851 you end up doing is adding these leafs like you see under two, there's a little leaf that 2220 03:28:50,851 --> 03:28:55,660 just has a zero on it. But if you see under five, there's this big long leaf with a whole 2221 03:28:55,660 --> 03:29:02,090 bunch of numbers off of it. So I'm making one will help you understand this terminology. 2222 03:29:02,090 --> 03:29:07,311 But I first wanted to just show you this picture because it's actually kind of hard to understand 2223 03:29:07,311 --> 03:29:12,090 what's going on with a stem leaf unless you understand that that vertical line in the 2224 03:29:12,090 --> 03:29:16,331 numbers to the left of it is considered a stem. And then each one of these things we 2225 03:29:16,331 --> 03:29:21,271 build off start, you know, off of each of those numbers is called a leaf. So people 2226 03:29:21,271 --> 03:29:29,910 talk about the four leaf in the five leaf already. Okay, so again, I'm just so into 2227 03:29:29,910 --> 03:29:36,811 making up data, right? So I decided to make up data from 42 patients who visited a primary 2228 03:29:36,811 --> 03:29:41,800 care clinic and referred to mental health. Now the reason why I made update on the subject 2229 03:29:41,800 --> 03:29:45,891 is I'm very upset about this subject. I think people are waiting too long to get mental 2230 03:29:45,891 --> 03:29:52,180 health treatment. Especially if you've been following the news about the Veterans Administration. 2231 03:29:52,180 --> 03:29:57,061 In the US. A lot of people are put on hold even for primary care. You know, they're put 2232 03:29:57,061 --> 03:30:01,601 on waiting lists and I don't like so I made a fake data by That as a demonstration just 2233 03:30:01,601 --> 03:30:07,660 to highlight these issues. Okay, so what what data Did I make up, I made up the the number 2234 03:30:07,660 --> 03:30:13,950 of days between the referral and their first mental health appointment. That was what was 2235 03:30:13,950 --> 03:30:19,800 collected. So let's say you go in on January 1, and you get a referral. And then 10 days 2236 03:30:19,800 --> 03:30:24,120 later, you actually show up at the clinic, then that would be 10. Right? That would be 2237 03:30:24,120 --> 03:30:30,400 your value. So that's quantitative. So let's take a look at it. So on the right side of 2238 03:30:30,400 --> 03:30:37,071 the slide, you see just this pile of numbers from all these people that came in and, and 2239 03:30:37,071 --> 03:30:40,490 then got a referral. So like, you look at the first person had to wait a 2240 03:30:40,490 --> 03:30:41,490 month, 2241 03:30:41,490 --> 03:30:47,440 go see a mental health professional. But if you look, you know, the third one, and that 2242 03:30:47,440 --> 03:30:52,390 person only needed 12 days. So that's how you sort of consume this fake data I made. 2243 03:30:52,390 --> 03:30:57,050 And then you'll see over on the on the left side, I already made a step. It's blank that 2244 03:30:57,050 --> 03:31:00,390 doesn't have any numbers on it, but I knew I need that vertical line. So I just made 2245 03:31:00,390 --> 03:31:04,480 that in preparation. Okay, so let's build our simile. 2246 03:31:04,480 --> 03:31:08,830 So what we do is we start with the first number, and that's what's awesome about this is you 2247 03:31:08,830 --> 03:31:12,521 just start with the first number. And if you want, you can kind of cross them out as you 2248 03:31:12,521 --> 03:31:13,521 go along 2249 03:31:13,521 --> 03:31:17,580 to keep track. So we start with this first number. And you'll see what I did, I went 2250 03:31:17,580 --> 03:31:22,900 over to the stem, and I put the three on the left side of the stem and the zero on the 2251 03:31:22,900 --> 03:31:25,710 right, this begins the three leaf, 2252 03:31:25,710 --> 03:31:26,710 okay. 2253 03:31:26,710 --> 03:31:33,670 Here's the next number. Now, I put the two above the three because it's like right before 2254 03:31:33,670 --> 03:31:38,200 it and you can kind of imagine we're gonna walk down like 23456. And then I put the seven 2255 03:31:38,200 --> 03:31:45,220 on the right side to start the the two leaf. Alrighty, here we are with the next number, 2256 03:31:45,220 --> 03:31:50,420 which is 12. And as you'll see, I started the one leaf, you're starting to see the pattern, 2257 03:31:50,420 --> 03:31:54,730 right? And you can probably guess what's going to happen next, we start the four leaf and 2258 03:31:54,730 --> 03:32:01,462 put the two there. Okay, our next leaf, we've already started, right for 35. So what do 2259 03:32:01,462 --> 03:32:07,980 we do there? Well, we just add the five on to the three leaf, the three leaf was already 2260 03:32:07,980 --> 03:32:15,660 started with that, that 30 at the beginning, so we just pile a five on there. Here's 47, 2261 03:32:15,660 --> 03:32:21,510 we just pile a seven on there. Now you'll notice I tried to line up that seven on the 2262 03:32:21,510 --> 03:32:25,440 four leaf with the five on the three leaf. When you're doing this by hand, well, even 2263 03:32:25,440 --> 03:32:29,811 when you're not doing it by hand, you really have to keep those things lined up or you 2264 03:32:29,811 --> 03:32:34,690 you won't have a good stem and leaf. Okay. Now I'm going to just fast forward a little 2265 03:32:34,690 --> 03:32:40,061 a little because you can probably imagine how to do the next row the 3836. You just 2266 03:32:40,061 --> 03:32:44,811 keep piling it on. But I want to show you what happens when you get to the special case 2267 03:32:44,811 --> 03:32:51,140 here. Okay, well, we'll go with this 29. This is the last thing before the special case. 2268 03:32:51,140 --> 03:32:57,260 So you'll notice that 38 got put in there, see that eight, three leaf that 36 got put 2269 03:32:57,260 --> 03:33:01,150 in there, you know from the second row, see, we put everything in there. And now we put 2270 03:33:01,150 --> 03:33:06,840 in the 29 look at that we got a three after that. That's our next one. So where are we 2271 03:33:06,840 --> 03:33:11,200 gonna put that three? And I, you know, you might think on three leaf but that's not right, 2272 03:33:11,200 --> 03:33:16,690 right? Because that's 30 something. So where do you put the three? Well, some of you figured 2273 03:33:16,690 --> 03:33:23,500 this out, you have to add a zero onto your step. So look at that, I put that zero there 2274 03:33:23,500 --> 03:33:28,200 and then we put the three in. And then you can already guess how to do the 21. Next, 2275 03:33:28,200 --> 03:33:33,730 we'll just tack a one on to the to lead. But then when we get to the next zero, we just 2276 03:33:33,730 --> 03:33:41,291 add a zero on to the zero we. 2277 03:33:41,291 --> 03:33:47,010 So you can probably figure out how to pile up all of these. But I did want to talk to 2278 03:33:47,010 --> 03:33:53,090 you about something else that happens with these stem leafs. As you go on adding to the 2279 03:33:53,090 --> 03:33:58,240 leaf, you got to be careful because you might end up with a situation where you got something 2280 03:33:58,240 --> 03:34:03,671 big now I really feel sorry for this fake person. 51 days for a mental health appointment 2281 03:34:03,671 --> 03:34:09,521 that's too long, right? But it causes us later to have to add a five. 2282 03:34:09,521 --> 03:34:14,010 Now this can cause real estate problems, especially on a piece of paper, you know, what have you 2283 03:34:14,010 --> 03:34:18,080 the four was right at the bottom of the paper, right, it's kind of hard, maybe you have to 2284 03:34:18,080 --> 03:34:24,540 tape some paper at the bottom I have this problem a lot. Um, you'll see here this, I 2285 03:34:24,540 --> 03:34:30,280 even had to move this up on the slide when we got later to the 70 I'd add the seven leaf. 2286 03:34:30,280 --> 03:34:35,340 Now I just want to show you for some reason the state of we didn't have any 60s. But you 2287 03:34:35,340 --> 03:34:42,290 still have to put that six leaf place or in that that's got to be there. So even if you 2288 03:34:42,290 --> 03:34:46,790 know as we go on, if we're missing any leaves in between, we just need the place are there 2289 03:34:46,790 --> 03:34:52,880 because that space has to be there. And here's here's an outlier. We're gonna learn about 2290 03:34:52,880 --> 03:34:58,610 outliers pretty soon. This is a really long time. 105 days this is kind of like VA status 2291 03:34:58,610 --> 03:35:04,819 right? But it And you'll see that and of course, this is fake data, but unfortunately reflects 2292 03:35:04,819 --> 03:35:10,950 real data. You'll see when we get to 105, not only did we skip the eight leaf and the 2293 03:35:10,950 --> 03:35:17,710 nine leaf, and we need to leave a space for them, but 10 becomes the part of the stem. 2294 03:35:17,710 --> 03:35:18,710 So 2295 03:35:18,710 --> 03:35:23,000 if we went on to 200, or 300, I mean, that would be awful. The wait that long, though, 2296 03:35:23,000 --> 03:35:31,521 the first two digits of it, like if we had 365, the 36 of the 365 would be the part of 2297 03:35:31,521 --> 03:35:38,190 the step. Alright, so I just did a little demonstration to explain certain nuances of 2298 03:35:38,190 --> 03:35:45,300 the stem leaf that you might encounter in your life. So now, I'm going to just reflect 2299 03:35:45,300 --> 03:35:50,530 back on the two ways that I've described in this lecture for you to organize quantitative 2300 03:35:50,530 --> 03:35:57,021 data. First, I showed you how to make a frequency table. But what you need to do with that one 2301 03:35:57,021 --> 03:36:02,730 is you need to set up classes and class with and and to count the frequencies in there 2302 03:36:02,730 --> 03:36:06,930 a lot of there's a lot of pre processing a lot of pre calculations, you really want to 2303 03:36:06,930 --> 03:36:11,090 think when you're doing this, and you don't want to be distracted. However, if you're 2304 03:36:11,090 --> 03:36:16,470 trying to do a stem and leaf, you really can do that on the fly, you don't need to set 2305 03:36:16,470 --> 03:36:22,521 up classes or class with, as you noticed, we just went through the line of those pile 2306 03:36:22,521 --> 03:36:28,050 of numbers, and just crossed them off as we put them onto the stemmen wave. And there 2307 03:36:28,050 --> 03:36:34,480 was really no need to count, you can tally the data as you go through the list, you know, 2308 03:36:34,480 --> 03:36:40,630 cross it off. And it's just really quicker to do. Of course, those of you who are pretty 2309 03:36:40,630 --> 03:36:46,010 clever saying, Well, basically you're forcing in a stem and leaf everything to be in the 2310 03:36:46,010 --> 03:36:52,431 class of, you know, the 10s, right, you know, the 20s and the 30s in the 40s. That's like 2311 03:36:52,431 --> 03:36:57,140 the two leaf, the three leaf and the poorly. And yeah, it is kind of like a simplified 2312 03:36:57,140 --> 03:37:02,440 way of making those kinds of classes. But in any case, I just wanted to alert you to 2313 03:37:02,440 --> 03:37:08,840 this because you might see some similarities between the two. And I wanted to highlight 2314 03:37:08,840 --> 03:37:16,261 those as well as the differences. Now I'm going to give you a few tricks here, I want 2315 03:37:16,261 --> 03:37:21,711 to tell you about the concept of an unordered leaf. So an unordered leaf is what we were 2316 03:37:21,711 --> 03:37:27,271 making before when I was demonstrating, it's just where the numbers are out of order in 2317 03:37:27,271 --> 03:37:31,290 the leaf like you'll see this two leaf it's a seven, seven to nine. Well, if there were 2318 03:37:31,290 --> 03:37:37,110 an order would say 2779, right, like the two would come first before the seventh and the 2319 03:37:37,110 --> 03:37:41,030 ninth. And the same with the three leaf that's out of order, because you can see that it's 2320 03:37:41,030 --> 03:37:44,240 zero and five is fine, but eight doesn't come before six 2321 03:37:44,240 --> 03:37:46,470 and five, right? That's no 2322 03:37:46,470 --> 03:37:53,410 problem to make an unordered leap. However, after making an unordered version, you can 2323 03:37:53,410 --> 03:37:58,400 rewrite the stem and leaf in an ordered way. So you see how I did that I rewrote the two 2324 03:37:58,400 --> 03:38:03,460 leaf and the three leaf. And now they're all the leaves are in in order. Okay, you don't 2325 03:38:03,460 --> 03:38:08,730 have to be but you can do that. And if you do that, if you make your stem only first 2326 03:38:08,730 --> 03:38:14,590 unordered the way I was demonstrating, then you rewrite it into ordered, it is way easier 2327 03:38:14,590 --> 03:38:20,960 to count it up to make a frequency table no matter what classes you choose. Or you can 2328 03:38:20,960 --> 03:38:25,670 just make each leaf a class. And then it's super easy to make the frequency table. So 2329 03:38:25,670 --> 03:38:31,061 that's why I combined these two pieces of the chapter together is because I wanted to 2330 03:38:31,061 --> 03:38:38,891 show you how you can use a stem and leaf to help you make a frequency table. So a stem 2331 03:38:38,891 --> 03:38:44,021 leaf, it's just another way to organize quantitative data. And it's easier to make kind of on the 2332 03:38:44,021 --> 03:38:50,050 fly than a frequency table because it requires less preparation. And they can help you put 2333 03:38:50,050 --> 03:38:57,000 data in order before like in preparation for a frequency table started to help you as a 2334 03:38:57,000 --> 03:39:03,300 first step to make sure that you can organize everything. And at the end. Remember I keep 2335 03:39:03,300 --> 03:39:07,100 emphasizing your frequency table has to reflect all your data points. And they can only be 2336 03:39:07,100 --> 03:39:12,070 in one class, blah, blah. Well this is one way to make sure that happens is to first 2337 03:39:12,070 --> 03:39:20,440 do this pre organization using an ordered stem and leaf. So in conclusion, frequency 2338 03:39:20,440 --> 03:39:26,680 tables and stem and leaf displays organize data, they organize quantitative data. And 2339 03:39:26,680 --> 03:39:30,850 the stem and leaf may help you make a frequency table. So you might want to start with that. 2340 03:39:30,850 --> 03:39:36,931 And the purpose of both of these things is to reveal a thing called a distribution. And 2341 03:39:36,931 --> 03:39:43,271 I'm going to explain that in the next lecture. Hello, it's Monica wahi. Again, your lecturer 2342 03:39:43,271 --> 03:39:48,730 from library college and we are moving on to chapter 3.1 which is measures of central 2343 03:39:48,730 --> 03:39:54,511 tendency. And here are your learning objectives. So at the end of this lecture, you should 2344 03:39:54,511 --> 03:39:59,440 be able to explain how to calculate the mean. You should also be able to describe what a 2345 03:39:59,440 --> 03:40:04,891 mode is In say how many modes a dataset can have, you should be able to demonstrate how 2346 03:40:04,891 --> 03:40:09,380 to find the median in the set of data with odd number of values, as well as in a set 2347 03:40:09,380 --> 03:40:14,220 of data with an even number of values. And you should also be able to define trim mean 2348 03:40:14,220 --> 03:40:19,900 and weighted average. All right, so what's this measures of central tendency, I'm going 2349 03:40:19,900 --> 03:40:24,580 to explain that why we kind of call it that. And then I'm going to talk about them, which 2350 03:40:24,580 --> 03:40:28,581 the three biggies are mode, median, and mean. So I'm going to talk about those and explain 2351 03:40:28,581 --> 03:40:33,910 how to get those. Then, towards the end of the lecture, I'm going to go into some special 2352 03:40:33,910 --> 03:40:39,760 situations. One is called the trimmed mean. And the second is a weighted average. So let's 2353 03:40:39,760 --> 03:40:44,851 get started. What is the central tendency thing? Well, if you think about quantitative 2354 03:40:44,851 --> 03:40:49,040 data, which that you can only do this with quantitative data, not qualitative data. But 2355 03:40:49,040 --> 03:40:52,790 when you think of having a pile of numbers like this, one of the things you want to know 2356 03:40:52,790 --> 03:40:57,511 is how much they tend towards the center. Now, of course, you don't know where the center 2357 03:40:57,511 --> 03:41:02,430 is, until you start looking at the data. Some data are kind of high up in the hundreds, 2358 03:41:02,430 --> 03:41:07,720 like systolic blood pressure. I give a five point quiz and one of my classes, so those 2359 03:41:07,720 --> 03:41:13,131 numbers are low, like 12345. But then the question becomes, do the group towards the 2360 03:41:13,131 --> 03:41:19,360 center of whatever list of data they're in? Or don't they? How sort of sensory? 2361 03:41:19,360 --> 03:41:20,790 Are they? 2362 03:41:20,790 --> 03:41:26,250 You see these distributions on the slide? You'll see, on the left, you'd probably say, 2363 03:41:26,250 --> 03:41:30,262 well, that looks more sensory than what's on the right, you know, this normal distribution 2364 03:41:30,262 --> 03:41:35,432 on the left, and the skewed right distribution on the right. And so intuitively, you kind 2365 03:41:35,432 --> 03:41:39,561 of know what I'm talking about. But what this lecture is going to be about is how to actually 2366 03:41:39,561 --> 03:41:44,881 put numbers on the difference between what you see on the left and what you see on the 2367 03:41:44,881 --> 03:41:49,220 right. So these are the numbers, these are the measures of central tendency, we're going 2368 03:41:49,220 --> 03:41:55,570 to go over mode, median, and mean. And the median is a little different, depending on 2369 03:41:55,570 --> 03:41:59,180 whether you have an odd number of values or an even number of values. I mean, it means 2370 03:41:59,180 --> 03:42:03,940 the same thing, but you calculate it slightly differently. So I'll go over that. And then 2371 03:42:03,940 --> 03:42:08,440 the mean, a lot of you already know what a mean is, but there's a couple special means 2372 03:42:08,440 --> 03:42:13,311 we can make. One is called a trim mean, and another is called weighted average, which 2373 03:42:13,311 --> 03:42:17,410 is a weighted mean, I don't know why they chose the word average for that one, because 2374 03:42:17,410 --> 03:42:23,290 mean an average mean the same thing. But I'm going to go over these things. Okay, well, 2375 03:42:23,290 --> 03:42:27,890 let's start with the mode. The mode is the number in the data set that occurs the most 2376 03:42:27,890 --> 03:42:34,102 frequently. So I put up this little tiny data set here of just five numbers. And it's obvious 2377 03:42:34,102 --> 03:42:36,120 that then five is the mode, right, because it 2378 03:42:36,120 --> 03:42:41,990 repeats Once there, two fives there. But look, I just changed one of them, I changed it to 2379 03:42:41,990 --> 03:42:44,521 a six. And now there's no mode. 2380 03:42:44,521 --> 03:42:48,920 So I just want you to know that a lot of data sets don't even have a mode, there's just 2381 03:42:48,920 --> 03:42:55,271 no repeat at all in them. And that usually happens when you have a broad range of numbers, 2382 03:42:55,271 --> 03:42:59,300 they can have like systolic blood pressure, I mean, it would be kind of lucky, you just 2383 03:42:59,300 --> 03:43:05,380 got two people with the exact same one. But that can happen. So don't think there's always 2384 03:43:05,380 --> 03:43:11,061 going to be a mode, there might not be one. It's also possible to have more than one mode, 2385 03:43:11,061 --> 03:43:15,730 like look at that. So I've got six numbers up there. And the two repeats once and the 2386 03:43:15,730 --> 03:43:22,261 three repeat ones. So you've got two modes, right? But let's say that the three actually 2387 03:43:22,261 --> 03:43:27,350 repeated three times, then it would only be one mode, because the three threes would Trump 2388 03:43:27,350 --> 03:43:28,521 the two twos, 2389 03:43:28,521 --> 03:43:35,540 right? So you can just imagine how confusing this gets when you got a ton of numbers. What's 2390 03:43:35,540 --> 03:43:40,272 a little less confusing is, um, if you like I said have a broad range of numbers, it would 2391 03:43:40,272 --> 03:43:44,930 be kind of a coincidence, if two patients had the exact same systolic blood pressure 2392 03:43:44,930 --> 03:43:48,390 or platelet count, you know, like you get a repeat in there. And then that would be 2393 03:43:48,390 --> 03:43:52,751 the mode. Of course, if you measure a whole bunch of people, then eventually you're probably 2394 03:43:52,751 --> 03:43:57,601 going to get one. But I just wanted to say and also, if you look at the slide all those 2395 03:43:57,601 --> 03:44:01,500 numbers, you'd really have to go through and organize them and count them up and see if 2396 03:44:01,500 --> 03:44:05,830 there is a mode, there probably is one because we see a lot of repeats. But then which was 2397 03:44:05,830 --> 03:44:10,061 the one that wins that's repeated the most? Or are there two that are repeated the most, 2398 03:44:10,061 --> 03:44:17,450 and becomes kind of political when you really do it. And it's not worth a lot of work, because 2399 03:44:17,450 --> 03:44:23,010 what does the mode tell you? It doesn't really tell you much. It does tell you the most popular 2400 03:44:23,010 --> 03:44:29,240 answer. The word mode in French means fashion. So like I put on the slide, you know Allah 2401 03:44:29,240 --> 03:44:33,820 mode, it's in fashion. So it's the one that's most popular or the most common result, but 2402 03:44:33,820 --> 03:44:40,101 it's not used a lot in healthcare. And it's actually not used very often once in a while. 2403 03:44:40,101 --> 03:44:45,561 I'll say, Oh, the mode. In the class for my five point quiz was five, meaning everybody 2404 03:44:45,561 --> 03:44:50,521 did pretty well they mostly got a five. That was the most popular result. But you hardly 2405 03:44:50,521 --> 03:44:52,980 ever have to say that. And so 2406 03:44:52,980 --> 03:44:54,850 remember, we learn 2407 03:44:54,850 --> 03:45:00,180 the words resistant, like if a measure is resistant, you can't whack it out very easily. 2408 03:45:00,180 --> 03:45:04,030 Well, you can change things pretty easily with the mode, the modes not resistant, I 2409 03:45:04,030 --> 03:45:08,561 even just demonstrated that on those slides, by just changing one number, you can erase 2410 03:45:08,561 --> 03:45:14,021 the mode or add a mode or whatever. And so it's not stable, it's not resistant. And those 2411 03:45:14,021 --> 03:45:18,190 are the kinds of things we don't really like and healthcare, so we don't really use them. 2412 03:45:18,190 --> 03:45:23,561 So I'll move on to some cooler measures of central tendency. 2413 03:45:23,561 --> 03:45:28,690 And here's a really cool one, which is called the median. And it's the middle of the data. 2414 03:45:28,690 --> 03:45:35,171 And I'll explain that a little bit more what we mean by the center of the data. Okay, so 2415 03:45:35,171 --> 03:45:39,811 remember, we're talking about quantitative data. So you've got some pile of numbers, 2416 03:45:39,811 --> 03:45:43,870 it doesn't matter, you can always sort them in order of lowest to highest. And I keep 2417 03:45:43,870 --> 03:45:47,290 talking about this five point quiz, I give him my class. It's an easy quiz. And most 2418 03:45:47,290 --> 03:45:51,930 people get fives. But even so somebody gets a four usually, or somebody doesn't show up 2419 03:45:51,930 --> 03:45:55,471 for the quiz, and they get a zero. And so it doesn't matter, I can have 100 people in 2420 03:45:55,471 --> 03:45:59,830 the class, I still could put all of those numbers in order of lowest to highest, even 2421 03:45:59,830 --> 03:46:05,010 if most of them were fives. Because you'll get repeats in your data sometimes, right. 2422 03:46:05,010 --> 03:46:08,420 And also, sometimes you'll get outliers. Like if I said one person maybe didn't take the 2423 03:46:08,420 --> 03:46:13,771 quiz and they get a zero. But everybody gets else gets four and five is an easy quiz, well, 2424 03:46:13,771 --> 03:46:18,870 then that zero would be an outlier. So you don't have to worry about that. And like I 2425 03:46:18,870 --> 03:46:21,990 said, you know, the data values sometimes are almost the same, like almost everybody 2426 03:46:21,990 --> 03:46:25,900 gets a five on my quiz, because it's so easy. So it doesn't matter. Even if you have these 2427 03:46:25,900 --> 03:46:30,581 weirdnesses in your data, you can still just arrange them in order. And that's what we 2428 03:46:30,581 --> 03:46:36,001 mean by the median is the number that is halfway up, or halfway down, right. So if I've got 2429 03:46:36,001 --> 03:46:40,750 100 people in my class, and I've got the zero over here on the left, and I put all the, 2430 03:46:40,750 --> 03:46:47,230 you know, fours, and then the fives, you know, I have to count up what 50, right to see where 2431 03:46:47,230 --> 03:46:51,740 the middle is. And it's probably going to be in the five range, right. But that's all 2432 03:46:51,740 --> 03:46:58,221 we mean, we say, you'll take however many values you have, put them in order, even if 2433 03:46:58,221 --> 03:47:02,230 there's repeats and outliers or whatever, just put them in order, and then count up 2434 03:47:02,230 --> 03:47:07,460 halfway. And that's where the median is going to be. So I'll demonstrate this here. So how 2435 03:47:07,460 --> 03:47:11,811 to find the median, the first step is to order the data from the smallest to largest. So 2436 03:47:11,811 --> 03:47:15,690 I'm giving you two demonstrations. And I don't even know what these data mean, I just totally 2437 03:47:15,690 --> 03:47:20,830 made them up. The one at the top, the data set the top that starts with 42, that only 2438 03:47:20,830 --> 03:47:25,000 has five numbers in it. So I'm going to demonstrate the odd version with them. 2439 03:47:25,000 --> 03:47:26,000 The one 2440 03:47:26,000 --> 03:47:30,040 set at the bottom has actually six numbers in it. So I'm going to demonstrate the even 2441 03:47:30,040 --> 03:47:33,170 version, because remember, it goes a little differently, whether you have an odd number 2442 03:47:33,170 --> 03:47:39,240 of numbers or an even number of numbers. Okay, so those are the numbers. And we still have 2443 03:47:39,240 --> 03:47:43,101 to do the first step, which is order the data from smallest to largest, because you can 2444 03:47:43,101 --> 03:47:48,230 see they're not in order. So I'm going to do that here. Okay, there it is. So those 2445 03:47:48,230 --> 03:47:54,131 are the same numbers, they're just in order from smallest to largest, okay. So we're going 2446 03:47:54,131 --> 03:47:59,180 to get rid of those numbers on the top, and instead put the position they're in. So let's 2447 03:47:59,180 --> 03:48:05,480 look at the top data set, which is the odd one. So I'm going to say this is how you find 2448 03:48:05,480 --> 03:48:07,800 the median is you 2449 03:48:07,800 --> 03:48:08,800 number the positions, 2450 03:48:08,800 --> 03:48:14,021 you know, it's 12345. And it's the middle position. So you can imagine, if we had had 2451 03:48:14,021 --> 03:48:20,680 seven data points, we'd go out 1234. And we'd circle that one, and that would be the median. 2452 03:48:20,680 --> 03:48:25,021 So that's what you have to do is you take these, if you have odd values, you just put 2453 03:48:25,021 --> 03:48:29,771 them in order, and see I numbered them for you. And then you take the middle number, 2454 03:48:29,771 --> 03:48:34,830 and that's the median. That's what it is. It's 42 in this one. Okay, we'll do the downstairs 2455 03:48:34,830 --> 03:48:40,850 data set there that has six, as you can see, the positions are numbered. And then what 2456 03:48:40,850 --> 03:48:46,980 do you do, you go to the third and fourth position, which is the kind of the middle 2457 03:48:46,980 --> 03:48:53,061 right, and you literally make an average of them, you add the two, and they happen to 2458 03:48:53,061 --> 03:48:57,260 be seven and eight right next to each other. But if they had been like eight and 10, then 2459 03:48:57,260 --> 03:49:00,370 the average would have been nine, and that would have been the median. But because this 2460 03:49:00,370 --> 03:49:05,980 is seven and eight, you do seven plus eight, divided by two, and it's 7.5. So when you 2461 03:49:05,980 --> 03:49:10,130 do the median with an odd number of values, you're going to be taking one of the values 2462 03:49:10,130 --> 03:49:16,290 in there. If you're doing the median, on an even number of values, you might get something 2463 03:49:16,290 --> 03:49:21,610 with like a decimal, because you're looking for the two values that straddle the middle, 2464 03:49:21,610 --> 03:49:24,650 and you're going to be making an average of them. And so you might get kind of a wacky 2465 03:49:24,650 --> 03:49:34,040 number like 7.5 that's not in the underlying data set. So um, this is fine for like, if 2466 03:49:34,040 --> 03:49:38,930 you have five or six numbers or seven. What What if you have like 150 numbers, I mean, 2467 03:49:38,930 --> 03:49:43,580 you do still have to put them all in order to begin with, you know, like I use Excel, 2468 03:49:43,580 --> 03:49:50,410 I probably just soared. But you have to know how many numbers to go up. It's not obvious. 2469 03:49:50,410 --> 03:49:52,200 So this is how you find the middle number. They 2470 03:49:52,200 --> 03:49:53,720 have a little 2471 03:49:53,720 --> 03:49:58,980 formula for it. So let's say we have an odd number of values. And I'm giving you the example 2472 03:49:58,980 --> 03:50:04,080 like 21 love Let's say at 21 students in my class, and that's how many values I have. 2473 03:50:04,080 --> 03:50:09,230 And I wanted to make a median of their grade, what I would do is put them all in order. 2474 03:50:09,230 --> 03:50:13,730 And I'd say, Well, I have to go up so many, and that's the median. But I don't know how 2475 03:50:13,730 --> 03:50:21,150 many to go up. So I would use this calculation. So I take the end, which in our case is 21. 2476 03:50:21,150 --> 03:50:27,390 And I'd add it to one to it. And then we get 22. And then I divide by two. So that's just 2477 03:50:27,390 --> 03:50:33,561 how it works. So if you had 41, you would do 41 plus one, it would be 42, divided by 2478 03:50:33,561 --> 03:50:39,510 two. Or if you had, like, I don't know why I'm picking on ones like 27, you do 27 plus 2479 03:50:39,510 --> 03:50:45,851 one, and that would be 28. And 28 divided by two is 14. And so you see, it would just 2480 03:50:45,851 --> 03:50:50,561 force it to be an even number that you come out with. And then that's the position you 2481 03:50:50,561 --> 03:50:55,030 got go often. So if I had 21 students in my class, and I took the grades and raised them 2482 03:50:55,030 --> 03:51:00,631 in order from lowest, lowest to highest, like if they were that quiz grades, you know, most 2483 03:51:00,631 --> 03:51:03,490 of them would probably be four and five, but it wouldn't matter, what I would do is just 2484 03:51:03,490 --> 03:51:08,410 start with the lowest and count up to the 11th 1/11 position, and then that would be 2485 03:51:08,410 --> 03:51:14,101 my meaning. Now, you also have to do that, you have to find the middle number, even if 2486 03:51:14,101 --> 03:51:20,200 you have an even number of values. So I took an example 14, now you'll notice we use the 2487 03:51:20,200 --> 03:51:26,590 same formula. But if you do use this formula, you get 7.5. And that doesn't, that's not 2488 03:51:26,590 --> 03:51:31,600 the median. That's just how many positions you have to go up. Right. And so remember, 2489 03:51:31,600 --> 03:51:37,200 on the earlier slide, we had, we had to go between the third and fourth position, we 2490 03:51:37,200 --> 03:51:42,470 had to average those two numbers. Well, this is basically saying, if you get 7.5, you have 2491 03:51:42,470 --> 03:51:46,440 to go to the seventh and the eighth, the one that straddles it, and those are the two that 2492 03:51:46,440 --> 03:51:53,561 you average. So if my n like 100 is a nice, even number. So if you have 100 plus one and 2493 03:51:53,561 --> 03:52:00,190 you get 101, then you've got, you know, 50.5, right, and that just is a secret message that 2494 03:52:00,190 --> 03:52:06,000 when you line up all your data, you take the 50th, one in the row and the 51st, one in 2495 03:52:06,000 --> 03:52:10,210 the row, add them together, divide by two, and that's going to be your median. So I just 2496 03:52:10,210 --> 03:52:14,260 wanted to share with you this little formula, just in case, you get like a large number 2497 03:52:14,260 --> 03:52:19,030 of numbers thrown at you and putting them in order is a big pain. And then you have 2498 03:52:19,030 --> 03:52:24,400 to figure out how many to count up, you can use this formula to get the middle number. 2499 03:52:24,400 --> 03:52:28,040 So what does a median tell you, we have a lot more to talk about here. First of all, 2500 03:52:28,040 --> 03:52:33,601 it's called the 50th percentile of the data, what it means is 50%, or half of the data 2501 03:52:33,601 --> 03:52:37,801 points are below the median, and the other half are above. And that intuitively makes 2502 03:52:37,801 --> 03:52:41,890 sense because you just created we created this median together. And we could see that 2503 03:52:41,890 --> 03:52:46,811 half of the points are in the bottom half on the top. And so it's also known as a middle 2504 03:52:46,811 --> 03:52:51,230 rank of the data. And what's nice about the median is it doesn't really care much about 2505 03:52:51,230 --> 03:52:57,160 the ends of the data. Like if I gave extra credit to a few people in my five point quiz, 2506 03:52:57,160 --> 03:53:01,830 and they got a few sixes, probably the median won't even change because it's in the middle 2507 03:53:01,830 --> 03:53:05,681 where all the action is where we find the median. And outliers don't really bother it 2508 03:53:05,681 --> 03:53:10,061 because like if one or two people get a zero on the quiz, it's really, you know, if there's 2509 03:53:10,061 --> 03:53:14,470 21 people in there, or 100 people in there, it really isn't gonna affect, you know, these 2510 03:53:14,470 --> 03:53:18,360 things happening at the end. So we like the median because it's very resistant, and it's 2511 03:53:18,360 --> 03:53:25,850 very stable, you can't really whack it out with some outliers, throwing them on the ends. 2512 03:53:25,850 --> 03:53:31,410 Now I'm moving on to the third measure of central tendency, which is a mean, but I also 2513 03:53:31,410 --> 03:53:36,130 threw in here, trimmed mean and weighted average because there are other kinds of means. And 2514 03:53:36,130 --> 03:53:40,180 we're going to talk a little bit also about resistant measures, because like I just mentioned 2515 03:53:40,180 --> 03:53:41,180 that. 2516 03:53:41,180 --> 03:53:44,230 But I'm gonna step back 2517 03:53:44,230 --> 03:53:49,021 and talk a little bit about the Greek letter sigma here, that's actually capital sigma, 2518 03:53:49,021 --> 03:53:53,370 I do not speak Greek. And I actually have trouble speaking statistics, because a lot 2519 03:53:53,370 --> 03:53:57,811 of it's in Greek. So I try to avoid that and my lectures, but sometimes you can't get away 2520 03:53:57,811 --> 03:54:02,681 from it. So I have to really introduce you to this capital sigma. So in English, we say 2521 03:54:02,681 --> 03:54:07,630 or statistics ease, I guess, is whenever you see this, you say some of Wah, like you expect 2522 03:54:07,630 --> 03:54:14,730 something to be right after it. Okay. So if you see, like the sigma and then x, you would 2523 03:54:14,730 --> 03:54:20,931 say sum of X. That's how you say. So what is x? Well, remember how we were just making 2524 03:54:20,931 --> 03:54:26,900 medians. And we were looking at modes, well, each value there is considered an X, okay, 2525 03:54:26,900 --> 03:54:32,180 so each of the values in those days sets an X. So sum of X would mean add these all up 2526 03:54:32,180 --> 03:54:36,751 or add up all the axes. And then I just threw on another example, let's say somebody came 2527 03:54:36,751 --> 03:54:41,391 to you and said sum of X, Y, it would mean you must have some x y's lying around and 2528 03:54:41,391 --> 03:54:46,061 you have to add them together. Or somebody came up to you and said, you know, some of 2529 03:54:46,061 --> 03:54:50,820 the prices on your, of the food in your 2530 03:54:50,820 --> 03:54:56,530 basket and the grocery store, right? Somebody said some of that, you'd be like, Okay, I 2531 03:54:56,530 --> 03:55:00,551 have to go through all these prices and add them up. Right. So that's what some of them 2532 03:55:00,551 --> 03:55:04,561 Okay, and it's used a lot in statistics, and we're going to use some of all the time. So 2533 03:55:04,561 --> 03:55:08,261 I just want you to get in your head that whenever you see some of, there's probably going to 2534 03:55:08,261 --> 03:55:13,330 be this thing next to it. And it's gonna be a batch of numbers that you have to add up. 2535 03:55:13,330 --> 03:55:18,370 And if it's numbers from our data set, it will be called x, if it's other numbers from 2536 03:55:18,370 --> 03:55:22,070 something else that will be called whatever they're called. But just know that this means 2537 03:55:22,070 --> 03:55:27,250 some of and I see on the slide, the upper one is Times New Roman, and the lower ones 2538 03:55:27,250 --> 03:55:30,790 Arial, they look kind of different. But I just wanted you to get ready to deal with 2539 03:55:30,790 --> 03:55:36,980 this some of a lot. Okay, so here we are, I'm hitting you with a sum up. This is the 2540 03:55:36,980 --> 03:55:41,011 formula for the mean. And a lot of you already know how to calculate the mean. And you just 2541 03:55:41,011 --> 03:55:45,160 kind of do it. And you didn't know this is how you say it in statistics. But basically, 2542 03:55:45,160 --> 03:55:51,170 it's this ratio. So this is like a fraction. And on the top of the fraction is a sum of 2543 03:55:51,170 --> 03:55:55,220 X, you add up all your actions. And on the bottom of the fraction is an, which is however 2544 03:55:55,220 --> 03:55:58,863 many you have. So you add them all up and divide by however many you have. And you've 2545 03:55:58,863 --> 03:56:04,561 probably been doing this your whole life. But this is actually the formula. So I just 2546 03:56:04,561 --> 03:56:09,890 thought I'd demonstrated, um, see, I put that sum of remember those six data points I was 2547 03:56:09,890 --> 03:56:14,230 using for the median, I just kind of copied them over here, I add them all up. And so 2548 03:56:14,230 --> 03:56:19,551 I got some of axes 40, right. And then I counted them, and that was six, while I made them 2549 03:56:19,551 --> 03:56:24,550 be six. And so 40 divided by six is 6.7. So that would be the mean for these data. And 2550 03:56:24,550 --> 03:56:27,750 you probably already knew how to do that. But I wanted to sort of crosshatch it with 2551 03:56:27,750 --> 03:56:35,760 the actual formula. Okay, now I'm again, going to take a little break here to just talk about 2552 03:56:35,760 --> 03:56:41,110 means, because remember, we talked about sample statistics and population parameters. If somebody 2553 03:56:41,110 --> 03:56:47,140 just talks about a mean to you, and they say, look, the mean such and such as six or something, 2554 03:56:47,140 --> 03:56:50,950 unless you really get into it with them, you're not going to tell it's not going to be obvious 2555 03:56:50,950 --> 03:56:57,220 if they did a sample mean, or did a population mean? So but when we write this down, it becomes 2556 03:56:57,220 --> 03:57:03,400 obvious. If I say, x bar, see that x without line above it, that's pronounced x bar, and 2557 03:57:03,400 --> 03:57:07,160 you'll see I write it on the sides x bar, because it's so hard to put that little line 2558 03:57:07,160 --> 03:57:12,511 up there. But that means the same thing, this x bar, whenever here x bar, or you see that 2559 03:57:12,511 --> 03:57:17,660 x with a line over it, it means that it's the sample statistics. So if you ever saw 2560 03:57:17,660 --> 03:57:23,610 like x bar equals six, not only do you know the mean is six, but the secret code says 2561 03:57:23,610 --> 03:57:28,820 this mean comes from a sample, because x bar is being stated. But if you look on the right 2562 03:57:28,820 --> 03:57:35,600 side, you'll see that it says there's this m, and it's pronounced mu, it's a Greek letter 2563 03:57:35,600 --> 03:57:40,400 again, and I you'll show, you'll see on the left, I put it in Arial. And on the right, 2564 03:57:40,400 --> 03:57:44,970 it's n times new roman looks a little different. But it's pronounced mu. And so if you saw 2565 03:57:44,970 --> 03:57:51,351 mu equal sex, you'd be like, Whoa, that was a population they measured. And the you probably 2566 03:57:51,351 --> 03:57:54,320 say that too, because you don't see mu a lot like people usually don't 2567 03:57:54,320 --> 03:57:55,320 measure the population, 2568 03:57:55,320 --> 03:58:01,720 it's a lot of work, you often see x bar, but even so I want you to be cognizant of whether 2569 03:58:01,720 --> 03:58:05,881 it says mute or whether it says x bar, because it's still going to be a mean. But if it's 2570 03:58:05,881 --> 03:58:09,771 mu, they're talking about the population. And if it's x bar, they're talking about a 2571 03:58:09,771 --> 03:58:15,761 sample. And that might be more important later. But just keep this in mind. Also, when we 2572 03:58:15,761 --> 03:58:21,450 talk about samples, we use a lowercase n to mean the number of numbers we have. Whereas 2573 03:58:21,450 --> 03:58:27,751 if we use, we're talking about populations, we use an uppercase n a capital N. So you'll 2574 03:58:27,751 --> 03:58:35,080 see that the sample mean formula on the left side, this x bar equals sum of x divided by 2575 03:58:35,080 --> 03:58:41,740 n, it changes if you're talking about the population mean, and you're like, come on, 2576 03:58:41,740 --> 03:58:47,910 you add it up the same way. Like mu is basically the population mean, and capital and it's 2577 03:58:47,910 --> 03:58:53,580 just the number in the population, that means almost the same formula. But the issue is 2578 03:58:53,580 --> 03:58:57,720 you really are supposed to label things what they are. So if you're doing a population 2579 03:58:57,720 --> 03:59:01,800 mean, mean, you're supposed to call it mu, and you're supposed to use, you know, write 2580 03:59:01,800 --> 03:59:05,440 it like that on the right side of the slide. And if you're doing a sample mean, you're 2581 03:59:05,440 --> 03:59:09,430 supposed to call it x bar, and you're supposed to do it like on the left side of the slide. 2582 03:59:09,430 --> 03:59:14,010 So I just wanted to make that clear to you as you go through the rest of these lectures. 2583 03:59:14,010 --> 03:59:20,010 Because when I say mu, I'm gonna mean a mean, but it's gonna be from a population. And when 2584 03:59:20,010 --> 03:59:27,430 I say x bar, the mean the mean, but it's gonna mean it's from a sample. Alright, so now we've 2585 03:59:27,430 --> 03:59:32,391 talked about several measures of central tendency, but I wanted to put a means and medians together 2586 03:59:32,391 --> 03:59:37,100 in kind of a cage match because I wanted you to look at them and see what their differences 2587 03:59:37,100 --> 03:59:43,851 are. Now, I've been sort of giving accolades to the median, right, because it is very resistant 2588 03:59:43,851 --> 03:59:48,271 to outliers, and it's very stable. Remember how I pointed out if you throw some outliers 2589 03:59:48,271 --> 03:59:53,521 on either side, it doesn't really affect it much. Unfortunately, means are not resistant 2590 03:59:53,521 --> 03:59:59,351 to outliers. You could just throw like if I took my five point quiz, and I just felt 2591 03:59:59,351 --> 04:00:02,900 like failure. barring a student and then giving them 10 points, it would totally screw up 2592 04:00:02,900 --> 04:00:09,480 the mean for that class. And it's so it's not very stable. So one of the things we can 2593 04:00:09,480 --> 04:00:14,320 do if we've got outliers in our data is to just use the median. But sometimes we want 2594 04:00:14,320 --> 04:00:19,180 to use the mean. So we got to do different things with it. So one of the things we can 2595 04:00:19,180 --> 04:00:26,160 do to try and make a more stable mean, or honest mean is to trim it. So I'm going to 2596 04:00:26,160 --> 04:00:30,120 talk about how you do that. So as you can see, on the left side of the slide, a very 2597 04:00:30,120 --> 04:00:35,100 high value, a very low low value, like an outlier, or more than one outlier can really 2598 04:00:35,100 --> 04:00:39,710 throw off the mean. And it's not a problem with median. So if you want to make the meal 2599 04:00:39,710 --> 04:00:46,061 a little resistant, what you can do is trim data off of each end. So the outliers get 2600 04:00:46,061 --> 04:00:47,061 cut 2601 04:00:47,061 --> 04:00:48,061 off, 2602 04:00:48,061 --> 04:00:51,610 okay? The problem is, you can't look at the data, when you're doing that, really, you 2603 04:00:51,610 --> 04:00:56,170 would just have to make a rule when you're not looking and say, Okay, I'm going to trim 2604 04:00:56,170 --> 04:01:00,690 X amount off the top and X amount at the bottom and as to be equal, and you just have to look 2605 04:01:00,690 --> 04:01:07,950 away when you're doing. Okay, so what I'm some people do is a 5%, trim mean, which means 2606 04:01:07,950 --> 04:01:13,101 you take 5% of the data at the top and cut it off, and 5% at the bottom and cut it off. 2607 04:01:13,101 --> 04:01:17,950 So you basically lose 10% of your data. And in health care, a lot of people get mad about 2608 04:01:17,950 --> 04:01:22,090 that they don't want to lose any data. So they don't like to use this way of fixing 2609 04:01:22,090 --> 04:01:27,230 the problem of outliers, they use other ways. But I wanted to show you this as a simple 2610 04:01:27,230 --> 04:01:32,080 way to fix it. So I'm going to imagine we have 100 data points, because it just makes 2611 04:01:32,080 --> 04:01:38,260 it easier for you to see what's going on. Um, so if you had 100 data points, 5% of them 2612 04:01:38,260 --> 04:01:45,040 would be five. So basically, you'd be trimming five off of the top, and five off the bottom. 2613 04:01:45,040 --> 04:01:49,811 So the first step would be is probably you already made the mean out of this 100. And 2614 04:01:49,811 --> 04:01:53,880 you didn't like it because you saw outliers at the top and bottom. So what you have to 2615 04:01:53,880 --> 04:01:57,720 do is put the data in order just like you do for the median, you put them all in order, 2616 04:01:57,720 --> 04:02:01,681 you sort order from, you know, the lowest to the highest, take all of your 100 and do 2617 04:02:01,681 --> 04:02:07,250 that, then what you would do is you would like circle the five most bottom ones, and 2618 04:02:07,250 --> 04:02:11,030 they're going to get cut off, and you'd circle the five top most one of them, they're going 2619 04:02:11,030 --> 04:02:16,141 to get cut off, they get thrown out. And then you're you've got the 90 values left in the 2620 04:02:16,141 --> 04:02:21,200 middle. Now you make a mean out of those. And then that's a 5% trim mean, and you got 2621 04:02:21,200 --> 04:02:25,280 to tell people, if you do that, you can say here's the original mean, and here's the 5% 2622 04:02:25,280 --> 04:02:29,010 trimmed mean, because then people get an idea that there must have been some outliers and 2623 04:02:29,010 --> 04:02:34,711 some of your data got hacked off. But then this might give you sort of a more stable 2624 04:02:34,711 --> 04:02:42,400 estimate of the mean. Now I'm going to move to something else entirely. It's not about 2625 04:02:42,400 --> 04:02:48,080 trying to make the mean stable, it's just about trying to make the mean a little different. 2626 04:02:48,080 --> 04:02:54,240 Sometimes certain values in your mean should count more than others towards the mean. And 2627 04:02:54,240 --> 04:03:00,040 that sounds really esoteric, but the way we see it all the time is in school. So you might 2628 04:03:00,040 --> 04:03:04,800 get a great grade on your homework, you might get A's on your homework, right? But if homeworks 2629 04:03:04,800 --> 04:03:12,311 only worth 10% of your final grade, it doesn't help you much. And so what that 10% is it 2630 04:03:12,311 --> 04:03:17,690 when you have a class like that is it's called a weight. When you move into statistics, you 2631 04:03:17,690 --> 04:03:22,240 say well, I'm going to, you know, I as the teacher, I'm going to wait your homework grade 2632 04:03:22,240 --> 04:03:26,801 at 10% of your final grade. So it doesn't matter how awesome your homework grade is, 2633 04:03:26,801 --> 04:03:32,080 or how bad it is, it's really only going to count for 10% of your final grade. And that's 2634 04:03:32,080 --> 04:03:35,971 why we do weighted averages, you know, I don't think your homework should be worth like 50% 2635 04:03:35,971 --> 04:03:40,721 of your grade, right? That doesn't make any sense. And so even though, so you might want 2636 04:03:40,721 --> 04:03:46,860 to have different things contribute a different amounts of weight to that final mean. So this 2637 04:03:46,860 --> 04:03:51,140 is a way of messing around with the mean, and making certain things going into it count 2638 04:03:51,140 --> 04:03:57,301 for more, or have kind of a bigger vote than the other ones. And so I again, I'm just gonna 2639 04:03:57,301 --> 04:04:01,850 stick with school to give examples because this is where we normally see it. So I mean, 2640 04:04:01,850 --> 04:04:06,521 if this example where homework is worth 10% of your final grade and quizzes would be worth 2641 04:04:06,521 --> 04:04:12,190 20%. And the final worth 70%. And I just want to point out, I've actually seen people do 2642 04:04:12,190 --> 04:04:17,720 this, like cuz I tutor, and like this is horrible making your final worth, like, over 50% of 2643 04:04:17,720 --> 04:04:20,990 your grade. So this is just a shout out to any like professors watching this. Don't do 2644 04:04:20,990 --> 04:04:26,021 this. Okay. But anyway, let's say I was mean and I did it. And let's say you were pretty 2645 04:04:26,021 --> 04:04:30,980 good student and you got an A on the homework, right? And so we're gonna say that's a 4.0 2646 04:04:30,980 --> 04:04:37,000 because a lot of schools would say A's 4.0. Then let's say you got B plus on the quizzes, 2647 04:04:37,000 --> 04:04:41,700 maybe because the lectures weren't very good, right? Haha. So you got B plus on the quizzes 2648 04:04:41,700 --> 04:04:46,820 that would translate to the number 3.5 on that four point scale. And let's say you got 2649 04:04:46,820 --> 04:04:51,771 to be on the final. That's too bad, but that's 3.0. So what do I say that's too bad? Well, 2650 04:04:51,771 --> 04:04:56,990 you probably want an eight because the final counts for greater weight right accounts for 2651 04:04:56,990 --> 04:05:01,730 70% and you'd want that to be really high. Great. Now I first wanted to show you the 2652 04:05:01,730 --> 04:05:06,390 non weighted average, like the normal mean, you would make the normal mean you would make 2653 04:05:06,390 --> 04:05:10,160 as you just add the four to the 3.4 to the three and then divide by three, because you 2654 04:05:10,160 --> 04:05:16,420 have a three in there, and you'd get 3.5, you get a B plus in the class, right? But 2655 04:05:16,420 --> 04:05:22,500 let's just look down, or let's look up at that formula. So this is the weighted average 2656 04:05:22,500 --> 04:05:29,230 formula. It's the sum of x times the weights, 2657 04:05:29,230 --> 04:05:35,120 divided by the weights. And remember what I said sum of x y, like as an example. So 2658 04:05:35,120 --> 04:05:39,460 we have to, instead of just summing x, like we did in the non weighted average, we have 2659 04:05:39,460 --> 04:05:44,891 to do X times W, on all of them in summit, and you're like, what's w? Well, remember, 2660 04:05:44,891 --> 04:05:50,780 I told you what the homework worth 10% that's the weight for it, right? And so, so using 2661 04:05:50,780 --> 04:05:54,680 percent, when we do the weighted average, you use the decimal version. So you'll see 2662 04:05:54,680 --> 04:06:01,141 under the weighted average, I'm doing that sum of X w thing by taking the four and timesing 2663 04:06:01,141 --> 04:06:08,230 it by point one for that 10% first, and then see that B plus that 3.5. That gets multiplied 2664 04:06:08,230 --> 04:06:13,530 by point two, because that's where 20% and then there's that B, you got on the final, 2665 04:06:13,530 --> 04:06:19,890 right, that gets multiplied by point seven. So that's the sum of X w thing going. And 2666 04:06:19,890 --> 04:06:26,800 what do you get, you get 3.2. Now I don't even bother to, to divide this by some of 2667 04:06:26,800 --> 04:06:32,450 W, because some of W is one in this case, like if you add up point seven plus point 2668 04:06:32,450 --> 04:06:36,720 two plus point one, you get one. And that often happens, you just make the weights add 2669 04:06:36,720 --> 04:06:40,480 up to one. But I just wanted to let you know if for some reason you had goofy weights that 2670 04:06:40,480 --> 04:06:44,800 didn't add up to one, the last thing you have to do is divide by them. So as you can see, 2671 04:06:44,800 --> 04:06:49,680 in the lower part of the slide, the sum of X W is 3.2. And if we divided it by one, we 2672 04:06:49,680 --> 04:06:56,840 get 3.2. And now you don't get b plus in the class, now you get like a B. And that's the 2673 04:06:56,840 --> 04:07:00,590 difference between the non weight and the weighted average is the weighted average weighted 2674 04:07:00,590 --> 04:07:06,690 this final be extra, and then that caused the grade, the final grade to be lower. And 2675 04:07:06,690 --> 04:07:13,540 that's what waiting is. Now, I just want to say a few things. I've gone through all our 2676 04:07:13,540 --> 04:07:18,200 measures of central tendencies, but I wanted to talk about how they relate to the distributions 2677 04:07:18,200 --> 04:07:26,070 we learned recently. So I just put up an example of a normal distribution. And then I color 2678 04:07:26,070 --> 04:07:34,360 coded these lines. So see on the way, right, there's a color coded mean. And then there's 2679 04:07:34,360 --> 04:07:40,490 a green median. And then there's a purple mode. Technically, they should all be right 2680 04:07:40,490 --> 04:07:44,560 on top of each other. But you can see them if I did that, so I just wished him up next 2681 04:07:44,560 --> 04:07:48,810 to each other. what the point is, is if you have data with a normal distribution, all 2682 04:07:48,810 --> 04:07:54,521 these three things are on top of each other. And what the magic of this is, is you don't 2683 04:07:54,521 --> 04:08:01,600 even need a histogram to know. So like I use statistical software, and I'll feed in the 2684 04:08:01,600 --> 04:08:06,350 data, like a quantitative variable. And they'll say, Tell me the mean, median, and mode. And 2685 04:08:06,350 --> 04:08:12,271 then it will, it'll tell me the mean, median and mode. And even if I don't look at the 2686 04:08:12,271 --> 04:08:18,220 histogram, if it says almost the same number for Mean, Median mode, I automatically know 2687 04:08:18,220 --> 04:08:25,120 it's a normal distribution. Well, that's not the case with skewed distributions. So with 2688 04:08:25,120 --> 04:08:30,990 skewed distributions, the measures of central tendency are not right on top of each other. 2689 04:08:30,990 --> 04:08:37,110 In fact, they're in a different order, depending on whether we have right skewed or left skewed. 2690 04:08:37,110 --> 04:08:42,521 So at the top of the slide, I've got an example of a right skewed distribution, right? Because 2691 04:08:42,521 --> 04:08:49,720 it's light on the right. Alright, so what's happening here? Well, the mean, is getting 2692 04:08:49,720 --> 04:08:58,790 dragged around by that tail, that big tail. So you can see that the blue mean, is on the 2693 04:08:58,790 --> 04:09:04,080 right side of the median. So the median is more resistance. So it's sort of hanging out 2694 04:09:04,080 --> 04:09:09,670 closer to the bottom of the data. But the the tail, that right tail is pulling the mean 2695 04:09:09,670 --> 04:09:16,091 up. And then the mode is the lowest one. So if I get this print out, and I see that the 2696 04:09:16,091 --> 04:09:21,210 mode is the lowest the medians in the middle, and the means the highest, I can say without 2697 04:09:21,210 --> 04:09:27,090 even looking at the histogram, this is probably right skewed. Now let's look at the bottom 2698 04:09:27,090 --> 04:09:30,590 of the slide where we have the left skewed distribution, you know, because it's light 2699 04:09:30,590 --> 04:09:36,021 on the left, and you see the same phenomenon, but it's going the other direction, that that 2700 04:09:36,021 --> 04:09:42,190 tail, that's towards the low end of the data. It's dragging the mean down now. And notice 2701 04:09:42,190 --> 04:09:47,790 the median is more resistant doesn't get dragged down as much. And of course, the mode stays 2702 04:09:47,790 --> 04:09:54,610 at the high part of the data where there's more data, right? So if I get the printout, 2703 04:09:54,610 --> 04:09:58,231 and I see that the mean is the lowest and the medians in the middle and the most the 2704 04:09:58,231 --> 04:10:02,681 highest I'm like Okay, all right. have to look at the histogram. And I know this is 2705 04:10:02,681 --> 04:10:08,230 left skewed. So this is basically what I wanted to tell you about the, the distributions, 2706 04:10:08,230 --> 04:10:13,140 and these actual numbers and how they sort of relate. 2707 04:10:13,140 --> 04:10:17,970 So in conclusion, what this lecture was mainly about was the measures of central tendency, 2708 04:10:17,970 --> 04:10:25,150 right? mode, median and mean, and how to calculate those. And, you know, I've been kind of bagging 2709 04:10:25,150 --> 04:10:29,760 on the mean, I'm sorry, but the mean is just not resistance is totally not stable. And 2710 04:10:29,760 --> 04:10:34,811 the median is, so you want to remember these things? Yeah, you can kind of fix things by 2711 04:10:34,811 --> 04:10:38,700 doing the trimmed mean, we don't really like to do that in healthcare. Because we lose 2712 04:10:38,700 --> 04:10:44,720 some of our data, we find other ways of fixing the fact that our mean, maybe kind of goofy. 2713 04:10:44,720 --> 04:10:49,771 But they're outside of this lecture, how we do that. I also showed you about weighted 2714 04:10:49,771 --> 04:10:54,620 average, you know, just in case you have to hand calculate your grade. I'm actually I 2715 04:10:54,620 --> 04:10:59,140 had a student in my class once. And this is back when we had Blackboard. And there was 2716 04:10:59,140 --> 04:11:03,540 something wrong with Blackboard. So she was really upset because she thought she was getting 2717 04:11:03,540 --> 04:11:09,060 a really bad grade. But she was getting a bad grade because she didn't do a good job 2718 04:11:09,060 --> 04:11:13,500 of learning weighted average, because when I showed her how to actually calculate her 2719 04:11:13,500 --> 04:11:17,320 grade, it turned out to be a B, I remember she was crying. Because she did an unweighted 2720 04:11:17,320 --> 04:11:20,801 average, she was crying in my office. And then I just showed her how to do the weighted 2721 04:11:20,801 --> 04:11:27,600 average. And she stopped crying, she was getting a B. So just don't cry. Try the weighted average 2722 04:11:27,600 --> 04:11:33,780 first, okay. And then finally, I went over distributions and measures of central tendency, 2723 04:11:33,780 --> 04:11:40,221 and just related to you how the distributions, how the numbers we get from the measures of 2724 04:11:40,221 --> 04:11:46,200 central tendency, how we can put them on distributions and see some information about the distribution. 2725 04:11:46,200 --> 04:11:52,420 All right, well, you made it through the measures of central tendency, get ready for 3.2 measures 2726 04:11:52,420 --> 04:12:02,310 of variation. Hello, and welcome to chapter 3.2. It's Monica wahi. Library college lecture. 2727 04:12:02,310 --> 04:12:08,490 And I'm here to go over with you measures of variation. Alright, right, here are your 2728 04:12:08,490 --> 04:12:12,710 learning objectives. So at the end of this lecture, the student should be able to state 2729 04:12:12,710 --> 04:12:18,560 three different measures of variation using statistics, you should also be able to explain 2730 04:12:18,560 --> 04:12:22,930 how to calculate variance and standard deviation, which I'll give you a hint, those are two 2731 04:12:22,930 --> 04:12:28,120 of the measures. All right, you should also be able to calculate the coefficient of variation 2732 04:12:28,120 --> 04:12:35,760 and explain its interpretation. And finally, you should be able to state chebi shows theorem. 2733 04:12:35,760 --> 04:12:41,602 So now we're going to be concentrating on measures of variation. And the first one, 2734 04:12:41,602 --> 04:12:45,931 I'm going to talk about his range. And then I'm going to talk about variance and standard 2735 04:12:45,931 --> 04:12:48,521 deviation, which are two different ones, but I'm going to talk about them together. And 2736 04:12:48,521 --> 04:12:53,280 you'll see why. Then we're going to go over the coefficient of variation, which is abbreviated 2737 04:12:53,280 --> 04:12:58,550 CV, then we're going to talk about Chevy, Chevy Chevy came up with a theorem, we're 2738 04:12:58,550 --> 04:13:03,660 gonna talk about his theorem. And then his theorem leads us to calculate these intervals. 2739 04:13:03,660 --> 04:13:07,850 Remember, intervals are like, have a lower limit and an upper limit. I'll remind you 2740 04:13:07,850 --> 04:13:12,061 that and when will calculate Championship at intervals together? Alright, let's get 2741 04:13:12,061 --> 04:13:19,510 started. So let's think about variation. Okay. What is variation even mean? Well, it means 2742 04:13:19,510 --> 04:13:24,640 how much does the data vary? So imagine I taught two classes, which isn't too hard, 2743 04:13:24,640 --> 04:13:29,550 because I do teach two classes, I teach two of the same classes, two different sections. 2744 04:13:29,550 --> 04:13:35,880 So imagine that I gave a quiz. And the same mean grade was in each class. Okay. And I 2745 04:13:35,880 --> 04:13:41,311 said that, could we tell how internally consistent those grades were? So for instance, let's 2746 04:13:41,311 --> 04:13:46,990 say that I gave a five point quiz. And the mean, in each class was three? Do we really 2747 04:13:46,990 --> 04:13:52,601 know how many people got something far from three, like, maybe in one class, people got 2748 04:13:52,601 --> 04:13:58,350 a lot of fives, and ones. And that's how we got the average of three. And maybe in the 2749 04:13:58,350 --> 04:14:03,021 other class, everybody just got three, like, we really can't tell from a measure of central 2750 04:14:03,021 --> 04:14:08,880 tendency like median, or mean, or even mode, we can't tell how internally consistent the 2751 04:14:08,880 --> 04:14:12,900 data are, especially, we can't even tell that from a mean, two different classes can have 2752 04:14:12,900 --> 04:14:18,580 the same mean, and a totally different kind of variation behind the scenes. So when you're 2753 04:14:18,580 --> 04:14:23,790 talking about quantitative data, and you have a whole data set, and you do the measures 2754 04:14:23,790 --> 04:14:29,030 of central tendency, like Mean, Median mode, it doesn't tell the whole story, you have 2755 04:14:29,030 --> 04:14:34,690 to also add on the information about variation. And these calculations that we're going to 2756 04:14:34,690 --> 04:14:41,420 learn here in this lecture are about ways to express how much the data vary in the data 2757 04:14:41,420 --> 04:14:45,240 set. And it's just separate from central tendency. So central tendency is just about central 2758 04:14:45,240 --> 04:14:51,140 tendency. And then this variation is about variation. And you need to know both before 2759 04:14:51,140 --> 04:14:55,210 you can really evaluate your data set. So we'll get started on talking about ways to 2760 04:14:55,210 --> 04:14:59,271 calculate these measures of variation. 2761 04:14:59,271 --> 04:15:03,690 So um, As I said, I'm going to go through range. First, I'm going to talk about variance 2762 04:15:03,690 --> 04:15:07,261 and standard deviation. And I just want to remind you, you know how I'm always going 2763 04:15:07,261 --> 04:15:12,561 on about sample statistics versus population parameters. Well, this starts playing in in 2764 04:15:12,561 --> 04:15:17,190 that the formulas are slightly different than for sample variance, the standard deviation 2765 04:15:17,190 --> 04:15:22,080 and population standard deviation. So we'll go over those separate 2766 04:15:22,080 --> 04:15:23,561 different formulas. 2767 04:15:23,561 --> 04:15:27,440 Finally, we're going to talk about in the measures of variation, we're going to talk 2768 04:15:27,440 --> 04:15:32,470 about the coefficient of variation or CV, but we'll do that after these other ones. 2769 04:15:32,470 --> 04:15:37,680 Okay, so we're going to start with the range, because it's the simplest to calculate. So 2770 04:15:37,680 --> 04:15:41,641 here's how you do it. So you'll notice on the right, I just made up five numbers, I 2771 04:15:41,641 --> 04:15:46,761 just totally made them up. I don't know what they are. Okay, I just did that for a demonstration, 2772 04:15:46,761 --> 04:15:52,530 because the range is the difference between the maximum and minimum value. So literally, 2773 04:15:52,530 --> 04:15:56,920 it's pretty easy to calculate, you have to first search around for the highest or the 2774 04:15:56,920 --> 04:16:01,630 maximum, which in this little data set, it's so cute. It's only got five numbers. So it 2775 04:16:01,630 --> 04:16:07,090 was obvious that somebody ate was the highest, right? And it's sort of obvious at 21 is the 2776 04:16:07,090 --> 04:16:11,880 lowest. So how you calculate the range is you take the highest minus the lowest, and 2777 04:16:11,880 --> 04:16:16,240 then you get a number. And that's the range. And sometimes my students actually take the 2778 04:16:16,240 --> 04:16:20,431 highest, and then they put minus and then the lowest. And then they tell me, that's 2779 04:16:20,431 --> 04:16:24,800 the range. And I'm like, No, yeah, I actually have to subtract it out. So you'll see here, 2780 04:16:24,800 --> 04:16:32,080 it says 78 minus 21 equals 57. So it's 57. That's the range. Okay. So all it's telling 2781 04:16:32,080 --> 04:16:38,630 you is the distance between the top and the bottom. And I'll just say that, that's not 2782 04:16:38,630 --> 04:16:43,910 very useful. In fact, I had a problem with that when I was working, I worked at the army 2783 04:16:43,910 --> 04:16:50,780 on this army database. And I looked at the range of ages of soldiers when they started. 2784 04:16:50,780 --> 04:16:59,120 And the range was h Four, three 107. Alright, obviously, there was a problem with the data, 2785 04:16:59,120 --> 04:17:03,881 right? Just for some reason, there was a screwed up record that said, somebody got him when 2786 04:17:03,881 --> 04:17:07,641 they were four. And there was another screwed up record that said, somebody got in when 2787 04:17:07,641 --> 04:17:11,530 they were over 100, they were just screwed up data, okay. And that caused me to have 2788 04:17:11,530 --> 04:17:17,801 this ridiculous range. And so the range is not very stable or resistant, right? If we 2789 04:17:17,801 --> 04:17:21,641 just fixed that, you know, record that said somebody was four when they got in the army, 2790 04:17:21,641 --> 04:17:26,860 then we might have a normal range, you know, like little more like a minimum, we might 2791 04:17:26,860 --> 04:17:33,190 see 18, or 17, or 19, or something. But, as you can see, on the right side of the slide, 2792 04:17:33,190 --> 04:17:37,740 I just picked out that the minimum and the maximum, we could just change arbitrarily 2793 04:17:37,740 --> 04:17:43,120 change those numbers. And suddenly, we'd have something totally different from 57. So as 2794 04:17:43,120 --> 04:17:48,480 you can see, even though this range is a measure of variation, it's not stable and resistant. 2795 04:17:48,480 --> 04:17:53,750 And it actually kind of doesn't tell you much. If I say we've got a range of 57, you don't 2796 04:17:53,750 --> 04:17:59,390 know if the minimum is like zero, or like negative, or like 105, you know, you really 2797 04:17:59,390 --> 04:18:04,561 don't know where that ranges in. So it's not very useful. But it's a place to start, because 2798 04:18:04,561 --> 04:18:09,800 that's our first measure of variation. Now we're going to get into what we really use 2799 04:18:09,800 --> 04:18:15,521 in statistics a lot, you'll sometimes see in articles where they state with the ranges, 2800 04:18:15,521 --> 04:18:19,830 they usually don't state the actual number I tell you to calculate, they actually state 2801 04:18:19,830 --> 04:18:24,550 the minimum and the maximum. And sometimes that's interesting. But variance and standard 2802 04:18:24,550 --> 04:18:28,730 deviation. That's what we really live on in statistics for measures of variation. And 2803 04:18:28,730 --> 04:18:32,730 you're probably wondering why I'm talking about them together when they're totally different 2804 04:18:32,730 --> 04:18:37,190 calculations. Well, it's because they're friends. Okay? And how are they friends? Well, the 2805 04:18:37,190 --> 04:18:41,540 variance calculations, kind of a big formula. And so you get through that, and then you 2806 04:18:41,540 --> 04:18:46,490 have the variance. And then all you have to do to get the standard deviation is take the 2807 04:18:46,490 --> 04:18:50,480 square root of the variance. So that's why they're friends is like you go through all 2808 04:18:50,480 --> 04:18:54,311 this trouble to get the variance. And then the next step is just take the square root 2809 04:18:54,311 --> 04:18:58,771 of that, and you get the standard deviation. So before I actually talk about those formulas, 2810 04:18:58,771 --> 04:19:05,040 I wanted to just set in your head, what these words mean. Because, like, I remember, I worked 2811 04:19:05,040 --> 04:19:09,881 in a mental health place. And I don't know, we didn't have enough licensed people there. 2812 04:19:09,881 --> 04:19:14,360 And so our leader said, Oh, I'm applying to the state for a variance, right? Meaning that 2813 04:19:14,360 --> 04:19:19,760 the state would give us allow us to vary from the rules. Well, that's what variances is 2814 04:19:19,760 --> 04:19:25,430 how the data vary. So you think of the spread of the data and how well does the mean every 2815 04:19:25,430 --> 04:19:30,990 represent that spread? It doesn't, right. So variance is a way of representing how the 2816 04:19:30,990 --> 04:19:36,310 data vary really around the meet. Now, you're probably wondering, well, then why do you 2817 04:19:36,310 --> 04:19:40,910 even have standard deviation? It's the square root of variance. But let's just think about 2818 04:19:40,910 --> 04:19:46,021 what the word means. You know, standard means sort of following a standard are the same. 2819 04:19:46,021 --> 04:19:53,950 So it's just the amount of variation, that standard in the data set. And you know what 2820 04:19:53,950 --> 04:19:58,360 the word deviation means? Like, you can say, oh, that person is a social deviant because 2821 04:19:58,360 --> 04:20:03,590 they go to crimes or something. Or like this guy with a healthy nose, he does not have 2822 04:20:03,590 --> 04:20:08,610 a deviated septum. But you know, some people do have a deviated septum where it's like 2823 04:20:08,610 --> 04:20:09,610 crooked, 2824 04:20:09,610 --> 04:20:13,290 right and they have trouble like sneezing and blowing their nose and sometimes even 2825 04:20:13,290 --> 04:20:18,420 breathing. Well, a standard deviation would simply mean that everybody's deviation is 2826 04:20:18,420 --> 04:20:24,660 about the same. So, variance is a calculation that says how much things vary. And so the 2827 04:20:24,660 --> 04:20:27,750 standard deviation, because it's just the square root of variance, but I just want you 2828 04:20:27,750 --> 04:20:36,110 to imagine in your head, oh, standard deviation, that means how much the data deviates around 2829 04:20:36,110 --> 04:20:40,650 the mean, because a lot of times students get confused about the measures of central 2830 04:20:40,650 --> 04:20:45,561 tendency, they try to apply them to variation, but variation is totally different thing. 2831 04:20:45,561 --> 04:20:50,271 So just remember what variance literally means, and what standard deviation literally means. 2832 04:20:50,271 --> 04:20:57,910 And that might help you get through these formulas and understand the interpretation. 2833 04:20:57,910 --> 04:21:03,700 So as I mentioned earlier, the formulas for variance and standard deviation are different, 2834 04:21:03,700 --> 04:21:11,351 whether you're talking about a sample, or a population. And, admittedly, we don't use 2835 04:21:11,351 --> 04:21:17,360 the population variance or population standard deviation calculation very often, because 2836 04:21:17,360 --> 04:21:22,240 we don't measure the population that often. So we tend to use the sample variance and 2837 04:21:22,240 --> 04:21:26,271 sample standard deviation all the time. So I'm going to demonstrate those. But you'll 2838 04:21:26,271 --> 04:21:32,460 notice conceptually, they're really similar. Like, um, you know, if you have population 2839 04:21:32,460 --> 04:21:39,160 parameters like Meuse, and like population standard deviations, they tend to behave similarly 2840 04:21:39,160 --> 04:21:45,610 in formulas, as sample versions, it's just that in statistics, we always want to be really 2841 04:21:45,610 --> 04:21:49,980 clear about what we're talking about. So we always want to use the right symbol, so we're 2842 04:21:49,980 --> 04:21:56,250 hinting towards, we're analyzing a sample versus we're analyzing a population even though 2843 04:21:56,250 --> 04:22:00,830 conceptually like means or a mean, right? But you want to represent which mean you're 2844 04:22:00,830 --> 04:22:05,851 talking about one, that's a parameter, or one, that's a statistic, whenever you write 2845 04:22:05,851 --> 04:22:12,030 out the formula, so I'm just being picky about that. And then there's also two other things 2846 04:22:12,030 --> 04:22:18,181 you want to know. Um, there's two different ways of actually doing each of these formulas. 2847 04:22:18,181 --> 04:22:22,780 You know how like an algebra, you can have a big equation, and you can express it more 2848 04:22:22,780 --> 04:22:28,431 than one way. So that's all they do is they put a formula in one way called the defining 2849 04:22:28,431 --> 04:22:34,980 formula. And then they put the formula, same formula, but rearranged by algebra into the 2850 04:22:34,980 --> 04:22:39,590 computational formula. Now, I always think that's kind of funny that they call the computation, 2851 04:22:39,590 --> 04:22:43,780 right? I mean, both the formulas give you the same results, it's just plugging in numbers 2852 04:22:43,780 --> 04:22:47,350 and getting out the answer. And the answer is gonna be same, whether you use the defining 2853 04:22:47,350 --> 04:22:51,850 formula, or the computational formula. But what I think is so funny is they call it the 2854 04:22:51,850 --> 04:22:57,031 computational formula, but I cannot compute it. Like I always get confused when I use 2855 04:22:57,031 --> 04:23:02,920 it. So I pretty much ignore the computational formula in my entire life. And I just teach 2856 04:23:02,920 --> 04:23:07,680 the defining formula. And I find my students always remember the defining formula, they 2857 04:23:07,680 --> 04:23:11,670 always can get through it. Although people who are into the computational formula, they 2858 04:23:11,670 --> 04:23:16,771 tell me that I'm doing things the hard way, I'm going the long way around. But you know, 2859 04:23:16,771 --> 04:23:22,030 what just goes a long way around, it helps you not get confused, and helps you convince 2860 04:23:22,030 --> 04:23:27,001 yourself you actually got the right answer. So let's just do the defining formula. All 2861 04:23:27,001 --> 04:23:32,440 right. So let's look at the defining formula, you can look it up, you can look up the computational 2862 04:23:32,440 --> 04:23:36,840 formula, but this is the defining formula. So let's just get get our minds wrapped around 2863 04:23:36,840 --> 04:23:43,340 that. Remember, I told you that variance is great, because you calculate that, and then 2864 04:23:43,340 --> 04:23:46,851 you just take the square root of that, and you get the standard deviation. So as you 2865 04:23:46,851 --> 04:23:50,530 can see on the left side of the slide, we abbreviate the sample variance by just saying 2866 04:23:50,530 --> 04:23:55,950 s, which is the standard deviation to the second. I know that sounds ridiculous, right? 2867 04:23:55,950 --> 04:23:59,880 Like why don't we have a special thing just for the variance? Why do we just say it's 2868 04:23:59,880 --> 04:24:04,660 so the second and then say sample standard deviation is just as as well actually, to 2869 04:24:04,660 --> 04:24:09,180 be honest with you people use different notation. I'm just using this because it matches the 2870 04:24:09,180 --> 04:24:15,721 textbook we're using. But people will often say var for variance. And so in other textbooks, 2871 04:24:15,721 --> 04:24:21,280 they'll do that, and then statistical software, but they'll also say s to the second like 2872 04:24:21,280 --> 04:24:26,940 this, and it's maybe a good way of you remembering that the standard deviation is just the square 2873 04:24:26,940 --> 04:24:33,550 root of the variance, right? So if you ever see s to the second, remember, S is the sample 2874 04:24:33,550 --> 04:24:38,121 standard deviation, and s The second is the sample variance. And I'll show you the population 2875 04:24:38,121 --> 04:24:44,940 one in a minute. But if you see those, that's what they're talking. Okay. Now, let's look 2876 04:24:44,940 --> 04:24:52,101 upstairs at the top formula. See this thing on the top? It's really kind of scary, but 2877 04:24:52,101 --> 04:24:55,410 we're going to work through this and you're not going to be scared of it. Okay. 2878 04:24:55,410 --> 04:24:59,771 I know you know that there's a little some sign there that capital sigma so you know, 2879 04:24:59,771 --> 04:25:04,240 something's gonna They get summed up. But that looks kind of scary that x minus x bar 2880 04:25:04,240 --> 04:25:09,440 to the second thing will handle that, okay. But n minus one on the bottom, that's not 2881 04:25:09,440 --> 04:25:14,370 so scary, okay. And we'll handle that one too. And then you'll just notice, all I did 2882 04:25:14,370 --> 04:25:18,710 for the bottom part is I just put this huge square root sign over that whole thing. So 2883 04:25:18,710 --> 04:25:21,910 that's the only difference between the upstairs and the downstairs. And then I also wanted 2884 04:25:21,910 --> 04:25:26,630 to show you a picture of a calculator, because a lot of times, if you haven't really done 2885 04:25:26,630 --> 04:25:31,300 math or statistics for a while, you forget the whole concept of square root. And I'll 2886 04:25:31,300 --> 04:25:35,620 just remind you, whenever there's a square root of something, it just means that if you 2887 04:25:35,620 --> 04:25:41,681 times it by itself, you'll get that number. So remember, like 25, the square root of 25, 2888 04:25:41,681 --> 04:25:45,230 if you put 25 in your calculator, and you hit that square root thing, you'll get five, 2889 04:25:45,230 --> 04:25:50,440 right, because five times five is 25. However, if you put in 24, you're gonna get something 2890 04:25:50,440 --> 04:25:54,940 with decimals, right? But whatever it is, you get, if you times it by itself, you'll 2891 04:25:54,940 --> 04:25:58,980 get 24. So I just want to remind you of that, because sometimes people forget that if they 2892 04:25:58,980 --> 04:26:03,110 haven't been doing statistics or math for a while, or they haven't used the calculator 2893 04:26:03,110 --> 04:26:09,341 for a while. All right, I told you, I talked to you about this numerator, right that the 2894 04:26:09,341 --> 04:26:13,930 top is the numerator in a fraction, and the bottom is the denominator. So I'm going to 2895 04:26:13,930 --> 04:26:20,150 talk to you about this numerator. So the sum of X minus X bar squared, you know, that's 2896 04:26:20,150 --> 04:26:23,820 how I would say it, this is actually called this little piece of the formula is called 2897 04:26:23,820 --> 04:26:29,350 the sum of squares. And so when From now on, when I say sum of squares, I literally mean 2898 04:26:29,350 --> 04:26:35,641 the top half of this equation. So what you do when you do the defining formula, is you 2899 04:26:35,641 --> 04:26:39,131 just kind of relax and say, the first thing I'm going to do is figure out the sum of squares, 2900 04:26:39,131 --> 04:26:43,561 I'm going to figure out the top part. And then I'm going to just write that down, and 2901 04:26:43,561 --> 04:26:47,980 then later, I'm gonna come back to this formula and enter it. So this next part is, how do 2902 04:26:47,980 --> 04:26:52,780 we figure out that top part of the equation? How do we get the sum of squares, and I'll 2903 04:26:52,780 --> 04:26:59,080 show you. Okay, so let's just look at the slide, I'm on the left, there's this blank 2904 04:26:59,080 --> 04:27:04,410 table. And that's usually what I do first is I make this blank table. And you don't 2905 04:27:04,410 --> 04:27:08,551 want to say column one, column two, column three, I just put that there. So I could talk 2906 04:27:08,551 --> 04:27:13,150 about the columns. And then you know, I was talking about, but usually, what I put is 2907 04:27:13,150 --> 04:27:18,100 I put x in the first column, and they put x minus x bar I wrote out minus, but you can 2908 04:27:18,100 --> 04:27:25,160 just use a dash. And then I put in parentheses in the third column, x minus x bar to the 2909 04:27:25,160 --> 04:27:29,750 second, like that. Remember, when you have parentheses, you have to do what's inside 2910 04:27:29,750 --> 04:27:35,110 the parentheses first. So this means you literally have to do X minus X bar before you to the 2911 04:27:35,110 --> 04:27:39,930 second it or square it. And I'm just walking you through this to get you ready for what 2912 04:27:39,930 --> 04:27:44,230 we're going to do with this tape. On the right, so this slide, I'm just reminding you that 2913 04:27:44,230 --> 04:27:48,552 the sum of x minus x squared to the second, in other words, the sum of whatever is going 2914 04:27:48,552 --> 04:27:54,050 to be in the column three. That's another way of saying the sum of squares. Okay. So 2915 04:27:54,050 --> 04:27:58,521 an easy way to explain this, what the squares are, is to just show you how to calculate 2916 04:27:58,521 --> 04:27:59,521 it. 2917 04:27:59,521 --> 04:28:00,771 So I just 2918 04:28:00,771 --> 04:28:05,790 pulled out some data set, imagine a sample of six patients presented to Central lab. 2919 04:28:05,790 --> 04:28:09,910 So this happens to me when I go to my doctor, sometimes she'll say, you know, it's time 2920 04:28:09,910 --> 04:28:15,911 to do a lab panel for you. So she gives me this slip of paper, and I go downstairs to 2921 04:28:15,911 --> 04:28:20,010 the central lab, and I give them the slip of paper, and they say, Okay, sit down, and 2922 04:28:20,010 --> 04:28:24,380 then we'll call you up, and we'll draw your blood or whatever. So we're imagining six 2923 04:28:24,380 --> 04:28:31,140 people did that. And then they got up to have their blood drawn. We asked them, How long 2924 04:28:31,140 --> 04:28:36,530 did you wait? Okay. And I'm in the central lab where I literally do wait two minutes, 2925 04:28:36,530 --> 04:28:37,780 that's a really good 2926 04:28:37,780 --> 04:28:38,780 lap. But 2927 04:28:38,780 --> 04:28:42,940 sometimes it's really busy if I go like during lunch, and I'll wait something like 10 minutes. 2928 04:28:42,940 --> 04:28:48,650 So here are six patients. One of them waited two minutes, a couple of them waited three 2929 04:28:48,650 --> 04:28:53,021 minutes, probably the other three came in during lunch because they waited eight minutes, 2930 04:28:53,021 --> 04:28:57,940 10 minutes and 10 minutes. Okay, so that's our data, that it's a little tiny data set, 2931 04:28:57,940 --> 04:29:04,410 but I just wanted to use something small to show you how to calculate the variance, and 2932 04:29:04,410 --> 04:29:10,390 then the standard deviation with just this little data set. Okay. So what's the first 2933 04:29:10,390 --> 04:29:15,390 step? After making the table you have to make the blank table for us is you fill in the 2934 04:29:15,390 --> 04:29:21,600 first column, which is called x. So what is x? Actually, each of these patients waiting 2935 04:29:21,600 --> 04:29:29,150 time is an X. Remember sum of x, if we said sum of x, we would mean add all these x's 2936 04:29:29,150 --> 04:29:34,521 together, right? So So that's all I did, I just put each x in the column, you'll see 2937 04:29:34,521 --> 04:29:40,830 2338 1010. It's just like identical to these x's. And then I put at the bottom, I put that 2938 04:29:40,830 --> 04:29:46,190 little fancy sum of X and said 36. Okay, and so that's just the first thing you do. Just 2939 04:29:46,190 --> 04:29:51,960 put them all in and do the sum of X. All right. Now the next step is don't look at the left 2940 04:29:51,960 --> 04:29:57,990 side of the slide yet, look at the right side. Before you go and fill in column two, you 2941 04:29:57,990 --> 04:30:03,391 have to do X bar. In other words, You have to figure out the mean. Now, you can kind 2942 04:30:03,391 --> 04:30:09,340 of cheat because you just figure it out some of x. And if you remember the formula, the 2943 04:30:09,340 --> 04:30:15,440 mean, or the x bar of the sample is the sum of x divided by n. And remember, I told you 2944 04:30:15,440 --> 04:30:22,210 at six patients, so you just take 36 divided by six, and you get six. Now you just hold 2945 04:30:22,210 --> 04:30:23,210 that number, 2946 04:30:23,210 --> 04:30:24,650 you hold that. 2947 04:30:24,650 --> 04:30:30,830 So between column one and column two, you got to calculate x bar, and you hold, right. 2948 04:30:30,830 --> 04:30:35,141 And then while you're holding that, you keep it off to the side, you realize that this 2949 04:30:35,141 --> 04:30:42,460 is how we're going to fill in column two is what x minus x bar means is the x bar is just 2950 04:30:42,460 --> 04:30:49,220 six. But we have to go through each x and minus x bar from, it's helpful to order the 2951 04:30:49,220 --> 04:30:54,100 x's before you do this, like notice, I put them in order 2338 1010, it's a good idea 2952 04:30:54,100 --> 04:30:59,400 to just do that, because it helps your brain think whether or not you're doing the right 2953 04:30:59,400 --> 04:31:05,931 thing. So let's start with the two. So we do two minus six, which is the x bar. Now 2954 04:31:05,931 --> 04:31:11,200 you can look at column two, two minus six equals negative four, I hate negative numbers, 2955 04:31:11,200 --> 04:31:16,240 but you just have to deal with them sometimes. Okay, so it's negative four, so you just deal 2956 04:31:16,240 --> 04:31:21,060 with that, then you go to the next slide, and it's three minus six, which is negative 2957 04:31:21,060 --> 04:31:24,561 three. So we're still on the water here with the negatives, but you'll notice that the 2958 04:31:24,561 --> 04:31:29,190 next 1x is three, so you can kind of copy what you just did. So you're getting negative 2959 04:31:29,190 --> 04:31:32,771 three. So what you're actually technically filling in this column, I showed you the equation, 2960 04:31:32,771 --> 04:31:36,590 but you're putting negative four in the first one, negative three in the second one, negative 2961 04:31:36,590 --> 04:31:42,970 three in the third one. And then now finally, the fourth x is eight. So eight minus six, 2962 04:31:42,970 --> 04:31:48,070 we got above water, now we're in two, right. And then we have 10 minus six was 410 minus 2963 04:31:48,070 --> 04:31:52,810 six, which is war. And when you order them like that, that's often what happens. In fact, 2964 04:31:52,810 --> 04:31:56,950 that's always what happens is you end up with a bunch of negative ones at the beginning 2965 04:31:56,950 --> 04:32:01,811 and a bunch of positive one later, that's just totally normal. Don't worry about that. 2966 04:32:01,811 --> 04:32:06,551 But you got to be careful, you got to make sure you make the right meet. I've had people 2967 04:32:06,551 --> 04:32:11,840 on tests actually screw up this mean. So you can just imagine when a train wreck happens 2968 04:32:11,840 --> 04:32:15,990 after that is you do not get anything right after that. So make sure your means right. 2969 04:32:15,990 --> 04:32:20,650 And then make sure you subtract it from every single x and put the right answer in column 2970 04:32:20,650 --> 04:32:26,200 two. That's the next step. All right. Okay, so we're done with that step, what do we do 2971 04:32:26,200 --> 04:32:33,800 next? Now, we just take whatever we got in column two in square. So we have the first 2972 04:32:33,800 --> 04:32:39,641 one was negative four. So we take remember, square is just the the number time itself. 2973 04:32:39,641 --> 04:32:45,460 So if you don't like to use x to the second button on your calculator, you can just do 2974 04:32:45,460 --> 04:32:50,970 negative four times negative four, same thing. And so you'll notice we do negative four times 2975 04:32:50,970 --> 04:32:55,760 negative four, we get 16. Now, it's pretty easy. negative three times negative three 2976 04:32:55,760 --> 04:33:01,190 is not, you know, two times two is four. But I what I want you to really look at is the 2977 04:33:01,190 --> 04:33:07,590 10s. Notice that they get a 16 two, just like the two did. And that's 2978 04:33:07,590 --> 04:33:12,169 the trick here. Remember, I said I hate negative numbers? Well, a lot of statisticians feel 2979 04:33:12,169 --> 04:33:13,759 the same way I do. 2980 04:33:13,759 --> 04:33:20,269 And so they often fix it by squaring the number because it's a racist, the negative. Just 2981 04:33:20,270 --> 04:33:25,621 remember, negative times negative is positive, and positive times positive is also positive. 2982 04:33:25,621 --> 04:33:31,551 That's a little trick, you know, when it comes to multiplying. And so when we do that, we 2983 04:33:31,551 --> 04:33:40,061 are squaring each one of column two. And they're called squares, right? So we've got 16 994 2984 04:33:40,061 --> 04:33:47,520 1616. These each are squares. So what do you think we do? We add up that entire column, 2985 04:33:47,520 --> 04:33:52,778 and we get the sum of squares. So look at that, we add up that entire column, and we 2986 04:33:52,778 --> 04:33:57,849 get that super complicated looking thing at the bottom, which is the numerator for our 2987 04:33:57,849 --> 04:34:02,339 variance equation, right? Like this wasn't really that hard. Was it? Okay, so we sum 2988 04:34:02,340 --> 04:34:09,711 that up. And as it turns out, we get the number 70. So 70 is our sum of squares. All right. 2989 04:34:09,711 --> 04:34:15,438 All right. Now we're back at the sample variance formula. And I'm so excited because look at 2990 04:34:15,438 --> 04:34:21,519 the top of the formula. We answered. It's it's 70. Okay, so we got that 70. But we still 2991 04:34:21,520 --> 04:34:25,938 have to deal with the bottom of the formula. Remember, n was six, right? We had six patients, 2992 04:34:25,938 --> 04:34:30,519 and the bottom of the formula is n minus one. So the bottom of the formula is going to be 2993 04:34:30,520 --> 04:34:36,211 five, right? So let's fill this in. I was kind of running out of room, so I just filled 2994 04:34:36,211 --> 04:34:40,990 it in upstairs. So you see that 70 divided by five suddenly this looks super easy, right? 2995 04:34:40,990 --> 04:34:48,528 So 70 divided by five is 14. Okay? That's the variance. totally easy, right? Once you 2996 04:34:48,528 --> 04:34:52,269 make that, I mean, it's not it's tedious, right? You have to make that whole table and 2997 04:34:52,270 --> 04:34:57,141 add things up and stuff. But here, it's not really that hard. Now, Guess how we're gonna 2998 04:34:57,141 --> 04:35:03,641 make the standard deviation you've probably guessed it, we're just going to take a square 2999 04:35:03,641 --> 04:35:08,961 root of 14. So remember that button on your calculator, you could put in 14, hit that 3000 04:35:08,961 --> 04:35:15,141 button, and you get 3.74 and a bunch of other stuff, but I just chopped it off at 3.74. 3001 04:35:15,141 --> 04:35:21,938 So that is your sample standard deviation. Now I promised you I would talk about the 3002 04:35:21,938 --> 04:35:27,779 population formulas for standard deviation and variance, as well as the sample ones. 3003 04:35:27,779 --> 04:35:34,690 And I told you, they wouldn't really be conceptually much different. As you can see on the left 3004 04:35:34,690 --> 04:35:39,790 side of the slide, sample variances expressed, I made things red, so you can see what the 3005 04:35:39,791 --> 04:35:45,391 differences were sample variances s to the second, but population variances as other 3006 04:35:45,391 --> 04:35:50,750 Greek letter. Remember, I told you that that other sum was capital sigma, like, you know, 3007 04:35:50,750 --> 04:35:55,801 Greek is like English, in the sense they have capital and lowercase letters? Well, that 3008 04:35:55,801 --> 04:36:00,009 thing that I always think it looks like a jelly roll, but the Jelly Roll looking thing 3009 04:36:00,009 --> 04:36:06,269 is actually lowercase sigma. So that I'm never going to say lowercase sigma, except for now, 3010 04:36:06,270 --> 04:36:10,660 I'm going to say population variance and population standard deviation. So you'll see at the bottom 3011 04:36:10,660 --> 04:36:15,070 of the slide, the lowercase sigma alone is the population standard deviation. And then 3012 04:36:15,070 --> 04:36:21,230 the lowercase sigma to the second is the variance. So just remember, if you see that Jelly Roll 3013 04:36:21,230 --> 04:36:26,099 thing, we're talking about a population version of the standard deviation or variance in that 3014 04:36:26,099 --> 04:36:33,649 the sample. Also, you already know about mu versus x bar, right, so we have x bar on the 3015 04:36:33,650 --> 04:36:39,750 left. And that's the sample mean, and mu on the right, which is population mean. And you 3016 04:36:39,750 --> 04:36:45,820 also already know about n, which is the number in your sample. And this is where there's 3017 04:36:45,820 --> 04:36:52,131 a big difference actually, in the sample, you have to do n minus one on the bottom, 3018 04:36:52,131 --> 04:36:57,278 and in the population, you just do, and capital N that whole population. And if you think 3019 04:36:57,278 --> 04:37:02,060 about it, it makes kind of sense, because populations are huge, so won't even matter 3020 04:37:02,061 --> 04:37:08,301 if you like subtracted one. Whereas, you know, samples are small. So you sometimes have to, 3021 04:37:08,301 --> 04:37:12,539 you know, adjust or something, so you have to minus one, but you wouldn't even matter 3022 04:37:12,539 --> 04:37:17,109 like people make a mistake and accidentally minus one from the population one, they don't 3023 04:37:17,109 --> 04:37:21,291 get much of a different answer. And so that's why I'm concentrating on the sample once, 3024 04:37:21,291 --> 04:37:25,150 that's what we normally do. But I wanted to give a shout out Just so you know, if you 3025 04:37:25,150 --> 04:37:30,278 ever see the arm formulas on the right side of the slide, you know their population level 3026 04:37:30,278 --> 04:37:38,259 formulas. Alright, now we're gonna move on, we made it through range, variance and standard 3027 04:37:38,259 --> 04:37:43,130 deviation. So now we're gonna move on to talk about the coefficient of variation. And this 3028 04:37:43,131 --> 04:37:50,240 is used a lot for comparisons for comparing between two different labs often. 3029 04:37:50,240 --> 04:37:54,871 I say that because my friends are pathologist, in the first time I actually use this in medicine, 3030 04:37:54,871 --> 04:38:01,801 as we were comparing lab values on the same assay from two different labs, I just wanted 3031 04:38:01,801 --> 04:38:06,340 to explain to you this might be the first time you've heard the word coefficient. And 3032 04:38:06,340 --> 04:38:11,080 that gets a little confusing for people in statistics who are new, because the word coefficient 3033 04:38:11,080 --> 04:38:17,980 is actually just kind of a generic term for certain kinds of numbers. So you'll hear somebody 3034 04:38:17,980 --> 04:38:22,699 say, coefficient of variation. And you'll say, you'll hear somebody say coefficient 3035 04:38:22,699 --> 04:38:27,340 of something else, or coefficient of something else. And just a word coefficient. Most people 3036 04:38:27,340 --> 04:38:33,449 haven't even heard it. It just means a certain kind of number. It's just somebody says, oh, 3037 04:38:33,449 --> 04:38:38,509 the coefficient is not good, or it's high, or whatever, you need to ask them, What coefficient 3038 04:38:38,509 --> 04:38:43,710 are you talking about, right. So in other words, coefficient doesn't mean a specific 3039 04:38:43,711 --> 04:38:49,340 thing. It just means a number that comes out of statistics. And so you have to know which 3040 04:38:49,340 --> 04:38:54,250 coefficient they're talking about. So this is the first time maybe you've heard the word 3041 04:38:54,250 --> 04:38:58,750 coefficient. And I'm going to talk for the first time then, to you if you've never heard 3042 04:38:58,750 --> 04:39:04,169 coefficient before, about a specific coefficient called the coefficient of variation. Now, 3043 04:39:04,169 --> 04:39:10,011 you'll, as we go through this textbook, there's other coefficients on it. So please remember 3044 04:39:10,011 --> 04:39:16,958 this one is coefficient of variation, right? And a way to remember it is a CV for short. 3045 04:39:16,958 --> 04:39:22,999 And so other coefficients have different abbreviations, but the coefficient of variation is CV. So 3046 04:39:23,000 --> 04:39:30,099 I put on the right side of the slide the the formulas, and nobody seems to have any trouble 3047 04:39:30,099 --> 04:39:34,329 doing the formula, right, because once you calculate the standard deviation, the sample 3048 04:39:34,330 --> 04:39:38,600 standard deviation of the population one, as you can see in the formulas, and once you 3049 04:39:38,600 --> 04:39:44,380 calculate x bar, which is a mean for the sample, it's pretty easy to do the division, and then 3050 04:39:44,380 --> 04:39:49,520 they like it when you do it in percent. And you'll notice that about statistics is certain 3051 04:39:49,520 --> 04:39:55,282 things they prefer as proportions. And certain things they prefer as percents. It's just 3052 04:39:55,282 --> 04:40:01,050 like, I don't know, it's just like our culture in a way and so coefficient a very is always 3053 04:40:01,050 --> 04:40:07,130 expressed as a percent. So you have to times that by 100. And then put a percent sign after 3054 04:40:07,130 --> 04:40:11,560 it. But really, that's pretty easy to do you take the standard deviation, you'll see I 3055 04:40:11,560 --> 04:40:15,970 did it for our patients 3.74. It took us all that work to get there, right? Remember square 3056 04:40:15,970 --> 04:40:22,370 root of 14. And then remember, our x bar was six. So we needed that remember earlier for 3057 04:40:22,370 --> 04:40:28,872 that column, too. So I just dumpster dive dumpster dove, those numbers, and then did 3058 04:40:28,872 --> 04:40:34,790 this calculation out and I got 62%. And so students generally don't have trouble getting 3059 04:40:34,790 --> 04:40:40,070 that number. But what the problem is, is like, what is the number even mean? Right? Like, 3060 04:40:40,070 --> 04:40:43,952 what does it mean, if you divide the standard deviation by the x bar and times by 100%? 3061 04:40:43,952 --> 04:40:51,270 And like, how do you interpret that percent? So the easiest way to talk about it is to 3062 04:40:51,270 --> 04:40:55,660 actually compare it with something. Because one thing you'll also notice in statistics 3063 04:40:55,660 --> 04:41:02,800 is if you make ratios of things, they don't have any units. So if I take your blood pressure, 3064 04:41:02,800 --> 04:41:09,100 like your systolic blood pressure, and I say it's whatever, 130 mmHg. If I divide that 3065 04:41:09,100 --> 04:41:14,240 by your diastolic blood pressure, or even by some lab value, or your temperature, or 3066 04:41:14,240 --> 04:41:19,770 whatever, your IQ, suddenly I get a ratio, and that doesn't have units, right, it doesn't 3067 04:41:19,770 --> 04:41:24,720 have mmHg, or anything like that. And if I do that to a bunch of people, all of those 3068 04:41:24,720 --> 04:41:26,460 ratios don't have any units. 3069 04:41:26,460 --> 04:41:30,032 And so they technically could be compared to each other. So you'll see that that's a 3070 04:41:30,032 --> 04:41:35,130 strategy in statistics is they'll make ratios of things and say all those don't have any 3071 04:41:35,130 --> 04:41:41,602 units. So it's, you know, sort of lacking in that way. But the power is you can compare 3072 04:41:41,602 --> 04:41:48,790 these ratios. So, I decided to just pull out other patients, I just made up other patients, 3073 04:41:48,790 --> 04:41:53,510 right. I pretended we went back to the lab, the next day, and we gathered some data. And 3074 04:41:53,510 --> 04:41:59,940 we gather some data, and we came up with I just made this up an x bar of eight, and a 3075 04:41:59,940 --> 04:42:05,852 standard deviation of four. It's a little close to what we had before, right? Like x 3076 04:42:05,852 --> 04:42:13,220 bar six insanity, Visa 3.74. But anyway, in this next sample patients, the S four divided 3077 04:42:13,220 --> 04:42:18,842 by the x bar of eight times 100 equal to 50%, and not 62%, like the other one did. So how 3078 04:42:18,842 --> 04:42:23,730 do you interpret that? Well, the CV is a measure of the spread of the data relative to the 3079 04:42:23,730 --> 04:42:29,800 average of the data. So in the first sample, the standard deviation is only 50% of the 3080 04:42:29,800 --> 04:42:35,650 mean. But in the second sample, the standard deviation is 62%. 3081 04:42:35,650 --> 04:42:37,122 of the mean. 3082 04:42:37,122 --> 04:42:47,820 So what I would say is that the second sample, the red one with the 62%, has more standard 3083 04:42:47,820 --> 04:42:53,820 deviation, compared to the mean. And so that means it's less stable, right? It's got more 3084 04:42:53,820 --> 04:42:57,420 variance compared to its mean, and it's more standard standard deviation compared to its 3085 04:42:57,420 --> 04:43:04,100 mean. So it's less stable. So it moves around a lot. So if you said to me, if these were 3086 04:43:04,100 --> 04:43:09,750 actually two different labs, I would say, you know, I prefer the first lab, the purple 3087 04:43:09,750 --> 04:43:16,840 lab, because it's more predictable. I know, it's gonna be like less variation, because 3088 04:43:16,840 --> 04:43:23,031 it's 50%. And the 62% means that that's less predictable. It's a little hard to see in 3089 04:43:23,031 --> 04:43:28,522 this example. But what happens is, if you have two different labs, and you're looking 3090 04:43:28,522 --> 04:43:33,150 at this, like maybe you split a blood sample or a bunch of blood samples and send half 3091 04:43:33,150 --> 04:43:37,380 to one lab and half to the other, what you're supposed to get the same mean and the same 3092 04:43:37,380 --> 04:43:39,460 standard deviation, right? They're the same blood, 3093 04:43:39,460 --> 04:43:40,950 you just want it. 3094 04:43:40,950 --> 04:43:45,410 But sometimes you don't sometimes you get something like this, in which case, if you're 3095 04:43:45,410 --> 04:43:49,880 comparing labs, you would go with the purple lab and not the red lab because they produce 3096 04:43:49,880 --> 04:43:57,150 a more predictable result. So CV is a little hard to interpret. But it's easy to calculate. 3097 04:43:57,150 --> 04:44:06,270 So that's one awesome thing about now, we're gonna move on to chubby chef and his theorem. 3098 04:44:06,270 --> 04:44:12,260 So chubby chef figured something out a long time ago. And this is how he started thinking 3099 04:44:12,260 --> 04:44:16,310 about it. He first started thinking, well, let's say you have an x bar and an S, like 3100 04:44:16,310 --> 04:44:20,900 we just did with the CV. He noticed something else about it, he didn't notice the CV, he 3101 04:44:20,900 --> 04:44:26,740 noticed that you can create a lower and upper limit by subtracting the ass and adding the 3102 04:44:26,740 --> 04:44:33,570 s to the x bar. So remember back when we were making frequency tables, and I said, Well, 3103 04:44:33,570 --> 04:44:39,200 we need to make class limits, we need to make a lower class limit and an upper class limit. 3104 04:44:39,200 --> 04:44:43,602 Well, we use those terminology a lot like lower limits and upper limits. Well, Chevy 3105 04:44:43,602 --> 04:44:49,770 show was like wait a second, I got an idea. Let's say I take a mean. And I you know, this 3106 04:44:49,770 --> 04:44:54,100 will force the mean to be in the middle of this. I can subtract one standard deviation 3107 04:44:54,100 --> 04:44:59,220 from it, and I'll get some sort of lower limit and I'll add a standard deviation to that 3108 04:44:59,220 --> 04:45:02,760 mean and get some Sort of upper limit. And of course, let's pretend the standard deviation 3109 04:45:02,760 --> 04:45:07,340 was one, like you'd subtract one to that one. And so this would be like totally symmetrically 3110 04:45:07,340 --> 04:45:11,430 in the middle, right, the x bar would be in the middle, and then it'd be surrounded equally 3111 04:45:11,430 --> 04:45:16,060 by these two standard deviations. And I'm just saying standard deviation generically, 3112 04:45:16,060 --> 04:45:19,390 because you could do this with a mu, and the population standard deviation, two, you can 3113 04:45:19,390 --> 04:45:25,280 do the population work. So he just sort of, like figured out, that's a thing that can 3114 04:45:25,280 --> 04:45:30,180 happen, you can add and subtract a standard deviation from the mean. And you can get these 3115 04:45:30,180 --> 04:45:34,930 limits. And so example, let's say I have a mu. So I'm gonna pretend I have a population 3116 04:45:34,930 --> 04:45:39,693 a mu of 100. I don't know what I measured, but I got 100 and a population standard deviation 3117 04:45:39,693 --> 04:45:44,911 of five. So Chevy, I was thinking, you know what I could do, I could take that 100 and 3118 04:45:44,911 --> 04:45:51,650 subtract that five from it, and I get 95, I could take that 100 and add five to it, 3119 04:45:51,650 --> 04:45:57,022 I get 105. And so we just started like working with this concept, like I could subtract and 3120 04:45:57,022 --> 04:46:01,690 add like a standard deviation. And then he thought, Wait a second, I could even do this 3121 04:46:01,690 --> 04:46:06,400 with two standard deviations, right? So I could take like, if it was five, I could take 3122 04:46:06,400 --> 04:46:11,440 that times two, that's 10. And so I could do 100, subtract 10, and I get 90 for the 3123 04:46:11,440 --> 04:46:17,930 lower limit, and 100 and add 10. And I get 110 for the upper limit. And so I can make 3124 04:46:17,930 --> 04:46:22,442 this this range or this interval, right? from the lower limit to the upper limit, we call 3125 04:46:22,442 --> 04:46:29,120 it an interval, right. And so he just sort of conceptually realized that if he used some 3126 04:46:29,120 --> 04:46:34,590 rules along with this, there might be some useful interpretation of these limits, right, 3127 04:46:34,590 --> 04:46:39,660 there might be some way that uses limits to mean something. So we're going to look at 3128 04:46:39,660 --> 04:46:45,310 how he figured out to be able to use, you know, one standard deviation on either side 3129 04:46:45,310 --> 04:46:51,320 of the mean, or two, or three, or four multiples of these standard deviations on either side 3130 04:46:51,320 --> 04:46:58,860 of the mean, to actually come up with some lower and upper limits, that meant something. 3131 04:46:58,860 --> 04:47:05,600 So he realized that what these low lower and upper limits would mean is that at least some 3132 04:47:05,600 --> 04:47:10,820 percent of the data would be between these limits. So in other words, some percent of 3133 04:47:10,820 --> 04:47:16,680 the of the axes would be between the lower and the upper limit. But that percent would 3134 04:47:16,680 --> 04:47:22,730 depend on how many standard deviations you're going out, right? Like is it one is a two 3135 04:47:22,730 --> 04:47:29,100 is a three, the, the more you go out, obviously, the more percent of your data are covered 3136 04:47:29,100 --> 04:47:33,340 by the limits, because they're just huge, like, get it. So the interval so big, and 3137 04:47:33,340 --> 04:47:37,830 almost covers the whole thing. So you would expect that percentage go up, as the number 3138 04:47:37,830 --> 04:47:43,590 of standard deviations you use goes up. So so he was working on this out, and he came 3139 04:47:43,590 --> 04:47:48,710 up with this formula, right. And he also, he was figuring out, he wanted this to work 3140 04:47:48,710 --> 04:47:55,180 for all distributions, like normal, but also skewed. And also like uniform and by modal. 3141 04:47:55,180 --> 04:48:00,862 So this was the formula he came up with. Now, in this formula, see at the bottom, k stands 3142 04:48:00,862 --> 04:48:05,200 for the number of standard deviations, or the number of population standard deviations 3143 04:48:05,200 --> 04:48:12,640 that he's going to use, right? So let's pretend that he made KB to like two standard deviations, 3144 04:48:12,640 --> 04:48:18,820 right? Then you'd see this, it says one minus one divided by k to the second, which would 3145 04:48:18,820 --> 04:48:25,280 be to the second, so that would be to the second is what four. So one divided by four 3146 04:48:25,280 --> 04:48:31,130 is point two, five. And so one minus point two, five is like point seven, five, well, 3147 04:48:31,130 --> 04:48:34,420 you make that a percent at 75%. So 3148 04:48:34,420 --> 04:48:38,900 he's like, okay, that's what I'm going to say. If you go out two standard deviations 3149 04:48:38,900 --> 04:48:46,250 up or down, and you make those upper and lower limits, at least 75% of the data of the axes 3150 04:48:46,250 --> 04:48:53,120 are going to be there, at least, there might be more, but it'll be at least that. So he 3151 04:48:53,120 --> 04:48:57,850 did this he used to, and they use three, and he used four. 3152 04:48:57,850 --> 04:48:58,850 So 3153 04:48:58,850 --> 04:49:03,440 two standard deviations, either way, three standard deviations either way, or four standard 3154 04:49:03,440 --> 04:49:07,442 deviations either way. Now, students in my class often think that they have to memorize 3155 04:49:07,442 --> 04:49:13,550 this one minus one over K to the second, you don't memorize. This was just a story of how 3156 04:49:13,550 --> 04:49:19,420 Chevy chef did this proof. So you can memorize it for fun, but nobody memorizes it. I mean, 3157 04:49:19,420 --> 04:49:24,510 you know, Chevy chef did the work. I'm just showing you the proof, right? So he figured 3158 04:49:24,510 --> 04:49:30,020 this all out. So as you can see how he like you can do this with two, three and four, 3159 04:49:30,020 --> 04:49:33,150 you'll get the same answers Chevy chef does. So it's kind of a waste of time, but you can 3160 04:49:33,150 --> 04:49:37,862 do it just for fun. So he did the two one, I showed you that on the top. I even talked 3161 04:49:37,862 --> 04:49:43,890 you through it. So you've plugged two into the equation, you'll get 75%. So in that thing 3162 04:49:43,890 --> 04:49:50,410 I was just talking about like imagine I had 100, right? And that was my x bar and my standard 3163 04:49:50,410 --> 04:49:57,190 deviation was five, right? And then two times that is 10. So I go well my lower limit then 3164 04:49:57,190 --> 04:50:02,780 would be 90 in my upper limit. That would be 110. And I would be able to confidently 3165 04:50:02,780 --> 04:50:10,930 say at least 75% of my x's are between 90 and 110. So if I'd measured maybe 100 people, 3166 04:50:10,930 --> 04:50:15,870 right, I'd say at least 75 of them are going to be between these limits. In fact, it could 3167 04:50:15,870 --> 04:50:21,070 be 80, could be more, but at least 75. So then Remember, I told you to predict that 3168 04:50:21,070 --> 04:50:25,150 as we made this number bigger, you know, we go out more standard deviations, we're going 3169 04:50:25,150 --> 04:50:30,910 to cover more of the data, right? So we needed three, it didn't come out as even, it came 3170 04:50:30,910 --> 04:50:37,240 out in 88.9% of the data. So almost 89% will be covered if you go out three, and at least 3171 04:50:37,240 --> 04:50:44,230 almost 88.9%. And if you go out four standard deviations, it's at least 93.8%. Right? And 3172 04:50:44,230 --> 04:50:48,880 just to remind you, you know, when you have upper and lower limits, you have an interval, 3173 04:50:48,880 --> 04:50:53,020 right? That's just we just call it that. But this particular interval, if you get it this 3174 04:50:53,020 --> 04:50:57,900 way, it's Chevy service interval, because everybody's so happy did all this work, right? 3175 04:50:57,900 --> 04:51:04,420 Because I wouldn't have figured it out. So I just wanted to demonstrate an example of 3176 04:51:04,420 --> 04:51:09,520 championships interval, because then you can know how to interpret them or why anybody 3177 04:51:09,520 --> 04:51:16,070 does them. Okay, so remember our patient sample, they're in the waiting room at the lab, right? 3178 04:51:16,070 --> 04:51:19,600 So they waited on average, six minutes, and then the standard deviation of them waiting 3179 04:51:19,600 --> 04:51:25,282 was 3.74. Right? Now, when I gave you this demonstration of how to calculate the standard 3180 04:51:25,282 --> 04:51:31,561 deviation, I use this patient sample, I did that I only had a few patients in the sample 3181 04:51:31,561 --> 04:51:35,750 on purpose, because otherwise your table that we made with the defining formula would be 3182 04:51:35,750 --> 04:51:41,190 huge, and I never finished this video. So what I'm gonna ask you to do is pretend that 3183 04:51:41,190 --> 04:51:47,420 instead, we had 100 patients in there, right? Instead, I measured 100, and I got my x bar, 3184 04:51:47,420 --> 04:51:53,070 my 3.75 standard deviations, okay, so if we measured 100 patients, and we got that, I 3185 04:51:53,070 --> 04:52:00,160 just want to, I put this chubby shove rules in that table. So if we go out two standard 3186 04:52:00,160 --> 04:52:05,590 deviations from the mean, from the x bar, either side, whatever limits we get whatever 3187 04:52:05,590 --> 04:52:12,100 interval we get, we know at least because I made it, so we say you know, studied 100 3188 04:52:12,100 --> 04:52:18,710 patients. So by law, we're at least 75 of those patients will be between those lower 3189 04:52:18,710 --> 04:52:24,610 and upper limits, if we follow championship syrup. And if I do go out three standard deviations, 3190 04:52:24,610 --> 04:52:30,490 at least 88.9 patients will be in there. Okay, I know that doesn't make any sense, like 88.9 3191 04:52:30,490 --> 04:52:34,490 patients Saudia point nine of a patient. But what they're saying is, I guess it would be 3192 04:52:34,490 --> 04:52:41,780 89. All right, yeah, 89% of the patients or in other words, 89 patients, at least would 3193 04:52:41,780 --> 04:52:47,970 be in that interval. And of course, if I went out for at least, I wouldn't have to say 94.8 3194 04:52:47,970 --> 04:52:52,840 of a patient, but at least 94 patients would fit in that interval. And if you're thinking 3195 04:52:52,840 --> 04:52:56,920 about if we only start with 100 patients, that's almost all of them. So the for one 3196 04:52:56,920 --> 04:53:01,920 isn't so useful, right? So you'll see me on the left side of the slide calculating the 3197 04:53:01,920 --> 04:53:08,290 intervals, right? So let's start with the first one. The first one is two standard deviations 3198 04:53:08,290 --> 04:53:14,940 on either side of the mean. So the chubby chef interval we get is negative 1.48 to 13 3199 04:53:14,940 --> 04:53:20,520 4.48. And you probably notice you can't wait negative time. So already, this is kind of 3200 04:53:20,520 --> 04:53:27,282 weird, right? But what this is saying is of our 100 patients, at least 75 of them because 3201 04:53:27,282 --> 04:53:33,772 this is 75% championship interval, weighted between negative 1.48 minutes, so that might 3202 04:53:33,772 --> 04:53:41,870 as well rounded to zero between zero minutes, and 13 4.48 limp minutes, right. And so at 3203 04:53:41,870 --> 04:53:52,373 least 75% of them are, I fell in that range. Now 13.48 minutes is kind of long. So we would 3204 04:53:52,373 --> 04:53:58,000 be happy, I guess is 75% of them fell in that range, because then that means 3205 04:53:58,000 --> 04:54:05,430 that they were probably not waiting that long. But if you go out, then you widen this interval 3206 04:54:05,430 --> 04:54:13,890 like 88.9. If you do that, then you say at least well rounded to 89 89% of the patients 3207 04:54:13,890 --> 04:54:18,120 waited between negative five point to two minutes, which is you might as well make zero 3208 04:54:18,120 --> 04:54:24,830 and 17.22 minutes. So as you see, if we widen the interval, we're going to get some later 3209 04:54:24,830 --> 04:54:32,260 waiters in there. And so then we'll say, Well, at least 89% were between there, but at least 3210 04:54:32,260 --> 04:54:38,250 90 89% were between there and that means it wasn't bigger, right. And then again, we go 3211 04:54:38,250 --> 04:54:43,970 out one more, we get 93.8%. So let's just round it to 94. So at least 94% of the patients 3212 04:54:43,970 --> 04:54:50,400 or if we have 100 patients, at least 94 of them waited between negative 8.96 minutes, 3213 04:54:50,400 --> 04:54:58,160 which again is nonsensical, up to 20.96. But then we're starting to get where we'll have 3214 04:54:58,160 --> 04:55:03,080 almost all the patients with Somewhere between zero and 20 minutes, we really don't know 3215 04:55:03,080 --> 04:55:07,950 how long they waited. So this is just kind of to show you what happens when you line 3216 04:55:07,950 --> 04:55:13,520 that interval, you you maybe have less certainty about what individuals happen, be sort of 3217 04:55:13,520 --> 04:55:22,150 a better idea of what the range is. So again, I just put this at the bottom. If we had 100 3218 04:55:22,150 --> 04:55:26,830 patients, this is how you would interpret it, at least somebody five would have waited 3219 04:55:26,830 --> 04:55:33,360 between the lower and upper limit for the 75% championship interval. And then at least 3220 04:55:33,360 --> 04:55:39,500 80.9 patients I know nonsensical. And then the 93.8. So you see that interpretation lower 3221 04:55:39,500 --> 04:55:49,320 part of the slide. So this is a really difficult concept for a lot of students. And so I'll 3222 04:55:49,320 --> 04:55:55,830 just give you this take home message. First of all, Chevy shove interval works for any 3223 04:55:55,830 --> 04:55:59,770 distribution, normal skewed whatever. Reason why that's part of the take home messages 3224 04:55:59,770 --> 04:56:04,610 later, we're going to learn about intervals that only work with normal distributions. 3225 04:56:04,610 --> 04:56:09,390 Okay? So this one is loosey goosey. It works with all distributions. So that's one of the 3226 04:56:09,390 --> 04:56:15,030 take home messages for chubby sets interval. Also, Chevy says interval tell you that at 3227 04:56:15,030 --> 04:56:20,460 least a certain percent of the data are in the interval. Later, we're going to learn 3228 04:56:20,460 --> 04:56:25,282 about intervals where exactly a certain amount of data are in that interval. And so Chevy 3229 04:56:25,282 --> 04:56:31,690 shop again, a little loosey goosey, right, he says at least. Next, championship intervals 3230 04:56:31,690 --> 04:56:36,640 are sometimes nonsensical, as we just talked about. Negative time doesn't work, right. 3231 04:56:36,640 --> 04:56:42,940 Sometimes you'll have very high limits, especially with a four. And so ultimately, they're not 3232 04:56:42,940 --> 04:56:47,520 very useful. And they're not used in health care. I literally had never heard of Chevy 3233 04:56:47,520 --> 04:56:52,580 shows interval until I started teaching this class. So what is the purpose of teaching 3234 04:56:52,580 --> 04:56:57,820 you Chevy says interval. The purpose of teaching this is to point out in statistics, we often 3235 04:56:57,820 --> 04:57:03,040 use the s or the population standard deviation, you know, just standard deviation. And we 3236 04:57:03,040 --> 04:57:08,510 add or subtract, we'll add and subtract it from the mean, is a good way of making lower 3237 04:57:08,510 --> 04:57:13,290 and upper limits that have special significance. That's really the main take home message is 3238 04:57:13,290 --> 04:57:19,200 that you'll see this pattern as we go through this class, where we get a mean either populations 3239 04:57:19,200 --> 04:57:26,870 or sample, and we have x bar, you know, x bar or population mean. And then we have a 3240 04:57:26,870 --> 04:57:31,380 standard deviation, right either from sample a population. And then we take either one 3241 04:57:31,380 --> 04:57:36,342 standard deviation, we added subtracted or two, or multiples. And those intervals then 3242 04:57:36,342 --> 04:57:40,970 have certain significance. I only taught you in this one about Chevy chef, what you learn 3243 04:57:40,970 --> 04:57:48,480 about other intervals later that are made similarly. So in conclusion, what did we learn, 3244 04:57:48,480 --> 04:57:51,920 we learned how to calculate the range, we learned how to calculate the variance and 3245 04:57:51,920 --> 04:57:56,020 standard deviation. We learned about how to calculate the coefficient of variation, how 3246 04:57:56,020 --> 04:58:02,660 to interpret it. And we talked about the difference in the formulas from sample versus population. 3247 04:58:02,660 --> 04:58:06,900 And we learned about Chevy Chevy and his theorem, how he figured it out, and how we calculate 3248 04:58:06,900 --> 04:58:10,770 this intervals and how you interpret them. Now I just thought I'd show you this picture 3249 04:58:10,770 --> 04:58:16,350 of Chevy chef here. He's a Russian guy. Well, the stamp was from the USSR, for the Iron 3250 04:58:16,350 --> 04:58:22,202 Curtain fell. But I just thought I'd show it to you. So you knew who figured all this 3251 04:58:22,202 --> 04:58:27,250 out? Good job, you've made it through the measures of variation. And now you're ready 3252 04:58:27,250 --> 04:58:32,138 to do what the quiz, the homework, whatever, right? You're totally knowledgeable. 3253 04:58:32,138 --> 04:58:33,460 Good job. 3254 04:58:33,460 --> 04:58:40,990 Well, I'm back. And so are you. Welcome to Chapter 3.3 percentiles and box and whisker 3255 04:58:40,990 --> 04:58:46,270 plots. It's Monica wahi. Library college lecturer. And this is what we're going to talk about. 3256 04:58:46,270 --> 04:58:49,610 And this is what you're going to learn. At the end of this lecture, the students should 3257 04:58:49,610 --> 04:58:55,740 be able to explain what a percentile means, describe what the interquartile range is, 3258 04:58:55,740 --> 04:59:01,020 and how to calculate it. Explain the steps to making a box and whisker plot, and also 3259 04:59:01,020 --> 04:59:08,110 state how a box and whisker plot helps a person evaluate the distribution of the data. So 3260 04:59:08,110 --> 04:59:12,870 let's get started. You know, whenever we talk about a box and whisker plot, I think of some 3261 04:59:12,870 --> 04:59:15,410 cute little animal with all those whiskers. 3262 04:59:15,410 --> 04:59:19,401 I'll explain what the whiskers really are, I mean, not on the animal, but on the box 3263 04:59:19,401 --> 04:59:23,750 and whisker plot later. So what are we going to go over, we're going to go over percentiles, 3264 04:59:23,750 --> 04:59:28,670 and we're going to explain what those are. Then we're going to talk about core tiles 3265 04:59:28,670 --> 04:59:32,770 sounds a little slimmer, it's got the tiles and it will you'll you'll understand why they're 3266 04:59:32,770 --> 04:59:36,880 similar. Then we're going to compute core tiles. And then finally, we're going to do 3267 04:59:36,880 --> 04:59:42,700 the box and whisker plot. All right. So let's go. So percentiles, we're going to have a 3268 04:59:42,700 --> 04:59:45,990 flashback, okay. You're not going to like this little part because it's going to remind 3269 04:59:45,990 --> 04:59:50,670 you of standardized tests. So maybe not all of you have been subjected to this, but most 3270 04:59:50,670 --> 04:59:55,460 of us have if you gone to high school. In the US, you probably got to deal with these 3271 04:59:55,460 --> 05:00:00,660 standardized tests. So just remember, we're only talking about quantitative data. All 3272 05:00:00,660 --> 05:00:05,050 right. So if you take a standardized test or a non standardized test, you usually get 3273 05:00:05,050 --> 05:00:08,730 points. And points are numerical. So that's quantitative 3274 05:00:08,730 --> 05:00:09,730 data. 3275 05:00:09,730 --> 05:00:15,200 So I remember I used to take the standardized tests, and I'd be, you know, showing my friends 3276 05:00:15,200 --> 05:00:19,610 what I got, right, because they'd send you that thing in the mail. Now, I learned pretty 3277 05:00:19,610 --> 05:00:24,790 early on, that it mattered who all was in the pool of people maybe taking a test with 3278 05:00:24,790 --> 05:00:29,990 you, right. So if you're taking the test with a lot of stupid people, it's easier to get 3279 05:00:29,990 --> 05:00:35,602 a higher percentile, because what percentile means is it for example, if you test at the 3280 05:00:35,602 --> 05:00:43,520 77th percentile, it means you did better than 77% of people taking the test. And a lot of 3281 05:00:43,520 --> 05:00:48,190 those standardized tests, they didn't care how many points you got, what they cared about 3282 05:00:48,190 --> 05:00:54,430 is what percentile you were at. So different batches of people would have different scores. 3283 05:00:54,430 --> 05:00:59,210 And if you got a lot of lucky, got a lot of stupid people, then your score would be higher 3284 05:00:59,210 --> 05:01:04,000 than there. So it didn't really matter what your absolute score was, it just mattered 3285 05:01:04,000 --> 05:01:08,410 what your percentile was. So just to sort of remind you, if somebody had come up to 3286 05:01:08,410 --> 05:01:14,730 me in high school and said, I got 77 percentile, what I'd say is okay, if only 100 people had 3287 05:01:14,730 --> 05:01:19,430 taken the test, you'd have done better than Sunday, seven of them. Of course, we were 3288 05:01:19,430 --> 05:01:24,770 all Brady, Brady, you know, I was always in like the 95th, or the 97th, or the 98th. And 3289 05:01:24,770 --> 05:01:29,770 it happened so often, I wondered if it was really true. But what I realized is, is that 3290 05:01:29,770 --> 05:01:33,830 there were so many people in the pool, because you know, I was in public high school in Minnesota, 3291 05:01:33,830 --> 05:01:38,291 well, they were pulling together all the public high schools in Minnesota, ninth grade, you 3292 05:01:38,291 --> 05:01:41,870 know, as pulled with them in 10th, grade or whatever. And when you're taking like nursing 3293 05:01:41,870 --> 05:01:46,600 examinations, sometimes they'll do that they'll put you on a percentile. So I try to tell 3294 05:01:46,600 --> 05:01:49,872 people, you know, strategize, try to take in when only stupid people are taking, which 3295 05:01:49,872 --> 05:01:53,661 of course, makes no sense. How can you tell when stupid people are taking it, right? You 3296 05:01:53,661 --> 05:01:59,130 don't even know who's taking it. But really, that's that's what a percentile is, it's the 3297 05:01:59,130 --> 05:02:05,210 percentage of people that you did better than if you're at the 77th percentile, then you 3298 05:02:05,210 --> 05:02:13,640 did better than 77%. Okay, so here's just some rules about percentiles. First of all, 3299 05:02:13,640 --> 05:02:19,640 you know, I gave the example of the 77th percentile, well, the rule is you have to have one between 3300 05:02:19,640 --> 05:02:24,950 one and 99. Like, you can't have the negative second percentile, or the 100, and fifth percentile. 3301 05:02:24,950 --> 05:02:29,943 So that's the first, then whatever number you pick, like I was saying, that percent 3302 05:02:29,943 --> 05:02:36,140 of the values would fall below that number. And 100 minus that number, have the values 3303 05:02:36,140 --> 05:02:42,353 fall above that number. So like, in my, well, here, we'll give an example. 20, people take 3304 05:02:42,353 --> 05:02:49,880 a test, just 20, right, let's say there's a maximum score of five on the test. The 25th 3305 05:02:49,880 --> 05:02:56,110 percentile means that 25% of the scores will fall below whatever score that is, and 75% 3306 05:02:56,110 --> 05:03:00,880 will fall above that score. So let's say it's an easy test. And let's say out of my 20, 3307 05:03:00,880 --> 05:03:05,510 people, 12, get a four, which is almost the total, right, and the remaining eight, get 3308 05:03:05,510 --> 05:03:12,458 a five, so everybody gets either a four or five, well, then, you know, the 25th percentile, 3309 05:03:12,458 --> 05:03:18,690 or the score that cuts off the bottom five tests, right, will be a four, just because 3310 05:03:18,690 --> 05:03:22,560 this was an easy test. And every you know, the first 12, people got a four and then the 3311 05:03:22,560 --> 05:03:27,950 rest eight out of five. So even the 50th percentile, then would technically be at a four, right? 3312 05:03:27,950 --> 05:03:32,860 Now, this would all come out differently if it were a hard test, and most people got a 3313 05:03:32,860 --> 05:03:39,440 score below three, right? And so the percentiles would be shifted down, I just tell you that 3314 05:03:39,440 --> 05:03:44,860 so you can keep in mind the difference between the actual score and the percentile. So the 3315 05:03:44,860 --> 05:03:50,560 percentile just happens to mean that this percent of people got the score lower than 3316 05:03:50,560 --> 05:03:56,112 whatever your score is, it doesn't actually say what your score was, right? So that's 3317 05:03:56,112 --> 05:04:02,140 what you just want to remember as we're going to percent. Okay, now we're going to talk 3318 05:04:02,140 --> 05:04:07,060 about core tiles, and also the interquartile range. Remember the tile think so this relates 3319 05:04:07,060 --> 05:04:12,300 to percentiles. So I put a little quarter up there. So core tiles is a specific set 3320 05:04:12,300 --> 05:04:16,780 of percentiles. And you'll see why I put the little quarter up there. It's because there's 3321 05:04:16,780 --> 05:04:22,120 technically four core tiles, it's just that the top quartile doesn't count because it's 3322 05:04:22,120 --> 05:04:27,230 like the 100% one. And remember, it can only go up to 99, like I was just showing you. 3323 05:04:27,230 --> 05:04:32,710 So we calculate the first second and third quartile. So we have the 25th percentile is 3324 05:04:32,710 --> 05:04:38,280 the first quartile, the 50th percentile, which is also known as the median, which you're 3325 05:04:38,280 --> 05:04:43,550 already good at, right? That's known as a second quartile. And then the third quartile 3326 05:04:43,550 --> 05:04:51,240 is the 75th percentile. So those are your courthouse 25th 50th and 75th. And technically 3327 05:04:51,240 --> 05:04:55,610 a 100th. But we never say that, right? Because it only goes up to 99. So you have the first 3328 05:04:55,610 --> 05:05:00,800 quartile at the 25th percentile, the second quartile at the 50th percentile. The third 3329 05:05:00,800 --> 05:05:05,610 quartile at the 75th percentile. And these are actually not that hard to calculate by 3330 05:05:05,610 --> 05:05:07,630 hand. 3331 05:05:07,630 --> 05:05:13,792 So here's, like how you do it sort of an overview. So first you order the data from smallest 3332 05:05:13,792 --> 05:05:18,080 to largest, because remember, we have quantitative data, so you can sort them, so you sort them 3333 05:05:18,080 --> 05:05:22,540 smallest to largest. And this is feeling very immediately, right? Well guess what, that's 3334 05:05:22,540 --> 05:05:27,450 step two is you find the median, because the median is also the second quartile, which 3335 05:05:27,450 --> 05:05:32,810 is also the 50th percentile. So already, you have know how to do this, right? Because you 3336 05:05:32,810 --> 05:05:38,330 could already do step one, and two. Now, this is the harder part, this is the new part. 3337 05:05:38,330 --> 05:05:44,050 Step three is where you find the median of the lower half of the data. Right. And so 3338 05:05:44,050 --> 05:05:49,370 wherever you put your median, you pretend that's the end, and you look at the smaller 3339 05:05:49,370 --> 05:05:54,510 values, and you find the median of those. And that would be the first quartile or the 3340 05:05:54,510 --> 05:06:00,140 75th percentile. Then finally, step four, which you probably guessed, is you find where 3341 05:06:00,140 --> 05:06:03,570 your median was. And then you look at the upper half of the data between the median 3342 05:06:03,570 --> 05:06:08,180 and the maximum, and you make a median out of that part of the data, and then that's 3343 05:06:08,180 --> 05:06:13,890 your 75th percentile. Okay, and I'll show you an example of us doing that. But this 3344 05:06:13,890 --> 05:06:20,442 is an overview of the steps. Now, remember, range before what the range was, yeah, you 3345 05:06:20,442 --> 05:06:24,793 remember it, that's where we had the maximum minus the minimum, right? And I told you, 3346 05:06:24,793 --> 05:06:29,262 you have to actually do out the equation and tell me what number you get. And that's the 3347 05:06:29,262 --> 05:06:35,570 range. Well, we have something new and improved. In this lecture, here, we have the inter quartile 3348 05:06:35,570 --> 05:06:40,202 range. Okay, so you already know about quartiles, we were just talking about them. But inter 3349 05:06:40,202 --> 05:06:46,650 quartile sort of means like, within, right. So once you have the third quartile, and you 3350 05:06:46,650 --> 05:06:52,220 have the first quartile, you can calculate the inter quartile range, or RQR for short. 3351 05:06:52,220 --> 05:06:56,190 So if you see IQ are on here, just remember, that's interquartile range. So that's the 3352 05:06:56,190 --> 05:07:01,050 third quartile minus the first quarter. And again, I'll show you an example. It's this 3353 05:07:01,050 --> 05:07:07,720 is just an overview. Okay, here's the example I promised. On the right side of the slide, 3354 05:07:07,720 --> 05:07:13,880 you will see a sample of data I collected, I went to HD comm that's American Hospital 3355 05:07:13,880 --> 05:07:20,600 directory calm, and that provides publicly available information about American hospitals. 3356 05:07:20,600 --> 05:07:26,862 So I went in, and I took a random sample of 11, Massachusetts hospitals, there's a lot 3357 05:07:26,862 --> 05:07:31,920 more, so I took a random sample. And what I did was I wrote down how many beds each 3358 05:07:31,920 --> 05:07:36,952 of those hospitals had. Because if a hospital has several 100 beds, they're considered kind 3359 05:07:36,952 --> 05:07:42,250 of a big hospital. And if they have less than 100 beds, they're considered a smaller hospital. 3360 05:07:42,250 --> 05:07:48,130 So I wrote all those numbers down. And then I already did step one of making our courthouse 3361 05:07:48,130 --> 05:07:51,910 which is to order the data from smallest to largest. So you'll see on the right side of 3362 05:07:51,910 --> 05:07:59,841 the slide, my smallest hospital had only 41 beds, and my largest hospital had 364 beds 3363 05:07:59,841 --> 05:08:04,702 and see I put all of them in order, they're on the right. And so we already did step one. 3364 05:08:04,702 --> 05:08:12,282 So let's go on to step two. So the Step two is to find the median, and that's quartile 3365 05:08:12,282 --> 05:08:18,522 two, or the 50th percentile. Now, you're already good at that, right. And so we have 11 hospitals. 3366 05:08:18,522 --> 05:08:24,550 So we know that the sixth one in the row is going to be the median, you know, because 3367 05:08:24,550 --> 05:08:29,380 it's an odd number of hospitals that I drew. And so the sixth one will circle it, that's 3368 05:08:29,380 --> 05:08:34,542 the 50th percentile or the median, so we already got quartile two, it's, it's funny that you 3369 05:08:34,542 --> 05:08:36,090 have to start with quartile two, but that's 3370 05:08:36,090 --> 05:08:37,990 what you have to do. 3371 05:08:37,990 --> 05:08:43,770 Now, I just re color coded these. So you could kind of remember what's going on as we do 3372 05:08:43,770 --> 05:08:50,410 the other steps. 126 is the median. That's kind of not on anybody's side, it's not on 3373 05:08:50,410 --> 05:08:55,750 the lowest side, and it's not on the highest side. The orange ones then are considered 3374 05:08:55,750 --> 05:09:01,100 below the median. And the blue ones are considered above the median. And so I just color coded 3375 05:09:01,100 --> 05:09:06,950 them so you can keep track of what's going on in the next slides. Okay, now we're going 3376 05:09:06,950 --> 05:09:12,550 to do the 25th percentile for step three. So the goal is to find the median of the lower 3377 05:09:12,550 --> 05:09:16,840 half of the data. So now you see why I color coded it is because now we're pretending just 3378 05:09:16,840 --> 05:09:21,810 the orange ones exist. And we are just finding the median of that. And we're not counting 3379 05:09:21,810 --> 05:09:29,112 that 126, because that's already been used. And so now we find that 90 is the 25th percentile, 3380 05:09:29,112 --> 05:09:32,770 how you remember that it's not the 75th, it's not the third one is because it's the low 3381 05:09:32,770 --> 05:09:37,350 one, like 25 is a low number. And 75 is a higher number. So you go to the lower part 3382 05:09:37,350 --> 05:09:40,880 of the data, you find the median of that, and that's going to be your 25th percentile. 3383 05:09:40,880 --> 05:09:47,122 And so in our case, that's 90 then you probably guessed it, you go to the blue ones, right 3384 05:09:47,122 --> 05:09:54,410 the upper half and you go get the median out of that. And so of course ours is 254. So 3385 05:09:54,410 --> 05:10:00,020 that's our 75th percentile. So what we just did is we calculated our courthouse. We have 3386 05:10:00,020 --> 05:10:05,010 Our 50th percentile, our 25th percentile and our 75th percentile. So that's what I meant 3387 05:10:05,010 --> 05:10:10,760 by that overview slide. This is an example of how you would do that. And of course, I 3388 05:10:10,760 --> 05:10:16,080 have to give a shout out to the IQ R, which is the interquartile range. Remember, you 3389 05:10:16,080 --> 05:10:22,960 just learn that. So that's the 75th percentile minus the 25th percentile. So in our case, 3390 05:10:22,960 --> 05:10:31,920 that's going to be 254 minus 90, which equals 164. So that is your IQ R. So if I gave you 3391 05:10:31,920 --> 05:10:37,050 a test, and I asked you what is the IQ or for these data, you can't just put 254 minus 3392 05:10:37,050 --> 05:10:42,580 90, you actually have to work it out and put 164. So there you go. So that's our quarterly 3393 05:10:42,580 --> 05:10:50,430 example. So I just wanted to step back and give you some philosophical points on what 3394 05:10:50,430 --> 05:10:58,090 happens with q1 and q3, depending on how many data points you have. Okay, so remember, the 3395 05:10:58,090 --> 05:11:04,450 first step of this is always to put them in order from smallest to largest. So let's pretend 3396 05:11:04,450 --> 05:11:11,080 I had only drawn the first six values of my hospitals. See how I put on the slide, I put 3397 05:11:11,080 --> 05:11:18,930 the position of the number, which is 123456. And I put above the example numbers. So let's 3398 05:11:18,930 --> 05:11:22,772 say I was going to do the median on that, you know, what I'd have to do is I'd have 3399 05:11:22,772 --> 05:11:30,280 to take 90 plus 97, divided by two. But then the next question is, what do we do for q1 3400 05:11:30,280 --> 05:11:38,510 and q3? Well, given that in the example of having six values, the 90 and 97 are mushed, 3401 05:11:38,510 --> 05:11:44,550 together for the median, they don't get, they can get reused, or they do get reused when 3402 05:11:44,550 --> 05:11:50,470 looking at the bottom and the top half of the data. So when we went to go to do q one 3403 05:11:50,470 --> 05:11:55,030 in this, we would actually count that 90 in there. In fact, q one would be 74, because 3404 05:11:55,030 --> 05:12:01,280 that's the median of the three numbers below the median right below that line. And then 3405 05:12:01,280 --> 05:12:07,292 the Q three would actually be 121, because we actually count the 97 in there. So in other 3406 05:12:07,292 --> 05:12:11,432 words, when you have like six values, and the median is made out of mushing together 3407 05:12:11,432 --> 05:12:15,750 two values, like taking the average of those two values, those two values, they get to 3408 05:12:15,750 --> 05:12:20,330 double dip, they get to be in the bottom, and the bottom line gets to be in the bottom, 3409 05:12:20,330 --> 05:12:27,980 and the top one gets to be in the top when calculating q1 and q3. Now, well, what if 3410 05:12:27,980 --> 05:12:33,100 we had seven values instead of six? Okay, so I just expanded and pretended we had seven 3411 05:12:33,100 --> 05:12:38,790 hospitals. And you'll see that I have seven positions there. Well, this was a little like 3412 05:12:38,790 --> 05:12:46,190 the one we did, together with the 11 values, where the median was clearly this 97. Here, 3413 05:12:46,190 --> 05:12:53,280 in this case, it's 97. So that 97 does not get reused in the bottom in the top. So you'll 3414 05:12:53,280 --> 05:12:58,890 notice that q one is the middle number of the three bottom ones, and Q three is the 3415 05:12:58,890 --> 05:13:04,390 middle number, the top three ones. And so that's what happens when you have seven values. 3416 05:13:04,390 --> 05:13:11,080 And it's also happens when you have 11 values, like I demonstrated with those hospitals. 3417 05:13:11,080 --> 05:13:15,800 But it's not super predictable. Because what if you had eight values, we suddenly see it 3418 05:13:15,800 --> 05:13:20,702 gets a little complicated. So how would we do this? Well see the first four are between 3419 05:13:20,702 --> 05:13:26,530 41 and 97, top four between 121 155. Well, to make our median, we'd have to take the 3420 05:13:26,530 --> 05:13:32,040 mean of 97, and 121. But remember, they don't get used up the 97 then gets to double dip 3421 05:13:32,040 --> 05:13:37,830 and be part of the calculation for q1, and 121 gets a double dip and Part B part of the 3422 05:13:37,830 --> 05:13:42,770 calculation for q3. But even even with this double dipping, if you go down, you'll see 3423 05:13:42,770 --> 05:13:49,250 that there are four then numbers to contend with, for q1. So of course, to get q1, you 3424 05:13:49,250 --> 05:13:55,750 actually have to mush together or take an average of 74 and 90. And if you go up the 3425 05:13:55,750 --> 05:14:00,530 upper part of the data, in order to get q three, you're going to have to make an average 3426 05:14:00,530 --> 05:14:06,650 of 126 and 142 are the ones in position six in position seven. So if you're unlucky enough 3427 05:14:06,650 --> 05:14:10,450 to get like eight values, then you realize you're going to have to make your median by 3428 05:14:10,450 --> 05:14:14,990 making an average of two numbers, your q1 of making an average of two numbers and your 3429 05:14:14,990 --> 05:14:21,190 q3 like that. So it's not super predictable what's going to happen. You just have to pay 3430 05:14:21,190 --> 05:14:27,820 a lot of attention. Just remember if your median is made out of two numbers average, 3431 05:14:27,820 --> 05:14:34,542 those numbers get to double dip in the downstairs and the upstairs of calculating q1 and q3. 3432 05:14:34,542 --> 05:14:39,840 If instead your median is just one number, like because you have an odd number of values, 3433 05:14:39,840 --> 05:14:48,470 then that guy has to just stay there and does not double dip in q1 and q3 calculations. 3434 05:14:48,470 --> 05:14:53,420 So we can just see another example of this. So this is nine values right? Now remember, 3435 05:14:53,420 --> 05:14:58,000 when I had 11 values, it was like having seven values. I had this median and it was really 3436 05:14:58,000 --> 05:15:03,150 clear like we have here but even Um, the medians of the top of the top of the data and the 3437 05:15:03,150 --> 05:15:07,090 bottom of the day, they were just, you know, it was an odd number. And so it was easy to 3438 05:15:07,090 --> 05:15:12,890 figure that out. Well, you see here, in this case, our median is the fifth value, and that's 3439 05:15:12,890 --> 05:15:19,670 121. So 121, does not double dip anywhere, right? So we go to calculate q one, we only 3440 05:15:19,670 --> 05:15:24,020 have four values, because we're not counting the 121. And then we're stuck with taking 3441 05:15:24,020 --> 05:15:28,782 an average of the second and third value to get q one. And then same thing upstairs here, 3442 05:15:28,782 --> 05:15:33,050 between, you know, 142, and 155. You know, those are the two middle numbers of our four 3443 05:15:33,050 --> 05:15:37,710 numbers at the top. And then we have to take an average of those to get q3. So I guess 3444 05:15:37,710 --> 05:15:41,400 this is just my long way of saying you got to be really careful what you're doing. First, 3445 05:15:41,400 --> 05:15:46,760 make sure you've gotten the median, then figure out if that median is this kind of a median 3446 05:15:46,760 --> 05:15:51,170 where it's just you're circling, or it's a medium that came out of an average, because 3447 05:15:51,170 --> 05:15:54,410 if it's a medium that came out of an average, just know that those numbers are going to 3448 05:15:54,410 --> 05:15:59,922 double dip in q1 and q3. And if it's a medium that was because you had an odd number of 3449 05:15:59,922 --> 05:16:06,280 data, it was just like in the middle, that one doesn't get to double dip. Okay, enough 3450 05:16:06,280 --> 05:16:09,872 double dipping, I'm getting hungry. When I go to that roller coaster, I'm going to get 3451 05:16:09,872 --> 05:16:14,970 a double dip ice cream cone. Okay, we're gonna move on to box and whisker plot, which is 3452 05:16:14,970 --> 05:16:20,230 kind of like your percentiles getting graphed, right. So let's go back to our ingredients, 3453 05:16:20,230 --> 05:16:24,910 we already created our box plot ingredients. In fact, that's why I trickily went through 3454 05:16:24,910 --> 05:16:30,420 those portals first, because now we've created our ingredients to make a box plot. So I just 3455 05:16:30,420 --> 05:16:36,372 sort of summarize what we have on the left slot, side of the slide, say that 50 times, 3456 05:16:36,372 --> 05:16:43,092 hospital beds was what we were counting, the smallest Regional Hospital had only 41 beds. 3457 05:16:43,092 --> 05:16:50,350 q1 was 96. a little easier. I put it in an order cure, one was 90, median q2 was 126. 3458 05:16:50,350 --> 05:16:55,550 You know what I mean? I mean, cuartel, right, like by these cues, then q3 is 254. And then 3459 05:16:55,550 --> 05:17:00,651 the maximum was 364. Okay, so let's make a boxplot. And then you remember what the data 3460 05:17:00,651 --> 05:17:03,680 looks like on the right side of the slide. Okay, well, now I'm going to walk you through 3461 05:17:03,680 --> 05:17:09,660 how you would make this box plot. So first, you draw this thing? Well, how do you know 3462 05:17:09,660 --> 05:17:14,762 what to draw? Well, I usually just draw a line and a vertical line, and then put a zero 3463 05:17:14,762 --> 05:17:18,760 at the bottom, and then I cheat, I go look at the maximum go, Oh, I wonder where that 3464 05:17:18,760 --> 05:17:24,860 is. And see our maximum was like 364. So I just made 400. At the top, if our maximum 3465 05:17:24,860 --> 05:17:28,880 had been something like, you know, I think Massachusetts General Hospital has something 3466 05:17:28,880 --> 05:17:35,750 like 600 or 800 beds. If we had gotten that one in there, and that was our maximum, I 3467 05:17:35,750 --> 05:17:39,931 would maybe go up to 900, you know, whatever is a little bit above the maximum, that's 3468 05:17:39,931 --> 05:17:45,400 what I put at the top. So this was 364. So I put 400, then what I did was I divided it 3469 05:17:45,400 --> 05:17:50,470 in half, like I see where the 200 is, I just kind of threw that in there. And then I divided 3470 05:17:50,470 --> 05:17:54,980 between the 200 and the 400, a half and put the 300. And so you can just kind of eyeball 3471 05:17:54,980 --> 05:18:00,000 this and draw it out that way if you want. Okay, so I got this thing set up. And then 3472 05:18:00,000 --> 05:18:02,460 here we go, we're going to do the first thing. 3473 05:18:02,460 --> 05:18:09,420 Okay, here's the first thing we're going to draw in q1 or quarter one. So on the left 3474 05:18:09,420 --> 05:18:14,850 side of the slide, you'll see a circle that's 90. On the right side of the slide, I made 3475 05:18:14,850 --> 05:18:21,820 this horizontal line. Now how Why do you make that line? Well, look at how its proportion 3476 05:18:21,820 --> 05:18:26,970 to that that upward and down graph thing I made, you know, with the numbers, you probably 3477 05:18:26,970 --> 05:18:32,880 don't want to too wide, but you don't want to too skinny. This is just about right, like 3478 05:18:32,880 --> 05:18:38,850 Goldilocks just right. Okay, so you just make this horizontal line at q1. So that's the 3479 05:18:38,850 --> 05:18:48,240 first. Now you make a copy of that same line parallel, and you make it at q3. So if you 3480 05:18:48,240 --> 05:18:52,740 look at that, if you're I hope you're not lost, if you look at that, you know, 100 200 3481 05:18:52,740 --> 05:18:58,271 300 400, you know, q1 is 90, so it's about 10, under 100. So that's how I knew where 3482 05:18:58,271 --> 05:19:04,190 to position that lower one. And then 254, that's about, you know, a little bit higher 3483 05:19:04,190 --> 05:19:09,050 than halfway between 203 100. So that's where I roughly knew how to position this one. It's 3484 05:19:09,050 --> 05:19:13,670 not perfect. If you do it in statistical software, they put it out and it's perfect. But for 3485 05:19:13,670 --> 05:19:15,850 demonstration purposes, that's 3486 05:19:15,850 --> 05:19:16,850 what I'm doing. 3487 05:19:16,850 --> 05:19:21,640 Okay, so now what we've done is we put in q1 and q3 and we put these horizontal lines 3488 05:19:21,640 --> 05:19:28,960 that are parallel. Alright, here's the next step. We connect them, hence, the box so the 3489 05:19:28,960 --> 05:19:35,990 box gets made, right that you just call it connect them. Alright, now I put a little 3490 05:19:35,990 --> 05:19:39,910 circle on the right side of the slide because I wanted you to make sure you saw what's going 3491 05:19:39,910 --> 05:19:46,170 on there. Okay. That's when we put in q2 or the median, right? So the median is 126. See 3492 05:19:46,170 --> 05:19:51,230 where 100 is. It's up a little bit, and we make that parallel. But you see how I made 3493 05:19:51,230 --> 05:19:56,580 q one q three connected the box and then did the median. I think this is the easiest order 3494 05:19:56,580 --> 05:20:00,600 to do it and when you're drawing it by hand and you're not the statistical software Because 3495 05:20:00,600 --> 05:20:05,380 then that way, you know, this box is all nice. And then your median fits and everything looks 3496 05:20:05,380 --> 05:20:12,690 nice, but we're not done yet. We got the whiskers. So you're probably wondering this whole time, 3497 05:20:12,690 --> 05:20:16,920 what is this whisker thing? Well, you just figured out what the boxes the whiskers are 3498 05:20:16,920 --> 05:20:24,602 the markers for the minimum and the maximum. So you'll see the minimums at 41. And then 3499 05:20:24,602 --> 05:20:30,110 we have a whisker at 41. So why is it called a whisker? Well, it's smaller. I don't know 3500 05:20:30,110 --> 05:20:34,350 why it's called the whisker, but it's different from the other ones. Because it's smaller. 3501 05:20:34,350 --> 05:20:39,030 I guess that's a reason maybe. But notice how it's like half the size, almost half the 3502 05:20:39,030 --> 05:20:44,040 size. Sometimes they're really, really small, but it's tiny. And you want to position it, 3503 05:20:44,040 --> 05:20:49,530 like vertically in the middle, like you don't want it off to the side or anything. But and 3504 05:20:49,530 --> 05:20:55,820 you also want these parallel. You'll notice the maximums up there way high at 364. So 3505 05:20:55,820 --> 05:20:59,990 I just did both of these on the same slide. So you draw on the whiskers. And then you 3506 05:20:59,990 --> 05:21:05,060 probably can guess the last step. Yeah, connect the whiskers to the box. So good job. There 3507 05:21:05,060 --> 05:21:11,362 you went and did it You made a box plot. And then now let's look at the inter quartile 3508 05:21:11,362 --> 05:21:18,080 range. Remember how you calculated this, you took q three minus q one? Well, that means 3509 05:21:18,080 --> 05:21:27,770 this boxy thing is 164. Beds long, right? So that's where your IQ are. This is a visual 3510 05:21:27,770 --> 05:21:34,700 pictorial of your IQ. So very good. We did our boxplot, we did our inter quartile range. 3511 05:21:34,700 --> 05:21:37,940 And you're probably wondering, why don't we just do this? 3512 05:21:37,940 --> 05:21:40,250 I'll explain. 3513 05:21:40,250 --> 05:21:46,091 So why do we do this? Well, one of the main things that we do is we look at the distribution 3514 05:21:46,091 --> 05:21:50,800 in the data. I know, I know, you guys learn how to do a histogram already, and you're 3515 05:21:50,800 --> 05:21:55,610 good at a stem and leaf. Those are other ways of looking at the distribution. And if you 3516 05:21:55,610 --> 05:22:01,690 make a histogram of these data, you'll find that Well, I mean, these are only 11. But 3517 05:22:01,690 --> 05:22:05,410 you know, if you get a pile of data, and you make a histogram and the stem and leaf, you'll 3518 05:22:05,410 --> 05:22:11,240 find that those images agree with the boxplot. And you're probably thinking, Well, how do 3519 05:22:11,240 --> 05:22:14,942 how do they agree? Well, if you look on the right side of the slide, I'm just giving you 3520 05:22:14,942 --> 05:22:20,650 an example. So skewed, right? If you had skewed right data, and you knew it, because you made 3521 05:22:20,650 --> 05:22:26,830 a histogram and you saw a skewed right distribution, if you took the same data, and you made a 3522 05:22:26,830 --> 05:22:34,110 boxplot, it would be kind of like that skewed right one that we just did, where the top, 3523 05:22:34,110 --> 05:22:38,280 whisker would be really high in that thing connecting the whisker to the box. That would 3524 05:22:38,280 --> 05:22:43,150 be like really long, whereas the one on the bottom is short. As you can see, the skewed 3525 05:22:43,150 --> 05:22:49,971 left is the opposite, right? The bottom one is long, and the top one short. If you have 3526 05:22:49,971 --> 05:22:56,330 a normal distribution, remember that that's symmetrical. That's that mound shaped distribution, 3527 05:22:56,330 --> 05:23:00,530 and you have a larger spread. In other words, you have a bigger standard deviation, you 3528 05:23:00,530 --> 05:23:05,930 have a bigger variance, right? Then you're going to see a box that's really big like 3529 05:23:05,930 --> 05:23:10,260 that. But if you have a smaller spread, and it's a normal distribution, you're going to 3530 05:23:10,260 --> 05:23:13,430 see a box that looks like this. And you're probably wondering, where are you getting 3531 05:23:13,430 --> 05:23:21,610 these shapes? Well, I'll show you a kind of on the last slide here as we wrap up the conclusion. 3532 05:23:21,610 --> 05:23:27,770 It's because if you fly over a roller coaster, like see this roller coaster, this roller 3533 05:23:27,770 --> 05:23:32,520 coaster is skewed right? That would make sense, right? Because you want to go up steeply, 3534 05:23:32,520 --> 05:23:40,530 and then go down really fast. And see how the boxplot for the roller coaster looks. 3535 05:23:40,530 --> 05:23:47,670 You've got sort of the part where you start going up really fast. That's kind of near 3536 05:23:47,670 --> 05:23:53,910 the median and kind of near the the 25th percentile. And the part where you start where you're 3537 05:23:53,910 --> 05:23:58,692 just getting on and it's slowly going there. That's like the bottom whisker. And then you 3538 05:23:58,692 --> 05:24:03,330 go up and you come down. And it's a long tail, which is good, I guess if you design roller 3539 05:24:03,330 --> 05:24:09,042 coasters, and then that long tail, then is that right skew? So that's why I mean, if 3540 05:24:09,042 --> 05:24:13,442 in your mind, you're going how she getting this this histogram in this box, but this 3541 05:24:13,442 --> 05:24:19,080 is kind of how I'm doing it, as I'm saying, Well, if you flew over the histogram, or the 3542 05:24:19,080 --> 05:24:24,990 roller coaster, you might see like a shape of a box plot. So in conclusion, we talked 3543 05:24:24,990 --> 05:24:29,620 about percentiles, in general, like the 77th percentile, what that all means. And then 3544 05:24:29,620 --> 05:24:35,430 we focus in on quartiles, which are a specific set of percentiles. And then we're going to 3545 05:24:35,430 --> 05:24:40,810 go or we already did calculate the quartiles. And the reason why we did that is because 3546 05:24:40,810 --> 05:24:45,800 we first needed to do that in order to make the interquartile range. And then finally, 3547 05:24:45,800 --> 05:24:51,380 we need those quartiles in order to make and interpret a box and whisker plot. Okay, this 3548 05:24:51,380 --> 05:24:56,770 isn't the roller coaster I'm going to, but I'm going to one and I guarantee you it is 3549 05:24:56,770 --> 05:24:58,080 skewed right. 3550 05:24:58,080 --> 05:25:05,000 Greetings and salutations. Hi, this is Monica wahi, your library college lecturer bringing 3551 05:25:05,000 --> 05:25:13,160 to you chapter 4.1, scatter diagrams and linear correlation. So here's what you're gonna learn 3552 05:25:13,160 --> 05:25:18,360 at the end of this lecture, you should be able to explain what a scattergram is and 3553 05:25:18,360 --> 05:25:25,952 how to make one state what strength and direction mean with respect to correlations and compute 3554 05:25:25,952 --> 05:25:31,920 correlation coefficient are using the computational formula. And finally, you should be able to 3555 05:25:31,920 --> 05:25:38,122 describe why correlation is not necessarily causation. So let's jump right into it. First, 3556 05:25:38,122 --> 05:25:42,750 we're going to talk about making a scatter diagram. And the thing on the right side of 3557 05:25:42,750 --> 05:25:47,440 the screen is not a scatter diagram, but it's kind of scattered. So I put it there, it's 3558 05:25:47,440 --> 05:25:51,390 kind of pretty. And then next, we're going to talk about correlation coefficient, R, 3559 05:25:51,390 --> 05:25:56,840 and how to make it. And then finally, we're gonna do a shout out to causation and lurking 3560 05:25:56,840 --> 05:26:01,100 variables, which remember we talked about before, but we're going to talk about them 3561 05:26:01,100 --> 05:26:06,980 again, in relationship to our. So let's start with the scattergram. And I also call it a 3562 05:26:06,980 --> 05:26:10,350 scatter plot, because it's like everything in statistics, there's got to be about eight 3563 05:26:10,350 --> 05:26:15,840 names for everything. So scatter gram, and scatterplot mean the same thing. So let's 3564 05:26:15,840 --> 05:26:23,820 just get with the setup here. So scatter grams, or scatter plots are graphs of x, y pairs. 3565 05:26:23,820 --> 05:26:31,050 So what's an XY pair, xy pairs are measurements, two measurements made of the same individual 3566 05:26:31,050 --> 05:26:37,250 or the same unit. So if you measure my height and my weight, that's an XY pair, if you measure 3567 05:26:37,250 --> 05:26:41,200 my height in the my friend's weight, that's not an XY pair, because that's two different 3568 05:26:41,200 --> 05:26:50,410 people, right? So these xy pairs, the x part is called the explanatory or independent variable. 3569 05:26:50,410 --> 05:26:55,720 And it's always graphed on the x axis. So remember, in algebra, you would do these graphs, 3570 05:26:55,720 --> 05:27:00,920 where you have this vertical line, and that was the y axis, and you have this horizontal 3571 05:27:00,920 --> 05:27:04,730 line, which was the x axis. And I always had trouble remembering, which is which, but that's 3572 05:27:04,730 --> 05:27:11,040 how it is. And so whichever x whichever of the pairs is x, expect that to be graphed 3573 05:27:11,040 --> 05:27:17,080 along the x axis. And it's also called the explanatory and or independent. Remember, 3574 05:27:17,080 --> 05:27:22,870 there's got to be a million names for everything explanatory or independent variable. So if 3575 05:27:22,870 --> 05:27:27,560 I talk to you and said, here's an XY pair, and this one is the independent variable, 3576 05:27:27,560 --> 05:27:31,840 or this one is the explanatory variable, you need to like just secretly know I'm talking 3577 05:27:31,840 --> 05:27:38,680 about the X of the two. And then surprise, here's the y of the two and the Y is also 3578 05:27:38,680 --> 05:27:44,180 called response variable. It's also called the dependent variable. And that is graphed 3579 05:27:44,180 --> 05:27:50,070 on the y axis. So again, like I said, I used to have trouble remembering is the vertical 3580 05:27:50,070 --> 05:27:55,830 one, the y axis or the horizontal one. But what I did was I remembered, if you take a 3581 05:27:55,830 --> 05:28:01,120 capital Y, and you go grab onto its tail, and you go pull it straight down, you'll see 3582 05:28:01,120 --> 05:28:05,820 that it's vertical. And that's how I remember that's the y axis, it doesn't hurt the Y. 3583 05:28:05,820 --> 05:28:11,830 It's used to that. So if you can stretch the y's tail down, and you get vertical, remember, 3584 05:28:11,830 --> 05:28:16,762 that's the y axis. And then the other one is the x axis. Okay? And then also, you have 3585 05:28:16,762 --> 05:28:23,370 to find a way to remember which one means what like, does x mean explanatory and independent? 3586 05:28:23,370 --> 05:28:28,350 Or what or does it mean response independent. So how I do it is, you know how we sing the 3587 05:28:28,350 --> 05:28:36,622 ABCs abcdefg. Well, if you fast for the N is w x, y, z, right, so the x comes before 3588 05:28:36,622 --> 05:28:45,000 the Y, you know, in the alphabet, so I do x and then an arrow to y. And then I imagined 3589 05:28:45,000 --> 05:28:49,390 in my head that saying X causes Y, even though it doesn't necessarily cause y's, you'll see 3590 05:28:49,390 --> 05:28:54,452 at the end of this lecture, but I think about it that way. Because if that happens, then 3591 05:28:54,452 --> 05:29:01,730 y is dependent on x and x is independent, it can do whatever it wants, but y is dependent. 3592 05:29:01,730 --> 05:29:08,910 So that's my way of remembering x is the independent variable, and y is the dependent variable. 3593 05:29:08,910 --> 05:29:15,060 So anyway, that's a long way of saying the scattergram is a graph of these xy pairs. 3594 05:29:15,060 --> 05:29:21,740 And that's what we're going to do is make that graph. So we needed some xy pairs, right? 3595 05:29:21,740 --> 05:29:27,380 So I asked the question, do the number of diagnoses a patient has, does that correlate 3596 05:29:27,380 --> 05:29:32,580 with the number of medications she or he takes? So if you don't have that many diagnoses, 3597 05:29:32,580 --> 05:29:36,820 you probably aren't on that many meds, right. But if you have a lot of diagnoses, you should 3598 05:29:36,820 --> 05:29:38,530 be on a lot of meds. But we all know 3599 05:29:38,530 --> 05:29:42,890 people in real life can sort of violate that just depending, I mean, you could have one 3600 05:29:42,890 --> 05:29:47,170 really bad diagnosis with a lot of meds. Or you can have a bunch of diagnoses that are 3601 05:29:47,170 --> 05:29:50,780 all taken care of with one mad so it's not perfect, but this is kind of a reasonable 3602 05:29:50,780 --> 05:29:59,500 thing to think. So what I did was I put up here just for x y, Paris, as you can see, 3603 05:29:59,500 --> 05:30:05,670 so I'm got four pretend patients. And you can see here's the first patient, that person 3604 05:30:05,670 --> 05:30:11,220 has an x sub one because they only one diagnosis, but like I was saying must be a bad diagnosis 3605 05:30:11,220 --> 05:30:16,920 because that person has a y of three or is on three meds for it. Right? So that's how 3606 05:30:16,920 --> 05:30:23,770 you read this table. So let's start making our scattergram out of these data. Okay, so 3607 05:30:23,770 --> 05:30:29,400 here we go. So I labeled the x axis number of diagnoses, right just to keep things straight, 3608 05:30:29,400 --> 05:30:34,340 and the y axis number of medications, and then you'll see where I put the dot, right? 3609 05:30:34,340 --> 05:30:41,350 because x is one, I went over to one number of diagnosis, right? The one diagnosis, and 3610 05:30:41,350 --> 05:30:48,260 then, because why was three, I went up three to this three, right, and there goes the dot, 3611 05:30:48,260 --> 05:30:52,990 that's where that first person gets a dot, okay, you put it there. And that's what you're 3612 05:30:52,990 --> 05:30:58,690 going to do with these other ones, too, is four dots. Okay, I just threw all the dots 3613 05:30:58,690 --> 05:31:02,650 down, so you can kind of see what was going on. But here's the second person, right? So 3614 05:31:02,650 --> 05:31:08,590 that person had an X of three. So I went over three. And I just put those green arrows in 3615 05:31:08,590 --> 05:31:12,592 just so you can see what was going on, they're really not part of the scatterplot is just 3616 05:31:12,592 --> 05:31:19,060 more like, like cheating, you know, to show you because we're just practicing right? And 3617 05:31:19,060 --> 05:31:24,721 then that person, so had an X of three and then a y of five, and you see where the dot 3618 05:31:24,721 --> 05:31:30,940 goes right. And then here, you can see where the fourth got.or I'm sorry, the third that 3619 05:31:30,940 --> 05:31:35,080 goes because there's a four and a four. And then here we have the fourth that. So this 3620 05:31:35,080 --> 05:31:40,030 is the scattergram of these four patients. Of course, a lot of times you have like hundreds 3621 05:31:40,030 --> 05:31:46,830 of patients in there. But I just showed you the simple example. Okay, now, because we 3622 05:31:46,830 --> 05:31:53,470 did that, I can talk about linear correlation, you'll kind of get it right. linear correlation, 3623 05:31:53,470 --> 05:31:59,610 that term means that when you make a scatterplot of xy pairs, it kind of looks like a line. 3624 05:31:59,610 --> 05:32:05,430 Now over here on the right is not like biology. That's not like statistics. That's like algebra, 3625 05:32:05,430 --> 05:32:09,660 right? Because back in algebra, you'd have these perfect lines where the dot was right 3626 05:32:09,660 --> 05:32:15,320 on the line and see the x and y. Notice there's no diagnosis, nothing. That's algebra, right. 3627 05:32:15,320 --> 05:32:21,580 So perfect linear correlation. Looks like graphing points in algebra. And if you actually 3628 05:32:21,580 --> 05:32:27,000 make a scatterplot, of like people, xy pairs, and you see that, you should suspect there's 3629 05:32:27,000 --> 05:32:31,980 something wrong, it actually happened to me once, one of our statisticians came to me 3630 05:32:31,980 --> 05:32:32,990 and said, Monica, 3631 05:32:32,990 --> 05:32:33,990 look 3632 05:32:33,990 --> 05:32:39,280 at this, you won't believe this. And I said, Well, I don't believe this. What are you graphing? 3633 05:32:39,280 --> 05:32:47,530 And he said, on the x axis, he had put the weight of every of the person's liver. And 3634 05:32:47,530 --> 05:32:54,120 on the y axis, he put the weight of the whole person. And I'm like, I, how do you weigh 3635 05:32:54,120 --> 05:32:59,270 people's livers? Like, that sounds painful. And he goes, Oh, let me go see. And what he 3636 05:32:59,270 --> 05:33:05,110 learned was that you don't waste people's livers, you use an equation to estimate the 3637 05:33:05,110 --> 05:33:09,560 weight of their liver and guess what's in the equation is their actual weight. So I'm 3638 05:33:09,560 --> 05:33:14,880 like, that's why I came out, like on a line is because you were using the Y to calculate 3639 05:33:14,880 --> 05:33:20,270 the x. And he was like, Oh, you're so smart for a secretary. So then I became an epidemiologist. 3640 05:33:20,270 --> 05:33:27,040 But anyway, if you ever see this in biology, just suspect Something's fishy, because really, 3641 05:33:27,040 --> 05:33:32,440 things just don't end up right on line. But if they get really close, you can say it's 3642 05:33:32,440 --> 05:33:37,190 close to perfect linear correlation. I just wanted to let you know, that's what we're 3643 05:33:37,190 --> 05:33:43,340 what's going on here with this linear correlation. Okay, so let's talk about facts about linear 3644 05:33:43,340 --> 05:33:49,240 correlation. So things can be linearly correlated, without being perfectly on the line, obviously, 3645 05:33:49,240 --> 05:33:56,070 our little thing was, so if, if when you make those dots, your scattergram, if you imagine 3646 05:33:56,070 --> 05:34:00,372 a line going through it, if you imagine that the line is going up, like it kind of looks 3647 05:34:00,372 --> 05:34:07,860 like it's going up, this is called a positive correlation. But you don't always have a line 3648 05:34:07,860 --> 05:34:13,208 going up. So I want you to look at this. And I made up these data too. But on the x axis 3649 05:34:13,208 --> 05:34:18,780 is the number of patient complaints. So as we go on, the patients are madder and madder. 3650 05:34:18,780 --> 05:34:24,030 They're grouchy and gross, you're making more complaints. on the y axis, we have number 3651 05:34:24,030 --> 05:34:30,150 of nurses staffed on the shift, right? And so as you go up, there's more nurses. Well, 3652 05:34:30,150 --> 05:34:34,430 sure enough, when you got a lot of nurses, you don't have as many patient complaints, 3653 05:34:34,430 --> 05:34:39,860 right? Because they're being attended to. So this is what you would say is some people 3654 05:34:39,860 --> 05:34:47,110 say inverse correlation. But in this presentation, I'm calling it a negative correlation. Because 3655 05:34:47,110 --> 05:34:53,520 as one goes up, the other goes down. And as one goes down, the other goes up, because 3656 05:34:53,520 --> 05:34:59,692 and that's depicted visually with this line going down so you see, you can imagine line 3657 05:34:59,692 --> 05:35:05,570 going down That's a negative correlation. Neither is better, you know, positive versus 3658 05:35:05,570 --> 05:35:12,042 negative, it just explains how these things are behaving together how X and Y behave together. 3659 05:35:12,042 --> 05:35:13,570 But then 3660 05:35:13,570 --> 05:35:17,960 you can have situations where there's really no correlation, like x and y really don't 3661 05:35:17,960 --> 05:35:22,880 have anything to do with each other. So as you've seen, you know, when you're, when you 3662 05:35:22,880 --> 05:35:27,420 have patients in the hospital, some of them have really big families, and those families 3663 05:35:27,420 --> 05:35:32,660 come a lot. And some of them don't really have that many loved ones. So as you can see 3664 05:35:32,660 --> 05:35:39,630 along x, here are totally unique visitors, meaning you just count each person wants. 3665 05:35:39,630 --> 05:35:45,730 So you could have, there's a patient who only has one Unique Visitor. But if you look at 3666 05:35:45,730 --> 05:35:49,958 why they spent in the hospital, that person that's been there seven days, and that that 3667 05:35:49,958 --> 05:35:57,260 visitor keeps coming, right. And then you have maybe a patient here, the second one 3668 05:35:57,260 --> 05:36:01,180 is to unique visitors. And that person's only been in one day, but both those people have 3669 05:36:01,180 --> 05:36:06,200 been there, then you have people like a person with three unique visitors. And they've been 3670 05:36:06,200 --> 05:36:10,960 in the hospital for days, right. And those are probably the same three people coming 3671 05:36:10,960 --> 05:36:16,140 back. So it really doesn't matter how long a person's in the hospital, if they've got 3672 05:36:16,140 --> 05:36:21,792 a lot of loved ones who keep coming, they'll keep coming or not. Right? Right, according 3673 05:36:21,792 --> 05:36:28,130 to this correlation. So you end up imagining a straight line. And that's no correlation, 3674 05:36:28,130 --> 05:36:33,190 that's fine, too. Nothing is better or worse, it's just that you make the scattergram to 3675 05:36:33,190 --> 05:36:41,840 try and understand how x and y are related. This is always fun. Like in books, they always 3676 05:36:41,840 --> 05:36:48,240 make some sort of goofy picture. I don't know why they do this, I would never get a goofy 3677 05:36:48,240 --> 05:36:54,140 picture, like they show in books about, you know, this, I made up the correlation. This 3678 05:36:54,140 --> 05:36:58,820 is in the lobby, the number of the games in the lobby, and the number of the books in 3679 05:36:58,820 --> 05:37:02,792 the lobby, they should really have nothing to do with each other. But if you see something 3680 05:37:02,792 --> 05:37:08,122 just way goofy like this, just say it's no correlation. I don't even know how I get this. 3681 05:37:08,122 --> 05:37:16,420 Hi, there. Alright, so we've been talking about correlation. And it actually has two 3682 05:37:16,420 --> 05:37:21,830 attributes. So far, we've only talked about one and that is direction, we talked about 3683 05:37:21,830 --> 05:37:26,850 positive, negative and no correlation. So whenever you're talking about a correlation, 3684 05:37:26,850 --> 05:37:31,710 you have to say what direction it is. But you also have to say the other thing, which 3685 05:37:31,710 --> 05:37:35,940 is what strength it is. So now we're going to talk about how you figure out what the 3686 05:37:35,940 --> 05:37:42,800 strength is. So strength refers to how close to the line, all of the dots, they fall really 3687 05:37:42,800 --> 05:37:49,790 close to the line, it is considered strong. If they fall kind of close to the line, it's 3688 05:37:49,790 --> 05:37:54,980 called moderate. And if they are very close to the line is weak. Now remember, that's 3689 05:37:54,980 --> 05:37:59,730 totally different from what direction is it could be positive, strong, or negative, strong, 3690 05:37:59,730 --> 05:38:07,060 right, could be positive, moderate, or negative, moderate. So this is just a statement, the 3691 05:38:07,060 --> 05:38:12,220 strength is a statement of how close the dots you make in your scattergram file close to 3692 05:38:12,220 --> 05:38:20,360 the line that you end up dropping. So I thought I'd just give you a few examples. So look 3693 05:38:20,360 --> 05:38:24,692 at this, I just made this up. This is what a strong negative one would look like. Notice 3694 05:38:24,692 --> 05:38:32,270 how those pink dots are almost on the line. And this is a strong positive. Again, even 3695 05:38:32,270 --> 05:38:36,870 one of the dots is on all right, not all of them, you know, or it'd be perfect, but it's 3696 05:38:36,870 --> 05:38:42,130 never perfect. So this is really close. But it's strong, positive. So strong just refers 3697 05:38:42,130 --> 05:38:48,790 to the fact that the dots are almost on the line. Now, this is almost the same correlation, 3698 05:38:48,790 --> 05:38:54,182 but the dots are not really almost on the line has to be fair and kind of going between 3699 05:38:54,182 --> 05:38:59,020 them, but they're kind of far away. And so just eyeballing it, you would say this is 3700 05:38:59,020 --> 05:39:06,980 moderate. And here, it gets weak. And mainly it's because the dots are more all over the 3701 05:39:06,980 --> 05:39:13,350 place. But you'll notice there's one that's like right on the x axis. And then hey, look 3702 05:39:13,350 --> 05:39:18,920 up there, like in the title, there's one up there, like way up there. And that's like 3703 05:39:18,920 --> 05:39:26,942 an outlier. And sometimes, when you get outliers, they can really whack things out. So even 3704 05:39:26,942 --> 05:39:32,708 though this is a weak correlation, that line looks like so powerful, because it's almost 3705 05:39:32,708 --> 05:39:38,400 basically connecting these two outliers. So you just got to be careful, and that's part 3706 05:39:38,400 --> 05:39:43,590 of why you make a scattergram first is out large can have a really powerful effect on 3707 05:39:43,590 --> 05:39:44,702 the correlation. 3708 05:39:44,702 --> 05:39:50,190 Especially it's an any of the four corners of the plot. Like if you get a weird outlier 3709 05:39:50,190 --> 05:39:55,432 kinda in the middle, it's not going to do as much as if it's in the upper right, upper 3710 05:39:55,432 --> 05:39:59,810 left, lower right or lower left. It can really affect the direction like like, you know, 3711 05:39:59,810 --> 05:40:06,120 it's Like a seesaw, or a teeter totter, you know, an outlier can get on and really change 3712 05:40:06,120 --> 05:40:14,860 the direction of it. And it can also mess with how strong or weak the correlation is. 3713 05:40:14,860 --> 05:40:19,310 So that's why you really want to start with a scatterplot. And that's why the way this 3714 05:40:19,310 --> 05:40:24,112 chapter is organized starts with the scatterplot. This, you just want to look for outliers. 3715 05:40:24,112 --> 05:40:32,300 And also just see how X and Y look when you plot them. Now we're going to get on to correlation 3716 05:40:32,300 --> 05:40:39,350 coefficient, R, we're going to get on to computation and actually making a number. So you can not 3717 05:40:39,350 --> 05:40:45,840 just use watery terms like direction, you know, positive, negative, or moderate, strong 3718 05:40:45,840 --> 05:40:53,261 weak to explain it, but you can actually put a number on how correlated x and y are. So 3719 05:40:53,261 --> 05:41:00,190 remember, the word coefficient, we did it with coefficient of variation, which is different. 3720 05:41:00,190 --> 05:41:06,430 So the CV, you know, is one kind of coefficient. But what we're going to talk about is a different 3721 05:41:06,430 --> 05:41:12,590 kind. This time, our coefficient, this time is called R. And just coefficient means the 3722 05:41:12,590 --> 05:41:18,010 number we just like to use it in statistics. Now, it seems kind of weird, because like, 3723 05:41:18,010 --> 05:41:21,520 I'm talking about correlation, and people are like, Well, why is it our Why isn't it 3724 05:41:21,520 --> 05:41:26,042 like see for correlation, then like, I don't know, I didn't invent it. But this is how 3725 05:41:26,042 --> 05:41:35,880 you can remember you can go correlation, correlation. So correlation coefficient, R. So just remember, 3726 05:41:35,880 --> 05:41:43,780 r means correlation. And technically our mean sample correlation, population correlation 3727 05:41:43,780 --> 05:41:47,780 coefficient, right? Like his, you know, imagine you're correlating like height and weight 3728 05:41:47,780 --> 05:41:53,650 and the population like, oh, everybody in particular state, you actually need a Greek 3729 05:41:53,650 --> 05:41:57,600 letter for that. And I showed it on the screen, I don't know it's this fancy p, I don't know 3730 05:41:57,600 --> 05:42:03,690 the right name of it. But we don't actually cover it in this class. So I just want to 3731 05:42:03,690 --> 05:42:12,942 just show it to you, we're only going to focus on R, which is the sample correlation coefficient. 3732 05:42:12,942 --> 05:42:19,090 So what is r? Well, it's like I said, it's the numerical quantification of how correlated 3733 05:42:19,090 --> 05:42:27,000 a set of x y pairs are. And it's actually calculated by plugging all of the XY pairs 3734 05:42:27,000 --> 05:42:33,370 into the equation, I'll show you how to do it. And you can see that if you do it by hand, 3735 05:42:33,370 --> 05:42:39,230 if you have a lot of xy pairs that will take forever. So I tried to limit that. And like, 3736 05:42:39,230 --> 05:42:44,990 remember, standard deviation and variance, there was like a defining formula and a computational 3737 05:42:44,990 --> 05:42:50,650 formula. This time, I'm only going to show you the computational formula, it's, in my 3738 05:42:50,650 --> 05:42:55,830 opinion, ways your to do, but it gets you the same number. Alright. So that's what we're 3739 05:42:55,830 --> 05:42:59,901 going to do is we're going to take a set of xy pairs, and we're going to calculate 3740 05:42:59,901 --> 05:43:00,901 our 3741 05:43:00,901 --> 05:43:07,192 M. But then how do you interpret our Well, let me just prepare you mentally for what 3742 05:43:07,192 --> 05:43:12,060 we're going to get out of this calculation. The our calculation produces a number and 3743 05:43:12,060 --> 05:43:17,770 the lowest number possible is negative 1.0. So that's perfect negative correlation. So 3744 05:43:17,770 --> 05:43:23,180 if we were like in algebra, and we had an A line going down, and all the dots were on 3745 05:43:23,180 --> 05:43:28,793 it, then the R would be negative 1.0. But that never happens. Right? So if you want 3746 05:43:28,793 --> 05:43:33,860 to think about it is like if you have a negative correlation, and you get an R, that's like 3747 05:43:33,860 --> 05:43:41,610 negative point nine, five, or something really close to negative 1.0, that it's close to 3748 05:43:41,610 --> 05:43:46,480 negative 1.0. So it's close to perfect negative correlation. That's how you want to think 3749 05:43:46,480 --> 05:43:51,542 about it. And then the opposite is the highest possible number you can get for our is 1.0. 3750 05:43:51,542 --> 05:43:55,630 But most people never do that. except for that one mistake I was telling you about. 3751 05:43:55,630 --> 05:44:00,720 And that would be perfect positive correlation. So if you see that you calculate an R, and 3752 05:44:00,720 --> 05:44:06,910 it gets really close, like point nine, five, like I said, or nine, eight or whatever, then 3753 05:44:06,910 --> 05:44:12,208 you're thinking, whoa, this is really close to perfect positive correlation, right? And 3754 05:44:12,208 --> 05:44:17,860 then everything else is in between. So like, you know, point five or negative point three 3755 05:44:17,860 --> 05:44:25,820 or point 02, or negative point, oh nine, like all of those are between negative 1.0 and 3756 05:44:25,820 --> 05:44:31,610 1.0. And that's where r should be. So let's say you calculate R and you get eight. Okay, 3757 05:44:31,610 --> 05:44:37,852 you did it wrong, right? Or you calculate R and you get negative 2.3. Like that's not 3758 05:44:37,852 --> 05:44:44,942 right, it's got to be between negative 1.0 and 1.0. And if you make a scattergram, you 3759 05:44:44,942 --> 05:44:48,530 should know whether it should be on the negative side of the positive side or it should give 3760 05:44:48,530 --> 05:44:55,420 you a hint. So this is just more to calibrate what to expect from our because it's kind 3761 05:44:55,420 --> 05:45:01,550 of a big calculation. So I'm just going to give you some pictorial example. Because remember, 3762 05:45:01,550 --> 05:45:08,860 every single time we make our right, um, we also have a scatterplot behind it. And I just 3763 05:45:08,860 --> 05:45:12,980 thought, you know, it would be helpful to see some real life examples of our, these 3764 05:45:12,980 --> 05:45:18,970 are real life examples, okay, real life, you don't get this from just anything, right? 3765 05:45:18,970 --> 05:45:22,990 I'm just teasing. But anyway, so I started with some negative hours because I'm feeling 3766 05:45:22,990 --> 05:45:28,830 negative today. I went into the literature and I found this article 3767 05:45:28,830 --> 05:45:29,830 about, 3768 05:45:29,830 --> 05:45:38,040 oh, it's not MIT and Harvard. It's about the evolutionary principles of modular gene regulation, 3769 05:45:38,040 --> 05:45:43,170 a nice and all I know, it's, I'm supposed to cut down on eating bread. So that's all 3770 05:45:43,170 --> 05:45:48,990 I know about this. But they had these really nice scatter plots. So and they calculated 3771 05:45:48,990 --> 05:45:52,760 are for them, so and they had a little line on them. So I thought I'd show them to you. 3772 05:45:52,760 --> 05:45:58,840 So if you look, the one that's labeled D, see where the dots are, right, and see where 3773 05:45:58,840 --> 05:46:05,860 the line is. And this looks kind of like a moderate to strong, negative correlation, 3774 05:46:05,860 --> 05:46:10,740 right? Because the dots are kind of close to the line. And then when the group calculated 3775 05:46:10,740 --> 05:46:16,560 are they got negative point seven. And so that kind of makes sense, because, and then 3776 05:46:16,560 --> 05:46:22,208 I put my opinion in the lower right, these aren't official cut points or anything, but 3777 05:46:22,208 --> 05:46:27,612 I usually use these as a guide, see how I said negative point four to negative point 3778 05:46:27,612 --> 05:46:35,530 seven is moderate. So I would call that the one monitor. Now let's look at E. So see how 3779 05:46:35,530 --> 05:46:43,310 the dots don't cluster so close to the line, as they do with the D one, that's going to 3780 05:46:43,310 --> 05:46:48,890 make it a weaker correlation, it's still it's still negative, right? So it's negative point 3781 05:46:48,890 --> 05:46:54,670 four, four. And when you look at my little opinion, I still call that moderate, but it's 3782 05:46:54,670 --> 05:47:01,530 on the low end, see that. And then if you look at AF, see how many of them are like 3783 05:47:01,530 --> 05:47:09,160 way far away from that line, and they're dragging it down. So now it's in the even weaker correlation, 3784 05:47:09,160 --> 05:47:15,872 negative point two, five, right. And so then that's weak. And so this is just some examples 3785 05:47:15,872 --> 05:47:21,270 to give you a pictorial. And now I'll be I promise to be more positive, here's some positive 3786 05:47:21,270 --> 05:47:26,880 Rs, they didn't draw a line on this one, this is a different article, right? Says obesity 3787 05:47:26,880 --> 05:47:32,630 is associated with macrophage accumulation, and adipose tissue. So again, try to cut down 3788 05:47:32,630 --> 05:47:41,120 on bread. But anyway, um, if you look on the left side, you'll see all of these x y pairs 3789 05:47:41,120 --> 05:47:45,208 plotted on the scattergram. And even though we don't have a line there, we can imagine 3790 05:47:45,208 --> 05:47:50,770 it's going up. So we would expect this to be positive. But we also would imagine they're 3791 05:47:50,770 --> 05:47:56,542 not really clustering around the line very tightly. So when we see that the R is point 3792 05:47:56,542 --> 05:48:01,730 six, we're not surprised. I mean, it's on the high side, a moderate in my world, which 3793 05:48:01,730 --> 05:48:08,192 makes sense. But go look on the right one, you know, under the B one, look at how those, 3794 05:48:08,192 --> 05:48:12,190 you could almost connect the dots and get a line out of that. So that's really tightly 3795 05:48:12,190 --> 05:48:17,500 hugging the line. And then we're not surprised to see that the R is point nine, two. So that's 3796 05:48:17,500 --> 05:48:22,450 pretty strong. So I just wanted to give you these tutorials before we actually went forth, 3797 05:48:22,450 --> 05:48:28,300 and calculated r because that's one thing you can do is do the scatterplot have an expectation, 3798 05:48:28,300 --> 05:48:33,710 what r should look like. And then if you calculate R and it's totally wacky, you know that you 3799 05:48:33,710 --> 05:48:41,800 did something wrong. Okay, let's calculate our and let's use the computational formula. 3800 05:48:41,800 --> 05:48:49,200 Okay, I threw the formula up in the upper left, and don't feel overwhelmed by it, we're 3801 05:48:49,200 --> 05:48:55,641 going to take that apart very carefully, right. But before we even do that, I just want you 3802 05:48:55,641 --> 05:49:01,830 to have a flashback to chapter 3.2. c, all those sums of are those capital sigma was 3803 05:49:01,830 --> 05:49:09,150 in the equation. So we're going to handle calculating are a lot like we handled calculation, 3804 05:49:09,150 --> 05:49:14,730 calculating variance and standard deviation. We're going to make like a table with columns. 3805 05:49:14,730 --> 05:49:18,290 And then we're going to fill in those columns with calculations. And then we're going to 3806 05:49:18,290 --> 05:49:23,272 add up the columns to get all those numbers. So already you were good at that, and 3.2, 3807 05:49:23,272 --> 05:49:28,530 you'll be good at this too. And then I made up a story because it's a lot easier to check 3808 05:49:28,530 --> 05:49:34,990 your work if there's some story behind that and statistics. So pretend we have seven patients 3809 05:49:34,990 --> 05:49:39,330 that have been going to your clinic for a year. They're good patients, they keep coming. 3810 05:49:39,330 --> 05:49:46,760 So they came to the clinic over the year. And at the last visit of the year. You measured 3811 05:49:46,760 --> 05:49:48,390 the diastolic blood 3812 05:49:48,390 --> 05:49:53,522 pressure, and what you predicted was or what you thought would make sense as those with 3813 05:49:53,522 --> 05:49:57,890 a higher diastolic blood pressure would have had more appointments over the year because 3814 05:49:57,890 --> 05:50:01,532 probably they're trying to stabilize and run power. Sure, maybe they have other problems 3815 05:50:01,532 --> 05:50:06,910 that are driving it up. This makes perfect sense, right? So what you wanted to do is 3816 05:50:06,910 --> 05:50:11,240 see if you are right, so you're going to take the diastolic blood pressure at the last appointment 3817 05:50:11,240 --> 05:50:17,650 as your x, you know, because you think that that's maybe the explanatory variable, or, 3818 05:50:17,650 --> 05:50:22,890 you know, that would be the independent variable that would make it so have something to do 3819 05:50:22,890 --> 05:50:28,510 with whether or not they had a lot of appointments. And then you take why as the number of appointments 3820 05:50:28,510 --> 05:50:34,790 over the last year, because you'd say, Okay, hi, DBP probably means they have more appointments. 3821 05:50:34,790 --> 05:50:40,630 That's just your idea, maybe you're wrong, but we're gonna do that. Okay. So, um, I put 3822 05:50:40,630 --> 05:50:47,600 in the title, just a reminder, access DVP. And why is number of appointments so you don't 3823 05:50:47,600 --> 05:50:52,930 forget. And then we made up this tape. So look at the first column, it's just the patient 3824 05:50:52,930 --> 05:50:56,190 number, it's nothing, you know, exciting, we just want to keep track of which patient 3825 05:50:56,190 --> 05:51:05,280 is one, right. And then notice under x, we just have all of their dbps. So this patient, 3826 05:51:05,280 --> 05:51:11,612 one at the last appointment had a 70 mmHg, and patient two at 115, mmHg. That's 3827 05:51:11,612 --> 05:51:12,870 kind of alarming. 3828 05:51:12,870 --> 05:51:17,920 But these are fake data. So don't get worried about these patients. But anyway, we just 3829 05:51:17,920 --> 05:51:23,720 fill in x. And then also, when you have their chart out, you can look up how many appointments 3830 05:51:23,720 --> 05:51:28,100 they had over the last year and patient went only at three, whereas patient two had like 3831 05:51:28,100 --> 05:51:33,730 45, which you can believe because sometimes they're coming in all the time to get stuff, 3832 05:51:33,730 --> 05:51:40,300 adjusted. It but then you know, patient three, only a 21 and patient four at seven. So you 3833 05:51:40,300 --> 05:51:44,860 can see these are the XY pairs for each of these patients, right. And it's pretty simple 3834 05:51:44,860 --> 05:51:51,372 to go to the bottom and sum up each of the columns, we have some of xs 678 and some of 3835 05:51:51,372 --> 05:51:56,960 y's 166. And also, I'm reminding you of the our calculation, I put that in the upper right, 3836 05:51:56,960 --> 05:52:04,140 just so we can see what we're doing. I just want to call your attention to one of the 3837 05:52:04,140 --> 05:52:10,210 terms in there, which is sum of X, which I put in the parentheses here. And that we already 3838 05:52:10,210 --> 05:52:14,352 know, just from making the first part of this table and adding it up. So we already have 3839 05:52:14,352 --> 05:52:20,930 that thing. And now I just wanted to point out, if you saw the sum of x over here, it's 3840 05:52:20,930 --> 05:52:27,320 not exactly the sum, it's a sum of x y. So the Y is mushed. Right next to it, that's 3841 05:52:27,320 --> 05:52:32,122 not some of x, that's some of x y. And that's later in the game, we're gonna put the sum 3842 05:52:32,122 --> 05:52:37,070 of x y at the bottom of the last column. So So that first term there, that's not some 3843 05:52:37,070 --> 05:52:46,880 of x, that's some of x y. Okay, now downstairs, we see the sum of x to the second, right? 3844 05:52:46,880 --> 05:52:53,320 And that looks an awful lot like the one next to it on the left, which says sum of x to 3845 05:52:53,320 --> 05:52:58,740 the second, right? And so how do you tell the difference between the kind without the 3846 05:52:58,740 --> 05:53:05,532 parentheses and the kind with the parentheses. So this is how I do. The rule is always regardless 3847 05:53:05,532 --> 05:53:11,880 what's going on, do what's in the parentheses first. So that's easy to do. If you have parentheses, 3848 05:53:11,880 --> 05:53:17,870 if you got the parentheses version, you know that the sum of x to the second with the parentheses 3849 05:53:17,870 --> 05:53:24,120 in it, is you just do the sum of X, and you do the sum of X and E times by each other. 3850 05:53:24,120 --> 05:53:29,790 Right? But what if you don't have any? Well, what I do is I say, Well, if I did have some, 3851 05:53:29,790 --> 05:53:36,000 I do it this way. But if I don't have any, then I know I have to do the sum of the x 3852 05:53:36,000 --> 05:53:43,240 squared calm, right. So that's where you take x times x x times x, x times x on each line, 3853 05:53:43,240 --> 05:53:50,458 put it there and sum that. So that's how I go through it no matter where I am in statistics 3854 05:53:50,458 --> 05:53:57,280 or algebra. If I see that some symbol and then the x squared, I first look for the parentheses. 3855 05:53:57,280 --> 05:54:02,590 If they're there, I know what to do. If they're not there, then I know you don't do the thing 3856 05:54:02,590 --> 05:54:07,680 where you just take the sum of x squared, you have to go and look at the bottom of the 3857 05:54:07,680 --> 05:54:14,670 column of the x to the second column and take the sum of that. I hope this is helpful. All 3858 05:54:14,670 --> 05:54:22,200 right, so as you can see, there's, I've shown you on the top of the equation is where you 3859 05:54:22,200 --> 05:54:28,890 just take the sum of X and the sum of Y. And on the bottom, I'm showing you where you take 3860 05:54:28,890 --> 05:54:34,530 those and you take the square of them. And then in the other term is the one where you 3861 05:54:34,530 --> 05:54:43,362 just take the sum of the call. All right. And so there you go. So what happened here? 3862 05:54:43,362 --> 05:54:51,310 Well, we filled an x to the second so if you go to a patient, 170 times 70 is 4900. That's 3863 05:54:51,310 --> 05:54:57,130 where we're getting that number. So you go through and then patient to 115 times 115 3864 05:54:57,130 --> 05:55:03,730 is 13,225. So you go from All those and then you sum those up. And that's what goes in 3865 05:55:03,730 --> 05:55:07,630 that first term. And then I'll bet you can guess what the next 3866 05:55:07,630 --> 05:55:08,790 slide is. 3867 05:55:08,790 --> 05:55:13,291 Surprise. Now we do the y one, so don't get confused because you kinda have to skip a 3868 05:55:13,291 --> 05:55:20,280 column there. So three times three is nine. And so that's why in the Y squared, I'm 45 3869 05:55:20,280 --> 05:55:26,500 times 45 is 2025. That's how we're doing those. You sum all that up, and then go look up at 3870 05:55:26,500 --> 05:55:35,230 the equation, that's where you put that sum of Y squared. Now we have x, y. And this reminds 3871 05:55:35,230 --> 05:55:39,890 me of a student I had before. She was really confused. She's like, Monica, I don't know 3872 05:55:39,890 --> 05:55:45,140 what to do with x, y, the x, y quantity. And I go, What do you mean? I mean, it's pretty 3873 05:55:45,140 --> 05:55:53,250 obvious. You just take x times y, like here, 70 times three is 210. She goes, x times y, 3874 05:55:53,250 --> 05:55:58,420 where's the times? Like, how do you know it's supposed to be times like, I don't see any 3875 05:55:58,420 --> 05:56:04,810 times. Right? I don't see any dimes either. Like there's no like, like, how do you know 3876 05:56:04,810 --> 05:56:10,270 to do that? Well, anyway, I'll just tell you, I guess, imagine, like a little multiplication 3877 05:56:10,270 --> 05:56:14,690 symbol between x and y. That's what's supposed to be there. That's what you're supposed to 3878 05:56:14,690 --> 05:56:19,320 imagine, I guess I was so used to looking at it was like, you're right, I guess you're 3879 05:56:19,320 --> 05:56:26,881 just supposed to assume that. So take x times y. So for patient two, we just took 115 times 3880 05:56:26,881 --> 05:56:33,960 45. And that's how we got 5175. So you go through each of those, it's a lot of processing. 3881 05:56:33,960 --> 05:56:39,180 And then you sum it up at the bottom, whoo, that's a big number. And then you see, I circled 3882 05:56:39,180 --> 05:56:45,140 it in the our equation. So I think we figured out where to put everything, obviously, n 3883 05:56:45,140 --> 05:56:49,480 is seven, right, because we have seven patients, you see a bunch of ends in there. So I think 3884 05:56:49,480 --> 05:56:57,910 we have all our ingredients. So let's move forward. So all I did here was rewrite the 3885 05:56:57,910 --> 05:57:03,872 exact same equation with all the ingredients in it, right. So like I said, the N is seven. 3886 05:57:03,872 --> 05:57:10,920 And so wherever you see n, you'll see a seven. See that sum of X, Y on the top, you see where 3887 05:57:10,920 --> 05:57:16,880 that goes, see some of x and some of y and then downstairs, you'll see I filled in all 3888 05:57:16,880 --> 05:57:23,930 those numbers too. Now, let me just talk to you a little bit about both levels, the numerator 3889 05:57:23,930 --> 05:57:29,920 and the denominator in the numerator, because we have order of operation, you need to do 3890 05:57:29,920 --> 05:57:38,450 out the end times the sum of x y, that's seven times 18,458, you need to do that out first. 3891 05:57:38,450 --> 05:57:44,570 And then you need to do the other one, you know the 678 times 166 first, and then after 3892 05:57:44,570 --> 05:57:48,050 you're done with those two things, you have to subtract the second one from the first 3893 05:57:48,050 --> 05:57:53,350 one, that's the order you have to do that in to get the numerator right. Now for the 3894 05:57:53,350 --> 05:57:58,510 denominator, it's a little bit the same, but a little more complicated. You see on the 3895 05:57:58,510 --> 05:58:06,500 left side, you have that seven times 67,892, you have to do that out. And then you have 3896 05:58:06,500 --> 05:58:14,090 678 squared, you have to do that out, then you have to take that, subtract it from the 3897 05:58:14,090 --> 05:58:19,460 first one. And after that, after you have that, you take a square root of all of that, 3898 05:58:19,460 --> 05:58:24,720 and that's your first term. And then you still have to go over to the other one, you have 3899 05:58:24,720 --> 05:58:34,362 to take seven times 6768. Keep that then take 166 times 166. Keep that, that that term, 3900 05:58:34,362 --> 05:58:38,220 you subtract from the first one. And after you're done with all that, you take the square 3901 05:58:38,220 --> 05:58:43,760 root of that, and then those two things, you have to multiply together. So that's a lot 3902 05:58:43,760 --> 05:58:50,690 of work, and you have to do it in the right order. So here, I just wanted you to see how 3903 05:58:50,690 --> 05:58:57,660 you, you probably want to just work out this term separately first, and then work out this 3904 05:58:57,660 --> 05:59:03,330 terms separately. And just like that thing I was telling you about x y, those two terms, 3905 05:59:03,330 --> 05:59:06,730 once you work them out, you take the square root of the left one in the square root of 3906 05:59:06,730 --> 05:59:13,622 the right one, you have to multiply them together to get the denominator. So this slide is to 3907 05:59:13,622 --> 05:59:20,880 help you see I threw the numerator on that was relatively easy. But these are the two 3908 05:59:20,880 --> 05:59:26,300 different numbers you should get from the left side of the denominator and the right 3909 05:59:26,300 --> 05:59:31,330 side of the denominator just to check your work. And then of course, once you multiply 3910 05:59:31,330 --> 05:59:40,940 them by each other, you get this number 17,561.3. So ultimately, what the calculation for our 3911 05:59:40,940 --> 05:59:46,930 comes down to is you're trying to calculate the numerator and you're trying to calculate 3912 05:59:46,930 --> 05:59:53,150 the denominator. And at the end, you divide the numerator by the denominator and you get 3913 05:59:53,150 --> 05:59:57,792 the answer which is R. So we're going to do that now. 3914 05:59:57,792 --> 06:00:06,150 And here's what we got is we got this 0.949. And because we see that it's positive, then 3915 06:00:06,150 --> 06:00:12,480 we know it's a positive correlation. And then remember my opinion. And also probably everyone's 3916 06:00:12,480 --> 06:00:17,480 opinion, because if you run that up, you go point nine, five, well, that's getting really 3917 06:00:17,480 --> 06:00:23,670 close to 1.0. So most people would agree that that's pretty strong. So how you would diagnose 3918 06:00:23,670 --> 06:00:31,920 this correlation is you would say it's positive, and it's strong. Okay, I just want to wrap 3919 06:00:31,920 --> 06:00:38,481 this up by giving you a few facts about our that I may not have covered yet. First, r 3920 06:00:38,481 --> 06:00:45,692 requires data with a bi variate normal distribution, which is something we didn't check before 3921 06:00:45,692 --> 06:00:50,542 doing our r in this class, because I just don't cover that. But please know, if you 3922 06:00:50,542 --> 06:00:55,958 take another statistics class, and they bring up our, they might talk about checking for 3923 06:00:55,958 --> 06:01:02,390 the by various normal distributions. So just know about. Next, please know that our also 3924 06:01:02,390 --> 06:01:06,970 does not have any units. So other things that don't have units, remember, the coefficient 3925 06:01:06,970 --> 06:01:14,102 of variation didn't have any units, some things just don't have units, and r is one of them. 3926 06:01:14,102 --> 06:01:21,880 Also, we did talk about how perfect linear correlation is where r equals negative 1.0. 3927 06:01:21,880 --> 06:01:28,102 That's if it's a negative correlation, or r equals 1.0, which is a positive correlation. 3928 06:01:28,102 --> 06:01:33,792 But I might not have mentioned that no linear correlation is r equals zero. Now, you probably 3929 06:01:33,792 --> 06:01:39,500 won't see that in real life. But sometimes I'll make an R, and the R is either positive 3930 06:01:39,500 --> 06:01:46,890 or negative. But it's 0.0000000. Something right? Regardless of whether it's positive 3931 06:01:46,890 --> 06:01:52,522 or negative, if it's 0.00000, something, it's really close to zero. So that means there's 3932 06:01:52,522 --> 06:01:59,420 probably like, no linear correlation. And then we learned about positive or negative 3933 06:01:59,420 --> 06:02:05,122 art, but I just wanted to remind you of the behavior of X and Y when you get those circumstances, 3934 06:02:05,122 --> 06:02:11,990 okay. So if you have a positive R, it means as x goes up, y goes up. But it also means 3935 06:02:11,990 --> 06:02:18,890 as x goes down, y goes down. So they travel together. When you get a negative r, it means 3936 06:02:18,890 --> 06:02:26,090 as x goes up, y goes down. But also it means opposite, as x goes down, y goes up, so they 3937 06:02:26,090 --> 06:02:34,100 travel in the opposite directions. Now, here's another fact about our little factoid, if 3938 06:02:34,100 --> 06:02:40,112 you choose to switch the axes, like let's say I designate, you give me xy pairs, and 3939 06:02:40,112 --> 06:02:45,450 I designate a certain variable as x and the certainly one is y, and you actually designate 3940 06:02:45,450 --> 06:02:51,510 them the opposite, it really doesn't matter even in the equation, because you'll end up 3941 06:02:51,510 --> 06:02:59,420 with the same R value. So it doesn't matter if you call the x my X, Y, and I call your, 3942 06:02:59,420 --> 06:03:05,390 you know, y x, like we can switch them, but you'll still end up with the same are with 3943 06:03:05,390 --> 06:03:12,090 the calculation. Then finally, even if you converted x&y to different units, you get 3944 06:03:12,090 --> 06:03:17,458 the same error. So let's say that you were in England, and you were doing the correlation 3945 06:03:17,458 --> 06:03:22,140 between height and weight. And you were using the metric system on the same patients that 3946 06:03:22,140 --> 06:03:28,070 I was using the US system, even though we'd have different numbers, cuz obviously you 3947 06:03:28,070 --> 06:03:36,130 have to convert them, we'd still get the same are when we're done. So finally, we get to 3948 06:03:36,130 --> 06:03:41,300 the last subject of this lecture, which is lurking variables, which you've heard about 3949 06:03:41,300 --> 06:03:46,080 before. But the main point I want to make is correlation is not causation. So you don't 3950 06:03:46,080 --> 06:03:47,110 want to be misled 3951 06:03:47,110 --> 06:03:50,140 by correlations. 3952 06:03:50,140 --> 06:03:54,202 So beware of lurking variable. So remember, lurking variables are things lurking behind 3953 06:03:54,202 --> 06:03:59,970 the scenes, I caused things, right. And so you may have realized that selecting x and 3954 06:03:59,970 --> 06:04:04,300 y, like if you have xy pairs, designating which one is x and which one is y is kind 3955 06:04:04,300 --> 06:04:09,640 of political, because you're implying that x could cause y. So let's say that you're 3956 06:04:09,640 --> 06:04:15,980 correlating height and weight, taller, people are heavier. So you would cause x to be height 3957 06:04:15,980 --> 06:04:20,320 and y to be weight. You know, people don't go, Oh, I'm too short, I should gain weight 3958 06:04:20,320 --> 06:04:24,400 so I can grow taller. You know, that's just not the way things work. So you have to put 3959 06:04:24,400 --> 06:04:30,660 x as the height, and y is the weight. But there are Riya. In reality, other causes of 3960 06:04:30,660 --> 06:04:35,042 weight besides height. In fact, there are things that cause both height and weight, 3961 06:04:35,042 --> 06:04:41,010 like genetics, right? So a genetic profile that leads to Thomas and also obesity could 3962 06:04:41,010 --> 06:04:45,300 be a lurking variable in the relationship between height and weight. So there could 3963 06:04:45,300 --> 06:04:49,920 be some tall people that are always obese, and it's not really just because they're tall. 3964 06:04:49,920 --> 06:04:54,910 It could be because they have the genetics that programmed them to be tall and also obese, 3965 06:04:54,910 --> 06:05:02,240 right? And so here's an example where you got to be real careful. Um, with correlation. 3966 06:05:02,240 --> 06:05:06,708 So there's been this claim that eating ice cream causes murders, because they noticed 3967 06:05:06,708 --> 06:05:11,880 when in areas where ice cream sales go up, murder rates rise. And I don't know about 3968 06:05:11,880 --> 06:05:17,122 you, but when I have some really good ice cream, it just makes me so mad. I'm just kidding. 3969 06:05:17,122 --> 06:05:22,470 I mean, why would this happened? Right? Well, the reality is summer and warm weather are 3970 06:05:22,470 --> 06:05:28,640 lurking variables, because we sell more ice cream in the summer. You know, the ice cream 3971 06:05:28,640 --> 06:05:34,060 consumption goes up. But also people are outside more and more murders occur. And you know, 3972 06:05:34,060 --> 06:05:40,670 I from Minnesota, where it gets really cold for periods of the winter, and oh my gosh, 3973 06:05:40,670 --> 06:05:45,510 there are totally no murders, then, like people just don't commit murders, when it's really 3974 06:05:45,510 --> 06:05:51,160 frigid out, it's just really inconvenient. So that's a situation where there's a lurking 3975 06:05:51,160 --> 06:05:56,130 variable. And so you don't want to start, you know, screwing up our ice cream laws and 3976 06:05:56,130 --> 06:06:02,060 making it so we can have ice cream, just because you misappropriate that ice cream causes murders, 3977 06:06:02,060 --> 06:06:08,290 right? There's a lurking variable behind it, that's having something to do with both. Here's 3978 06:06:08,290 --> 06:06:14,260 another one. And this was my professor in my biostatistics class, they use the C put 3979 06:06:14,260 --> 06:06:22,330 up a really like a time series chart over a long time, like since the 1900s. And they 3980 06:06:22,330 --> 06:06:28,270 pointed out as people purchase more onions, the overtime is onion consumption goes up 3981 06:06:28,270 --> 06:06:32,970 and down. The stock market rises, right? So when the stock market slow, people aren't 3982 06:06:32,970 --> 06:06:39,720 eating as many onions. And this is just true over generations in the US. So um, yeah, we've 3983 06:06:39,720 --> 06:06:43,200 had some problems with our economy in the US, do you think we should all start eating 3984 06:06:43,200 --> 06:06:49,780 a bunch of onions, right? So the healthy economy is a lurking variable. And a healthy economy, 3985 06:06:49,780 --> 06:06:54,690 people buy more food, they including onions, and also a healthy economy boost the stock 3986 06:06:54,690 --> 06:06:59,862 market. So you got to be careful about this correlation is not causation. You know, and 3987 06:06:59,862 --> 06:07:04,220 so if you want to make the stock market go up, don't make everybody onions. And definitely 3988 06:07:04,220 --> 06:07:11,820 don't make a stop eating ice cream, that would make me very upset. So at the end of the day, 3989 06:07:11,820 --> 06:07:16,080 you're not going to be able to affect the murder rate by bringing down the ice cream 3990 06:07:16,080 --> 06:07:20,390 consumption rate. And you're not going to be able to fix the stock market by making 3991 06:07:20,390 --> 06:07:25,970 people eat onions. And so that's the whole concept behind lurking variables. And correlation 3992 06:07:25,970 --> 06:07:34,290 is not necessarily causation. So in conclusion, when you're doing your correlations, First, 3993 06:07:34,290 --> 06:07:38,612 make a scattergram because you want to get an idea visual idea of the strength in their 3994 06:07:38,612 --> 06:07:44,390 direction. And you also want to look for outliers, then go on and calculate are by hand, but 3995 06:07:44,390 --> 06:07:48,640 be really careful because it's a big hairy calculation. And you don't want to make any 3996 06:07:48,640 --> 06:07:54,090 mistakes. And then finally, when you go to interpret are Be careful of lurking variables. 3997 06:07:54,090 --> 06:08:01,580 And remember that correlation is not necessarily causation. And now, time for some ice cream. 3998 06:08:01,580 --> 06:08:10,592 Hello, it's Monica wahi, your library college lecturer here to ruin your day with chapter 3999 06:08:10,592 --> 06:08:18,060 4.2 linear regression and the coefficient of determination. So at the end of this probably 4000 06:08:18,060 --> 06:08:24,070 painstaking lecture, the student should be able to at least explain what the least squares 4001 06:08:24,070 --> 06:08:30,661 line is. Identify and describe the components of the least squares line equation, explain 4002 06:08:30,661 --> 06:08:37,480 how to calculate the residuals, and calculate and interpret the coefficient of determination, 4003 06:08:37,480 --> 06:08:45,870 or CD for short. Alright, so it's really cool if you have a crystal ball, because then you 4004 06:08:45,870 --> 06:08:50,442 can make predictions, right, you just look into the crystal ball. It's some nice equipment, 4005 06:08:50,442 --> 06:08:54,542 I've had friends who have them, they're very nice to put out on your dining room table 4006 06:08:54,542 --> 06:09:00,470 as the centerpiece. Unfortunately, though, they don't really play much into statistical 4007 06:09:00,470 --> 06:09:05,160 prediction. So what I'm going to show you in this lecture is how we use statistics for 4008 06:09:05,160 --> 06:09:10,240 prediction instead of this beautiful crystal ball. So we're going to start by talking about 4009 06:09:10,240 --> 06:09:14,390 what the least squares line is. And then we're going to talk about the least squares line 4010 06:09:14,390 --> 06:09:19,940 equation, which is the crystal ball thing we use only in statistics, okay. And then 4011 06:09:19,940 --> 06:09:24,230 we're going to talk about dealing with prediction using the least squares line. And finally, 4012 06:09:24,230 --> 06:09:29,740 we're going to talk about the coefficient of determination. So let's get started. And 4013 06:09:29,740 --> 06:09:36,690 let's get started with the term least squares. criterion, right? So remember, criteria is 4014 06:09:36,690 --> 06:09:43,000 plural and criterion is singular. And it means well criteria as stuff you need to meet right 4015 06:09:43,000 --> 06:09:48,200 to be eligible like you have to meet the criteria for registration for college right? Well, 4016 06:09:48,200 --> 06:09:53,012 least squares Cartier tyrian is just one, which is awesome, because then you only have 4017 06:09:53,012 --> 06:09:58,820 to meet one thing. So one of the things you probably wondered when you were watching last 4018 06:09:58,820 --> 06:10:03,208 lecture is how do you know exactly where to draw this line when you have a scatterplot. 4019 06:10:03,208 --> 06:10:08,060 Like, how do you know where to make the line the most fair. So in the last chapter, when 4020 06:10:08,060 --> 06:10:12,060 we plotted the scatter grams, I just drew a line there for demonstration. But there 4021 06:10:12,060 --> 06:10:17,710 actually is an official rule as to where the line goes. Okay. And basically, the rule is 4022 06:10:17,710 --> 06:10:23,702 as has to meet the least squares criteria. Okay? if it meets that criteria, there's only 4023 06:10:23,702 --> 06:10:27,790 one line that does, then that is where the line goes. So how do we 4024 06:10:27,790 --> 06:10:29,900 get to that? 4025 06:10:29,900 --> 06:10:37,450 Well, this is roughly what it looks like. When you draw the line, there is a vertical 4026 06:10:37,450 --> 06:10:44,470 distance from each of the dots to the line. Now, as you can see, by the slide, sometimes 4027 06:10:44,470 --> 06:10:50,560 the dots are below the line. And sometimes they're above the line. And so the word square 4028 06:10:50,560 --> 06:10:55,590 is indicates that whether it's up or down, you're going to square it. So it's not going 4029 06:10:55,590 --> 06:10:59,872 to be negative anymore. Because whenever you square a negative, it becomes positive. So 4030 06:10:59,872 --> 06:11:04,660 first, you're going to have to square all of these things. Okay? So imagine you were 4031 06:11:04,660 --> 06:11:10,970 just going to try it out, like, maybe draw this line, and then you calculate the squares, 4032 06:11:10,970 --> 06:11:14,830 and you'd be like, okay, that's how many and then maybe you tilt the line a little. and 4033 06:11:14,830 --> 06:11:20,000 calculate the scores again. And your goal would be to add when you added up all the 4034 06:11:20,000 --> 06:11:25,450 squares, to have the least ones. So the line belongs where what causes smallest sum of 4035 06:11:25,450 --> 06:11:32,390 squares for the whole data set. So if your software, which you're not you're a person, 4036 06:11:32,390 --> 06:11:36,640 right, but if you were software, you'd be figuring that out using your software brain 4037 06:11:36,640 --> 06:11:41,680 as well, how exactly to tilt this line, and where exactly to put it to minimize these 4038 06:11:41,680 --> 06:11:48,110 squares, but we're people. So I'm going to go on and explain how people do this. So the 4039 06:11:48,110 --> 06:11:51,540 trick is, if you can figure out with the line close, you can draw it on the scatterplot 4040 06:11:51,540 --> 06:11:56,410 and be right. But there is a challenge of knowing exactly where it belongs on the graph. 4041 06:11:56,410 --> 06:12:01,360 And then also, you're probably realizing you don't always have a graph to draw it on. Like 4042 06:12:01,360 --> 06:12:05,900 maybe you need to talk to somebody about where the line goes, and you can't draw a picture. 4043 06:12:05,900 --> 06:12:11,362 So how you explain where the line goes as you use an equation. And some of you may remember 4044 06:12:11,362 --> 06:12:15,730 this, and some of you may not, so I thought I'd do a little quick review of how lines 4045 06:12:15,730 --> 06:12:22,620 and equations relate. Okay, so we're going to get into the least squares line equation. 4046 06:12:22,620 --> 06:12:26,430 But first, I'm going to give you a little flashback about algebra, and I'm sorry, if 4047 06:12:26,430 --> 06:12:30,820 this is painful, um, this is hard for me, because I wasn't really that good at algebra. 4048 06:12:30,820 --> 06:12:35,270 But um, I and this isn't statistics, this is algebra, but I just wanted you to remember 4049 06:12:35,270 --> 06:12:41,250 this part. Okay. So back in algebra, there was a chapter, where you were given these 4050 06:12:41,250 --> 06:12:45,630 xy pairs, and then was different from statistics, because they all lined up on a line, see, 4051 06:12:45,630 --> 06:12:49,990 these pink dots are just perfectly out of line, okay, and these are the XY pairs. And 4052 06:12:49,990 --> 06:12:54,730 remember, you had to graph this kind of like we had to do scatter plots. And then you were 4053 06:12:54,730 --> 06:13:02,160 given this equation, y equals b x plus a, right? And that was the linear equation to 4054 06:13:02,160 --> 06:13:09,120 describe this line. And you were like, okay, I don't get how to put this equation together 4055 06:13:09,120 --> 06:13:14,192 with this line. And so first, the teacher would say, well, B stands for the slope of 4056 06:13:14,192 --> 06:13:18,500 the line, right? Because you have to know the slope, I mean, the line can be tilted, 4057 06:13:18,500 --> 06:13:22,400 any which way. And so if you know the slope, you already know something about the line. 4058 06:13:22,400 --> 06:13:28,670 And in algebra, how you would make the slope as you calculate the rise over the run, right. 4059 06:13:28,670 --> 06:13:35,230 And so there, you know, be in algebra was rise over run, and you'd get the slope. And 4060 06:13:35,230 --> 06:13:40,150 then you'd be like, great. But you'll always needed another thing in order to define the 4061 06:13:40,150 --> 06:13:46,320 line. Because if you imagine this line is in an elevator, it could still have the same 4062 06:13:46,320 --> 06:13:53,970 slope, but go up or down, right, so we need to anchor it on the y axis somewhere. So h 4063 06:13:53,970 --> 06:14:00,510 stands for the Y interceptor where it's Spears through the y axis. And, as you can see, by 4064 06:14:00,510 --> 06:14:07,032 the drawing, it looks like a is zero comma, zero, right? But you don't have to look at 4065 06:14:07,032 --> 06:14:13,670 it, what you can do in algebra, is you to get a is what you would do is go since you'd 4066 06:14:13,670 --> 06:14:19,942 filled in B, you just go grab an XY pair, and plug the X and and plug the y and then 4067 06:14:19,942 --> 06:14:25,500 plug the B, you just got in and back. Calculate the y intercept, right. And that's how you 4068 06:14:25,500 --> 06:14:30,200 would get the whole linear equation. And so that's how you would do it in algebra. And 4069 06:14:30,200 --> 06:14:34,630 I just wanted to remind you that because we do some similar things in statistics, it's 4070 06:14:34,630 --> 06:14:40,390 a little different. But I wanted to remind you how to connect what a line looks like 4071 06:14:40,390 --> 06:14:46,640 with how this equation works. All right. Well, welcome to statistics looks, those pink things 4072 06:14:46,640 --> 06:14:52,300 are not on a line. So we want to make a line but now you know about the least squares criterion. 4073 06:14:52,300 --> 06:14:58,320 What you're trying to do is make a line that minimizes the least squares, right? So here 4074 06:14:58,320 --> 06:15:03,520 we go. Um, remember Hello, I was just talking about this linear equation back in algebra. 4075 06:15:03,520 --> 06:15:09,380 Well notice the difference. The main difference here is the hat, right? The y is wearing a 4076 06:15:09,380 --> 06:15:15,080 hat. And that's universally in statistics, whenever you see a letter or a number wearing 4077 06:15:15,080 --> 06:15:20,300 a hat, it means it's an estimate. Okay? So of course, we're estimating why because if 4078 06:15:20,300 --> 06:15:23,600 you look on that line, none of these dots actually falls 4079 06:15:23,600 --> 06:15:24,890 on that line. 4080 06:15:24,890 --> 06:15:29,910 And we don't really expect even an estimate to fall on that line just close, right? You 4081 06:15:29,910 --> 06:15:35,080 know, because of the least squares, okay. And so we almost have, in a way, the same 4082 06:15:35,080 --> 06:15:39,980 goal we did back in algebra, we have to get that be that slope. And then we have to use 4083 06:15:39,980 --> 06:15:46,440 that to back calculate our a. Okay, so let's go on with that. Um, so like I said, in the 4084 06:15:46,440 --> 06:15:52,292 software approach, you just feed all the XY pairs in, and then the software just actually 4085 06:15:52,292 --> 06:15:57,730 prints out the B in the A, it just prints out the slope and the y intercept, which is 4086 06:15:57,730 --> 06:16:02,380 why I love the software. But we don't get to use that in our class. In our class, we 4087 06:16:02,380 --> 06:16:06,170 have to do the manual approach just because it's painful. And I had to do too. So now 4088 06:16:06,170 --> 06:16:12,120 I'm making you do it right, me. Okay, what, what we'll do is plug all the XY pairs into 4089 06:16:12,120 --> 06:16:17,130 an equation to get the slope, the speed. And I promise you, I won't give you a ton of xy 4090 06:16:17,130 --> 06:16:22,532 pairs, you know, or you'll be there forever. But this next step, we have to do, we didn't 4091 06:16:22,532 --> 06:16:27,271 have to do an algebra. And that is we're going to have to go back to all of our x's, calculate 4092 06:16:27,271 --> 06:16:34,470 x bar, and go back to all of our y's and calculate y bar. Remember, that's the mean of the x's 4093 06:16:34,470 --> 06:16:38,952 in the mean of the y's. And you're probably wondering, Well, why do we have to do that? 4094 06:16:38,952 --> 06:16:44,862 I'll show you again. But in case you didn't notice, though, those dots really didn't fall 4095 06:16:44,862 --> 06:16:51,272 on least squares line, they fell around, and you need a.at least on that line to help back 4096 06:16:51,272 --> 06:16:56,090 calculate that wider set. And the rule of the least squares line, one of the rules of 4097 06:16:56,090 --> 06:17:03,600 it is that x bar comma y bar is on that least squares line. So you can know if you calculate 4098 06:17:03,600 --> 06:17:08,458 that out that that's actually on the least squares line. Okay. And so finally, after 4099 06:17:08,458 --> 06:17:13,990 you do x bar and y bar, you plug in B, and you plug in x bar for the x, and you plug 4100 06:17:13,990 --> 06:17:22,790 in y bar for the Y hat to back calculate the A. So it's a similar, but different process 4101 06:17:22,790 --> 06:17:29,230 as algebra. So the moral of the story is you need to recycle, right, we got to be good 4102 06:17:29,230 --> 06:17:33,710 to the environment. So what has happened? Well, you wouldn't be at this point in your 4103 06:17:33,710 --> 06:17:40,380 life of making a least squares line, if you hadn't already started out by making a scatterplot. 4104 06:17:40,380 --> 06:17:46,022 And then deciding you wanted to do R, and then making are. And when you make Are you 4105 06:17:46,022 --> 06:17:50,250 end up with that big table, remember, and you end up with all these calculations, like 4106 06:17:50,250 --> 06:17:56,160 some of x, some of y, some of x squared and some of x y. Now you want to recycle those, 4107 06:17:56,160 --> 06:18:00,910 you want to save those calculations from our because they fit also into the equation for 4108 06:18:00,910 --> 06:18:06,790 b. So you want to recycle that. Also, you want to save the are you made, because you're 4109 06:18:06,790 --> 06:18:12,440 going to recycle that into the coefficient of determination, which I'll explain later. 4110 06:18:12,440 --> 06:18:17,320 And then this is not about recycling, you'll actually have to make this a new, but you 4111 06:18:17,320 --> 06:18:22,530 need to calculate x bar and y bar. Now you never needed to do that before now, but now 4112 06:18:22,530 --> 06:18:29,370 you need this. And so yeah, so get together your old r calculations, and then put your 4113 06:18:29,370 --> 06:18:36,890 x bar and y bar together and you'll be ready to do the least squares line equation. Alright, 4114 06:18:36,890 --> 06:18:43,840 so here's a flashback. Remember this big table? Remember our story, we had seven patients, 4115 06:18:43,840 --> 06:18:47,128 right? And x was their diastolic blood pressure at 4116 06:18:47,128 --> 06:18:48,128 the last 4117 06:18:48,128 --> 06:18:52,980 visit they had of the year. And then why was the number of appointments they had over the 4118 06:18:52,980 --> 06:18:57,150 year. And we thought, Well, if your diastolic blood pressure, you know goes up, then maybe 4119 06:18:57,150 --> 06:19:00,860 you need more appointments because it's marker of being sick. I don't know. That was my little 4120 06:19:00,860 --> 06:19:06,543 story. Okay, so over on the right now we'll see that the formula, we have the formula 4121 06:19:06,543 --> 06:19:12,730 we're using for B, the tax gives you two formulas, again, I've always got my favorite, it's the 4122 06:19:12,730 --> 06:19:19,110 one with the table, right? So here's the formula for B. And then after you calculate B, you'll 4123 06:19:19,110 --> 06:19:25,980 notice in the formula for a, b is in the formula for a so you got to do B first, right. So 4124 06:19:25,980 --> 06:19:31,070 a lot of times students are a little confused and what the goal is here, the goal is to 4125 06:19:31,070 --> 06:19:35,570 if you look at the bottom of the slide, the goal is to come up with what B is and what 4126 06:19:35,570 --> 06:19:40,180 A is, and then fill it in. And that's your least squares line equation. So your least 4127 06:19:40,180 --> 06:19:45,650 squares line equation is always going to have an A y hat in it. That's that's a variable 4128 06:19:45,650 --> 06:19:50,070 that just gets to stay there. It's always going to have that equals and then after that, 4129 06:19:50,070 --> 06:19:54,531 whatever your B is going to be mushed up next to that x so it's always gonna have that x 4130 06:19:54,531 --> 06:20:00,300 there. And then plus and then whatever you get for a and just as a trick, if a Turns 4131 06:20:00,300 --> 06:20:07,140 out to be negative, then it ends up being minus a, right. But that's the generic equation. 4132 06:20:07,140 --> 06:20:11,581 And our goal is to calculate B and A and fill them in. And then we will say this is our 4133 06:20:11,581 --> 06:20:19,510 least squares line equation. Oh, remember how I was saying, you actually need to make 4134 06:20:19,510 --> 06:20:23,910 some new calculations, right. So you need to make y bar and you need to make x bar. 4135 06:20:23,910 --> 06:20:29,060 And it's a little easier to show when I've got this column, the columns up. If you look 4136 06:20:29,060 --> 06:20:32,900 at the bottom of the slide, remember how some of X was six, some D eight and remember how 4137 06:20:32,900 --> 06:20:39,890 our n is seven. And remember how a sum of x divided by n is your x bar. And the same 4138 06:20:39,890 --> 06:20:44,550 goes for y, right, we have the sum of Y divided by seven, I just wanted to quickly remind 4139 06:20:44,550 --> 06:20:48,970 you of this, that you need to generate these things before, you can actually completely 4140 06:20:48,970 --> 06:20:56,460 finish the least squares line equation. I just summarized like that I cut to the chase, 4141 06:20:56,460 --> 06:21:01,790 basically, I just summarize the the actual numbers you're going to need and put them 4142 06:21:01,790 --> 06:21:06,660 over here. So we don't have to look at that whole big table anymore. Alright, and you'll 4143 06:21:06,660 --> 06:21:11,300 notice that I grayed out the sum of Y squared because I realized later we don't really use 4144 06:21:11,300 --> 06:21:18,200 that. Okay, so let's look under on the left side under the big list of numbers we have. 4145 06:21:18,200 --> 06:21:22,990 And you'll see the B equation that I filled in, right, and if you compare that to the 4146 06:21:22,990 --> 06:21:27,750 formula on the right side, you'll see what's going on, you know that n is seven, right? 4147 06:21:27,750 --> 06:21:32,950 So wherever you see that seven, that's where n is okay, then the top of equation, remember 4148 06:21:32,950 --> 06:21:36,640 some of x, y, let's just look that up. Yeah, that's that big number 18,458, 4149 06:21:36,640 --> 06:21:46,290 I wanted to just be clear, you have to do out that left side, the seven times the 18,458, 4150 06:21:46,290 --> 06:21:51,730 you have to do that one out, and then do out the right side, which is that sum of x times 4151 06:21:51,730 --> 06:21:58,290 sum of Y which is 678 times 166, you have to do that one out. And then after that, you 4152 06:21:58,290 --> 06:22:04,380 have to subtract the right one from the left one, because of order of operation. Okay, 4153 06:22:04,380 --> 06:22:08,650 so that's how you make the numerator. Now let's just look downstairs, again, we have 4154 06:22:08,650 --> 06:22:13,100 an n, so we know that's seven, and then that sum of x squared. And remember, it doesn't 4155 06:22:13,100 --> 06:22:18,651 have the parentheses around the sum of X square, if it had the parentheses around it, you'd 4156 06:22:18,651 --> 06:22:23,600 be taking like 678 and squaring that, but it doesn't have the parentheses. So you have 4157 06:22:23,600 --> 06:22:30,272 to use that big numbers 67,892. Okay. And again, like with the upstairs, you got to 4158 06:22:30,272 --> 06:22:36,520 do out that side of the equation, right, that term, you've got to multiply that out before 4159 06:22:36,520 --> 06:22:41,750 even looking at the rest of the equation, right. And then Oh, here we go. On the right 4160 06:22:41,750 --> 06:22:47,372 side of the denominator, we have some of x squared, that's exactly the example I was 4161 06:22:47,372 --> 06:22:55,290 giving earlier. So you say 678 times 678. And you have to do that one out, right. And 4162 06:22:55,290 --> 06:22:58,690 then after you do that one out, and you do the first one out, then you subtract the second 4163 06:22:58,690 --> 06:23:02,981 one from the first one, remember order of operation. And if you do it right, you should 4164 06:23:02,981 --> 06:23:09,310 get C below the on the left side of the slide, you should get that for the numerator in that 4165 06:23:09,310 --> 06:23:13,590 for the denominator, and then you divide them out and you get 1.1. And that's your B, right. 4166 06:23:13,590 --> 06:23:21,140 So there you go. That's how you do it. And so now we got to worry about AES. So what 4167 06:23:21,140 --> 06:23:28,010 I did was I just wrote B at the top there, so B is 1.1. And so now we can use B to try 4168 06:23:28,010 --> 06:23:34,042 and figure out a, so remember how I look at my list. Remember, I did x bar and y bar for 4169 06:23:34,042 --> 06:23:40,590 you just so we had that ready. So now we're going to calculate a by putting in Y bar minus 4170 06:23:40,590 --> 06:23:47,012 and remember order of operation again, we got to do the B which is 1.1 times x bar. 4171 06:23:47,012 --> 06:23:51,740 So we do that one out first, and then subtract it from 23.7. And remember, remember, I was 4172 06:23:51,740 --> 06:23:57,700 saying sometimes you get a negative a, well, we got negative ad for a. Alright, so we got 4173 06:23:57,700 --> 06:24:04,940 our B, we got our a, and let's go. Now, oh, if you want to check your work, this should 4174 06:24:04,940 --> 06:24:11,400 work out right. Like you should be able to take the B times the x bar, right, which is 4175 06:24:11,400 --> 06:24:21,680 1.1 times 96.9 minus 80. You know the a and you should get 23.7. So if that works out, 4176 06:24:21,680 --> 06:24:26,080 then you know you did everything right. But remember what the goal was, the goal was to 4177 06:24:26,080 --> 06:24:31,510 actually fill in that least squares line equation. So if you look over on the right, that's what 4178 06:24:31,510 --> 06:24:37,010 we did. So we still have our Y hat, we still have our equals, now we have a 1.1 where the 4179 06:24:37,010 --> 06:24:43,450 B belongs. We still have that x because those are variables that we had in the x, and then 4180 06:24:43,450 --> 06:24:48,080 we do minus 80. Because we came out with a negative one. If it had been just plain 80 4181 06:24:48,080 --> 06:24:54,622 would say plus ad, okay. All right at the beginning of this presentation, I teased you 4182 06:24:54,622 --> 06:24:58,500 that we were going to do prediction with the least squares line equation. We weren't going 4183 06:24:58,500 --> 06:25:03,180 to use a crystal ball. We were going to Use this equation. Well, I finally get to that 4184 06:25:03,180 --> 06:25:08,940 exciting part of this presentation. But, and there's always a big, but I first have to 4185 06:25:08,940 --> 06:25:13,730 warm you up with some rules, right? First of all, I just want you to reflect on what 4186 06:25:13,730 --> 06:25:19,170 we just did. And realize that we can draw the least squares line. But unlike algebra, 4187 06:25:19,170 --> 06:25:24,730 our xy pairs probably aren't on it, right? Like in this example, none of the XY pairs 4188 06:25:24,730 --> 06:25:31,378 are on it. So you need to be sure about at least one xy pair that's actually going to 4189 06:25:31,378 --> 06:25:35,570 land on the least squares line. And the only one that you can be sure of is going to land 4190 06:25:35,570 --> 06:25:41,850 on least squares line is x bar, comma y bar. And if you reflect on it, that's why we had 4191 06:25:41,850 --> 06:25:46,390 to calculate that right, because we had to use x bar and y bar in the calculation to 4192 06:25:46,390 --> 06:25:52,730 back calculate a the y intercept. Now, you may be lucky and get a data set that there 4193 06:25:52,730 --> 06:25:58,560 is an x y pair that just happens to fall on the least squares line, or maybe even a couple 4194 06:25:58,560 --> 06:26:04,830 or maybe more. But you can't trust that. So if you need to trust that there's a point 4195 06:26:04,830 --> 06:26:10,708 on the least squares line, you know, it's always going to be x bar comma y bar. All 4196 06:26:10,708 --> 06:26:11,708 right. 4197 06:26:11,708 --> 06:26:18,773 And now I want to focus more succinctly, on to the slope or B, right. So remember, we 4198 06:26:18,773 --> 06:26:24,321 just in our example, calculated B and we got 1.1. For me, and that's a slope. So I want 4199 06:26:24,321 --> 06:26:29,520 to point it out that the slope B of the least squares lines tells us how many units the 4200 06:26:29,520 --> 06:26:35,870 response variable or Y is expected to change for each one unit of change and the explanatory 4201 06:26:35,870 --> 06:26:41,410 variable or x. So that's a little kind of a tongue twister. But if you think of our 4202 06:26:41,410 --> 06:26:47,340 example, it's a little easier to understand. So the fact that that slope was 1.1, in our 4203 06:26:47,340 --> 06:26:52,260 example, and that we were having XP DBP. And why be number of appointments over the last 4204 06:26:52,260 --> 06:26:57,930 year, what we're essentially saying by that is, for each increase in one mmHg of DBP, 4205 06:26:57,930 --> 06:27:04,720 or the X for each increasing one of those, there is a 1.1 increase in the number of appointments 4206 06:27:04,720 --> 06:27:12,450 the patient had over the past year. So as DBP goes up by one, then the appointments 4207 06:27:12,450 --> 06:27:17,140 goes up by 1.1. Well, I don't know what 1/10 of an appointment is, but you get what I'm 4208 06:27:17,140 --> 06:27:23,560 saying because it's just a Y, okay. And so the number of units change in the Y for each 4209 06:27:23,560 --> 06:27:30,880 unit change in X is called the marginal change in the Y. So which if you sort of think about 4210 06:27:30,880 --> 06:27:37,630 it, that's 1.1. So 1.1 is the slope. But 1.1 is also the marginal change in the Y for each 4211 06:27:37,630 --> 06:27:45,720 unit change in the x. Now, I also want to just recall for you this concept of influential 4212 06:27:45,720 --> 06:27:51,370 points, right, so like with our if a point is an outlier, and remember, we should have 4213 06:27:51,370 --> 06:27:54,720 done a scatterplot. And everything before we got to this point, because we need our 4214 06:27:54,720 --> 06:27:59,958 we need all those sums of x's and sums of y's and sums of sums and whatever, right. 4215 06:27:59,958 --> 06:28:04,640 And so like with AR, if a point is an outlier, and you can see it on the scatterplot, it 4216 06:28:04,640 --> 06:28:07,952 can really drastically influenced the least squares line equation, just like it's can 4217 06:28:07,952 --> 06:28:14,390 screw up our right. And so an extremely high x or an extremely low X can do this. And I 4218 06:28:14,390 --> 06:28:20,210 was just, you know, pointing out a culprit we have here on the scatterplot. So always 4219 06:28:20,210 --> 06:28:24,640 check your scattergram first for outliers, because you could end up in a situation where 4220 06:28:24,640 --> 06:28:27,930 you're making a least squares line and there's a bunch of outliers, you know, whacking it 4221 06:28:27,930 --> 06:28:35,352 out. Okay, now I'm gonna also bring up, you're probably like, when do we get to the prediction 4222 06:28:35,352 --> 06:28:38,900 part? I'm like, you just have to relax, I have to get through a few of these issues, 4223 06:28:38,900 --> 06:28:44,680 right? So one of them is the residual. And you know, the word residual, like it kind 4224 06:28:44,680 --> 06:28:49,080 of sounds like residue, right? Like you said, you know, somebody comes over and sits there 4225 06:28:49,080 --> 06:28:53,180 their cup on your coffee table without using a coaster that leaves some residue and you 4226 06:28:53,180 --> 06:28:57,500 get all mad, okay, well, that's kind of what a residual is. It's like kind of like residue, 4227 06:28:57,500 --> 06:29:01,850 it's like something left over, right. So once the equation is there, once you make the least 4228 06:29:01,850 --> 06:29:07,160 squares line equation, there's something I just want you to notice. And that is you can 4229 06:29:07,160 --> 06:29:12,570 take each x, remember how we had seven patients, they each had an X, you can theoretically 4230 06:29:12,570 --> 06:29:19,310 take each x, plug it into the equation and get the Y hat out, right? So I want to just 4231 06:29:19,310 --> 06:29:25,128 demonstrate doing that. So we have our equation upper right here. So a patient one, I took 4232 06:29:25,128 --> 06:29:31,680 patient ones x which was 70. And I plugged it in 70 times 1.1 minus 80. You know, I put 4233 06:29:31,680 --> 06:29:37,150 in the equation and I got negative three. Now that's why had the real why I put it on 4234 06:29:37,150 --> 06:29:42,870 the screen here is actually three. So as you can see, you know it's not the same answer, 4235 06:29:42,870 --> 06:29:48,440 right? And then patient two I did it with patient two also I did 1.1 times 115 because 4236 06:29:48,440 --> 06:29:52,720 that's the x and then minus 80. You know, because that's the rest of the equation. And 4237 06:29:52,720 --> 06:30:00,280 I got 46.5 Now that was a little closer, because look at patient twos wise. That was 45 If 4238 06:30:00,280 --> 06:30:05,050 it's really close to this 46.5, that's a little bit better. But the reason I was doing all 4239 06:30:05,050 --> 06:30:12,480 that is I just wanted to tell you the residual is y minus y hat. So in the first case, we 4240 06:30:12,480 --> 06:30:17,480 have y hat was negative three and y was three. So patient when we did three minus negative 4241 06:30:17,480 --> 06:30:21,850 three, and we got sick, so that's the residual, it's kind of like residue, right? It's like 4242 06:30:21,850 --> 06:30:27,180 the residue leftover between Y hat and y, right. And then patient who we did it again, 4243 06:30:27,180 --> 06:30:35,550 we took y which is 45 minus y hat, which was bigger, it was 46.5. So we got negative 1.5. 4244 06:30:35,550 --> 06:30:40,660 So that's the residual. So So this is how you calculate the residual. And this is what 4245 06:30:40,660 --> 06:30:46,952 it is, this is how you get it. But the bottom line is, you don't want big residuals, right? 4246 06:30:46,952 --> 06:30:52,410 Because that would mean the line didn't fit very well. So you'll find that if you have 4247 06:30:52,410 --> 06:30:56,650 a really good fitting line, you have very small residuals. And so you're probably like, 4248 06:30:56,650 --> 06:31:01,872 well, what's a good fitting line? Well, we'll get to the coefficient of determination, and 4249 06:31:01,872 --> 06:31:07,740 that'll help you see what constitutes a good fitting line. 4250 06:31:07,740 --> 06:31:15,140 But first, I will get to the prediction part, okay. So you're done with your least squares 4251 06:31:15,140 --> 06:31:20,490 line equation, and you want to use it for prediction. So let's say you knew someone's 4252 06:31:20,490 --> 06:31:25,190 DVP, and you wanted to predict how many appointments she or he would have in the next year. Now, 4253 06:31:25,190 --> 06:31:29,010 what you're not doing is you're not using, you're not reusing your X's from your data, 4254 06:31:29,010 --> 06:31:33,800 we just did that to make the residuals, what you're doing is actually imagining a new thing 4255 06:31:33,800 --> 06:31:39,900 out there. And you're gonna use this equation for prediction. So you could plug in the DVP 4256 06:31:39,900 --> 06:31:46,050 as an X, and get the Y hat out, and say that's your prediction, right? But you gotta use 4257 06:31:46,050 --> 06:31:51,380 some caution. If you use an X within the range of the original equation, as you can see, 4258 06:31:51,380 --> 06:31:57,321 I put the x's up here, the range of the original equation was like 70 to 125. Right, those 4259 06:31:57,321 --> 06:32:03,373 were, you know, the areas covered by x, right? If you do that, if you pick an X, somewhere 4260 06:32:03,373 --> 06:32:07,110 in there, this type of prediction is called interpolation. And people feel pretty good 4261 06:32:07,110 --> 06:32:11,190 about it. But if you use an x from outside the range, like one that's really smaller, 4262 06:32:11,190 --> 06:32:17,330 like 65, or one that's bigger, like 130, then it's called extrapolation. And then it's not 4263 06:32:17,330 --> 06:32:22,850 such a good idea, because you don't know if it's really going to work, right. So here, 4264 06:32:22,850 --> 06:32:28,250 I'm going to give you an example of interpolation. The patient in your study as a DBP of 80. 4265 06:32:28,250 --> 06:32:34,208 Okay, so 80s, right in there, it's in that range. So let's use it right. So we do it. 4266 06:32:34,208 --> 06:32:37,510 Now, this looks familiar to you, because we just did this when we did residuals, but we're 4267 06:32:37,510 --> 06:32:44,240 using a new person now. So 1.1, times 80, minus 80, equals eight. So this is how we, 4268 06:32:44,240 --> 06:32:48,458 what we would do is predict that this patient would come to eight appointments next year. 4269 06:32:48,458 --> 06:32:53,420 So there, that's how we use our least squares line equation, like a crystal ball where we 4270 06:32:53,420 --> 06:33:00,740 can predict right? So is it really this easy, right? Is this all you have to do to predict 4271 06:33:00,740 --> 06:33:08,020 the future? Well, it's not really that easy. You can't make a linear equation out of any 4272 06:33:08,020 --> 06:33:14,050 old xy pair. So remember this from our last lecture, see, the scatterplot. It looks like 4273 06:33:14,050 --> 06:33:19,050 what a cloud in That's right. It doesn't have a linear equation, you know, it doesn't look 4274 06:33:19,050 --> 06:33:23,970 like it should make a line. But you know what, you feed that stuff into the software, or 4275 06:33:23,970 --> 06:33:30,110 you feed that stuff into your B formula, and you're a formula, you'll get, you'll get a 4276 06:33:30,110 --> 06:33:35,010 line out of it, even if there's no linear correlation. And so if you get that line out 4277 06:33:35,010 --> 06:33:40,530 of some scatterplot, that looks like this, then it's not a very good line, right? And 4278 06:33:40,530 --> 06:33:45,350 it wouldn't work very well for prediction, right? Because this looks pretty unpredictable. 4279 06:33:45,350 --> 06:33:51,080 So for that reason, we can't just accept any line that is handed to us. To evaluate if 4280 06:33:51,080 --> 06:33:55,700 our least squares line equation should be used for interpretation, we need the coefficient 4281 06:33:55,700 --> 06:34:04,020 of determination. So here we are at the coefficient of determination. And so remember how I said 4282 06:34:04,020 --> 06:34:12,190 you have to recycle, recycle recycle in this, well get out your our time to recycle. So 4283 06:34:12,190 --> 06:34:19,140 the coefficient of determination is also called r squared. And it literally means r times 4284 06:34:19,140 --> 06:34:26,240 r. And I just have to add this on. Just like remember the coefficient of variation. Remember 4285 06:34:26,240 --> 06:34:34,410 that one, we always turn r squared into a percent, right? And so you times it by 101%. 4286 06:34:34,410 --> 06:34:42,470 So in this example that we did remember, early on in the last lecture, we did the R for this, 4287 06:34:42,470 --> 06:34:46,910 that not the scatterplot I just showed you, but the for the one of DBP, and the appointments, 4288 06:34:46,910 --> 06:34:52,820 right? And we got an R that was really, really strong positive correlation, right, we got 4289 06:34:52,820 --> 06:34:57,510 point nine, five. Well, if we want to calculate r squared, which is the coefficient of determination, 4290 06:34:57,510 --> 06:35:02,730 we take point nine five times point nine, five If and we get point nine oh, but we got 4291 06:35:02,730 --> 06:35:08,441 to do that percent thing. So we end up with 90%. So this is how you say it, you say that 4292 06:35:08,441 --> 06:35:16,390 90% is the variation that's explained? And why, by the linear equation, right? So that's, 4293 06:35:16,390 --> 06:35:21,800 you know, y varies, right? Like how many appointments they had, you know, it was different for each 4294 06:35:21,800 --> 06:35:28,750 person. Well, 90% of that variation is explained by the equation. And of course, if you take 4295 06:35:28,750 --> 06:35:34,960 100 minus 90%, there's 10%, unexplained variation. So there's still some variation that could 4296 06:35:34,960 --> 06:35:40,750 be explained by other variables, but not a lot. And how you actually stated is, you know, 4297 06:35:40,750 --> 06:35:45,830 when you're done with this, if you were writing a paper, you'd say, 90% of the variation in 4298 06:35:45,830 --> 06:35:51,530 the number of appointments is explained by DBP. And I know people are like, explain, 4299 06:35:51,530 --> 06:35:55,650 like, it doesn't have a mouth, like, what does it talking about? You just have to say 4300 06:35:55,650 --> 06:36:00,330 it this way. There's it's statistics ease, this is how you say it. And by 4301 06:36:00,330 --> 06:36:06,291 contrast, or by complimentary, what you would say is 10% of the variation in the number 4302 06:36:06,291 --> 06:36:12,120 of appointments is not explained by DBP. Right? It could be explained by other things. Well, 4303 06:36:12,120 --> 06:36:17,970 we happen to get a nice, I see CD for coefficient of determination. You know, we got a nice 4304 06:36:17,970 --> 06:36:23,600 high one. But what if it's a low? Well, let's just think about it CD should be better than 4305 06:36:23,600 --> 06:36:30,101 at least 50%? Because that would be random, right? And the higher the better. So if you're 4306 06:36:30,101 --> 06:36:34,670 on a test, nobody's going to give you a CD of like 60% and say, Is this any good because 4307 06:36:34,670 --> 06:36:39,442 I don't know, you'd be very conflicted. In real life, what I use it for is to compare 4308 06:36:39,442 --> 06:36:44,390 models, if one is 60%, and the others 55%. Of course, I'm going to go with a 60%. One, 4309 06:36:44,390 --> 06:36:49,430 but it's still not very good, right. And if it's low, you know, the higher the better, 4310 06:36:49,430 --> 06:36:54,030 basically. And if it's low, it means that you probably need other variables to help 4311 06:36:54,030 --> 06:37:02,770 the x you use to explain more of the variation because that x is not doing. Okay, in summary, 4312 06:37:02,770 --> 06:37:08,080 I just wanted to go over chapter four, so you realize where we've been. Okay. So we 4313 06:37:08,080 --> 06:37:13,610 started out with a set of quantitative x, y pairs. First thing we did was we made a 4314 06:37:13,610 --> 06:37:17,458 scatterplot, we wanted to look at the linear relationship between x and y. And we wanted 4315 06:37:17,458 --> 06:37:23,080 to look at outliers. If we'd seen a lot of outliers, or no linear relationship, we would 4316 06:37:23,080 --> 06:37:27,810 have stopped there. But because this is a class we had to learn, I forced them to be 4317 06:37:27,810 --> 06:37:32,680 a scatterplot with a linear variation, and not too many outliers. So we could move forward 4318 06:37:32,680 --> 06:37:38,542 and do our so we calculated our to see if our correlation was positive or negative, 4319 06:37:38,542 --> 06:37:43,510 and weak, moderate, or strong. So that's what you do if you find a linear relationship. 4320 06:37:43,510 --> 06:37:50,580 Next, in addition, in this lecture, we calculated B and A to come up with the least squares 4321 06:37:50,580 --> 06:37:56,580 line equation. And I just wanted to you to notice that the sign on B will always match 4322 06:37:56,580 --> 06:38:02,200 the sign on R. So if you have a positive R, you'll have a positive slope, if you have 4323 06:38:02,200 --> 06:38:06,890 a negative or you have a negative slope, but otherwise, the numbers won't match, just a 4324 06:38:06,890 --> 06:38:12,470 sign. And then also, I wanted you to notice that strong correlations will give you high 4325 06:38:12,470 --> 06:38:16,560 coefficient of determination, even if they're negative correlations, because remember, it's 4326 06:38:16,560 --> 06:38:22,042 r times r. And so negative times negative are still as positive, right? So if you have 4327 06:38:22,042 --> 06:38:28,810 strong correlation, like negative point nine, or point nine, it really doesn't matter what 4328 06:38:28,810 --> 06:38:34,550 direction if it's strong, then you're going to get a high coefficient of determination. 4329 06:38:34,550 --> 06:38:39,708 So after we did this B and A thing, we use that linear equation to calculate residuals, 4330 06:38:39,708 --> 06:38:44,810 right, like we took the x's from the original data and put them in got the Y hat and calculated 4331 06:38:44,810 --> 06:38:51,660 the residuals. After that, we use R to calculate the coefficient of determination or CD, to 4332 06:38:51,660 --> 06:38:56,320 decide if we wanted to use the literate equation for prediction. Because if it was bad, we 4333 06:38:56,320 --> 06:39:00,730 weren't going to do that. But we decided was good for prediction at 90%. And we decided 4334 06:39:00,730 --> 06:39:06,708 to use it. So that was our journey through these xy pairs all the way down to the coefficient 4335 06:39:06,708 --> 06:39:12,220 of determination. Good job, you made it. So in conclusion, the least squares criterion, 4336 06:39:12,220 --> 06:39:15,910 and calculating the least squares line was the first thing we went over how to do that 4337 06:39:15,910 --> 06:39:21,032 and what it all means. And then I reviewed some issues with prediction using the least 4338 06:39:21,032 --> 06:39:26,042 squares line, because it looks kind of easy. It looks kind of, you know, better than sliced 4339 06:39:26,042 --> 06:39:29,550 bread, but there are some things you have to think about. Finally, we went over the 4340 06:39:29,550 --> 06:39:35,410 coefficient of determination so that you could figure out how good your least squares line 4341 06:39:35,410 --> 06:39:41,600 equation was. And I just wanted to point out that CD kind of looks like CDs, you know, 4342 06:39:41,600 --> 06:39:47,952 like we used to have CDs. They were so pretty and rainbowy like that. But now all CD means 4343 06:39:47,952 --> 06:39:57,530 is coefficient of determination. Hello, and welcome back to statistics. It's Monica wahi 4344 06:39:57,530 --> 06:40:03,600 are labarre College lecturer and You've made it to chapter seven, I broke up chapter seven 4345 06:40:03,600 --> 06:40:08,730 into bite sized pieces. And we're going to start with chapter 7.1, talking about the 4346 06:40:08,730 --> 06:40:14,650 normal distribution and the empirical rule. So here are your learning objectives for this 4347 06:40:14,650 --> 06:40:19,950 lecture. At the end of this lecture, you should be able to state two properties of the normal 4348 06:40:19,950 --> 06:40:26,440 curve, state two differences between Chebyshev intervals and the empirical rule, and explain 4349 06:40:26,440 --> 06:40:30,090 how to apply the empirical rule to a normal distribution. 4350 06:40:30,090 --> 06:40:35,690 So, remember, distributions, we learned about them a while back, but I'll remind you a little 4351 06:40:35,690 --> 06:40:39,920 bit about them. And then we're going to talk about properties of the normal distribution, 4352 06:40:39,920 --> 06:40:45,530 or specifically the normal curve, that shape that comes out of making a histogram of normally 4353 06:40:45,530 --> 06:40:50,320 distributed data, then we're going to remember Chevy Chevy intervals, we're going to talk 4354 06:40:50,320 --> 06:40:54,800 about what Chevy Chevy did for us, and what Chevy Chevy really didn't do for us. And then 4355 06:40:54,800 --> 06:40:59,730 we're gonna move on to the empirical rule, which works very well, better than Chevy Chevy 4356 06:40:59,730 --> 06:41:03,920 intervals, when you have normally distributed data. And then I'm going to show you an example 4357 06:41:03,920 --> 06:41:10,860 of how to apply the empirical rule to that normally distributed data. So remember, the 4358 06:41:10,860 --> 06:41:15,740 normal distribution, in fact, remember distributions at all right? So to get a distribution, and 4359 06:41:15,740 --> 06:41:19,500 a lot of people sort of forget this, by the time we get to chapter seven, but I just wanted 4360 06:41:19,500 --> 06:41:25,032 to remind you, this is from an earlier lecture, we had a quantitative variable, which was 4361 06:41:25,032 --> 06:41:30,532 how far a patient's had been transported. And we determined classes, and we made a frequency 4362 06:41:30,532 --> 06:41:36,120 table. So remember that. And then after that, we made a frequency histogram, and then made 4363 06:41:36,120 --> 06:41:40,530 a shape. And as you could see that shape, which is the distribution, that shape in this 4364 06:41:40,530 --> 06:41:45,780 one was skewed, right, see that light on the right, okay, but that's an example of something 4365 06:41:45,780 --> 06:41:50,352 we cannot apply the empirical rule to, because the empirical rule only applies to normally 4366 06:41:50,352 --> 06:41:56,980 distributed data. So I had to give you an example of that. And here's my example. So 4367 06:41:56,980 --> 06:42:02,000 when I was in my undergraduate in costume design at the University of Minnesota, they 4368 06:42:02,000 --> 06:42:06,700 made us take a chemistry class and one of those big lecture halls. So I was in a very 4369 06:42:06,700 --> 06:42:11,420 large class that probably had about 100 people. And we were given this really difficult test, 4370 06:42:11,420 --> 06:42:17,220 it was 100 point test, and I was used to getting like A's. And so when they were done with 4371 06:42:17,220 --> 06:42:22,060 the test, the T A's, were handing the tests back to everybody. So they could see their 4372 06:42:22,060 --> 06:42:26,670 grade, while the professor was writing on the board, and was reading the frequency of 4373 06:42:26,670 --> 06:42:32,870 all the different scores. And I remember the TA handed me my test, and it said 73 on it. 4374 06:42:32,870 --> 06:42:39,800 And I'm used to getting like 90s, up to 100. And I remember stating out loud, saying 73, 4375 06:42:39,800 --> 06:42:44,150 that is an awful score, I can't believe I did so badly. I was talking like that. But 4376 06:42:44,150 --> 06:42:48,490 at the same time, the professor was writing the frequencies on the board. And what I realized 4377 06:42:48,490 --> 06:42:57,010 is the top score was in the 80s. And I had the third top score was 73. That's how hard 4378 06:42:57,010 --> 06:43:02,030 the test was. And that's a nice Shut up, because I noticed everybody giving me dirty looks 4379 06:43:02,030 --> 06:43:08,500 because they had scored actually below me. So I wanted you to imagine that class. And 4380 06:43:08,500 --> 06:43:13,910 I imagined what the normal distribution would look like for that class with the distribution 4381 06:43:13,910 --> 06:43:18,670 of the scores. And the reason why I thought it would be normal is because we all did badly, 4382 06:43:18,670 --> 06:43:25,380 right. And so nobody got 100. So we were all below the 100. So I imagined this curve here 4383 06:43:25,380 --> 06:43:30,620 for you. And I imagined my class, I had 100 people just to make it easy. Of course, the 4384 06:43:30,620 --> 06:43:36,200 test was difficult. And nobody got 100 points. And the mode, the median. And the mean, were 4385 06:43:36,200 --> 06:43:41,600 all near see great, because you remember how, when you have a normal distribution, the mode, 4386 06:43:41,600 --> 06:43:48,800 median, and mean are all on top of each other. So we all did pretty badly. So I'm going to 4387 06:43:48,800 --> 06:43:56,458 use this example of the fake chemistry test scores to exhibit exemplify these properties 4388 06:43:56,458 --> 06:44:01,220 of the normal curve. So there's five I'm going to talk about. The first is that the curve 4389 06:44:01,220 --> 06:44:04,990 is bell shaped with the highest point over the mean. And so you can see I drew a scribbly 4390 06:44:04,990 --> 06:44:08,800 little curve, put a little arrow there to show you that that's where the mean of the 4391 06:44:08,800 --> 06:44:14,150 scores were. And then I also wanted you to notice that the curve is symmetrical with 4392 06:44:14,150 --> 06:44:19,452 a vertical line through the mean. So there's like a mirror image of the curve on either 4393 06:44:19,452 --> 06:44:25,280 side. Now, it's not perfect, obviously. But it should be roughly like that. And you know, 4394 06:44:25,280 --> 06:44:30,930 this is not true of skewed or bi modal or these other things we've been talking about. 4395 06:44:30,930 --> 06:44:37,150 Okay, and the third property is that the curve approaches the horizontal axis but never touches 4396 06:44:37,150 --> 06:44:43,240 it. You don't have to memorize this, but remember, asym totw or asymptomatically close, that's 4397 06:44:43,240 --> 06:44:46,370 when a line gets really close to another line, but they never touch. 4398 06:44:46,370 --> 06:44:50,660 It's so romantic. But anyway, that's a very Bollywood thing to say, by the way, but uh, 4399 06:44:50,660 --> 06:44:57,080 so the curve approaches the horizontal axis and never touches or crosses and then also 4400 06:44:57,080 --> 06:45:02,320 there's this inflection or these transition points between cupping upward and downward. 4401 06:45:02,320 --> 06:45:07,022 And these transition points occur at about the mean, plus one standard deviation and 4402 06:45:07,022 --> 06:45:11,800 about the mean minus one standard deviation. And this is a little hard to explain. But 4403 06:45:11,800 --> 06:45:17,080 imagine you're on a roller coaster and you're going up this normal curve. There's this part 4404 06:45:17,080 --> 06:45:21,200 where you're just mainly going on, well, the part where it seems to kind of level out and 4405 06:45:21,200 --> 06:45:26,792 you're at the top of the curve, he starts to relaxing. That's that inflection point. 4406 06:45:26,792 --> 06:45:30,040 And so as you're going over in the roller coaster, and you're in that flat part, and 4407 06:45:30,040 --> 06:45:35,090 then you start kind of going down, that's the second inflection. So that's where what 4408 06:45:35,090 --> 06:45:38,430 it's saying about is the property of this curve is that you have these inflection points 4409 06:45:38,430 --> 06:45:43,628 like that. And they roughly occur at plus or minus one standard deviation above and 4410 06:45:43,628 --> 06:45:49,250 below the mean. Then finally, and I call it this, and just so you could see it, the area 4411 06:45:49,250 --> 06:45:54,490 under the entire curve is one, so think 100%. So it would be nice if that were a square 4412 06:45:54,490 --> 06:45:59,192 or rectangle, or even a triangle, something that we're used to in geometry, but it's not, 4413 06:45:59,192 --> 06:46:03,830 it's this goofy shape, right? But still, you need to get it in your head that that shape 4414 06:46:03,830 --> 06:46:11,510 is worth 1.0 in proportion land, or 100% in percent land. And what I mean by that is, 4415 06:46:11,510 --> 06:46:17,370 let's say we cut that shape and half, the, each side would have 50% or point five on 4416 06:46:17,370 --> 06:46:23,532 it, then let's cut it a different way. So the part of the curve on the right side of 4417 06:46:23,532 --> 06:46:28,390 that line is a fourth of the curve, or 25% of the curve, even though it's goofy shaped, 4418 06:46:28,390 --> 06:46:33,208 and the part on the left side is 75%. So that's what we're trying to get you to think like 4419 06:46:33,208 --> 06:46:38,760 is that, yeah, you can just declare that all the area under the curve equals one or 100%. 4420 06:46:38,760 --> 06:46:42,140 But the reason why we're declaring that is because we're gonna cut it up and say different 4421 06:46:42,140 --> 06:46:48,860 amounts of percent of the curve. Now we get to the empirical rule, since we reviewed this 4422 06:46:48,860 --> 06:46:54,000 whole curve thing, and I'm going to make you remember Chevy shove, I'm sorry, but you know, 4423 06:46:54,000 --> 06:46:58,680 let's talk about Chevy Chevy, Chevy shove helped us get some intervals, right, in intervals 4424 06:46:58,680 --> 06:47:02,690 have boundaries, or limits, they have a lower limit and an upper limit. That's how you know 4425 06:47:02,690 --> 06:47:07,860 what bounds the interval. So when we were doing Chebyshev intervals, what we would do 4426 06:47:07,860 --> 06:47:12,080 is we'd figure out a lower limit and upper limit, and we'd say at least so much percent 4427 06:47:12,080 --> 06:47:17,110 of the data falls in the interval, right? So when we would choose the lower limit of 4428 06:47:17,110 --> 06:47:22,458 mu minus two times the standard deviation, and the upper limit was mu plus two times 4429 06:47:22,458 --> 06:47:28,530 the standard deviation, we would say at least 75% of the data were in the interval. So I 4430 06:47:28,530 --> 06:47:33,090 wanted to just show you a demonstration using my fake class. So remember, there were 100 4431 06:47:33,090 --> 06:47:37,550 students in the class, I actually came up with a mu for them. And their mu on the test 4432 06:47:37,550 --> 06:47:44,060 was 65.5. So my 73 was better than the mean, but not much better, right. So the mu for 4433 06:47:44,060 --> 06:47:51,200 that class was 65.5. And the standard deviation was 14.5. So I calculated these chubby shove 4434 06:47:51,200 --> 06:47:56,970 this championship interval for 75% of the data. So I took 65.5 minus two times 14.5. 4435 06:47:56,970 --> 06:48:02,440 And I got 36.5, which is a pretty bad grade. And then the upper limit was pretty good, 4436 06:48:02,440 --> 06:48:09,570 right? 65.5 plus two times 14.5 equals 94.5. On 100 point test, that's a pretty good grade, 4437 06:48:09,570 --> 06:48:14,340 right? So if you had 100 data points, or 100 students, at least 75 would have scored between 4438 06:48:14,340 --> 06:48:20,240 36.5 and 94.5. So you're probably already realizing, okay, that doesn't really help 4439 06:48:20,240 --> 06:48:27,170 Monica, who scored 73. And this is a really wide range, we say at least 75% of people 4440 06:48:27,170 --> 06:48:32,040 score there, you could probably guess that without even knowing about chubby ship intervals, 4441 06:48:32,040 --> 06:48:36,910 right? So it didn't really help me narrow down, like how well is this class doing? If 4442 06:48:36,910 --> 06:48:40,820 I had had the mu and the standard deviation, I could have calculated this and said, Okay, 4443 06:48:40,820 --> 06:48:43,820 I'm no better off. 4444 06:48:43,820 --> 06:48:49,650 So championships theorem on the left side, and applies to any distribution, you don't 4445 06:48:49,650 --> 06:48:53,680 need a normal distribution, you can use that skewed distribution. Also, you'll notice it 4446 06:48:53,680 --> 06:48:59,050 says at least. So like this was at least 75% of the data fell in there. Maybe even 100% 4447 06:48:59,050 --> 06:49:03,500 fell in there. So it doesn't really help us. And as you go, let you start with two standard 4448 06:49:03,500 --> 06:49:09,640 deviations. If you go out three, it's 88.9%. And four, it's 93.8%. You know, you might 4449 06:49:09,640 --> 06:49:14,680 as well start at the beginning and say almost 100% of the data falls in this interval. And 4450 06:49:14,680 --> 06:49:19,208 if you're saying that it's not very useful, right. But it kind of gets stuck doing that 4451 06:49:19,208 --> 06:49:24,240 because championships theorem applies to any distribution, the empirical rule is much more 4452 06:49:24,240 --> 06:49:30,700 elite. It only applies to the normal distribution. And you'll see why if you are lucky enough 4453 06:49:30,700 --> 06:49:35,458 to get the normal distribution that you want to use the empirical rule instead of championship. 4454 06:49:35,458 --> 06:49:39,690 Okay? Because Secondly, the empirical rule says approximately It doesn't say at least, 4455 06:49:39,690 --> 06:49:46,042 so it's saying basically, not at least it's saying about exactly this. So you can trust 4456 06:49:46,042 --> 06:49:52,910 it. Okay, you don't have like this unknown, like maybe 100%. There's, so it says, This 4457 06:49:52,910 --> 06:49:58,022 is what it says and I'll show you a diagram of it, but it says that 68% of the data are 4458 06:49:58,022 --> 06:50:03,920 in the interview interval. mu plus or minus one standard deviation. So mu minus one standard 4459 06:50:03,920 --> 06:50:08,660 deviation all the way up to mu plus one standard deviation 68% of the data are in there. And 4460 06:50:08,660 --> 06:50:12,140 you'll notice that Chevy chef didn't even say anything about one standard deviation. 4461 06:50:12,140 --> 06:50:18,690 And so already, we've got something way more useful if we apply the empirical rule, right. 4462 06:50:18,690 --> 06:50:25,532 So next we go to 95% of the data are in the interval, mu plus or minus two standard deviations, 4463 06:50:25,532 --> 06:50:31,640 95%, approximately 95% are in there. Now, if we had bought chubby chef, we'd be saying 4464 06:50:31,640 --> 06:50:38,290 about this too, we'd be saying 75%. Okay, we'd be saying at least 75%, which could be 4465 06:50:38,290 --> 06:50:39,290 95%. 4466 06:50:39,290 --> 06:50:44,730 But here, if we're using the empirical rule, we're relatively sure that it's 95% between 4467 06:50:44,730 --> 06:50:51,070 mu plus or minus two standard deviations you can like better, right? Finally, if you get 4468 06:50:51,070 --> 06:50:55,840 out to three standard deviations, you're kind of running out of data, because 99.7%, almost 4469 06:50:55,840 --> 06:51:00,708 all of them fall in that interval. So as you can see, the empirical rule is going to give 4470 06:51:00,708 --> 06:51:06,890 you a more specific answer. But again, you can only use it if you have a normal distribution, 4471 06:51:06,890 --> 06:51:11,878 but which we do. So let's go look at that. Okay, this is a diagram that I'm going to 4472 06:51:11,878 --> 06:51:16,872 help I made it myself, actually, because I thought it was the other diagrams I saw were 4473 06:51:16,872 --> 06:51:21,770 not pretty. And this one is very pretty in my mind, but let me unpack this diagram for 4474 06:51:21,770 --> 06:51:22,770 you, because there's 4475 06:51:22,770 --> 06:51:23,980 a lot going on. And 4476 06:51:23,980 --> 06:51:28,350 first of all, I want you to notice the shape of it, it's a normal distribution, okay. And 4477 06:51:28,350 --> 06:51:32,120 then I want you to notice that I put this black line down the middle, and I put a little 4478 06:51:32,120 --> 06:51:37,261 arrow that says mu. So this is where we want to imagine mu, it's no matter what your what 4479 06:51:37,261 --> 06:51:43,610 your actual numbers are from you. Like in our case, this is 65.5 for our points. Just 4480 06:51:43,610 --> 06:51:47,820 imagine whatever your mu is, and whatever your standard deviation is, this is where 4481 06:51:47,820 --> 06:51:52,940 you would put the meal, right, then you'll notice that each of these sections that's 4482 06:51:52,940 --> 06:51:58,850 colored, has a little standard deviation symbol in it, because that's representing that, that 4483 06:51:58,850 --> 06:52:04,700 the width of that is one standard deviation. So if your standard deviation was like five, 4484 06:52:04,700 --> 06:52:09,670 then mu would be plus plus or minus five, like the green one would be mu plus one standard 4485 06:52:09,670 --> 06:52:13,958 deviation. So it'd be mean plus five, and then you draw that parallel line there and 4486 06:52:13,958 --> 06:52:18,060 see that arrow that says mu plus one zero deviation, that would be there. And of course 4487 06:52:18,060 --> 06:52:21,720 I can, I just had to use the symbols, because I don't know how big the standard deviation 4488 06:52:21,720 --> 06:52:27,040 really would be, or what the mean really would be. But whatever it was mu plus one standard 4489 06:52:27,040 --> 06:52:32,452 deviation, if you go up there, you would see that that green area represents 34% of the 4490 06:52:32,452 --> 06:52:37,850 data. And if you're lucky enough to have exactly 100 people, like I did in my demonstration, 4491 06:52:37,850 --> 06:52:43,220 that would mean that between mu and mu plus one standard deviation of these test scores 4492 06:52:43,220 --> 06:52:49,170 would be 34 people's scores, right, so you can really figure that out. Same with the 4493 06:52:49,170 --> 06:52:54,550 yellow section only, that's mu minus one standard deviation, and 34% of the scores would be 4494 06:52:54,550 --> 06:52:55,550 between those two 4495 06:52:55,550 --> 06:52:57,180 numbers. 4496 06:52:57,180 --> 06:53:03,378 Now you'll see as you get up into the blue, that's between one and two standard deviations 4497 06:53:03,378 --> 06:53:08,330 above the mu, you'll see that because the roller coasters a lot lower to the ground 4498 06:53:08,330 --> 06:53:14,180 there, that section is really small, it's only 13.5% of the data. And the same with 4499 06:53:14,180 --> 06:53:18,210 the orange one that's on the other side of the mu. So that's below the mean. And that's 4500 06:53:18,210 --> 06:53:22,750 only 13.5. And then you'll notice that at three standard deviations, between two and 4501 06:53:22,750 --> 06:53:27,850 three, there's a little tiny piece right, the purple piece and the red piece, those 4502 06:53:27,850 --> 06:53:35,720 are only worth 2.35% of this shape. And then I wanted to point out there is some stuff 4503 06:53:35,720 --> 06:53:41,570 at the end, in the little black part beyond three standard deviations on either side, 4504 06:53:41,570 --> 06:53:46,220 there's point one 5%. And a lot of times people forget that. But one way you can make sure 4505 06:53:46,220 --> 06:53:50,460 that you've got to remember that it's there is that if you add up all these percents on 4506 06:53:50,460 --> 06:53:56,790 the slide, you'll get 100% because remember, I promised you that the whole the whole curve 4507 06:53:56,790 --> 06:54:01,520 is worth 100%. And this is how we split it up. I also want you to notice that there's 4508 06:54:01,520 --> 06:54:08,290 kind of a cheat, right? If you just add up the green, blue, purple, and then the little 4509 06:54:08,290 --> 06:54:11,910 black part at the end, if you just add up those percents, you'll get 50%, right, because 4510 06:54:11,910 --> 06:54:17,192 that's half the curve. And the same, you'll get the same thing if you do the yellow, orange, 4511 06:54:17,192 --> 06:54:21,640 red, and the little part and the black at the bottom. If you add those up, you'll get 4512 06:54:21,640 --> 06:54:26,900 50%. So that's how you want to just conceptualize this whole empirical roll diagram. But now 4513 06:54:26,900 --> 06:54:35,050 we'll apply. So I put the empirical rule diagram on the left, and then I put our class frequency 4514 06:54:35,050 --> 06:54:39,872 histogram on the right and look, I put the meal and I put the standard deviation so we 4515 06:54:39,872 --> 06:54:44,730 could have it there. Now the first part of this section, I'm just going to show you how 4516 06:54:44,730 --> 06:54:49,390 to fill in the numbers under the diagram. Okay, and then after we fill in the numbers, 4517 06:54:49,390 --> 06:54:51,330 I'm going to talk to you about how to interpret 4518 06:54:51,330 --> 06:54:54,120 those numbers. 4519 06:54:54,120 --> 06:54:59,550 So let's start with easy let's write the mu underneath the symbol for me, which was 65.5. 4520 06:54:59,550 --> 06:55:07,640 So we just wrote that was simple, okay. Now let's do the plus or minus one standard deviation. 4521 06:55:07,640 --> 06:55:14,810 So you'll see 65.5, which is our mu minus, and I put one times 14.5. I know I just did 4522 06:55:14,810 --> 06:55:19,042 that for demonstration purpose. So you see, we're doing one times the standard deviation. 4523 06:55:19,042 --> 06:55:24,202 So if you subtract that from the meal, you get 51. And so I wrote that 51 underneath 4524 06:55:24,202 --> 06:55:30,220 the mu minus one standard deviation. And if you go the opposite way, and you add on 14.5, 4525 06:55:30,220 --> 06:55:36,080 you get 80. So I put that up there. So that's I just labeled those two, you can kind of 4526 06:55:36,080 --> 06:55:38,160 guess what we're going to do on the next 4527 06:55:38,160 --> 06:55:39,160 slide. 4528 06:55:39,160 --> 06:55:43,740 Surprise, we're going to do almost the same thing. All we're doing the mu minus two times 4529 06:55:43,740 --> 06:55:49,810 the standard deviation to get the 36.5. And the mu plus two times the standard deviation 4530 06:55:49,810 --> 06:55:56,680 to get that 94.5. And you probably already, we're ahead of me with this one. This is where 4531 06:55:56,680 --> 06:56:02,680 we do 65.5 minus three standard deviations, and we get 22. And then we add three standard 4532 06:56:02,680 --> 06:56:09,390 deviations, and we get 109. And now we're all able to So what does this all mean? Well, 4533 06:56:09,390 --> 06:56:15,310 remember, our n equals 100, just out of convenience. So what does this mean? It means that 34% 4534 06:56:15,310 --> 06:56:22,940 of the scores are between 51 and 65.5. So that's the yellow bar. Right? So 34 scores 4535 06:56:22,940 --> 06:56:27,958 were that because I 100 people in the class. So I'm standing there in that class, and I've 4536 06:56:27,958 --> 06:56:34,610 got a 73. But I don't 34 of those people I'm looking at have a score between 51 and 65.5. 4537 06:56:34,610 --> 06:56:40,000 I also know that another 34%, or another 34 in this class, because there's 100 have a 4538 06:56:40,000 --> 06:56:45,950 score between 65.5 and 80. And my 73 is somewhere in there, right? So already, I'm getting an 4539 06:56:45,950 --> 06:56:53,390 idea that 68 people are 68% of the scores are going to be between 51 and 80. Right. 4540 06:56:53,390 --> 06:57:01,230 And so I'm right there with 68% of the class. So I'm going to go through some fake test 4541 06:57:01,230 --> 06:57:05,600 questions for you to just show you how to come up with the answer. So let's say the 4542 06:57:05,600 --> 06:57:12,782 question was, what percent of the data student scores are between 36.5 and 80? So think about 4543 06:57:12,782 --> 06:57:18,050 how you would answer that question. So see where 36.5 is, it's on the lower limit of 4544 06:57:18,050 --> 06:57:23,210 the orange part, and see where the ad is, it's on the upper limit of the green part. 4545 06:57:23,210 --> 06:57:29,872 So what you would do is you would add up the percents in between right 13.5 plus 34, plus 4546 06:57:29,872 --> 06:57:35,292 34? And the answer to what percent of the data are between 36.5 and 80? The answer would 4547 06:57:35,292 --> 06:57:44,360 be at 1.5%. Here's another question. What cut point marks the top 16% of the scores. 4548 06:57:44,360 --> 06:57:49,080 So already, you know you're up in that area, probably where the purple or the blue are, 4549 06:57:49,080 --> 06:57:55,458 right? And so what would make the top 16%? Well, if you actually add together that point, 4550 06:57:55,458 --> 06:58:02,272 one 5%, from the little black part, the 2.35%, from the purple, and the blue 13.5%, you'll 4551 06:58:02,272 --> 06:58:09,362 get 16%. So the cut point then for that all the scores above 80, that would constitute 4552 06:58:09,362 --> 06:58:10,860 the top 16% 4553 06:58:10,860 --> 06:58:14,020 of the scores. 4554 06:58:14,020 --> 06:58:20,920 Here's another quiz question, what percent of the scores are below 94.5. So we see 94.5 4555 06:58:20,920 --> 06:58:25,230 is at the upper limit of the blue section. So you could kind of say, well, let's just 4556 06:58:25,230 --> 06:58:30,300 add up everything below. Right, we'll add up everything below it, and that person, the 4557 06:58:30,300 --> 06:58:36,292 scores will be below 94.5. And so we do that we add everything below it. But remember how 4558 06:58:36,292 --> 06:58:41,240 I said that there that the yellow, orange, red, and the little black part there that 4559 06:58:41,240 --> 06:58:47,220 that equals 50%? If you just wanted to say okay, that's 50% plus the green part, plus 4560 06:58:47,220 --> 06:58:53,330 the blue part, you could do that, and then you get the same answer. So what are the cut 4561 06:58:53,330 --> 06:58:58,100 points from the middle 68% of the data? I just wanted to show you an example. What if 4562 06:58:58,100 --> 06:59:04,300 they say middle, right? Well, you're gonna have to be centered around me that right? 4563 06:59:04,300 --> 06:59:10,220 So the middle 68% means 34% above the mean, and 34% below the mean. So the cut points 4564 06:59:10,220 --> 06:59:18,180 would be 51 to 80. Okay, now I'm going to ask a similar question, but I'm going to use 4565 06:59:18,180 --> 06:59:25,700 different words. Okay. What is the probability that if I select one student from this class, 4566 06:59:25,700 --> 06:59:31,170 that student will have a score less than 80? Okay, so notice, I'm using totally different 4567 06:59:31,170 --> 06:59:38,260 terminology. I'm saying what is the probability yet? The only the actual answer is what you 4568 06:59:38,260 --> 06:59:44,470 would probably guess, which is where you add up all the percents below 80. So the point 4569 06:59:44,470 --> 06:59:49,720 of me giving you this quiz questions is to point out that percent and probability mean 4570 06:59:49,720 --> 06:59:54,060 the same thing when you talk. So either I'm gonna say what percent of the data are below 4571 06:59:54,060 --> 06:59:59,740 at the score of 80? Or what is the probability that if I select one student, that student 4572 06:59:59,740 --> 07:00:05,220 was scored less than 80? That is actually the same question. So the answer is going 4573 07:00:05,220 --> 07:00:11,522 to be I use that 50% trick here. That answers me 50%, which is the whole bottom half of 4574 07:00:11,522 --> 07:00:19,350 that curve plus 34% gets up to 84%. Right? So, so the probability that if I select on 4575 07:00:19,350 --> 07:00:24,720 student, that student will have a score less than 80 is 84%. And that's the same as what 4576 07:00:24,720 --> 07:00:31,980 percent of the data is below 80 is 84%. Okay. Here's another probability question, what 4577 07:00:31,980 --> 07:00:39,730 is the probability I will select a student with a score between 36.5 and 51? Well, that's 4578 07:00:39,730 --> 07:00:46,020 as if I was asking, it's the same question as what percent of the data are between 36.5 4579 07:00:46,020 --> 07:00:51,920 and 51? which you would know the answer that that would be 13.5. That's the orange part, 4580 07:00:51,920 --> 07:00:57,180 right? But even if I say, what is the probability, I will select a student with a score between 4581 07:00:57,180 --> 07:01:04,520 36.5 and 51 13.5%? So let's say that we were at a casino, and we were betting, right. And 4582 07:01:04,520 --> 07:01:09,100 I'm like saying, okay, there's 100 students, I'm going to just grab a score out, and I'm 4583 07:01:09,100 --> 07:01:15,140 betting a lot of money that I'm going to grab somebody between 36.5 and 51. And you'd probably 4584 07:01:15,140 --> 07:01:21,250 be like, you don't want to bet on that. Because you only have 13.5% probability of selecting 4585 07:01:21,250 --> 07:01:26,800 one, you probably want to bet if you're going to bet on something in the in the yellow section 4586 07:01:26,800 --> 07:01:31,240 or something in the green section, because they have higher probability. So that's how 4587 07:01:31,240 --> 07:01:36,070 you would think about probability. And percent, even though they're kind of the same thing. 4588 07:01:36,070 --> 07:01:39,080 I just wanted to show you how they word the questions differently. 4589 07:01:39,080 --> 07:01:45,280 But it means the same thing. So now I want you to just sit back and think for a second. 4590 07:01:45,280 --> 07:01:50,140 So think about what would happen in a different class taking the same hard test, meaning nobody's 4591 07:01:50,140 --> 07:01:55,730 getting 100%? What's the mu was the same, meaning everybody's doing badly. But the standard 4592 07:01:55,730 --> 07:02:00,772 deviation was larger than 14.5? What would that do to the intervals? So let's just stare 4593 07:02:00,772 --> 07:02:05,560 at this for a second. Let's say the mu was still 65.5. But the standard deviation was 4594 07:02:05,560 --> 07:02:11,628 like 30. Okay, there was a lot of variation in the class, that would already mean that 4595 07:02:11,628 --> 07:02:19,410 where the ad is right now, that that would actually be 95.5. Right? And where that 51 4596 07:02:19,410 --> 07:02:25,040 is there. Now, if we have a standard deviation of 30, that would actually be 35.5. I mean, 4597 07:02:25,040 --> 07:02:30,470 that'd be a way bigger interval, right. And so the class I was in in chemistry was an 4598 07:02:30,470 --> 07:02:34,300 undergraduate class, I was in costume design. This was a whole bunch of different kinds 4599 07:02:34,300 --> 07:02:38,870 of people in chemistry. And that's probably why we even had kind of a big standard deviation 4600 07:02:38,870 --> 07:02:43,522 of 14.5. Even though I made that up. I mean, in reality, we probably did have a big standard 4601 07:02:43,522 --> 07:02:49,780 deviation. I knew in the chemical engineering department, they had chemistry classes for 4602 07:02:49,780 --> 07:02:53,890 chemical engineering majors, I'll tell you, their standard deviation was probably a lot 4603 07:02:53,890 --> 07:02:59,910 smaller, because they were probably more alike and got more similar grades as each other. 4604 07:02:59,910 --> 07:03:04,630 But with this diverse class, we probably had a pretty big standard deviation. So that gets 4605 07:03:04,630 --> 07:03:08,280 to my last question, what if the standard deviation was actually smaller than 14.5. 4606 07:03:08,280 --> 07:03:12,750 So if we were like in the chemical engineering class, and they were taking chemistry, and 4607 07:03:12,750 --> 07:03:17,390 they had a smaller standard deviation, maybe they might have had the same mean 65.5. But 4608 07:03:17,390 --> 07:03:24,580 let's say their standard deviation was like five, then where the ad is now would be a 4609 07:03:24,580 --> 07:03:34,870 70.5. And where the 51 is, would be a 60.5. And we'd have way more confidence of where 4610 07:03:34,870 --> 07:03:40,910 we knew the scores fell, like as I was standing there with my 73. I would be saying like, 4611 07:03:40,910 --> 07:03:46,550 Oh, you know, my 73 is pretty high, if everybody has a small standard deviation, right? Whereas 4612 07:03:46,550 --> 07:03:50,128 it's not that high here, because we have kind of a big standard deviation. That's in the 4613 07:03:50,128 --> 07:03:56,320 first though the green part. So the reason why I want you to think about that is, that's 4614 07:03:56,320 --> 07:03:57,740 why 4615 07:03:57,740 --> 07:03:58,740 this 4616 07:03:58,740 --> 07:04:04,870 shape goes by mu and standard deviation, because it really matters how big the standard deviation 4617 07:04:04,870 --> 07:04:15,010 is, how big each of those areas are with the different colors. So I just wanted to remind 4618 07:04:15,010 --> 07:04:20,700 you that percent, area and probability are all related. The percents literally refer 4619 07:04:20,700 --> 07:04:27,240 to the percent of the area of the shape, okay? And imagine the whole thing is 100%. So just 4620 07:04:27,240 --> 07:04:33,800 to remind you, the orange part is 13.5% of the area of the hole shape, but it also is 4621 07:04:33,800 --> 07:04:40,850 the probability that an X like a student and x falls between mu minus one standard deviations 4622 07:04:40,850 --> 07:04:47,120 and mu minus two standard deviations. And that if I select 1x, from a group, this group 4623 07:04:47,120 --> 07:04:55,180 that I'm 13.5% is the probability that I will get an X in that range. And so it means both 4624 07:04:55,180 --> 07:05:02,942 things. So in conclusion, the empirical rule helps establish intervals that apply to normally 4625 07:05:02,942 --> 07:05:09,390 distributed data. And it's more useful than trebuchet. Because it's more specific, these 4626 07:05:09,390 --> 07:05:14,330 intervals have a certain percentage of the data points in them. And they also refer to 4627 07:05:14,330 --> 07:05:20,730 the probability of selecting an X in that interval. And these intervals depend on the 4628 07:05:20,730 --> 07:05:26,020 mean and the standard deviation of the data distribution. So if those change then exactly 4629 07:05:26,020 --> 07:05:31,860 where the numbers are on those intervals change. Well, I hope you enjoyed my explanation of 4630 07:05:31,860 --> 07:05:39,870 the empirical rule. And now you can practice doing it yourself at home. Good morning, good 4631 07:05:39,870 --> 07:05:45,480 day. And good afternoon. This is Monica wahi, your library college lecturer here moving 4632 07:05:45,480 --> 07:05:51,532 you through chapter 7.2, and 7.3, z scores and probabilities, I decided to merge these 4633 07:05:51,532 --> 07:05:56,280 two chapters together, because I thought they actually kind of belong together, I didn't 4634 07:05:56,280 --> 07:05:59,941 really understand why they were separated. So at the end of this lecture, you should 4635 07:05:59,941 --> 07:06:05,590 be able to explain how to convert an X to a z score, show how to look up a z score in 4636 07:06:05,590 --> 07:06:11,300 a Z table. Explain how to find the probability of an X falling between two values on a normal 4637 07:06:11,300 --> 07:06:16,070 distribution, describe how to use the Z table to look up a z corresponding to a percentage, 4638 07:06:16,070 --> 07:06:21,560 and describe how to use the formula to calculate x from a z score. Well, that sounds like a 4639 07:06:21,560 --> 07:06:25,290 lot, but you'll understand that at the end of this lecture, first, I'm going to go over 4640 07:06:25,290 --> 07:06:29,920 what a z score is and what the standard normal distribution is. Then I'm going to talk about 4641 07:06:29,920 --> 07:06:34,670 Z score probabilities. And what those are, I'm going to show you how to use the Z table 4642 07:06:34,670 --> 07:06:39,350 to answer some harder questions besides the ones I talked about during the z score probabilities 4643 07:06:39,350 --> 07:06:43,170 section, then I'm going to show you how to use a slightly different formula to calculate 4644 07:06:43,170 --> 07:06:49,350 x from z. Finally, I'm going to just remind you some tips and tricks about using z scores 4645 07:06:49,350 --> 07:06:55,890 and probabilities correctly. So all this talk about z scores. So what is the z score? And 4646 07:06:55,890 --> 07:07:01,660 what is the standard normal distribution? Well, let's take a look at this very, pretty 4647 07:07:01,660 --> 07:07:08,610 thing I made. You may recognize it from the last lecture, it was my little Empirical Rule 4648 07:07:08,610 --> 07:07:14,400 diagram. So remember, the empirical rule, remember how it required a normal distribution? 4649 07:07:14,400 --> 07:07:19,830 Well, that worked well for the cut points available, right? Like mu mu plus or minus 4650 07:07:19,830 --> 07:07:25,030 one standard deviation, mu plus or minus two standard deviations. If we ask questions that 4651 07:07:25,030 --> 07:07:31,150 were right on those cut points, we had good answers. But what about in between those cut 4652 07:07:31,150 --> 07:07:37,120 points. So I wanted you to notice, in this Empirical Rule diagram, these numbers at the 4653 07:07:37,120 --> 07:07:40,640 bottom, like I just circled them, like negative three, negative two, negative one, and then 4654 07:07:40,640 --> 07:07:46,850 mew doesn't have a number. So pretend there's a zero there. And then there's one, two and 4655 07:07:46,850 --> 07:07:55,670 three, okay? That is the standard normal distribution. And that is also called z. So these things 4656 07:07:55,670 --> 07:08:03,190 on the right, those are z scores. So see the green area, zero is the z score that's on 4657 07:08:03,190 --> 07:08:09,160 the lower limit of that, and one is the z score at the upper limit of the green area. 4658 07:08:09,160 --> 07:08:13,378 So you can see that this whole curve, the the standard normal distribution on the right, 4659 07:08:13,378 --> 07:08:18,360 the whole, the mean of the whole curve is zero. And the standard deviation of the whole 4660 07:08:18,360 --> 07:08:24,270 curve is one. And that is what c score is. So I just want you to notice the concept of 4661 07:08:24,270 --> 07:08:31,042 standard. I'm, I'm in the US. And in the US, we use, you know, the US dollar, but one of 4662 07:08:31,042 --> 07:08:36,250 the things I've noticed is that a lot of countries see it as a standard. So they'll map their 4663 07:08:36,250 --> 07:08:42,490 currency to the US dollar. So maybe the Euro will map its currency to the US dollar, maybe 4664 07:08:42,490 --> 07:08:43,490 the Egyptian pound 4665 07:08:43,490 --> 07:08:48,470 will also map its currency to the US dollar. And once it does that, it's a lot easier to 4666 07:08:48,470 --> 07:08:52,780 compare them, right. And so that's the main reason for the standard normal distribution 4667 07:08:52,780 --> 07:08:58,790 is it helps you compare exes from different distributions, different normal distributions 4668 07:08:58,790 --> 07:09:03,170 that have different means in different standard deviations from each other. It helps you map 4669 07:09:03,170 --> 07:09:10,770 them to this normal standard normal distribution here that standard, so you can compare them. 4670 07:09:10,770 --> 07:09:16,150 So let's talk about z scores, every value on a normal distribution. So every x can be 4671 07:09:16,150 --> 07:09:22,800 converted to a z score, just like I was saying how you can convert any currency to dollars, 4672 07:09:22,800 --> 07:09:23,800 there's some 4673 07:09:23,800 --> 07:09:25,670 formula for that. 4674 07:09:25,670 --> 07:09:31,120 You can convert every x on a normal distribution to a z score. But you have to know how to 4675 07:09:31,120 --> 07:09:35,570 use the formula right? And what goes into that formula. Well, first, you need the X 4676 07:09:35,570 --> 07:09:39,640 that you want to convert to a z score. So you need to pick one, then you need to know 4677 07:09:39,640 --> 07:09:47,010 the mu of your distribution, your normal distribution, and the standard deviation of your distribution. 4678 07:09:47,010 --> 07:09:52,040 And here are the two formulas that are used. The one I was just talking about is on the 4679 07:09:52,040 --> 07:09:56,970 left is the formula for calculating the z score. And we'll go over the one on the right 4680 07:09:56,970 --> 07:10:05,060 later in this lecture. So remember in the last lecture, I was talking about a class 4681 07:10:05,060 --> 07:10:10,340 that had 100 people in it. And that all took a really hard test, it was so hard, nobody 4682 07:10:10,340 --> 07:10:16,920 got 100%. And it was 100 point test. So nobody got 100. The top score was in the 90s. So 4683 07:10:16,920 --> 07:10:17,920 um, 4684 07:10:17,920 --> 07:10:22,860 and remember, in the upper right there was there's the meal, the meal was 65.5, which 4685 07:10:22,860 --> 07:10:24,950 is pretty bad score, 100 4686 07:10:24,950 --> 07:10:30,380 point test, and the standard deviation was 14.5. So I'm going to give you an example 4687 07:10:30,380 --> 07:10:35,840 of calculating a z score on that particular distribution. So let's say you got a friend, 4688 07:10:35,840 --> 07:10:40,560 you have smart friend, and that's my friend got a 90 in the face of all this? Well, let's 4689 07:10:40,560 --> 07:10:45,730 calculate the z score for 90 on this particular distribution. Okay, so here's what we're going 4690 07:10:45,730 --> 07:10:50,380 to do is, first we're going to remind ourselves, you don't have to do this in real life when 4691 07:10:50,380 --> 07:10:54,890 you're doing it. But I'm just doing this for demonstration purposes, is what our Empirical 4692 07:10:54,890 --> 07:11:02,820 Rule stuff look like. Remember, at mu plus one standard deviation was 80. And mu plus 4693 07:11:02,820 --> 07:11:08,320 two standard deviations was 94.5. So already, you know, whatever your answer is going to 4694 07:11:08,320 --> 07:11:14,240 be for 90 is it's going to be between one and two. Right. But we just don't know exactly 4695 07:11:14,240 --> 07:11:17,800 what it's going to be. So I'm just showing you this for demonstration purposes to relate 4696 07:11:17,800 --> 07:11:22,692 it to the last lecture. But you don't have to do this in real life when you calculate. 4697 07:11:22,692 --> 07:11:26,610 Okay, so we know that the Z we're going to calculate is going to be somewhere between 4698 07:11:26,610 --> 07:11:33,770 one and two. And as you'll see, on the slide here, I labeled over on the z curve, I labeled 4699 07:11:33,770 --> 07:11:39,050 where z equals zero, which is the mu that's 65.5. So we're going to anticipate we're going 4700 07:11:39,050 --> 07:11:43,510 to get a z score, that's somewhere between one and two. And you'll see in blue, I listed 4701 07:11:43,510 --> 07:11:50,160 the ingredients, right, so we have the smartphone score 90, we have the mu 65.5. And we have 4702 07:11:50,160 --> 07:11:57,420 standard deviation 14.5. And then we have our z formula. So let's do it. Okay, so x 4703 07:11:57,420 --> 07:12:03,060 minus mu is going to be 90, which is our x minus 65.5. You do that out first, and then 4704 07:12:03,060 --> 07:12:09,340 you divide it by 14.5. And look, our Z score is 1.69. And that's exactly where we thought 4705 07:12:09,340 --> 07:12:14,590 it would be, it would be somewhere between one and two. And so as you can see, you can 4706 07:12:14,590 --> 07:12:21,730 take any x and convert it to Z. Here we'll do another example, only this friend is not 4707 07:12:21,730 --> 07:12:26,331 so smart. This friend actually got a score that was kind of low, it was so low, it was 4708 07:12:26,331 --> 07:12:31,952 below the meal of 65.5, this poor friend only got a 50. So let's try it again, let's do 4709 07:12:31,952 --> 07:12:39,090 a z score for 50. So again, you know this is just for demonstration purposes. But remember, 4710 07:12:39,090 --> 07:12:47,060 in Empirical Rule land 51 was that mu minus one standard deviation. So we're going to 4711 07:12:47,060 --> 07:12:52,900 expect that between again, negative one and negative two is z is where our 50x is going 4712 07:12:52,900 --> 07:13:00,140 to land if we calculate the z score. And so here we are, we calculate the z score, we 4713 07:13:00,140 --> 07:13:07,220 have 50 minus 65.5 divided by 14.5, and we get negative 1.07. And the reason why it's 4714 07:13:07,220 --> 07:13:11,452 negative is, as you can see, it's on the left of the meal, 4715 07:13:11,452 --> 07:13:15,640 so then the z score is gonna be negative. And so as you can see, it's exactly where 4716 07:13:15,640 --> 07:13:21,500 we thought it would be, it would be a little bit to the left of negative one. 4717 07:13:21,500 --> 07:13:25,720 So now we're going to get into something that's a little bit harder, which is the z score 4718 07:13:25,720 --> 07:13:30,000 probability. So you're feeling pretty good about the z score. But now let's talk about 4719 07:13:30,000 --> 07:13:36,420 the probabilities. Okay, so remember the probability from the empirical rule, this is just old 4720 07:13:36,420 --> 07:13:40,730 Empirical Rule stuff. So remember, I gave you a question at the end of that lecture, 4721 07:13:40,730 --> 07:13:46,260 I said, What is the probability I will select a student with a score between 36.5 and 51? 4722 07:13:46,260 --> 07:13:55,230 And remember, the answer was like this orange area, which is 13.5%. But what if you have 4723 07:13:55,230 --> 07:14:01,980 z scores like 1.69? The Smart friend, and negative 1.07, which are the not so smart 4724 07:14:01,980 --> 07:14:06,070 friend, you know, in other words, you have excess of 90 and 50, which are not on the 4725 07:14:06,070 --> 07:14:12,128 empirical rule? How do you figure out the percent or the probability? That's the next 4726 07:14:12,128 --> 07:14:20,390 step with your z scores? Okay, so now let's ask this question, let's say, what is the 4727 07:14:20,390 --> 07:14:26,780 probability that students scored above the smartframe. Now, we could also ask for below, 4728 07:14:26,780 --> 07:14:32,310 but I'm just choosing to ask for above this time. So in other words, what is the area 4729 07:14:32,310 --> 07:14:38,990 under the curve from z equals 1.69? All the way up. So see, like a little ways through 4730 07:14:38,990 --> 07:14:46,620 that blue edge. We wish we knew the area for everything up from 1.69 Z, through the purple 4731 07:14:46,620 --> 07:14:51,590 area through the little black thing at the top. We wish we knew that area. We only know 4732 07:14:51,590 --> 07:14:55,560 from the empirical rule what's on the cut points of like one and two, but we don't know 4733 07:14:55,560 --> 07:15:02,230 this in in between things. So how do we figure that out? Well This is another problem here. 4734 07:15:02,230 --> 07:15:08,140 What is the probability that students scored below the nozzle smart friend, right? And 4735 07:15:08,140 --> 07:15:14,560 in that case, see the diagram, we'd have to figure out what is the part of the orange 4736 07:15:14,560 --> 07:15:19,750 that that friend gets plus the red and plus a little black part of the bottom? What is 4737 07:15:19,750 --> 07:15:26,060 the percent or the proportion of the curve that represents that. So that's what we're 4738 07:15:26,060 --> 07:15:33,932 getting into now. And that's what we do is we look these up in a Z table. So what the 4739 07:15:33,932 --> 07:15:41,910 Z table is, is basically, they figured out every single Z score, you could have between 4740 07:15:41,910 --> 07:15:49,650 negative 3.49. And I'll go into why negative 3.49, between negative 3.49 and positive 3.49. 4741 07:15:49,650 --> 07:15:51,420 And they went like every 100. 4742 07:15:51,420 --> 07:15:52,420 So 4743 07:15:52,420 --> 07:15:59,310 they figured out for every single one of those these scores, what the probability is, and 4744 07:15:59,310 --> 07:16:04,390 they actually fit that all on a table. And so now, what I'm going to show you how to 4745 07:16:04,390 --> 07:16:09,520 do is how to use that table to look up the probabilities. And by the way, if you look 4746 07:16:09,520 --> 07:16:14,081 up a probability that happens to be on one of those Empirical Rule cut points, you'll 4747 07:16:14,081 --> 07:16:19,110 get what the empirical rule says. It's just said, the empirical rule is nice, because 4748 07:16:19,110 --> 07:16:22,628 you don't have to pull out the table. But if you have something that's not on the empirical 4749 07:16:22,628 --> 07:16:30,570 rule, cut points, get out your Z table. So how do you use the Z table? Well, the first 4750 07:16:30,570 --> 07:16:36,020 thing is you want to figure out what area you want, right? So we're going to start and 4751 07:16:36,020 --> 07:16:41,530 do the not so smart friend, because that's a little bit easier actually to demonstrate. 4752 07:16:41,530 --> 07:16:48,380 Okay, so what is the probability that students scored below the not so smart friend? So, 4753 07:16:48,380 --> 07:16:53,410 which is a secret way of saying, what is the area under the curve that makes up most of 4754 07:16:53,410 --> 07:16:59,060 that orange part, all the red and the little black part at the bottom? What is that proportion. 4755 07:16:59,060 --> 07:17:06,340 And so for areas left of specified Z value, you're supposed to use the table directly. 4756 07:17:06,340 --> 07:17:11,220 So I'm going to show you how to use that table to look up negative 1.07. And then I'm going 4757 07:17:11,220 --> 07:17:16,720 to come back and tell you what they mean by use it directly. Hi, there. So here we are 4758 07:17:16,720 --> 07:17:21,990 at the Z table. And if you have the book, you can look it up in the appendix in on page 4759 07:17:21,990 --> 07:17:26,430 eight. But there's also a lot of z tables on the internet. Sometimes they're arranged 4760 07:17:26,430 --> 07:17:31,010 a little differently. So I'm using this one because it's from the book. So remember, the 4761 07:17:31,010 --> 07:17:37,830 Z that we're looking up, we're looking up the Z of negative 1.07. So remember, I said 4762 07:17:37,830 --> 07:17:42,930 they had to somehow calculate all the different probabilities for every single z between negative 4763 07:17:42,930 --> 07:17:49,830 3.49 through positive 3.49. Every 100th, they had to come up with that, well, how did they 4764 07:17:49,830 --> 07:17:53,930 fit it all on their table? Well, this is what they did. See, this is the being the Z table. 4765 07:17:53,930 --> 07:18:00,840 Remember, I said negative 3.49? Well, this is negative 3.4. And then to find the Z and 4766 07:18:00,840 --> 07:18:06,048 negative 3.49, you have to imagine that the nine is here, but it's going to be the last 4767 07:18:06,048 --> 07:18:13,230 one here. So see this nine here, this is what it would be. So just for pretend, if we had 4768 07:18:13,230 --> 07:18:22,360 a z score of negative 2.58, I go 2.5. And then I have to go over to the eight, one right 4769 07:18:22,360 --> 07:18:31,120 here. Okay. Or if I had one that was negative 2.10, right, or negative, just plain 2.1. 4770 07:18:31,120 --> 07:18:39,140 Right? Then I'd go over just one to this zero, line and see these these little tiny things 4771 07:18:39,140 --> 07:18:43,780 in here. Those are all probabilities. In fact, let's go look up our probability, which is 4772 07:18:43,780 --> 07:18:50,958 negative 1.07. So we're going to go down here, negative, here we are at negative 1.0. And 4773 07:18:50,958 --> 07:18:54,112 then we have to go over to the seven column, right, so what's the song? Here's a song, 4774 07:18:54,112 --> 07:19:00,782 it's three from the left, I guess I could have guessed that. So we have negative 1.0987. 4775 07:19:00,782 --> 07:19:12,140 So this is point 1423. Otherwise known as 14 point 23%. So that's actually what you 4776 07:19:12,140 --> 07:19:17,100 get out of the Z table. That's the probability that's the percent you're looking for. And 4777 07:19:17,100 --> 07:19:21,420 just in case, you're wondering, these aren't all negative, the first page is negative. 4778 07:19:21,420 --> 07:19:28,530 The second page is positive is all the positive Z scores all the way up to 3.49. But what 4779 07:19:28,530 --> 07:19:34,570 I want you to hold in your head is what we just looked at, which was negative 1.07, which 4780 07:19:34,570 --> 07:19:35,900 is point 1423. 4781 07:19:35,900 --> 07:19:38,760 Okay, hold that thought. 4782 07:19:38,760 --> 07:19:46,260 Okay, here we are back at our slides. And so look at that green part where it says four 4783 07:19:46,260 --> 07:19:51,060 areas to the left of a specified Z value, which we're doing with the not so smart friend, 4784 07:19:51,060 --> 07:19:57,200 use the table entry directly. So here was our table entry. It was point 1423. So we're 4785 07:19:57,200 --> 07:20:01,990 just going to use that number that we found and we're gonna say the probability then, 4786 07:20:01,990 --> 07:20:09,180 is 14.23%. And that kind of makes logical sense knowing the empirical rule. Now, I'm 4787 07:20:09,180 --> 07:20:16,860 going to show you an example of what why I was saying, use it directly. In this next 4788 07:20:16,860 --> 07:20:20,250 example, we're going to look at the smart friends probability. In fact, we're going 4789 07:20:20,250 --> 07:20:25,560 to ask what is the probability that the students scored above the smart friend in the smart 4790 07:20:25,560 --> 07:20:31,090 friend set z equals 1.69. So I'm going to demonstrate now, for areas to the right of 4791 07:20:31,090 --> 07:20:36,390 a specified Z value, you either look them up in the table, then subtract result from 4792 07:20:36,390 --> 07:20:44,560 one, or you use the opposite z, which is in this case would be negative 1.69. And you'll 4793 07:20:44,560 --> 07:20:49,430 get the same answer, whether you do with the first way The second way, but I'm going to 4794 07:20:49,430 --> 07:20:54,490 demonstrate both okay. So first, I'm going to demonstrate what happens when you look 4795 07:20:54,490 --> 07:20:59,640 up the probability in the table for that, see, and then you subtract that probability 4796 07:20:59,640 --> 07:21:07,020 from one. So let's go look up z equals 1.69. All right, here we are back at our Z table, 4797 07:21:07,020 --> 07:21:12,190 only this time, we're looking up a positive z. So we don't want this first one, we want 4798 07:21:12,190 --> 07:21:18,650 the second one. So remember, we're looking up z equals 1.69. So we're looking under here 4799 07:21:18,650 --> 07:21:25,120 for 1.6. And that's right here. And now we have to go over to the nine column. So that's 4800 07:21:25,120 --> 07:21:35,200 going to be point 9545. So hold that thought, point 9545. Okay, we're back with our probability 4801 07:21:35,200 --> 07:21:39,670 that we looked up in the Z table. Now remember, we were supposed to look it up in the table 4802 07:21:39,670 --> 07:21:44,690 and subtract the result from one. So that's what we're going to do now. So we found point 4803 07:21:44,690 --> 07:21:55,860 9545 in the table, we're going to take one minus point 9545. And we get 0.0455, or 4.55%, 4804 07:21:55,860 --> 07:22:00,510 this little tiny piece, which kind of makes sense, because it's right at the top of the 4805 07:22:00,510 --> 07:22:04,790 distribution, just a little piece of the blue, and the purple, and then the little black 4806 07:22:04,790 --> 07:22:11,140 at the top. Alright, and so what you want to imagine is that point 954, or five, which 4807 07:22:11,140 --> 07:22:20,452 is like 95.4, or 5%, that's the whole piece below z equals 1.69. That's most of the blue, 4808 07:22:20,452 --> 07:22:25,470 the green, the yellow, the orange, the red, and the little black at the bottom, that's 4809 07:22:25,470 --> 07:22:31,458 all in the point 9545. Okay, so again, we were looking up in the area to the right of 4810 07:22:31,458 --> 07:22:37,340 the specified Z value, and I showed you the first way of doing it, there's another way 4811 07:22:37,340 --> 07:22:43,640 of doing it, and that's where you just use the opposite z from the get go. So we're going 4812 07:22:43,640 --> 07:22:50,450 to now use the opposite seat, we're going to look up negative 1.69. All right, here 4813 07:22:50,450 --> 07:22:58,020 we are back at the Z table. Only this time, we're looking at negative 1.69. So negative 4814 07:22:58,020 --> 07:23:03,430 1.6 is the first thing we need to find in this column. So here we are negative 1.6. 4815 07:23:03,430 --> 07:23:08,208 And then we know nine is the last column. I'm learning that. So we'll go over here. 4816 07:23:08,208 --> 07:23:14,430 And so that that looks familiar. Right point. Oh, 455. Okay, hold that thought. All right, 4817 07:23:14,430 --> 07:23:20,880 well, back. And so as you know, if you look it up in the table directly, like the 1.69 4818 07:23:20,880 --> 07:23:25,208 directly, and you take that probability, and you subtract it from one, which is what we 4819 07:23:25,208 --> 07:23:34,190 did last, we got the same answer we got now, right point, oh, 455, or 4.55%. So it is kind 4820 07:23:34,190 --> 07:23:39,570 of more efficient, to just use the opposite z, if you're looking for areas to the right 4821 07:23:39,570 --> 07:23:45,430 of the specified Z value. But I always say when you're done looking it up, compare it 4822 07:23:45,430 --> 07:23:50,590 to the picture. And I always say draw a picture to, you know, I don't mind if you have normal 4823 07:23:50,590 --> 07:23:57,230 curves drawn, drawn over all of your homework, or all over the wall, I guess, or maybe a 4824 07:23:57,230 --> 07:24:03,610 whiteboard, that's probably more efficient. But it's best to draw it out. label on there, 4825 07:24:03,610 --> 07:24:09,950 where your z and your x are, and then just look at it. Because we know that the little 4826 07:24:09,950 --> 07:24:17,340 piece above z equals 1.69 is not 95% of that curve. It's just not it, that's over 50%. 4827 07:24:17,340 --> 07:24:23,800 And we can tell that little tiny pieces under 50%. So if you accidentally do the first way 4828 07:24:23,800 --> 07:24:30,260 and forget to subtract from one, you know, maybe if you check it against your normal 4829 07:24:30,260 --> 07:24:31,260 curve drawing, 4830 07:24:31,260 --> 07:24:37,878 you'll realize oh, I made a mistake. So even though there's two different ways to find 4831 07:24:37,878 --> 07:24:44,220 the probability, if it's to the right of the z value, just try to make sure no matter which 4832 07:24:44,220 --> 07:24:51,401 ways you use that you finally do a reality check against the drawing you make, just to 4833 07:24:51,401 --> 07:24:54,940 make sure you got the right piece because there's only two pieces. There's a big piece 4834 07:24:54,940 --> 07:24:59,910 and a little piece of the skirt, and we got 4.55% we know that's a little piece and we 4835 07:24:59,910 --> 07:25:03,050 know From our drawing that we were looking for the little piece. So that's how you do 4836 07:25:03,050 --> 07:25:09,400 your reality check. Okay, you thought that there weren't any harder questions? Well, 4837 07:25:09,400 --> 07:25:13,070 here are some harder questions. So this is a little bit more on probabilities in the 4838 07:25:13,070 --> 07:25:19,510 Z table. So here's another question we haven't handled yet. What if you were looking at a 4839 07:25:19,510 --> 07:25:24,320 probability between two scores, such as the probability the students will score between 4840 07:25:24,320 --> 07:25:27,560 50 and 90, so it's somewhere in the middle, 4841 07:25:27,560 --> 07:25:28,730 okay. 4842 07:25:28,730 --> 07:25:34,660 Note that in that case, when you have a between one, you actually have two axes, and we'll 4843 07:25:34,660 --> 07:25:39,860 label them x one and x two, so the not so smart friend is going to be x one, and the 4844 07:25:39,860 --> 07:25:45,420 smarter friend is going to be x two, just to keep these x's straight. Okay. So the next 4845 07:25:45,420 --> 07:25:50,060 step is you're going to calculate z one and z two. And I'm kind of cheating. Because we 4846 07:25:50,060 --> 07:25:53,560 already did these, we already knew the Z one for the National smartphone was negative 1.07. 4847 07:25:53,560 --> 07:25:59,208 And we already knew the Z two, for the smarter friend was 1.69. So I just put them on the 4848 07:25:59,208 --> 07:26:05,150 diagram. Okay, and then here's this beginning of the strategy, and I'll just explain the 4849 07:26:05,150 --> 07:26:10,330 strategy, and then I'll do the strategy. So for z one, you find the probability to the 4850 07:26:10,330 --> 07:26:14,920 left of the Z, so you find the little piece to the left. And remember, you can take the 4851 07:26:14,920 --> 07:26:19,048 direct probability from the Z table. So that's what direct means is you just get to copy 4852 07:26:19,048 --> 07:26:25,020 it directly out of this table. Then for z two, you find the probability to the right 4853 07:26:25,020 --> 07:26:30,600 or above z. So you find the little piece there. And you use one of those two methods I showed 4854 07:26:30,600 --> 07:26:38,480 you, which we did together. And then finally, imagine like the whole curve, you're subtracting 4855 07:26:38,480 --> 07:26:44,180 the piece at the bottom, the Z, one probability, and you're subtracting the piece at the top. 4856 07:26:44,180 --> 07:26:49,360 So you're trimming with those two pieces to get the between probability. So that's the 4857 07:26:49,360 --> 07:26:56,042 strategy is basically you find out the the size, the probability of each of the little 4858 07:26:56,042 --> 07:27:00,452 pieces on the sides, you subtract both of those from one, and that traps whatever's 4859 07:27:00,452 --> 07:27:07,010 left in the middle. So I'll demonstrate this. So remember, for z one, the probability to 4860 07:27:07,010 --> 07:27:14,440 the left of Z one was point 1423. We did that together. And then we use both of those methods. 4861 07:27:14,440 --> 07:27:20,650 And they got the same answer to find the probability to the right of z two, which was point o 455. 4862 07:27:20,650 --> 07:27:25,220 Okay, so that's a little piece at the top, and then we got the little piece at the bottom. 4863 07:27:25,220 --> 07:27:32,420 And now we'll take one minus the piece at the bottom minus the piece of the top and 4864 07:27:32,420 --> 07:27:39,250 the total is point 8122, or 81. Point 22%. which kind of makes sense, that's a big piece 4865 07:27:39,250 --> 07:27:43,730 in the middle. So it wouldn't be surprising if it was about 80% of the curve. So this 4866 07:27:43,730 --> 07:27:50,660 is how you do a between like. Here's another question I haven't really handled, what have 4867 07:27:50,660 --> 07:27:55,720 you looking at a probability more than 50%? So such as the probability that students will 4868 07:27:55,720 --> 07:28:04,030 score greater than 50? Right? Like, like the big side? Okay? Well, actually, you just do 4869 07:28:04,030 --> 07:28:08,940 what you normally would do, you say four areas to the right of the specified Z value, either 4870 07:28:08,940 --> 07:28:13,708 look up in the table and subtract the result from one, or use the opposite z, which in 4871 07:28:13,708 --> 07:28:19,730 this case would be 1.07. So if we did method one, we'd end up going one minus point 1423, 4872 07:28:19,730 --> 07:28:26,610 which we already looked at, and we get point 8577, we use method to we'd take the Z of 4873 07:28:26,610 --> 07:28:32,680 1.7, not negative 1.07, but 1.07. And we could go look it up in the Z table, and we get point 4874 07:28:32,680 --> 07:28:39,130 8577. Again, 85 point 77%. So if this isn't actually a harder question, I just wanted 4875 07:28:39,130 --> 07:28:42,298 to show you how it works when you're getting like a bigger piece, bigger than 50% piece 4876 07:28:42,298 --> 07:28:50,780 of the distribution. And here's another sort of similar example, where we're looking at 4877 07:28:50,780 --> 07:28:57,680 the probability that students will score less than 90, okay. So that's easy, right for the 4878 07:28:57,680 --> 07:29:02,610 area's to the left of the specified Z value, just use the table directly. So when we went 4879 07:29:02,610 --> 07:29:09,850 and looked up z equals 1.69, we got point 9545. So that's the answer. It's 95.45% of 4880 07:29:09,850 --> 07:29:18,470 the curve is below z equals 1.69, or below x equals 90. So as I mentioned before, but 4881 07:29:18,470 --> 07:29:22,890 I'll just mention again, you're supposed to treat all probabilities to the left of z equals 4882 07:29:22,890 --> 07:29:30,500 negative 3.49 as P equals zero. So I showed you what negative 3.49 looks like in the Z 4883 07:29:30,500 --> 07:29:36,260 table. It's like point O two. Well, there's not much smaller than that. So just, if you 4884 07:29:36,260 --> 07:29:43,910 actually calculate z and you get like negative four, just say the P is zero, okay. Then the 4885 07:29:43,910 --> 07:29:49,190 second thing is treat all areas and probabilities to the right of z equals 3.49, SP equals one 4886 07:29:49,190 --> 07:29:56,870 or 100%. So as you can imagine, you know, 3.49, that's at the top of the curve. So if 4887 07:29:56,870 --> 07:30:02,110 you calculate a Z and you got like a five, you can just assume that's 100%, right or 4888 07:30:02,110 --> 07:30:10,458 one. Okay, um, so we've gone through how to calculate z. And we've talked about looking 4889 07:30:10,458 --> 07:30:15,290 at probabilities in the Z table. And we've even talked about manipulating those probabilities 4890 07:30:15,290 --> 07:30:23,798 to get certain probabilities. But we haven't talked about calculating x when z is given. 4891 07:30:23,798 --> 07:30:30,060 So sometimes you're actually given a z. And you are have to calculate the x back 4892 07:30:30,060 --> 07:30:35,630 from the Z. In fact, sometimes it's even harder. Sometimes you're given a probability. And 4893 07:30:35,630 --> 07:30:39,628 the probability is not as easy. But you can use the probability, remember that those little 4894 07:30:39,628 --> 07:30:43,140 percents in the middle of the table, you can go find it in the middle of the table and 4895 07:30:43,140 --> 07:30:49,230 look up the Z that keys to it, and then put it into this equation. And so I'm going to 4896 07:30:49,230 --> 07:30:54,620 just give you examples of some real life questions that you might see, like on a homework or 4897 07:30:54,620 --> 07:31:00,180 on a task, probably not in real real life. That where you need to calculate x, and you 4898 07:31:00,180 --> 07:31:08,292 need to use that formula in the red circle. So let's say I was just bored. And I was wondering, 4899 07:31:08,292 --> 07:31:16,770 what is the score the test score on the story distribution? That is add z equals 1.5? Okay, 4900 07:31:16,770 --> 07:31:20,750 so see where z equals 1.5? We never asked that question before. So let's say I just 4901 07:31:20,750 --> 07:31:25,200 out of curiosity wanted to know, what would the test score be of a student who was at 4902 07:31:25,200 --> 07:31:35,180 z equals 1.5. So what I would do is I would take 1.5 times 14.5, because that's what the 4903 07:31:35,180 --> 07:31:39,900 formula says. It's z times the standard deviation. And then I do that first because order of 4904 07:31:39,900 --> 07:31:47,120 operation. And then after doing that, I'd add the mu, which is 65.5. And I get 87.3. 4905 07:31:47,120 --> 07:31:55,370 So the x, the student who got 87.3, that student got a score, that's add z equals 1.5. Now, 4906 07:31:55,370 --> 07:32:00,378 as you probably imagine, people don't go around asking so much about well, I wonder what that 4907 07:32:00,378 --> 07:32:05,830 person's score is at z equals negative 2.3? Or whatever. They don't usually phrase it 4908 07:32:05,830 --> 07:32:11,310 like that. Usually, you see more like a question like this, which is what is the score that 4909 07:32:11,310 --> 07:32:19,140 marks the top 7% of scores? And that's a secret way of saying, We are looking for the Z at 4910 07:32:19,140 --> 07:32:24,470 p equals point. Oh, seven. Oh, so it's like we turn that 7% backwards into probability. 4911 07:32:24,470 --> 07:32:29,890 And we say, we're actually looking for the Z at p equals point. Oh, seven. Oh, so how 4912 07:32:29,890 --> 07:32:33,020 do you do that? Well, I'm going to show you. 4913 07:32:33,020 --> 07:32:34,290 Okay, 4914 07:32:34,290 --> 07:32:43,260 so we're on the hunt for probability. Point. 0700. Okay, so let's start at the top of the 4915 07:32:43,260 --> 07:32:47,280 table here. You'll see we're digging around in the middle of the table, right? And you'll 4916 07:32:47,280 --> 07:32:51,580 see like point oh, that's nowhere near the ballpark, because we're looking for point 4917 07:32:51,580 --> 07:32:57,460 O seven. Oh, so let's scroll up here. or scroll down, actually. So now we're more we're in 4918 07:32:57,460 --> 07:33:02,400 the point O four neighborhood. Here's point O six. Okay, we're getting close. Well, here 4919 07:33:02,400 --> 07:33:03,670 we have a point. 4920 07:33:03,670 --> 07:33:09,700 Oh, 708. And that's point oh, eight more than we want it to be. 4921 07:33:09,700 --> 07:33:18,170 Well, here next door, we have point Oh, 694. And that's only point oh, six less than we 4922 07:33:18,170 --> 07:33:24,410 want it to be right, because if it had point O six more, it would be point O seven. Oh, 4923 07:33:24,410 --> 07:33:25,410 so this 4924 07:33:25,410 --> 07:33:30,640 is technically closer than this one, because this is point O, O eight off. And this is 4925 07:33:30,640 --> 07:33:39,101 only off by point O six. So we're gonna choose point o 694. As the probably the probability 4926 07:33:39,101 --> 07:33:44,000 of record for this for the top 7%. Only, we're not going to just choose this, we're going 4927 07:33:44,000 --> 07:33:48,870 to figure out what is z at that score. So what are we gonna do, we're gonna map back 4928 07:33:48,870 --> 07:33:54,780 here, negative 1.4. And then we got to go all the way up, which we can guess is eight. 4929 07:33:54,780 --> 07:34:03,340 So it's negative 1.48. So hold that thought. Okay, we started out looking for the Z p equals 4930 07:34:03,340 --> 07:34:14,480 0.0700. And but the closest we got was 0.0694, and then map to z equals negative 1.48. Now, 4931 07:34:14,480 --> 07:34:23,390 what I want you to notice is negative 1.48 is actually on the left side of me. Okay, 4932 07:34:23,390 --> 07:34:29,990 so that is the z score at the bottom 7% of the scores. So we're going to use the positive 4933 07:34:29,990 --> 07:34:36,610 version of that see, since we want the top 7%, so we're going to use 1.48. So the opposite 4934 07:34:36,610 --> 07:34:44,458 See, and now we're going to plug it into the equation. So 1.48 times 14.5, which is the 4935 07:34:44,458 --> 07:34:52,420 standard deviation plus 65.5 equals 87. So now at seven is the score that marks the top 4936 07:34:52,420 --> 07:35:01,740 7% of the scores. I'm going to do another exercise for you. That does the this time 4937 07:35:01,740 --> 07:35:06,270 the bottom 3% of the scores because this is often kind of challenging for students. So 4938 07:35:06,270 --> 07:35:10,980 I'll just give you a second demonstration. So as you can imagine, we're going on the 4939 07:35:10,980 --> 07:35:20,020 hunt now for z at p equals 0.0300. So let's go over to the Z table. All right, now we're 4940 07:35:20,020 --> 07:35:23,900 getting a little good at this, right? So we're digging around in the middle, and we're looking 4941 07:35:23,900 --> 07:35:33,620 for 0.0300. Okay, and starting at the top, we're in the 00. department. Oh, here's point 4942 07:35:33,620 --> 07:35:45,810 01. Something 02. Okay, we're getting close to the point 0300. So a point, point 0301. 4943 07:35:45,810 --> 07:35:50,720 Could you ask for anything closer? Totally Perfect. Okay, so that's what we're going 4944 07:35:50,720 --> 07:35:59,208 to use for our z is the the Z at 0.0301. So let's look up that C so that c is negative 4945 07:35:59,208 --> 07:36:04,500 1.8. And then we look up eight, so it's negative 1.88. 4946 07:36:04,500 --> 07:36:06,420 Hold that thought. 4947 07:36:06,420 --> 07:36:13,550 All right. Well, we were on the hunt for P equals Oh, point oh, three. Oh, and we didn't 4948 07:36:13,550 --> 07:36:20,878 find that. But we did find p equals point. Oh, 301 and the table, and that mapped back 4949 07:36:20,878 --> 07:36:28,450 to z equals negative 1.88. Right. And now we go back to the question, we see that we 4950 07:36:28,450 --> 07:36:35,110 want the bottom 3%, so we keep the negative. Now if I'd asked about the top 3%, we'd lose 4951 07:36:35,110 --> 07:36:40,440 the negative we use 1.88 in the equation, but since we want the bottom 3%, we're going 4952 07:36:40,440 --> 07:36:47,030 to keep the negative. Okay, so now let's do the equation. So x equals and then in the 4953 07:36:47,030 --> 07:36:53,320 parentheses negative 1.88 times 14.5, which is our standard deviation, then plus our mu, 4954 07:36:53,320 --> 07:37:00,930 which is 65.5. And the score we get is 38.2. So 38.2 is the score that marks the bottom 4955 07:37:00,930 --> 07:37:09,120 3% of scores, and just be happy your score is not in there. Okay, now, here's another 4956 07:37:09,120 --> 07:37:15,060 challenging hard question. What is the question on the tester, probably not in real life, 4957 07:37:15,060 --> 07:37:21,430 but on a test says what scores mark the middle 20% of the data. And so I put little arrows 4958 07:37:21,430 --> 07:37:25,450 on there just to point out well, when they say middle, they mean, it's hugging 4959 07:37:25,450 --> 07:37:26,910 the meal, 4960 07:37:26,910 --> 07:37:31,290 it's actually assuming that there's gonna be 10% on the right side of the meal, and 4961 07:37:31,290 --> 07:37:37,840 10% on the left side of the meal. And so how you start to do this is you figure out the 4962 07:37:37,840 --> 07:37:47,040 z score for one minus point two, which is the 20% divided by two, which equals four, 4963 07:37:47,040 --> 07:37:53,560 right? So then after that, you know, because one minus point two is point eight, and point 4964 07:37:53,560 --> 07:37:59,720 eight divided by two is point four. So we get this point four. So we go find the z score 4965 07:37:59,720 --> 07:38:05,640 at point four, which you're good at using the Z table now. So uh, so I'm, you know, 4966 07:38:05,640 --> 07:38:11,433 looked around, and I found point 4013, in that, digging around in the middle of the 4967 07:38:11,433 --> 07:38:19,458 Z table, and that map back to negative z equals negative point two, five, right. And so that 4968 07:38:19,458 --> 07:38:24,840 is then what I would put on for the lower limit on that one, and then z equals point 4969 07:38:24,840 --> 07:38:30,282 two, five, the positive version goes on the other side. So once you figured out both of 4970 07:38:30,282 --> 07:38:34,970 the Z's, the Z on the left and the Z on the right, you just have to put them through the 4971 07:38:34,970 --> 07:38:41,610 equation. So for the left side, we use the negative z. And for the right side, we use 4972 07:38:41,610 --> 07:38:46,280 the positive Z. And that's how we get our limits. So what's for is mark the middle 20% 4973 07:38:46,280 --> 07:38:54,230 of the data 61.9 and 69.1. It's not weird how that worked out. But anyway, 61.9 and 4974 07:38:54,230 --> 07:38:59,760 69.1. Mark the middle 20% of the data. I didn't totally didn't do that on purpose. It just 4975 07:38:59,760 --> 07:39:06,140 worked out that way. All right, I can't believe you made it through all this. I'll bet your 4976 07:39:06,140 --> 07:39:11,290 brain is ready to explode. So now is a good time to talk about just a little review. Just 4977 07:39:11,290 --> 07:39:17,930 help me come down a little bit from this whole really intense lecture. Okay. So first, I'm 4978 07:39:17,930 --> 07:39:24,330 going to do a little Z score quiz game show style stuff here, right? So if you ever get 4979 07:39:24,330 --> 07:39:28,050 the question when you're on the test, and you're like, Oh, my gosh, where is x? Where's 4980 07:39:28,050 --> 07:39:33,750 x? Well, if you can't find x, it's usually in the question. So usually, the way these 4981 07:39:33,750 --> 07:39:40,370 questions go is somebody like maybe me, we'll put a mu and a standard deviation at the top 4982 07:39:40,370 --> 07:39:45,820 of the question. And then there'll be like, maybe five questions about that pertain to 4983 07:39:45,820 --> 07:39:50,570 that mu and that standard deviation, but they asked about different axes. And when I would 4984 07:39:50,570 --> 07:39:53,730 teach this class, a person, you know, people will come running up to me in the middle of 4985 07:39:53,730 --> 07:39:58,660 a test, which you probably shouldn't do. And they would say, where's the x? Where's the 4986 07:39:58,660 --> 07:40:02,900 x you gave me you know? These pieces of the equation but I can't find the x. And I'd be 4987 07:40:02,900 --> 07:40:07,580 like, walk on the question. Look in the question, you know, because I don't want to give it 4988 07:40:07,580 --> 07:40:11,970 away, and then they'd all run back to their seats and find it. So that's so if you're 4989 07:40:11,970 --> 07:40:17,560 wondering, your panic and where's x? Look in the question, it's usually in the question. 4990 07:40:17,560 --> 07:40:23,410 Okay, so let's say you find an X, and what do you do with an x? Okay, and you're stuck 4991 07:40:23,410 --> 07:40:28,500 with an X, what do you Well, usually, what you have to do is calculate a z score. So 4992 07:40:28,500 --> 07:40:33,330 remember, if you've got an X, you probably have a mu and a standard deviation, you can 4993 07:40:33,330 --> 07:40:37,410 calculate a z score on that. So if you're panicking on a test, and you have an x, I 4994 07:40:37,410 --> 07:40:41,952 mean, Sandy nation, just for fun, calculate a z score and see if it gets you anywhere. 4995 07:40:41,952 --> 07:40:46,620 Okay, well, let's say you have a z score, what do you do with a Z score? Well, you always 4996 07:40:46,620 --> 07:40:51,140 look it up, right? I mean, if you're, if you're going this direction, if you're getting if 4997 07:40:51,140 --> 07:40:56,340 you started with an X, and you get a Z, you got to go to the Z table with. Okay, so that's 4998 07:40:56,340 --> 07:41:00,031 your next step. So if you're doing all this work, calculate a z score. And then you're 4999 07:41:00,031 --> 07:41:05,570 done. You're like, Oh, my gosh, what's my next step? Go look at the Z table. Well, what 5000 07:41:05,570 --> 07:41:10,792 is the question asks for an x, right? Well, remember, we have a whole formula for that. 5001 07:41:10,792 --> 07:41:17,320 So use the x formula. So if there's no x anywhere, and it's asking for an x, then use the other 5002 07:41:17,320 --> 07:41:18,320 formula, use the 5003 07:41:18,320 --> 07:41:20,260 x formula? 5004 07:41:20,260 --> 07:41:26,128 And what if the question gives you a P, or I just said p for probability, but it could 5005 07:41:26,128 --> 07:41:30,950 be a percentage, like Remember, the top is 7%, and the bottom 3%? Well, if they give 5006 07:41:30,950 --> 07:41:37,048 you a percent, just start digging around in the middle of the Z table, just start digging 5007 07:41:37,048 --> 07:41:41,040 around looking for that person. Because once you start digging around, you realize that 5008 07:41:41,040 --> 07:41:45,590 map's back to a z. And then you can get into the groove of using the x formula, and you'll 5009 07:41:45,590 --> 07:41:52,580 probably get yourself out of this pack. So here are some final tips and tricks for getting 5010 07:41:52,580 --> 07:41:58,340 z scores and probabilities, right? And I've said this one before, draw a picture. And 5011 07:41:58,340 --> 07:42:03,140 what do I mean by that graph out the question, draw the curve, draw the line from you, which 5012 07:42:03,140 --> 07:42:08,330 goes in the middle. And where the X goes above or below the mu, just start with that it doesn't 5013 07:42:08,330 --> 07:42:13,170 have to be the scale. But mainly, you want to get those elements in there. There's 1x 5014 07:42:13,170 --> 07:42:18,040 shade, the part of the curve wanted either above the X or below the x, you know, just 5015 07:42:18,040 --> 07:42:22,760 color it in. So that you get an idea of Do you want the big part, the one that's greater 5016 07:42:22,760 --> 07:42:28,378 than 50%, or the little part, the one that's less than 50%? If there are two x's, then 5017 07:42:28,378 --> 07:42:33,700 shade in the area wanted, which is usually in between them. If it's a calculate the x 5018 07:42:33,700 --> 07:42:39,900 question, put where the Z or the P is. So if it was like the top 7%, you could shade 5019 07:42:39,900 --> 07:42:44,792 in the top little part of the curve. If it was the bottom 3%, you could cheat in the 5020 07:42:44,792 --> 07:42:50,010 bottom little part of the curve. So make this picture and do it at the beginning. Okay, 5021 07:42:50,010 --> 07:42:54,720 then, note that x is usually in the question. If you can't find x, and you're trying to 5022 07:42:54,720 --> 07:42:58,660 do the Z formula, and you're saying, Okay, I'm trying to make a z score. That's what 5023 07:42:58,660 --> 07:43:02,890 it asks for. I'm trying to find a probability. That's what it asks for looking the question, 5024 07:43:02,890 --> 07:43:08,650 and you'll probably find the accent there. A big problem that I see is people mistake 5025 07:43:08,650 --> 07:43:15,590 little Z's for peace. Now, obviously, if you've got a Z, that's like negative, you know, a, 5026 07:43:15,590 --> 07:43:18,542 p can't be negative, a probability can't be negative. So you won't make that mistake. 5027 07:43:18,542 --> 07:43:25,510 Even if it's like negative point two, five, right? You won't make that mistake. And if 5028 07:43:25,510 --> 07:43:30,100 the Z is bigger than one, you won't make that mistake. So if you see a z equals 2.5, you're 5029 07:43:30,100 --> 07:43:34,900 like, obviously, that's not a probability. But when you have a little BBC score, that's 5030 07:43:34,900 --> 07:43:41,700 between zero and one, like point O two, three, it looks a lot like a P, but it's still a 5031 07:43:41,700 --> 07:43:45,440 z. So a lot of times people get a little lazy, like they hate using the Z table, and 5032 07:43:45,440 --> 07:43:49,030 then they calculate the z score, and it's really little, so they don't look it up. Don't 5033 07:43:49,030 --> 07:43:51,490 be fooled. You still have to look it up. So 5034 07:43:51,490 --> 07:43:56,030 if you're calculating z, you need a little baby z like that it still is he still go look 5035 07:43:56,030 --> 07:44:00,890 it up. Okay. Then finally, remember how step one was draw a picture. And I went on and 5036 07:44:00,890 --> 07:44:06,450 on about that. Step 99. Or the last step before you're done with the question is check your 5037 07:44:06,450 --> 07:44:11,202 logic against that picture. So if you shaded a big part of your picture, your probability 5038 07:44:11,202 --> 07:44:17,490 should be bigger than point five, or 50%. If you shaded a little tiny part of your picture, 5039 07:44:17,490 --> 07:44:21,570 and you're getting like point nine, five, something, you know that that's wrong. So 5040 07:44:21,570 --> 07:44:26,050 please check your logic against the picture. Before you say that you're done with your 5041 07:44:26,050 --> 07:44:35,160 question. Okay. So you made it through this long lecture about z, and about probabilities. 5042 07:44:35,160 --> 07:44:39,872 So I gave you an introduction to the standard normal curve into those two Z score formulas. 5043 07:44:39,872 --> 07:44:45,570 I showed you how to calculate z scores, and how to look at probabilities. And I also showed 5044 07:44:45,570 --> 07:44:51,400 you at the end, how to calculate x if given a z score or a probability. Okay, and all 5045 07:44:51,400 --> 07:44:56,410 I want to say is, unfortunately, those students those pretend students on that distribution, 5046 07:44:56,410 --> 07:45:02,378 they were none of them got 100% Okay? That's not the case in our class, a lot of times 5047 07:45:02,378 --> 07:45:08,820 people get 100% on the quizzes. That's why I can't use your grades as examples. Okay, 5048 07:45:08,820 --> 07:45:16,700 so good luck on the quiz. Well, hello, it's time for statistics. It's Monica wahi, your 5049 07:45:16,700 --> 07:45:24,870 library college lecturer back with chapter 7.4 and 7.5 sampling distributions and the 5050 07:45:24,870 --> 07:45:31,040 central limit theorem. So at the end of this lecture, you should be able to state the new 5051 07:45:31,040 --> 07:45:36,840 statistical notation for parameters and statistics, for two measures of variation. 5052 07:45:36,840 --> 07:45:38,792 Name one type 5053 07:45:38,792 --> 07:45:44,510 of inference and describe it. explain the difference between a frequency distribution 5054 07:45:44,510 --> 07:45:50,970 and a sampling distribution, describe the central limit theorem in either words or formulas, 5055 07:45:50,970 --> 07:45:57,490 and also describe how to calculate the standard error. So, here's your introduction to this 5056 07:45:57,490 --> 07:46:03,798 lecture. And as you can see, I must 7.4 and 7.5. Together Again, they felt like a natural 5057 07:46:03,798 --> 07:46:09,960 fit. First, we're going to review and maybe overview on parameters, statistics, and also 5058 07:46:09,960 --> 07:46:15,860 inferences, we're going to just talk about those ideas, because that will sort of easy 5059 07:46:15,860 --> 07:46:21,270 into the next part, which is where we start talking about sampling distribution, which 5060 07:46:21,270 --> 07:46:26,650 is the new concept here. Okay. And then we'll go on to talk about the central limit theorem. 5061 07:46:26,650 --> 07:46:32,202 And finally, I'll do a little demonstration of how to find probabilities regarding x 5062 07:46:32,202 --> 07:46:33,202 bar. 5063 07:46:33,202 --> 07:46:35,690 So if you're not really sure about what that means, don't worry, you should be able to 5064 07:46:35,690 --> 07:46:43,160 understand it at the end of this lecture. All right, here's the first part, parameters, 5065 07:46:43,160 --> 07:46:49,270 statistics and inferences. And this is the review and overview I promised you. So if 5066 07:46:49,270 --> 07:46:54,730 you remember from a long time ago, a statistic is a numerical measure describing a sample. 5067 07:46:54,730 --> 07:47:01,820 And a parameter is a numerical measure describing a population remember s s sample statistic 5068 07:47:01,820 --> 07:47:09,150 p p, population parameter, you probably remember that. Okay, so we have different ways of notating 5069 07:47:09,150 --> 07:47:14,872 these. So if you look under measure, like you see me right, and if it's a statistic, 5070 07:47:14,872 --> 07:47:20,130 it's x bar, and I say x bar on this on the slide sometimes because it's hard to make 5071 07:47:20,130 --> 07:47:25,240 that little line always be positioned above the x. So I'm just lazy to say x bar. And 5072 07:47:25,240 --> 07:47:30,940 then under parameter, it's that that new symbol, so it's pronounced a meal, but it looks like 5073 07:47:30,940 --> 07:47:36,230 that thing on the slide. All right, um, the next two variants and standard deviation, 5074 07:47:36,230 --> 07:47:43,000 remember how they're friends. And so the statistic version is the s for variance, it's the s 5075 07:47:43,000 --> 07:47:50,220 with the little two up there, the exponent, because you know, it's standard deviation 5076 07:47:50,220 --> 07:47:54,510 to the second is variance in the square root of variance is a standard deviation. 5077 07:47:54,510 --> 07:48:01,130 So that's why they have s and then S to the second for the statistic, okay. For the parameter, 5078 07:48:01,130 --> 07:48:06,970 it's that lowercase sigma symbol. And that's it's that to the second when it's variance, 5079 07:48:06,970 --> 07:48:15,000 and it's just without the exponent, when it's just the regular parameter of standard deviation, 5080 07:48:15,000 --> 07:48:16,000 right. 5081 07:48:16,000 --> 07:48:19,490 And you're used to seeing these on the slides. This is just review. I'm also in mentioned 5082 07:48:19,490 --> 07:48:26,282 in the book proportion is p hat, and then the parameter is P. But I don't really go 5083 07:48:26,282 --> 07:48:32,810 into that. I just wanted to do a little shout out to it. Okay, let's think about the word 5084 07:48:32,810 --> 07:48:38,990 inference, like infer, like, if somebody implies something, maybe you'll infer it. Like, he 5085 07:48:38,990 --> 07:48:44,180 implied, it would be hard if I came over late that night. So I inferred that I shouldn't 5086 07:48:44,180 --> 07:48:50,110 come over late then. So like here, you know, you may have heard the term where there's 5087 07:48:50,110 --> 07:48:56,160 smoke, there's fire. And so you see this on the slide, there's a lot of smoke. Is there 5088 07:48:56,160 --> 07:49:01,700 fire, though, is that smoke coming from fire? Because if you look at it, it probably could 5089 07:49:01,700 --> 07:49:08,660 be coming from fire. But there's sort of this outside chance. It's not what we think it 5090 07:49:08,660 --> 07:49:13,070 is, like maybe, you know, I have if you've ever used a fire extinguisher, they make all 5091 07:49:13,070 --> 07:49:18,850 this phone come out. Maybe it's that, you know, or maybe it's like, if you've ever had 5092 07:49:18,850 --> 07:49:24,840 dry eyes, and then that makes a bunch of smoke. Maybe it's not fire, right? So where there's 5093 07:49:24,840 --> 07:49:28,692 smoke, there's fire. That's an inference. Well, let's see 5094 07:49:28,692 --> 07:49:30,420 if it's actually fire, 5095 07:49:30,420 --> 07:49:35,500 right. But we weren't sure we thought it was likely to be fire. But we weren't sure. And 5096 07:49:35,500 --> 07:49:41,200 so there's inference is something that you do in statistics, because you use probability 5097 07:49:41,200 --> 07:49:45,130 to make these inferences because you can't see the fire. You can just see the smoke and 5098 07:49:45,130 --> 07:49:49,890 you're not sure, right? So there's three different kinds. I'm going to talk about the first kind 5099 07:49:49,890 --> 07:49:55,114 of estimation, where we estimate the value of a parameter using a sample. So the sample 5100 07:49:55,114 --> 07:50:00,010 is kind of like the smoke and the parameters the fire we can't see. So we estimate 5101 07:50:00,010 --> 07:50:07,440 Okay, and we're going to talk about that in chapter eight more. A second time, type of 5102 07:50:07,440 --> 07:50:12,160 inference we do is testing, where we do a test to help us make a decision about a population 5103 07:50:12,160 --> 07:50:17,130 parameter. In other words, we don't know one, but we want to make a decision about it. So 5104 07:50:17,130 --> 07:50:22,860 we do a statistical test. And we're not going to get into that, that's in chapter nine. 5105 07:50:22,860 --> 07:50:28,200 Finally, there's regression, where we make predictions or forecasts about a statistic, 5106 07:50:28,200 --> 07:50:34,560 that's a third kind of inference. And we actually already did this in chapter 4.2. So the reason 5107 07:50:34,560 --> 07:50:42,260 why I bring up all of this is that estimation, which is going to be in chapter eight, and 5108 07:50:42,260 --> 07:50:45,510 testing, which is going to be in chapter nine, but we're not going over chapter nine in this 5109 07:50:45,510 --> 07:50:52,360 class. But um, but if we were, you know, you'd have to know this because in this lecture, 5110 07:50:52,360 --> 07:50:57,180 I'm going to talk about sampling distributions in the central limit theorem. And you need 5111 07:50:57,180 --> 07:51:01,708 to grasp those things in order to do those, these two things on the slide that with the 5112 07:51:01,708 --> 07:51:07,372 box around them, estimation, and testing. And so that's why I'm bringing this up now. 5113 07:51:07,372 --> 07:51:13,360 Okay, so now we're going to move on to talking about sampling distribution, and how it's 5114 07:51:13,360 --> 07:51:20,830 different from a frequency distribution. Alright, so let's just remind ourselves what a frequency 5115 07:51:20,830 --> 07:51:26,470 distribution actually is. Okay? So remember that from a long time ago, what you would 5116 07:51:26,470 --> 07:51:33,680 have is a quantitative variable, you'd make a frequency table. And then you use that to 5117 07:51:33,680 --> 07:51:39,260 graph the histogram, right. And here, I made an example down there of frequency histogram 5118 07:51:39,260 --> 07:51:43,200 that shows a normal distribution. And so that's what you would do, you know, step two would 5119 07:51:43,200 --> 07:51:50,080 be draw it. And then you see the shape and figure out what the distribution was of that 5120 07:51:50,080 --> 07:51:58,362 quantitative variable, or that x, okay, because each one of these is an X, like the middle 5121 07:51:58,362 --> 07:52:04,100 one, it's almost 30 X's that are in that frequency. Okay, now we're going to talk about sampling 5122 07:52:04,100 --> 07:52:09,730 distribution, it's a little more complicated. In a sampling distribution, you start out 5123 07:52:09,730 --> 07:52:14,230 with a population, that's the first thing is you're dealing with population, then you 5124 07:52:14,230 --> 07:52:20,050 pick an N, of a certain size, like you pick a number, that you're going to have your sample 5125 07:52:20,050 --> 07:52:28,160 size B. And then you take as many samples of that size as possible from the population. 5126 07:52:28,160 --> 07:52:34,500 And then you make an x bar from each of the samples. So there's a ton of samples, right? 5127 07:52:34,500 --> 07:52:38,110 Because and I'll show you a little demonstration. So you can really wrap your mind around how 5128 07:52:38,110 --> 07:52:43,630 many different samples that can be. But each one is going to have an x bar. And then you 5129 07:52:43,630 --> 07:52:47,930 make a histogram of all those x bars. So like I said, I'm going to just kind of show you 5130 07:52:47,930 --> 07:52:53,202 what I'm talking about. So we're going to imagine this is a population of people. And 5131 07:52:53,202 --> 07:52:57,490 we're going to imagine we're going to talk about BMI or body mass index, just so you 5132 07:52:57,490 --> 07:53:01,878 can wrap your mind around this. So you start with this population, let's decide on an N. 5133 07:53:01,878 --> 07:53:08,320 How about five five is good, right? So now what the deal is, is I'm trying to take as 5134 07:53:08,320 --> 07:53:15,000 many samples of n as possible from all of these people on the slide. So here's our first 5135 07:53:15,000 --> 07:53:21,030 sample we took, and we got an x bar for BMI of 23. From these five people. Well, let's 5136 07:53:21,030 --> 07:53:25,590 try these five people. Now, look, we double dipped with that first one, okay, but we get 5137 07:53:25,590 --> 07:53:32,090 this x bar of 21. And we can keep going. And actually, there's gonna be a ton of these, 5138 07:53:32,090 --> 07:53:37,160 right, there's a ton of different ones. But it's finite. I mean, at the end of the day, 5139 07:53:37,160 --> 07:53:42,600 there's only so many groups of five, I can get out of this population on the slide, and 5140 07:53:42,600 --> 07:53:48,910 each group of five is going to have its own x bar. So I could write down every single 5141 07:53:48,910 --> 07:53:53,730 one of those x bars I get for every single group of five I can make out of this. And 5142 07:53:53,730 --> 07:53:59,740 then I can make a histogram of all the x bars. And, of course, I'd start with a frequency 5143 07:53:59,740 --> 07:54:05,150 table. But look at the frequencies, they're huge. That's because you can get just a ton 5144 07:54:05,150 --> 07:54:12,292 of samples out of one population. And so what you'll see is if you make a histogram out 5145 07:54:12,292 --> 07:54:17,692 of that, it looks normally distributed, it's just that the frequencies are really high, 5146 07:54:17,692 --> 07:54:21,910 because there's a whole bunch of different samples you can take. And remember, this is 5147 07:54:21,910 --> 07:54:29,690 a frequency histogram of x bars. This is each one of these frequencies is an x bar that 5148 07:54:29,690 --> 07:54:35,870 you got out of a group of five you could take. And so that's what the sampling distribution 5149 07:54:35,870 --> 07:54:41,730 is, it ends up looking like a histogram, but it's a histogram of all the possible x bars 5150 07:54:41,730 --> 07:54:47,540 you could get from all the possible samples of whatever end size you picked from the population 5151 07:54:47,540 --> 07:54:49,890 that you 5152 07:54:49,890 --> 07:54:51,060 have. 5153 07:54:51,060 --> 07:54:57,010 So uh, so this is the fancy way, the official statistical way of saying it is a sampling 5154 07:54:57,010 --> 07:55:03,850 distribution is a probability distribution of A sample statistic, in this case x bar 5155 07:55:03,850 --> 07:55:10,690 based on all possible simple random samples of the same size from the same population. 5156 07:55:10,690 --> 07:55:15,792 So that's what makes it the sampling distribution and not a frequency distribution. And so in 5157 07:55:15,792 --> 07:55:19,900 the next section, so you're probably like, Okay, great, that's wonderful. You just explained 5158 07:55:19,900 --> 07:55:23,610 that. But in the next section, we're going to talk about the central limit theorem, here 5159 07:55:23,610 --> 07:55:28,390 comes a theorem, right. And there's a proof for the theorem. And you need to understand 5160 07:55:28,390 --> 07:55:34,042 this concept of sampling distribution for inference in order to understand this proof, 5161 07:55:34,042 --> 07:55:40,900 so I just had to go through this. Okay, now we're on to the central limit theorem, and 5162 07:55:40,900 --> 07:55:48,542 how it's used for statistical inference. So I'm gonna start by explaining it in words 5163 07:55:48,542 --> 07:55:54,110 and see that sampling distributions over there. So this is the words around the central limit 5164 07:55:54,110 --> 07:55:58,970 theorem, it says, For any normal distribution, and remember, we're talking about a normal 5165 07:55:58,970 --> 07:56:04,270 distribution here, the sampling distribution, meaning the distributions of the x bars from 5166 07:56:04,270 --> 07:56:09,272 all possible samples, like we just talked about, is a normal distribution, meaning it's 5167 07:56:09,272 --> 07:56:14,600 not skewed, it's not my model, whatever, it looks kinda like what is on the slide. Okay. 5168 07:56:14,600 --> 07:56:23,590 And then to this is important, the mean of the x bars is actually mu. So I had a student 5169 07:56:23,590 --> 07:56:31,260 who would say, Oh, the x bar of the x bars, is mu. And that's actually true. If you actually 5170 07:56:31,260 --> 07:56:35,560 did the thing I described, which don't try it at home, because you'll be up all night 5171 07:56:35,560 --> 07:56:41,700 taking samples, okay. But if you did, if you actually got all samples of five from a population, 5172 07:56:41,700 --> 07:56:49,090 and got all their x bars, and you made a mean of all those x bars, you'd get mu and how 5173 07:56:49,090 --> 07:56:53,240 you could check it is, of course, just easily taking a mean of the entire population like 5174 07:56:53,240 --> 07:56:57,080 that would have been the easy way to do it. But no, if you do it this way, where you get 5175 07:56:57,080 --> 07:57:00,863 every possible x bar for a particular sample size, and then you make an x bar, those x 5176 07:57:00,863 --> 07:57:05,850 bars, you'll get meal. So that's, you know, it's a proof. So that sounds like a thing, 5177 07:57:05,850 --> 07:57:10,840 that would be inappropriate, right? Now, here's the next part three, the standard deviation 5178 07:57:10,840 --> 07:57:17,798 of all those x Mars is actually the population standard deviation divided by the square root 5179 07:57:17,798 --> 07:57:23,110 of whatever and you picked. So in other words, if you have the whole population data, and 5180 07:57:23,110 --> 07:57:27,000 you just found out the standard deviation, you just have the standard deviation. But 5181 07:57:27,000 --> 07:57:30,890 if you did this thing with the x bar, where you took all those x bars, and you found the 5182 07:57:30,890 --> 07:57:36,840 standard deviation of those x bars, that would equal the population standard deviation divided 5183 07:57:36,840 --> 07:57:43,192 by the square root of whatever n, you use to get all those x bars, again, sounds really 5184 07:57:43,192 --> 07:57:47,370 poufy In theory, but that's the third part of the central limit theorem 5185 07:57:47,370 --> 07:57:48,780 in words. 5186 07:57:48,780 --> 07:57:54,770 And so here's some people like to look at it from a formula standpoint. So you'll see 5187 07:57:54,770 --> 07:57:58,792 on the right side of the slide, in this little, these little formulas, that N means the sample 5188 07:57:58,792 --> 07:58:03,670 size. And remember, I picked five, you could pick a different one, right? And mu is the 5189 07:58:03,670 --> 07:58:09,452 mean of the x distribution, meaning the population mean, right. And then that population standard 5190 07:58:09,452 --> 07:58:13,048 deviation symbol is the standard deviation of the x distribution mean the population 5191 07:58:13,048 --> 07:58:18,480 standard deviation. So we look on the left. Now this is just a formula version of what 5192 07:58:18,480 --> 07:58:24,540 I just the mu of all the x bars that you could get from a particular sample in a particular 5193 07:58:24,540 --> 07:58:28,960 population is going to equal the mean or the population. And the standard deviation of 5194 07:58:28,960 --> 07:58:33,530 all those x bars is going to equal the population standard deviation divided by the square root 5195 07:58:33,530 --> 07:58:41,042 of whatever n you picked. So now, I just want to point out the Z thing. We've been doing 5196 07:58:41,042 --> 07:58:47,480 this z thing, right, but we've been doing it with 1x. Now, if you imagine grabbing a 5197 07:58:47,480 --> 07:58:53,430 bunch of x's, in other words, a sample, this is the formula you're going to be using, which 5198 07:58:53,430 --> 07:59:01,820 is x bar minus mu over the standard deviation divided by the square root of n, right? And 5199 07:59:01,820 --> 07:59:07,620 so that's kind of what we're moving into here is what happens if you get a sample and you're 5200 07:59:07,620 --> 07:59:15,640 looking at x bar, not if you just grab 1x. And you're looking at that. So I wanted to 5201 07:59:15,640 --> 07:59:21,510 point out, first of all, that this whole thing is only supposed to happen if your n is greater 5202 07:59:21,510 --> 07:59:28,170 than 30. Okay? Otherwise, you shouldn't really be doing this. Then the second thing I wanted 5203 07:59:28,170 --> 07:59:35,202 to point out is that this piece underneath and the lower part of the equation, that's 5204 07:59:35,202 --> 07:59:41,440 called the standard error, they named that piece. And part of the reason why I like that 5205 07:59:41,440 --> 07:59:47,670 they named that piece separately, is I usually make that piece before I even do the equation. 5206 07:59:47,670 --> 07:59:52,270 So I just have that number sitting around because, you know, there's a square root underneath 5207 07:59:52,270 --> 07:59:57,862 this standard deviation, and that whole thing is underneath another thing so it's hard to 5208 07:59:57,862 --> 08:00:03,530 do all that dividing. So I usually just make that standard error first, by taking the standard 5209 08:00:03,530 --> 08:00:07,250 population standard deviation divided by the square root of n and just have that number 5210 08:00:07,250 --> 08:00:12,470 and then later I use it in this z equation. So that's two things I wanted you to notice. 5211 08:00:12,470 --> 08:00:18,622 So I brought that out on the slide. Okay, here's more on the central limit theorem. 5212 08:00:18,622 --> 08:00:24,770 So if the distribution of X is normal, then the distribution of x bar is also normal. 5213 08:00:24,770 --> 08:00:29,580 So we look at the top, that's an example of just an X distribution. And then if you go 5214 08:00:29,580 --> 08:00:33,950 do that thing, we take all those samples, and you get all those x bars. And then you 5215 08:00:33,950 --> 08:00:38,590 make the histogram, you'll see the pink one down, lower. Next bar distribution, 5216 08:00:38,590 --> 08:00:42,340 this is just a pictorial example. 5217 08:00:42,340 --> 08:00:50,208 But even if the distribution of X is not normal, as long as there's more than 30, and is more 5218 08:00:50,208 --> 08:00:56,580 than 30, the central limit theorem says that the x bar distribution is approximately normal. 5219 08:00:56,580 --> 08:01:03,970 So remember, a lot of that hospital data we've been looking at, like a hospital beds in a 5220 08:01:03,970 --> 08:01:10,890 state, often you'll see a skewed distribution. But if you have more than 30, hospitals, then 5221 08:01:10,890 --> 08:01:18,390 it what you could do is you could pick n n, and take n bigger than 30. And take a bunch 5222 08:01:18,390 --> 08:01:22,730 of samples and get a bunch of x bar, it's not just a bunch get all of them all of the 5223 08:01:22,730 --> 08:01:27,710 possible ones. And then when you if you made that x bar distribution, even though the hospital 5224 08:01:27,710 --> 08:01:33,792 beds would be skewed, just as an X distribution, their x bar distribution would be normal. 5225 08:01:33,792 --> 08:01:38,730 And that's one other important piece of the central limit theorem. That's one important 5226 08:01:38,730 --> 08:01:45,190 piece of that proof is that all of those x bars that you get, will end up on a normal 5227 08:01:45,190 --> 08:01:50,290 distribution, even if your underlying distribution is not normal. So long as the end you're picking 5228 08:01:50,290 --> 08:01:57,060 is greater than 30. And finally, that leads to you know, proofs are they build on each 5229 08:01:57,060 --> 08:02:01,860 other, that leads us to the concept that a sample statistic is considered unbiased, just 5230 08:02:01,860 --> 08:02:10,190 unbiased, right? It's not perfect, but it's unbiased. If the mean of its sampling distribution, 5231 08:02:10,190 --> 08:02:16,380 equals the parameter being estimated, in other words, the fact that the x bar of the x bar 5232 08:02:16,380 --> 08:02:23,628 is is mu, means that an x bar is going to be unbiased. It might not be mu, it might 5233 08:02:23,628 --> 08:02:29,841 not be exactly the same as the population mean. But it will be unbiased. It's not a 5234 08:02:29,841 --> 08:02:38,280 biased representative of mu. All right, now let's move on to finding probabilities regarding 5235 08:02:38,280 --> 08:02:42,230 x bar. So for those of you who want to actually do something and apply something and stop 5236 08:02:42,230 --> 08:02:48,930 thinking about theory, let's go. Okay, but let's remind ourselves, what are we doing? 5237 08:02:48,930 --> 08:02:54,910 Right? What are we doing? Well, what were we doing in chapters 7.1 through 7.3, we were 5238 08:02:54,910 --> 08:03:01,470 looking at having a normally distributed x. So we have this population of quantitative 5239 08:03:01,470 --> 08:03:06,360 values that were normally distributed. And we had a population mean a mu, and we the 5240 08:03:06,360 --> 08:03:11,810 population standard deviation. And we kept doing these exercises, where we were finding 5241 08:03:11,810 --> 08:03:17,298 the probability of selecting a value from that population and x from that population 5242 08:03:17,298 --> 08:03:22,542 above or below a certain value of x, right. And so we were looking at the probabilities, 5243 08:03:22,542 --> 08:03:27,060 and we'd look up the z score in the Z table probabilities. And so basically, what we would 5244 08:03:27,060 --> 08:03:35,070 be doing is converting m x to z, right. And we use this formula here to convert x to z. 5245 08:03:35,070 --> 08:03:39,650 So whenever we add an x, we could put it on the Z distribution, and we could figure out 5246 08:03:39,650 --> 08:03:46,060 the probability. So here's what's different. Now, you'll notice the first thing has not 5247 08:03:46,060 --> 08:03:49,920 changed, we're still talking about normally distributed x's, we're still talking about 5248 08:03:49,920 --> 08:03:55,090 a population where we have a mu and a population standard deviation. But now we're not just 5249 08:03:55,090 --> 08:04:02,370 grabbing 1x. From that population, we're grabbing a sample. And because we're grabbing a sample, 5250 08:04:02,370 --> 08:04:07,622 we have to pick an N. So the N is going to be different each time, right? So we're grabbing 5251 08:04:07,622 --> 08:04:11,470 a sample of the population. Well, how do we boil that down to one number? Well, we're 5252 08:04:11,470 --> 08:04:18,640 taking the x bar are the mean value from that sample. And that's what we're doing. The Z 5253 08:04:18,640 --> 08:04:26,378 score is that x bar instead of the x, because we're taking a sample, so when you see the 5254 08:04:26,378 --> 08:04:33,230 formula below, you'll notice that the other one just had x in it, because we only had 5255 08:04:33,230 --> 08:04:41,112 one, this one has x bar, and because we have a sample, you also notice that downstairs, 5256 08:04:41,112 --> 08:04:45,160 what we had before was the population standard deviation, but now 5257 08:04:45,160 --> 08:04:49,522 we have the standard error. Remember I talked about that the population standard deviation 5258 08:04:49,522 --> 08:04:55,230 divided by the square root of n, that's where n comes in, because it's going to matter which 5259 08:04:55,230 --> 08:05:04,160 what and you have to make the Z come out right? Alright, so now that we're reminded of what 5260 08:05:04,160 --> 08:05:11,500 we're doing, we'll just explain how to do it right. So let's say you do have an N, right, 5261 08:05:11,500 --> 08:05:16,730 and you have an x bar, like you grabbed your n and you got an x bar, you can convert that 5262 08:05:16,730 --> 08:05:23,030 x bar to a z score using this formula, where, of course, you have to be told the population 5263 08:05:23,030 --> 08:05:27,311 mean and the population standard deviation, but then you'll have your x bar and you'll 5264 08:05:27,311 --> 08:05:31,970 have your n. So you can do the whole equation. And then you'll get to see and guess what 5265 08:05:31,970 --> 08:05:35,030 you do. What do you do with a Z, you look it up. So you look at the probability for 5266 08:05:35,030 --> 08:05:42,260 the z score in the Z table. Like in chapter 7.2, and 7.3. Only, this is just about x bar, 5267 08:05:42,260 --> 08:05:49,650 basically. So um, and then I thought, what I would do is walk you through two examples. 5268 08:05:49,650 --> 08:05:56,240 You're already kind of good at this, because this is not too different from 7.2, and 7.3. 5269 08:05:56,240 --> 08:06:01,340 But I just want to walk you through it, because it is a little different when you have a sample 5270 08:06:01,340 --> 08:06:07,470 versus just 1x. Okay, so remember our poor chemistry class that I was in when I got to 5271 08:06:07,470 --> 08:06:12,050 73? Well, remember, we were assuming it was 100 Student class. So there were 100 students 5272 08:06:12,050 --> 08:06:17,530 in the class and equals 100 in the class capital, right, because they're the population. And 5273 08:06:17,530 --> 08:06:22,420 then if you look on the slide, you'll see the mu of their scores was pretty bad. It 5274 08:06:22,420 --> 08:06:29,950 was 65.5 on 100 point test, and the population standard deviation was 14.5. So this was the 5275 08:06:29,950 --> 08:06:35,660 population of this 100 Student class. So I'm going to do some exercises here, let's say 5276 08:06:35,660 --> 08:06:40,480 we're going to pick a, we have to pick an N bigger than 30. So we're going to pick an 5277 08:06:40,480 --> 08:06:47,220 N of 49. Right? Now, I'm coming up with a little scenario here. To pass the class students 5278 08:06:47,220 --> 08:06:52,690 have to get at least 70, which is a C. So let's pretend this is the question, what is 5279 08:06:52,690 --> 08:07:00,890 the probability of me selecting a sample of 49 students with an x bar greater than 70? 5280 08:07:00,890 --> 08:07:04,810 Notice how we ask the question a little bit differently. What's the probability of me 5281 08:07:04,810 --> 08:07:10,390 getting a set of 49 students such that their x bar is greater than 70? Does not kind of 5282 08:07:10,390 --> 08:07:15,680 remind you of the central limit theorem, where we had to go back and get a like an N a five, 5283 08:07:15,680 --> 08:07:20,798 we got different ends of five? What what's the probability of me getting one of those 5284 08:07:20,798 --> 08:07:29,050 samples that has an x bar in the greater than 70? That's the question, right. And I drew 5285 08:07:29,050 --> 08:07:35,880 this out here, remember our old z distribution with our also our x distribution, and I kind 5286 08:07:35,880 --> 08:07:39,230 of drew where somebody is. But I wanted you to point I wanted to point out for you, the 5287 08:07:39,230 --> 08:07:43,798 probability for an x bar is going to be smaller than for x, because you're going to have to 5288 08:07:43,798 --> 08:07:51,798 do a lot of work to get that x bar to be above 70. Right? So here we go. So I'm just going 5289 08:07:51,798 --> 08:07:57,980 to remind you that the equation at the top and the equation at the bottom are the same 5290 08:07:57,980 --> 08:08:03,560 equation. I'm just using the term assay for the standard error. And I like to calculate 5291 08:08:03,560 --> 08:08:08,780 that separately, like I told you, so I like to do that first. So we're going to do that. 5292 08:08:08,780 --> 08:08:15,280 And how do we do that? Well, the end was 49, right? And I'm the population standard deviation 5293 08:08:15,280 --> 08:08:22,250 is 14.5. So that's where we get this, this number, the standard error of 2.1. So now, 5294 08:08:22,250 --> 08:08:29,351 let's calculate the Z. All right, here's z. So z is our x, which is our x bar, which is 5295 08:08:29,351 --> 08:08:39,360 70 minus 65.5, which is our mu, divided by our prep cooked standard error, which is 2.1. 5296 08:08:39,360 --> 08:08:45,710 And we get a Z of 2.17. So we're tempted to look that up. But let's look at our picture. 5297 08:08:45,710 --> 08:08:51,770 So here's our z distribution. And what we're going for is this little piece at the top 5298 08:08:51,770 --> 08:08:57,350 right above 2.17. So that's a little piece. So we got to look for that right? Let's go 5299 08:08:57,350 --> 08:09:03,970 look. So because we're going to go for the piece at the top, we're going to use the opposite 5300 08:09:03,970 --> 08:09:10,920 z. There's remember two ways of doing this. But everybody seems to prefer the way where 5301 08:09:10,920 --> 08:09:15,010 you use the opposite z if you're looking for something to the right. So we're going to 5302 08:09:15,010 --> 08:09:21,440 use negative 2.17 to get a little piece, right? Because when you look that up, I'm not going 5303 08:09:21,440 --> 08:09:28,810 to demonstrate you guys are good at this now. You get P equals 0.0150. If you were to look 5304 08:09:28,810 --> 08:09:35,351 up 2.17, then you'd get the big piece. So that's why we do this. And so then the answer 5305 08:09:35,351 --> 08:09:40,852 is, remember the question was what is the probability of me selecting a sample or a 5306 08:09:40,852 --> 08:09:47,450 set of 49 students with an x bar that's greater than 70. And remember how this real test really 5307 08:09:47,450 --> 08:09:48,450 sucked. I mean, 5308 08:09:48,450 --> 08:09:55,370 people that mu was 65.5. So it was pretty hard to get a high score. So the probability 5309 08:09:55,370 --> 08:10:06,280 was pretty low as point 0.0150. Or if you do that Present version 1.5%. Okay, now we're 5310 08:10:06,280 --> 08:10:11,860 going to try a different one. That one was asking what is the probability of me selecting 5311 08:10:11,860 --> 08:10:17,440 a sample with an x bar greater than a certain number? Now we're going to talk about the 5312 08:10:17,440 --> 08:10:23,680 probability of selecting a sample with the x bar between two numbers, right? So again, 5313 08:10:23,680 --> 08:10:30,140 we're back with our poor student class that with this terrible chemistry test, this time 5314 08:10:30,140 --> 08:10:35,150 I decided to choose the end of 36, you'll notice that I always choose perfect squares 5315 08:10:35,150 --> 08:10:41,372 for ends because you have to take the square root, and I'm just lazy. So okay, here's our 5316 08:10:41,372 --> 08:10:48,010 question, what is the probability of me selecting a sample of 36 students with an x bar between 5317 08:10:48,010 --> 08:10:55,070 60 and 65. And just I drew this picture up here to remind you that, that's gonna be on 5318 08:10:55,070 --> 08:11:00,710 the left side of meal, you know, we're going to be dealing with negative Z's right. And 5319 08:11:00,710 --> 08:11:07,048 so we have to remember when we would have two axes, back in 7.2, and 7.3. Well, this 5320 08:11:07,048 --> 08:11:12,250 is now a situation where we have 2x bars, so you just got to name them x bar one and 5321 08:11:12,250 --> 08:11:18,250 x bar two. And, again, I show you this demonstration, you know, these red arrows, but the probability 5322 08:11:18,250 --> 08:11:21,960 for x bar will be smaller than for x, because it's harder to get a whole group of people 5323 08:11:21,960 --> 08:11:28,650 together to give you an x bar in between a certain place. Alright, so this is not new, 5324 08:11:28,650 --> 08:11:33,530 these are the same formulas I showed you before, I just want to emphasize that making your 5325 08:11:33,530 --> 08:11:39,140 standard error first, can really help you as you move along through these problems, 5326 08:11:39,140 --> 08:11:42,400 it just makes it a little easier to calculate, especially in this case, where we're going 5327 08:11:42,400 --> 08:11:50,740 to use the standard error twice. So again, what we do is we take, this would look exactly 5328 08:11:50,740 --> 08:11:55,610 like the last standard error, but it's different because our n is different. So this time, 5329 08:11:55,610 --> 08:12:01,950 our standard error comes out as 2.4. And what I just want to remind you is that the more 5330 08:12:01,950 --> 08:12:08,260 and you get, the bigger that square root of n gets, I mean, n gets bigger, the square 5331 08:12:08,260 --> 08:12:14,490 root of n gets bigger. And that's then the smaller the standard error gets. So you can 5332 08:12:14,490 --> 08:12:22,810 make the standard error really small, if you just get a lot of n, right. So here's z one 5333 08:12:22,810 --> 08:12:27,420 and z two, I put them both up there. But we can just walk through this, you know, x bar 5334 08:12:27,420 --> 08:12:34,250 one is 60. And x bar two is 65. Because it's between 60 and 65. So you see that, um, you 5335 08:12:34,250 --> 08:12:38,750 see what's going on in the slide. And like I told you, you know, these were both of these 5336 08:12:38,750 --> 08:12:43,890 x bars are below the mu. So they're both kind of negative Z's. And so we've got our negative 5337 08:12:43,890 --> 08:12:50,430 Z's. And that now we have to just remind ourselves, well, what are we doing, right? And so you 5338 08:12:50,430 --> 08:12:56,840 see, z one is at negative 2.28. So that's a little piece at the bottom, we're going 5339 08:12:56,840 --> 08:13:03,520 to want to trim off. And then the big piece at the top for z two, that starts at negative 5340 08:13:03,520 --> 08:13:09,520 point two, one. So that's just remember, the picture is really helpful. So now we're going 5341 08:13:09,520 --> 08:13:14,612 to go deal with the probabilities, right? So for z one, we're looking at something to 5342 08:13:14,612 --> 08:13:22,150 the left, so we just leave the Z alone and go look it up. And that's p equals 0.0113. 5343 08:13:22,150 --> 08:13:26,730 For z two, we got to flip the sign because we have to use the opposite z, because we're 5344 08:13:26,730 --> 08:13:31,350 going for the right, so that was the probability two and we can check that see, because we 5345 08:13:31,350 --> 08:13:37,720 can see that's more than 50% of that shape. So it's point 5832. Okay, so we got our probabilities 5346 08:13:37,720 --> 08:13:44,298 now. And like just like last time, we got to take one minus both of those pieces, right? 5347 08:13:44,298 --> 08:13:48,540 And then we get the probability in the middle. And that's the probability of drawing us sample 5348 08:13:48,540 --> 08:13:56,300 of 36 students with an x bar between 60 and 65. And I just to translate that to the answer, 5349 08:13:56,300 --> 08:14:04,650 the probability is point 4055. Or if you rounded it, you know, when you like, percents, you 5350 08:14:04,650 --> 08:14:12,610 could say 41%. So in conclusion, we reviewed the parameters, and the statistics, and those 5351 08:14:12,610 --> 08:14:17,310 notations. And we talked about inferences and what we're doing with inference. Next, 5352 08:14:17,310 --> 08:14:20,540 we talked about what a sampling distribution is, and how that's different from 5353 08:14:20,540 --> 08:14:25,160 a frequency distribution. So you can tell you know what's going on with that. Then I 5354 08:14:25,160 --> 08:14:29,290 presented to you the central limit theorem, which may have been kind of confusing, because 5355 08:14:29,290 --> 08:14:33,650 you know, theorems always are, they're always about different principles and about different 5356 08:14:33,650 --> 08:14:39,240 things equaling each other. But because of the central limit theorem, we then have permission 5357 08:14:39,240 --> 08:14:44,220 to do the operations we're doing after that, which is finding probabilities regarding x 5358 08:14:44,220 --> 08:14:48,900 bar. The central limit theorem says that, you know, this is how the world works. So 5359 08:14:48,900 --> 08:14:53,170 you get to use the standard error, and you get to do these kinds of calculations. So 5360 08:14:53,170 --> 08:14:59,550 now, you know how to in addition to finding probabilities regarding x, you can find probabilities, 5361 08:14:59,550 --> 08:15:02,270 we got x bar. Don't you feel smart 655032