All language subtitles for 2. Creating bins & Visualizing distributions

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,960 --> 00:00:06,840 Today we're going to continue working our data mining workbook which contains the bank churn data set 2 00:00:06,840 --> 00:00:07,160 . 3 00:00:07,200 --> 00:00:12,580 However we will start looking at a bit of a different rule. 4 00:00:12,630 --> 00:00:15,560 Subtle looking at our daughter from a bit of a different angle. 5 00:00:15,930 --> 00:00:23,280 Let's create a new work tap and this one will call and well we won't call it anything yet. 6 00:00:23,280 --> 00:00:30,690 What we want to visualize here is we want to understand how our customers are distributed by age so 7 00:00:31,020 --> 00:00:35,260 our old customers predominately older or younger what age groups are they in. 8 00:00:35,370 --> 00:00:36,240 And so on. 9 00:00:36,690 --> 00:00:39,510 And as you can see we already have an age variable here. 10 00:00:39,810 --> 00:00:43,900 So the first thing that comes to mind is let's say age and drag it into columns. 11 00:00:44,110 --> 00:00:47,240 And now let's take number of records and drag it into rows. 12 00:00:47,490 --> 00:00:52,860 And now we have one big dot over here at the top. 13 00:00:52,860 --> 00:00:54,100 Why do we have one big dog. 14 00:00:54,120 --> 00:01:04,260 Because well what happened is Tablo took some of Age of all of your customers and it got just over 380000 15 00:01:04,260 --> 00:01:09,840 years and then it took some more records so the toll on them are at it's 10000. 16 00:01:09,840 --> 00:01:14,400 Before we continue I'm just going to fix up the format here so we can see better. 17 00:01:15,330 --> 00:01:21,800 So you can see here three eighty thousand not for 90000 and 10000 customers. 18 00:01:21,810 --> 00:01:23,370 So how do we avoid doing that. 19 00:01:23,370 --> 00:01:30,690 Well in my course on Tablo I talk about aggregation and granularity. 20 00:01:30,690 --> 00:01:32,330 We won't go into details about that yet. 21 00:01:32,340 --> 00:01:34,880 I'll just show you how to avoid this happening. 22 00:01:34,890 --> 00:01:43,470 So if you go to some age and you just change this from a measure to a dimension then what will happen 23 00:01:43,470 --> 00:01:47,250 is it's not aggregating age anymore it's using it as a dimension. 24 00:01:47,280 --> 00:01:52,430 And as you can see you've got this nice shape of a curve. 25 00:01:52,680 --> 00:01:59,430 So basically it just tells you how many customers at every given age they are and it's uploading it 26 00:01:59,430 --> 00:02:00,890 with a line. 27 00:02:00,900 --> 00:02:07,170 So that's good but the problem here is that this is a nice data set so I'm going to Right-Click and 28 00:02:07,170 --> 00:02:12,180 I'll show you the data such if you Right-Click a data set up here and click your data and you go in 29 00:02:12,180 --> 00:02:15,110 here you will see the details of your data. 30 00:02:15,360 --> 00:02:16,930 So there we go. 31 00:02:16,930 --> 00:02:17,900 There's age. 32 00:02:18,150 --> 00:02:23,250 So here you can see that this is a nice data set in the sense that age has been around for you in a 33 00:02:23,250 --> 00:02:30,250 way it's not good because you don't have the details how exactly how old the person is is he for 29 34 00:02:30,250 --> 00:02:34,890 years between nine and a half or Z 29 and 11 months and a couple of days old. 35 00:02:34,890 --> 00:02:41,520 So you don't have that decimal point you can tell how exactly all the Is and therefore you've lost some 36 00:02:41,520 --> 00:02:44,090 information but on the other hand you can't do anything about it. 37 00:02:44,100 --> 00:02:49,170 If this is how the data said was provided to well then you've got to work on it and you might as well 38 00:02:49,170 --> 00:02:54,990 just enjoy it because that means it's a nice data set and you have one less thing to worry about you 39 00:02:54,990 --> 00:02:57,120 don't have the decimal point to worry about. 40 00:02:57,120 --> 00:02:59,320 So what does that say to us. 41 00:02:59,340 --> 00:03:06,230 Well that says to us that yes in this case because they're rounded you Tablo will be able to group customers 42 00:03:06,240 --> 00:03:10,860 for instance there's a person who's 36 and there is a person who's 36 and there's a person who is 36 43 00:03:11,220 --> 00:03:16,520 and there's a couple more people who are 36 and then you go to this chart and then you find 46. 44 00:03:16,550 --> 00:03:24,240 Let me just close this and you find 36 and you find that there's is 456 people that have thirty six 45 00:03:24,540 --> 00:03:26,470 in that column. 46 00:03:26,550 --> 00:03:33,090 What you will find is with other data sets when it's not rounded for you then you'll have people who 47 00:03:33,090 --> 00:03:34,170 are three six and a half. 48 00:03:34,170 --> 00:03:38,640 You'll have the people of thirty six point to thirty six point three and that will cause this charge 49 00:03:38,640 --> 00:03:45,030 to go to create very strange patterns because you might have a lot of people who have three six and 50 00:03:45,390 --> 00:03:51,600 a half but you might have very few people or three 6.6 so it will be very spiky and you don't want that 51 00:03:51,600 --> 00:03:53,340 in your body you can really see a spike over here. 52 00:03:53,340 --> 00:03:54,410 This is not normal right. 53 00:03:54,420 --> 00:04:00,820 So 30 29 and then 31 so 348 people foreign for people on this island. 54 00:04:00,870 --> 00:04:06,170 Just you know just so happened maybe these people are about to turn 30 or these people just turned 31 55 00:04:06,570 --> 00:04:10,280 and therefore you have this spike and you'll see a lot more of them. 56 00:04:10,290 --> 00:04:15,810 If your daughter is I would say more proper in the sense it's more precise. 57 00:04:15,960 --> 00:04:17,530 So you have to find a way around that. 58 00:04:17,550 --> 00:04:24,360 And we have to find a way to visualize our distribution of age regardless of the way our daughter is 59 00:04:24,360 --> 00:04:25,280 presented to us. 60 00:04:25,320 --> 00:04:32,760 And this is where we can use bins. This is where bin's come in and this is where we start thinking about 61 00:04:32,850 --> 00:04:37,970 converting our numeric variables to categorical variables in two dimensions. 62 00:04:38,160 --> 00:04:40,920 So let's start by having a look at that. 63 00:04:40,950 --> 00:04:46,820 I'm going to Right-Click age and what I want to do is I want to group people into bands. 64 00:04:46,950 --> 00:04:53,730 So I want to group people into for instance 5 year bans anybody between 20 and 25 will go into one group 65 00:04:53,730 --> 00:04:56,190 of 25 to 30 will go to the next group and so on. 66 00:04:56,430 --> 00:05:02,490 Well it's very easy to do it and Tablo all you have to go do is right click your variable and then here 67 00:05:02,550 --> 00:05:06,700 select Create and then select bins. 68 00:05:07,140 --> 00:05:12,460 And here you want an age been so we don't want size to be 10 one size to be five. 69 00:05:12,510 --> 00:05:17,470 So every five years there'll be a new bin will click OK. 70 00:05:17,850 --> 00:05:20,030 And here you can see Age state as a measure. 71 00:05:20,040 --> 00:05:24,160 But now we have a dimension which is also age but it's called Age been. 72 00:05:24,540 --> 00:05:29,190 And now what we'll do is we'll take age pin and will drag it in to replace age. 73 00:05:29,970 --> 00:05:35,460 Well actually I need to take age out and then put age and been there because one's a dimension that 74 00:05:35,460 --> 00:05:41,400 is a measure and what you can see right away is a nice distribution just like the ones we are used to 75 00:05:42,000 --> 00:05:46,130 seeing in economics and in mathematics because what happened here. 76 00:05:46,170 --> 00:05:53,730 We've grouped our records by category or by these bins and the bins are acting as categories so just 77 00:05:53,730 --> 00:06:01,440 like let's say gender or the country that people live in a similar way so it's not a recognisable is 78 00:06:01,440 --> 00:06:06,930 no longer recognising these as numbers these are categories so people in this category are between 20 79 00:06:06,930 --> 00:06:10,980 and 25 people in this category between 25 and 30 by the way. 80 00:06:10,980 --> 00:06:18,300 It includes the starting numbers so starting from 25 up to the last number before 30 in our case it's 81 00:06:18,300 --> 00:06:27,720 29 so 25 29 30 to 34 35 to 39 and so on and right away you can see how many people they are in each 82 00:06:27,720 --> 00:06:29,550 of these categories. 83 00:06:29,760 --> 00:06:31,850 So what else would we want to do here. 84 00:06:31,860 --> 00:06:37,700 Maybe let's give it a color so it looks a bit more pleasant. 85 00:06:38,040 --> 00:06:42,540 Let's take some number records hold down control and ragged on to color. 86 00:06:42,550 --> 00:06:45,140 Personally I use green for money. 87 00:06:45,190 --> 00:06:48,810 So let's use a different color. 88 00:06:48,840 --> 00:06:52,970 How about we go with kind of a brownish color. 89 00:06:53,590 --> 00:07:00,690 Something that apply and maybe we want to add a label as well so we don't have to visually check every 90 00:07:00,690 --> 00:07:03,590 single time where this is sitting. 91 00:07:03,750 --> 00:07:09,960 So we'll go and use some number of records as a label as well. 92 00:07:10,080 --> 00:07:11,650 And now you can see that. 93 00:07:11,880 --> 00:07:15,510 Exactly how many people you have in each of these categories. 94 00:07:15,540 --> 00:07:22,980 So and this approach allows you to assess your daughter a bit more so now you understand from this sample 95 00:07:22,980 --> 00:07:28,860 which is quite a significant sample 10000 people in the sample So it's very representative of the overall 96 00:07:28,860 --> 00:07:35,480 population of this bank and what you can see is that most people are age 35 to 40. 97 00:07:35,670 --> 00:07:42,060 Quite a few people between 30 to 40 so basically this is telling you what is the demographic of the 98 00:07:42,060 --> 00:07:43,290 bank like they are. 99 00:07:43,460 --> 00:07:53,760 Let's say more younger people say in their 30s that most customers are presented by age groups between 100 00:07:53,760 --> 00:07:55,590 30 and 45. 101 00:07:55,590 --> 00:07:59,550 So you have the most of your customers are in those age groups. 102 00:07:59,550 --> 00:08:05,100 And in fact if you want to see the percentages you can easily convert this so you can just go here and 103 00:08:05,250 --> 00:08:08,220 add table calculation like we did previously. 104 00:08:08,220 --> 00:08:11,040 You can even at a quick table calculation in this case. 105 00:08:11,040 --> 00:08:13,190 So you just need a percent of total. 106 00:08:13,380 --> 00:08:16,150 And there you go so you have the exact percentages. 107 00:08:16,170 --> 00:08:25,060 This gives you 43 plus 43 plus 16 gives you 59 or 59 plus. 108 00:08:25,200 --> 00:08:32,740 There's like a percentage so that gives you 60 percent 60 percent of your customers are in these H-back 109 00:08:33,030 --> 00:08:35,370 age bands and the rest there's like a long tail. 110 00:08:35,370 --> 00:08:39,440 So this is a right skewed distribution. 111 00:08:39,480 --> 00:08:46,560 So that is a thing to remember as well so maybe another tip that when that skewed means where the tail 112 00:08:46,560 --> 00:08:50,280 is going if the tells going to the right then it's a right skewed. 113 00:08:50,280 --> 00:08:55,260 This is not a left skewed a lot of people get it wrong left skewed would be if the tail was going to 114 00:08:55,260 --> 00:09:01,590 the left so that's a statistical term to keep in mind it's a rice food distribution got a long tail 115 00:09:01,590 --> 00:09:04,050 to the right you and have customers at 90 years old. 116 00:09:04,350 --> 00:09:07,390 And so what are we going to do now. 117 00:09:07,470 --> 00:09:11,560 That's that's pretty much what we want to do with this chart. 118 00:09:11,580 --> 00:09:15,480 I'll just say one quick thing you can change the bins very easily. 119 00:09:15,750 --> 00:09:20,800 You can just go to age and then you go to Edit. 120 00:09:21,060 --> 00:09:24,470 And here let's say we want a 10 year band. 121 00:09:24,810 --> 00:09:31,170 See right away it's changed and now you can see that you got most customers between 30 and 40 and then 122 00:09:31,170 --> 00:09:35,600 40 to 50 you have 26 and so rich controls at that. 123 00:09:35,910 --> 00:09:38,760 And that's all we wanted to do here. 124 00:09:38,760 --> 00:09:43,680 The last thing you might want to do is change this access because it's not consistent with your percentage 125 00:09:43,680 --> 00:09:48,890 you might want to change a percentage but that's details we're just investigating so we won't be doing 126 00:09:48,890 --> 00:09:49,560 it right now. 127 00:09:49,770 --> 00:09:51,600 And that's all for today. 128 00:09:51,600 --> 00:09:53,160 I look forward to seeing you next time. 129 00:09:53,160 --> 00:09:54,780 And until then happy analyzing 14202

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.