Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,960 --> 00:00:06,840
Today we're going to continue working our data mining workbook which contains the bank churn data set
2
00:00:06,840 --> 00:00:07,160
.
3
00:00:07,200 --> 00:00:12,580
However we will start looking at a bit of a different rule.
4
00:00:12,630 --> 00:00:15,560
Subtle looking at our daughter from a bit of a different angle.
5
00:00:15,930 --> 00:00:23,280
Let's create a new work tap and this one will call and well we won't call it anything yet.
6
00:00:23,280 --> 00:00:30,690
What we want to visualize here is we want to understand how our customers are distributed by age so
7
00:00:31,020 --> 00:00:35,260
our old customers predominately older or younger what age groups are they in.
8
00:00:35,370 --> 00:00:36,240
And so on.
9
00:00:36,690 --> 00:00:39,510
And as you can see we already have an age variable here.
10
00:00:39,810 --> 00:00:43,900
So the first thing that comes to mind is let's say age and drag it into columns.
11
00:00:44,110 --> 00:00:47,240
And now let's take number of records and drag it into rows.
12
00:00:47,490 --> 00:00:52,860
And now we have one big dot over here at the top.
13
00:00:52,860 --> 00:00:54,100
Why do we have one big dog.
14
00:00:54,120 --> 00:01:04,260
Because well what happened is Tablo took some of Age of all of your customers and it got just over 380000
15
00:01:04,260 --> 00:01:09,840
years and then it took some more records so the toll on them are at it's 10000.
16
00:01:09,840 --> 00:01:14,400
Before we continue I'm just going to fix up the format here so we can see better.
17
00:01:15,330 --> 00:01:21,800
So you can see here three eighty thousand not for 90000 and 10000 customers.
18
00:01:21,810 --> 00:01:23,370
So how do we avoid doing that.
19
00:01:23,370 --> 00:01:30,690
Well in my course on Tablo I talk about aggregation and granularity.
20
00:01:30,690 --> 00:01:32,330
We won't go into details about that yet.
21
00:01:32,340 --> 00:01:34,880
I'll just show you how to avoid this happening.
22
00:01:34,890 --> 00:01:43,470
So if you go to some age and you just change this from a measure to a dimension then what will happen
23
00:01:43,470 --> 00:01:47,250
is it's not aggregating age anymore it's using it as a dimension.
24
00:01:47,280 --> 00:01:52,430
And as you can see you've got this nice shape of a curve.
25
00:01:52,680 --> 00:01:59,430
So basically it just tells you how many customers at every given age they are and it's uploading it
26
00:01:59,430 --> 00:02:00,890
with a line.
27
00:02:00,900 --> 00:02:07,170
So that's good but the problem here is that this is a nice data set so I'm going to Right-Click and
28
00:02:07,170 --> 00:02:12,180
I'll show you the data such if you Right-Click a data set up here and click your data and you go in
29
00:02:12,180 --> 00:02:15,110
here you will see the details of your data.
30
00:02:15,360 --> 00:02:16,930
So there we go.
31
00:02:16,930 --> 00:02:17,900
There's age.
32
00:02:18,150 --> 00:02:23,250
So here you can see that this is a nice data set in the sense that age has been around for you in a
33
00:02:23,250 --> 00:02:30,250
way it's not good because you don't have the details how exactly how old the person is is he for 29
34
00:02:30,250 --> 00:02:34,890
years between nine and a half or Z 29 and 11 months and a couple of days old.
35
00:02:34,890 --> 00:02:41,520
So you don't have that decimal point you can tell how exactly all the Is and therefore you've lost some
36
00:02:41,520 --> 00:02:44,090
information but on the other hand you can't do anything about it.
37
00:02:44,100 --> 00:02:49,170
If this is how the data said was provided to well then you've got to work on it and you might as well
38
00:02:49,170 --> 00:02:54,990
just enjoy it because that means it's a nice data set and you have one less thing to worry about you
39
00:02:54,990 --> 00:02:57,120
don't have the decimal point to worry about.
40
00:02:57,120 --> 00:02:59,320
So what does that say to us.
41
00:02:59,340 --> 00:03:06,230
Well that says to us that yes in this case because they're rounded you Tablo will be able to group customers
42
00:03:06,240 --> 00:03:10,860
for instance there's a person who's 36 and there is a person who's 36 and there's a person who is 36
43
00:03:11,220 --> 00:03:16,520
and there's a couple more people who are 36 and then you go to this chart and then you find 46.
44
00:03:16,550 --> 00:03:24,240
Let me just close this and you find 36 and you find that there's is 456 people that have thirty six
45
00:03:24,540 --> 00:03:26,470
in that column.
46
00:03:26,550 --> 00:03:33,090
What you will find is with other data sets when it's not rounded for you then you'll have people who
47
00:03:33,090 --> 00:03:34,170
are three six and a half.
48
00:03:34,170 --> 00:03:38,640
You'll have the people of thirty six point to thirty six point three and that will cause this charge
49
00:03:38,640 --> 00:03:45,030
to go to create very strange patterns because you might have a lot of people who have three six and
50
00:03:45,390 --> 00:03:51,600
a half but you might have very few people or three 6.6 so it will be very spiky and you don't want that
51
00:03:51,600 --> 00:03:53,340
in your body you can really see a spike over here.
52
00:03:53,340 --> 00:03:54,410
This is not normal right.
53
00:03:54,420 --> 00:04:00,820
So 30 29 and then 31 so 348 people foreign for people on this island.
54
00:04:00,870 --> 00:04:06,170
Just you know just so happened maybe these people are about to turn 30 or these people just turned 31
55
00:04:06,570 --> 00:04:10,280
and therefore you have this spike and you'll see a lot more of them.
56
00:04:10,290 --> 00:04:15,810
If your daughter is I would say more proper in the sense it's more precise.
57
00:04:15,960 --> 00:04:17,530
So you have to find a way around that.
58
00:04:17,550 --> 00:04:24,360
And we have to find a way to visualize our distribution of age regardless of the way our daughter is
59
00:04:24,360 --> 00:04:25,280
presented to us.
60
00:04:25,320 --> 00:04:32,760
And this is where we can use bins. This is where bin's come in and this is where we start thinking about
61
00:04:32,850 --> 00:04:37,970
converting our numeric variables to categorical variables in two dimensions.
62
00:04:38,160 --> 00:04:40,920
So let's start by having a look at that.
63
00:04:40,950 --> 00:04:46,820
I'm going to Right-Click age and what I want to do is I want to group people into bands.
64
00:04:46,950 --> 00:04:53,730
So I want to group people into for instance 5 year bans anybody between 20 and 25 will go into one group
65
00:04:53,730 --> 00:04:56,190
of 25 to 30 will go to the next group and so on.
66
00:04:56,430 --> 00:05:02,490
Well it's very easy to do it and Tablo all you have to go do is right click your variable and then here
67
00:05:02,550 --> 00:05:06,700
select Create and then select bins.
68
00:05:07,140 --> 00:05:12,460
And here you want an age been so we don't want size to be 10 one size to be five.
69
00:05:12,510 --> 00:05:17,470
So every five years there'll be a new bin will click OK.
70
00:05:17,850 --> 00:05:20,030
And here you can see Age state as a measure.
71
00:05:20,040 --> 00:05:24,160
But now we have a dimension which is also age but it's called Age been.
72
00:05:24,540 --> 00:05:29,190
And now what we'll do is we'll take age pin and will drag it in to replace age.
73
00:05:29,970 --> 00:05:35,460
Well actually I need to take age out and then put age and been there because one's a dimension that
74
00:05:35,460 --> 00:05:41,400
is a measure and what you can see right away is a nice distribution just like the ones we are used to
75
00:05:42,000 --> 00:05:46,130
seeing in economics and in mathematics because what happened here.
76
00:05:46,170 --> 00:05:53,730
We've grouped our records by category or by these bins and the bins are acting as categories so just
77
00:05:53,730 --> 00:06:01,440
like let's say gender or the country that people live in a similar way so it's not a recognisable is
78
00:06:01,440 --> 00:06:06,930
no longer recognising these as numbers these are categories so people in this category are between 20
79
00:06:06,930 --> 00:06:10,980
and 25 people in this category between 25 and 30 by the way.
80
00:06:10,980 --> 00:06:18,300
It includes the starting numbers so starting from 25 up to the last number before 30 in our case it's
81
00:06:18,300 --> 00:06:27,720
29 so 25 29 30 to 34 35 to 39 and so on and right away you can see how many people they are in each
82
00:06:27,720 --> 00:06:29,550
of these categories.
83
00:06:29,760 --> 00:06:31,850
So what else would we want to do here.
84
00:06:31,860 --> 00:06:37,700
Maybe let's give it a color so it looks a bit more pleasant.
85
00:06:38,040 --> 00:06:42,540
Let's take some number records hold down control and ragged on to color.
86
00:06:42,550 --> 00:06:45,140
Personally I use green for money.
87
00:06:45,190 --> 00:06:48,810
So let's use a different color.
88
00:06:48,840 --> 00:06:52,970
How about we go with kind of a brownish color.
89
00:06:53,590 --> 00:07:00,690
Something that apply and maybe we want to add a label as well so we don't have to visually check every
90
00:07:00,690 --> 00:07:03,590
single time where this is sitting.
91
00:07:03,750 --> 00:07:09,960
So we'll go and use some number of records as a label as well.
92
00:07:10,080 --> 00:07:11,650
And now you can see that.
93
00:07:11,880 --> 00:07:15,510
Exactly how many people you have in each of these categories.
94
00:07:15,540 --> 00:07:22,980
So and this approach allows you to assess your daughter a bit more so now you understand from this sample
95
00:07:22,980 --> 00:07:28,860
which is quite a significant sample 10000 people in the sample So it's very representative of the overall
96
00:07:28,860 --> 00:07:35,480
population of this bank and what you can see is that most people are age 35 to 40.
97
00:07:35,670 --> 00:07:42,060
Quite a few people between 30 to 40 so basically this is telling you what is the demographic of the
98
00:07:42,060 --> 00:07:43,290
bank like they are.
99
00:07:43,460 --> 00:07:53,760
Let's say more younger people say in their 30s that most customers are presented by age groups between
100
00:07:53,760 --> 00:07:55,590
30 and 45.
101
00:07:55,590 --> 00:07:59,550
So you have the most of your customers are in those age groups.
102
00:07:59,550 --> 00:08:05,100
And in fact if you want to see the percentages you can easily convert this so you can just go here and
103
00:08:05,250 --> 00:08:08,220
add table calculation like we did previously.
104
00:08:08,220 --> 00:08:11,040
You can even at a quick table calculation in this case.
105
00:08:11,040 --> 00:08:13,190
So you just need a percent of total.
106
00:08:13,380 --> 00:08:16,150
And there you go so you have the exact percentages.
107
00:08:16,170 --> 00:08:25,060
This gives you 43 plus 43 plus 16 gives you 59 or 59 plus.
108
00:08:25,200 --> 00:08:32,740
There's like a percentage so that gives you 60 percent 60 percent of your customers are in these H-back
109
00:08:33,030 --> 00:08:35,370
age bands and the rest there's like a long tail.
110
00:08:35,370 --> 00:08:39,440
So this is a right skewed distribution.
111
00:08:39,480 --> 00:08:46,560
So that is a thing to remember as well so maybe another tip that when that skewed means where the tail
112
00:08:46,560 --> 00:08:50,280
is going if the tells going to the right then it's a right skewed.
113
00:08:50,280 --> 00:08:55,260
This is not a left skewed a lot of people get it wrong left skewed would be if the tail was going to
114
00:08:55,260 --> 00:09:01,590
the left so that's a statistical term to keep in mind it's a rice food distribution got a long tail
115
00:09:01,590 --> 00:09:04,050
to the right you and have customers at 90 years old.
116
00:09:04,350 --> 00:09:07,390
And so what are we going to do now.
117
00:09:07,470 --> 00:09:11,560
That's that's pretty much what we want to do with this chart.
118
00:09:11,580 --> 00:09:15,480
I'll just say one quick thing you can change the bins very easily.
119
00:09:15,750 --> 00:09:20,800
You can just go to age and then you go to Edit.
120
00:09:21,060 --> 00:09:24,470
And here let's say we want a 10 year band.
121
00:09:24,810 --> 00:09:31,170
See right away it's changed and now you can see that you got most customers between 30 and 40 and then
122
00:09:31,170 --> 00:09:35,600
40 to 50 you have 26 and so rich controls at that.
123
00:09:35,910 --> 00:09:38,760
And that's all we wanted to do here.
124
00:09:38,760 --> 00:09:43,680
The last thing you might want to do is change this access because it's not consistent with your percentage
125
00:09:43,680 --> 00:09:48,890
you might want to change a percentage but that's details we're just investigating so we won't be doing
126
00:09:48,890 --> 00:09:49,560
it right now.
127
00:09:49,770 --> 00:09:51,600
And that's all for today.
128
00:09:51,600 --> 00:09:53,160
I look forward to seeing you next time.
129
00:09:53,160 --> 00:09:54,780
And until then happy analyzing
14202
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.