Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,360 --> 00:00:01,400
Instructor: Hey, welcome back
2
00:00:01,400 --> 00:00:03,930
to this market segmentation problem.
3
00:00:03,930 --> 00:00:06,363
We were just investigating an elbow.
4
00:00:07,290 --> 00:00:09,450
There isn't a clear tip of the elbow.
5
00:00:09,450 --> 00:00:12,870
I can see three or four tips which are worth trying,
6
00:00:12,870 --> 00:00:16,590
at two, three, four, and five clusters.
7
00:00:16,590 --> 00:00:19,260
Here's the limitation of the elbow method.
8
00:00:19,260 --> 00:00:22,470
We can see the change in WCSS with the increase
9
00:00:22,470 --> 00:00:23,880
in the number of clusters,
10
00:00:23,880 --> 00:00:26,883
but we don't really know which solution is the best one.
11
00:00:28,830 --> 00:00:32,460
All right, let's try two clusters.
12
00:00:32,460 --> 00:00:34,170
We already discussed qualitatively
13
00:00:34,170 --> 00:00:36,510
that this is probably a suboptimal solution,
14
00:00:36,510 --> 00:00:38,760
but it is worth inspecting the difference
15
00:00:38,760 --> 00:00:40,950
with standardized variables.
16
00:00:40,950 --> 00:00:43,450
We'll declare a variable called kmeans_new
17
00:00:45,690 --> 00:00:48,150
equal to KMeans of two.
18
00:00:48,150 --> 00:00:50,823
Next, we will fit the x_scaled data.
19
00:00:51,870 --> 00:00:54,270
Finally, we will create a new data frame
20
00:00:54,270 --> 00:00:59,270
called clusters_new, containing the values from x.
21
00:00:59,730 --> 00:01:01,410
Then the column cluster_pred
22
00:01:01,410 --> 00:01:03,240
will contain the predicted clusters
23
00:01:03,240 --> 00:01:06,303
from this new clustering solution with the scaled x.
24
00:01:07,680 --> 00:01:09,753
Here is a crucial moment.
25
00:01:10,830 --> 00:01:13,800
The data frame contains the original values,
26
00:01:13,800 --> 00:01:15,390
but the predicted clusters are based
27
00:01:15,390 --> 00:01:18,330
on the solution using the standardized data.
28
00:01:18,330 --> 00:01:20,700
This is very important.
29
00:01:20,700 --> 00:01:23,220
We will plot the data without standardizing it,
30
00:01:23,220 --> 00:01:26,163
but the solution itself is the standardized one.
31
00:01:27,450 --> 00:01:30,150
Let me show you by plotting the data.
32
00:01:30,150 --> 00:01:33,780
By keeping the original x-axis, we get the intuition,
33
00:01:33,780 --> 00:01:36,540
how satisfied were the customers.
34
00:01:36,540 --> 00:01:40,230
If we plot the standardized values, we would be deceived.
35
00:01:40,230 --> 00:01:43,260
The middle parts of the two graphs are different.
36
00:01:43,260 --> 00:01:46,500
This one is 5.5, and on the standardized graph,
37
00:01:46,500 --> 00:01:49,020
the midpoint zero actually corresponds
38
00:01:49,020 --> 00:01:51,933
to the mean of the variable, or 6.4.
39
00:01:53,010 --> 00:01:55,380
Let that sink in for a second.
40
00:01:55,380 --> 00:01:58,503
If you wish, you can rewind to be sure you got that right.
41
00:01:59,580 --> 00:02:01,803
We will now continue with the solution.
42
00:02:02,640 --> 00:02:04,350
We can see two clusters,
43
00:02:04,350 --> 00:02:06,690
as we specified the number to be two.
44
00:02:06,690 --> 00:02:08,550
No surprise here.
45
00:02:08,550 --> 00:02:12,000
What's different, though, is the clusters themselves.
46
00:02:12,000 --> 00:02:14,340
Comparing this result with the previous one,
47
00:02:14,340 --> 00:02:15,600
we can clearly see
48
00:02:15,600 --> 00:02:18,690
that both dimensions were taken into account.
49
00:02:18,690 --> 00:02:21,480
Moreover, these two clusters coincide
50
00:02:21,480 --> 00:02:23,370
with our initial speculations,
51
00:02:23,370 --> 00:02:27,270
that those two would be the result of k equals two.
52
00:02:27,270 --> 00:02:31,140
Okay, great, we are now much more confident
53
00:02:31,140 --> 00:02:34,440
that standardization is generally a good thing.
54
00:02:34,440 --> 00:02:37,710
However, the problem is not solved yet.
55
00:02:37,710 --> 00:02:40,410
This two-cluster solution does not make a whole lot
56
00:02:40,410 --> 00:02:44,130
of sense, as we discussed before, but it's a good start.
57
00:02:44,130 --> 00:02:45,990
Let's name the two clusters.
58
00:02:45,990 --> 00:02:49,800
One contains people with low loyalty and low satisfaction,
59
00:02:49,800 --> 00:02:53,073
so we can call these people alienated.
60
00:02:54,000 --> 00:02:57,690
By the way, naming your clusters is very important.
61
00:02:57,690 --> 00:03:01,050
In unsupervised learning, clustering included,
62
00:03:01,050 --> 00:03:02,820
the algorithm will do the magic,
63
00:03:02,820 --> 00:03:05,850
but then we step in to interpret the result.
64
00:03:05,850 --> 00:03:09,180
My feeling here is to call them the alienated cluster,
65
00:03:09,180 --> 00:03:11,820
as they are dissatisfied and not loyal.
66
00:03:11,820 --> 00:03:15,063
No wonder, it's unlikely they'll be back to our shop.
67
00:03:16,020 --> 00:03:18,690
As for the other cluster, it is so heterogeneous
68
00:03:18,690 --> 00:03:21,153
that I'd call it the everything else cluster.
69
00:03:22,260 --> 00:03:24,690
All right, let's get back to the elbow.
70
00:03:24,690 --> 00:03:28,593
Noteworthy tips of the elbow are also three, four, and five.
71
00:03:29,430 --> 00:03:31,620
I'll try them one after the other.
72
00:03:31,620 --> 00:03:33,330
With our well-parameterized code,
73
00:03:33,330 --> 00:03:36,450
we can just change the number of clusters in the first line,
74
00:03:36,450 --> 00:03:39,390
and rerunning the code would do the trick.
75
00:03:39,390 --> 00:03:42,240
Let's try with three clusters.
76
00:03:42,240 --> 00:03:43,890
That's the result.
77
00:03:43,890 --> 00:03:46,830
We have the alienated cluster once more.
78
00:03:46,830 --> 00:03:48,390
That's a good sign.
79
00:03:48,390 --> 00:03:49,770
It shows us that we were right
80
00:03:49,770 --> 00:03:52,020
in concluding that it is a cluster of its own,
81
00:03:52,020 --> 00:03:55,233
while the everything else cluster is now split into two.
82
00:03:56,880 --> 00:03:59,250
I'd call this group the supporters.
83
00:03:59,250 --> 00:04:01,050
They are not particularly happy
84
00:04:01,050 --> 00:04:03,810
with the shopping experience, but they like the brand
85
00:04:03,810 --> 00:04:05,760
and wanna keep coming back.
86
00:04:05,760 --> 00:04:07,890
Note that there are not that many of them.
87
00:04:07,890 --> 00:04:09,423
It is a small cluster.
88
00:04:10,860 --> 00:04:14,490
Finally, the third cluster is called, well,
89
00:04:14,490 --> 00:04:17,700
the all that's left cluster, I guess.
90
00:04:17,700 --> 00:04:20,853
We can't really name it as it is still very much mixed.
91
00:04:22,170 --> 00:04:23,820
What happens next?
92
00:04:23,820 --> 00:04:26,343
Let's check out a four-cluster solution.
93
00:04:31,800 --> 00:04:35,280
We have the alienated and the supporters clusters,
94
00:04:35,280 --> 00:04:38,973
and now these two new ones can also be named, finally.
95
00:04:40,290 --> 00:04:42,180
The upper right one consists of clients
96
00:04:42,180 --> 00:04:44,340
that are satisfied and loyal.
97
00:04:44,340 --> 00:04:47,520
These are our fans, the core customers.
98
00:04:47,520 --> 00:04:49,230
Eventually, we hope that all the points
99
00:04:49,230 --> 00:04:51,120
on this graph turn into fans,
100
00:04:51,120 --> 00:04:54,210
but we will elaborate on this later.
101
00:04:54,210 --> 00:04:56,490
Let's name the last cluster.
102
00:04:56,490 --> 00:04:58,950
We have people who are predominantly satisfied
103
00:04:58,950 --> 00:05:02,400
but not loyal, and some of them are actually disloyal.
104
00:05:02,400 --> 00:05:03,840
A term I've seen somewhere
105
00:05:03,840 --> 00:05:07,710
to describe such customers is roamers.
106
00:05:07,710 --> 00:05:11,370
They like your brand, but they are not very loyal to it.
107
00:05:11,370 --> 00:05:13,653
We have all been there for some brand.
108
00:05:14,520 --> 00:05:17,460
Okay, this solution is definitely the best one
109
00:05:17,460 --> 00:05:18,723
we've seen so far.
110
00:05:21,210 --> 00:05:24,030
Here's where it stood on the elbow graph,
111
00:05:24,030 --> 00:05:26,583
but how about we try with five clusters?
112
00:05:31,170 --> 00:05:33,180
The alienated, the supporters,
113
00:05:33,180 --> 00:05:35,640
and the fans remain unchanged.
114
00:05:35,640 --> 00:05:38,970
These people here look like the roamers from before.
115
00:05:38,970 --> 00:05:41,520
Finally, these clients are almost in the middle
116
00:05:41,520 --> 00:05:43,320
of our standardized graph.
117
00:05:43,320 --> 00:05:45,420
They almost neutral on the loyalty feature
118
00:05:45,420 --> 00:05:47,400
but are generally satisfied.
119
00:05:47,400 --> 00:05:49,200
They are also roamers.
120
00:05:49,200 --> 00:05:51,570
This solution actually split the roamers
121
00:05:51,570 --> 00:05:55,650
into two subclusters, those that are extremely satisfied
122
00:05:55,650 --> 00:05:57,780
and those that are just satisfied,
123
00:05:57,780 --> 00:06:00,813
so there isn't much value added to our segmentation.
124
00:06:01,860 --> 00:06:04,410
We can carry on with as many clusters as we want,
125
00:06:04,410 --> 00:06:05,700
but from now on,
126
00:06:05,700 --> 00:06:09,450
we would just further segment the four core clusters.
127
00:06:09,450 --> 00:06:11,823
Let's finish off with nine clusters.
128
00:06:15,600 --> 00:06:17,670
Similar to what we had a second ago,
129
00:06:17,670 --> 00:06:20,340
many of the clusters were further segmented.
130
00:06:20,340 --> 00:06:22,740
It is extremely hard to name all of them,
131
00:06:22,740 --> 00:06:24,270
and even if we do,
132
00:06:24,270 --> 00:06:27,390
we will probably need to use a lot of adjectives.
133
00:06:27,390 --> 00:06:31,380
For instance, the alienated cluster is split into two,
134
00:06:31,380 --> 00:06:33,240
the very alienated cluster
135
00:06:33,240 --> 00:06:36,300
and the moderately alienated cluster.
136
00:06:36,300 --> 00:06:38,580
As you can imagine, there is not much to gain
137
00:06:38,580 --> 00:06:40,323
by using such a fragmented.
138
00:06:42,000 --> 00:06:44,880
In my mind, the four and five-cluster solutions
139
00:06:44,880 --> 00:06:46,560
were the best ones.
140
00:06:46,560 --> 00:06:49,773
Which one you want to use depends on the problem at hand.
141
00:06:50,670 --> 00:06:54,540
Okay, in the next lesson, we will see what we can do
142
00:06:54,540 --> 00:06:56,880
with this new information.
143
00:06:56,880 --> 00:06:57,880
Thanks for watching.
11238
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.