Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,990 --> 00:00:03,210
Speaker: So far, most of what we've seen
2
00:00:03,210 --> 00:00:05,580
were examples of numerical variables,
3
00:00:05,580 --> 00:00:09,633
exchange rates, trading volume, security prices, and so on.
4
00:00:11,040 --> 00:00:14,610
Often though, we must deal with categorical data.
5
00:00:14,610 --> 00:00:18,510
In short, categorical data refers to groups or categories,
6
00:00:18,510 --> 00:00:21,330
such as our cat dog examples,
7
00:00:21,330 --> 00:00:23,220
but the machine learning algorithm
8
00:00:23,220 --> 00:00:26,280
takes only numbers as values, doesn't it?
9
00:00:26,280 --> 00:00:29,520
Therefore, the question when working with categorical data
10
00:00:29,520 --> 00:00:33,150
is how to convert a CAT category into a number
11
00:00:33,150 --> 00:00:36,633
so we can input it into a model or output it in the end.
12
00:00:38,100 --> 00:00:41,130
Obviously, a different number should be associated
13
00:00:41,130 --> 00:00:42,990
with each category, right?
14
00:00:42,990 --> 00:00:46,563
Or better, a tensor, we are getting closer.
15
00:00:48,150 --> 00:00:50,520
Imagine our shop has three products,
16
00:00:50,520 --> 00:00:53,070
bread, yogurt, and muffins.
17
00:00:53,070 --> 00:00:56,520
Now, how do we convert these categories to numbers?
18
00:00:56,520 --> 00:01:00,090
A possible solution could be to enumerate them like this.
19
00:01:00,090 --> 00:01:04,712
Bread equals one, yogurt equals two, muffins equals three.
20
00:01:06,150 --> 00:01:09,510
Unfortunately, this implies there is some order.
21
00:01:09,510 --> 00:01:12,330
It's like saying that a muffin is more than a yogurt,
22
00:01:12,330 --> 00:01:13,833
which is more than bread.
23
00:01:15,390 --> 00:01:17,160
Think about prices.
24
00:01:17,160 --> 00:01:21,603
If we instead had three prices, $1, $2, and $3,
25
00:01:22,530 --> 00:01:24,933
three times $1 is equal to $3.
26
00:01:25,830 --> 00:01:28,920
Using the same logic, does it make any sense to you
27
00:01:28,920 --> 00:01:31,503
that three times bread equals one muffin?
28
00:01:32,400 --> 00:01:34,860
There is another level of ambiguity.
29
00:01:34,860 --> 00:01:36,540
To get from bread to muffins,
30
00:01:36,540 --> 00:01:38,253
we always go through yogurt.
31
00:01:39,780 --> 00:01:42,780
Ultimately, what we have done is assume the data
32
00:01:42,780 --> 00:01:44,790
has some order while it hasn't.
33
00:01:44,790 --> 00:01:46,500
Typically that's an issue
34
00:01:46,500 --> 00:01:49,410
when our data is divided into categories.
35
00:01:49,410 --> 00:01:51,210
Think about the products in a shop,
36
00:01:51,210 --> 00:01:53,733
about different car brands or about people.
37
00:01:55,440 --> 00:01:59,490
So our question becomes how to encode such categories
38
00:01:59,490 --> 00:02:01,110
in a way which will be useful
39
00:02:01,110 --> 00:02:03,540
for a machine learning algorithm.
40
00:02:03,540 --> 00:02:05,760
Two main ways are adopted.
41
00:02:05,760 --> 00:02:08,250
The first one is called One-hot Encoding
42
00:02:08,250 --> 00:02:10,650
and the other Binary Encoding.
43
00:02:10,650 --> 00:02:13,680
We will see how to perform them in the next lesson.
44
00:02:13,680 --> 00:02:14,823
Thanks for watching.
3474
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.