Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,980 --> 00:00:07,890
Gone through Step 1 problem definition we've gone through Step 2 data and step 3.
2
00:00:08,020 --> 00:00:10,030
We've defined what success means for us.
3
00:00:10,530 --> 00:00:14,920
Now let's get on to Step 4 which is features.
4
00:00:14,920 --> 00:00:20,380
Now here the question we're trying to answer is what do we already know about the data.
5
00:00:20,450 --> 00:00:25,460
Now if you haven't worked with data before you might hear this word features and be wondering what powders
6
00:00:25,690 --> 00:00:27,230
features mean.
7
00:00:27,280 --> 00:00:30,720
Well you'll hear this world come up a lot in machine learning.
8
00:00:31,000 --> 00:00:37,030
Maybe in the form of feature learning or feature variables or when someone ask how many features are
9
00:00:37,030 --> 00:00:39,550
there or what kind of features are there.
10
00:00:40,630 --> 00:00:44,200
Features is another word for different forms of data.
11
00:00:44,390 --> 00:00:49,960
Now we've already discussed different kinds of data such as structured and unstructured but features
12
00:00:49,960 --> 00:00:56,210
refers to the different forms of data within structured or unstructured data.
13
00:00:56,290 --> 00:01:03,520
For example let's go back to our predicting heart disease problem we might want to see if things such
14
00:01:03,520 --> 00:01:10,270
as a person's body weight their sex their average resting heart rate and their chest pain rating can
15
00:01:10,270 --> 00:01:13,310
be used to predict if they have heart disease or not.
16
00:01:14,500 --> 00:01:21,880
These three things a patient's body weight sex average resting heart rate and chest pain are features
17
00:01:22,180 --> 00:01:28,410
of the data that could also be referred to as feature variables.
18
00:01:28,440 --> 00:01:37,900
In other words we want to use the feature variables to predict the target variables which is whether
19
00:01:37,900 --> 00:01:41,430
or not a person has heart disease or no.
20
00:01:41,650 --> 00:01:45,770
Now when it comes to feature variables again there are different kinds.
21
00:01:45,820 --> 00:01:49,920
You've got numerical which means a number like body weight.
22
00:01:50,410 --> 00:01:58,160
There's categorical which means one thing or another like sex or whether a patient is a smoker or not.
23
00:01:58,300 --> 00:02:06,160
And then there's derived which is when someone like yourself looks at the data and creates a new feature
24
00:02:06,400 --> 00:02:08,700
using the existing ones.
25
00:02:08,770 --> 00:02:14,890
For example you might look at someone's hospital visit history timestamps and if they've had a visit
26
00:02:14,920 --> 00:02:20,820
in the last year you could make a categorical feature called visited in last year.
27
00:02:22,360 --> 00:02:25,980
If someone had visited in the last year they would get true.
28
00:02:25,990 --> 00:02:28,330
Or in our case yes.
29
00:02:28,450 --> 00:02:31,240
If not they would get false or in this case.
30
00:02:31,320 --> 00:02:32,380
No.
31
00:02:32,500 --> 00:02:40,360
The process of deriving features like this out of data is often referred to as feature engineering our
32
00:02:40,360 --> 00:02:46,030
heart disease example is structured but unstructured data has features too.
33
00:02:46,370 --> 00:02:52,240
They're just a little less obvious if you looked at enough images of dogs you'd start to figure out
34
00:02:52,670 --> 00:02:53,460
OK.
35
00:02:53,620 --> 00:02:59,170
Most of these creatures have four shapes coming out of their body their legs and a couple of circles
36
00:02:59,170 --> 00:03:05,720
up the front their eyes as a machine learning algorithm looks at different images.
37
00:03:05,720 --> 00:03:11,570
It would start to learn these different shapes and much more and figure out how different pictures are
38
00:03:11,570 --> 00:03:14,430
similar or different to each other.
39
00:03:14,480 --> 00:03:21,170
Don't worry when it comes to figuring out the different patterns between features such as the four rectangles
40
00:03:21,170 --> 00:03:27,470
sort of shapes coming out of a dog's body or the circles at the front of the dog's head you don't have
41
00:03:27,470 --> 00:03:30,240
to tell the machine learning algorithm what they are.
42
00:03:30,260 --> 00:03:38,710
The beautiful thing is it because the mount on its own the final thing to remember is a feature works
43
00:03:38,710 --> 00:03:41,250
best within a machine learning algorithm.
44
00:03:41,260 --> 00:03:47,950
If many of the samples have it for an hour predicting heart disease problem say we had a feature which
45
00:03:47,950 --> 00:03:56,440
was called most Eden Foods which had a list of the foods the Patient 8 most often but only 10 per cent
46
00:03:56,800 --> 00:04:00,100
or 10 out of 100 patient records had it.
47
00:04:00,460 --> 00:04:09,760
So this one idea for patient I.D. for 3 2 8 has most in food which is fries not ideal and these other
48
00:04:09,760 --> 00:04:16,030
patients don't have it because remember only 10 out of 100 examples have the most eaten food.
49
00:04:16,030 --> 00:04:19,830
They have data here and so these ones are just missing and that will be the same.
50
00:04:19,830 --> 00:04:25,000
So if you can imagine there's 100 patients here only 10 of them will have this most eaten food column
51
00:04:25,000 --> 00:04:26,050
filled.
52
00:04:26,350 --> 00:04:31,520
Since a machine learning algorithm learns best when all samples have similar information.
53
00:04:31,620 --> 00:04:39,480
We have to leave this one out or try to collect more information before using it the process of ensuring
54
00:04:39,540 --> 00:04:43,570
all samples have similar information is called feature coverage.
55
00:04:43,650 --> 00:04:47,740
In an ideal dataset you've got complete feature coverage.
56
00:04:47,940 --> 00:04:54,750
So for us to want to to be able to use this feature of most in foods Ideally we'd want all values here
57
00:04:54,780 --> 00:05:01,560
or at least more than 10 percent coverage which means that over 10 percent or over 10 and 100 examples
58
00:05:01,830 --> 00:05:04,010
have some sort of value in this column.
59
00:05:05,450 --> 00:05:11,000
We'll have plenty of practice looking at different features and coming lectures and projects and lessons.
60
00:05:11,000 --> 00:05:14,610
In the meantime think about a problem you had to solve recently.
61
00:05:14,750 --> 00:05:16,450
What features went into it.
62
00:05:16,640 --> 00:05:22,130
Were they numerical or categorical or did you combine them into your own derived feature.
6913
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.