Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,610 --> 00:00:01,890
Look at us go.
2
00:00:01,890 --> 00:00:04,650
We're moving to this framework at lightning pace.
3
00:00:04,650 --> 00:00:06,110
We've done Problem Definition.
4
00:00:06,150 --> 00:00:10,150
We've looked at data we've decided on an evaluation metric.
5
00:00:10,170 --> 00:00:13,110
We've understood a few of the features we've got in our data.
6
00:00:13,110 --> 00:00:15,620
Now we're up to step five which is modelling.
7
00:00:15,690 --> 00:00:17,640
Now there's a few parts to modelling.
8
00:00:17,640 --> 00:00:21,750
So we've broken this down into four different sections.
9
00:00:21,750 --> 00:00:23,800
And this is where it's Section One.
10
00:00:23,910 --> 00:00:28,420
And this is probably the most important concept in machine learning that three sets.
11
00:00:28,630 --> 00:00:35,730
And now over the whole of modelling we want to answer the question based on our problem and data what
12
00:00:35,730 --> 00:00:43,570
machine learning model should we use modelling can be broken down into three parts choosing and training
13
00:00:43,570 --> 00:00:50,470
a model churning a model and model comparison before we get into these though.
14
00:00:50,680 --> 00:00:57,160
Part one of modelling is and the most paramount topic to discuss in this whole entire course is the
15
00:00:57,160 --> 00:01:00,550
most important concept in machine learning.
16
00:01:00,760 --> 00:01:07,510
The train validation and test splits or commonly referred to as three sets.
17
00:01:07,510 --> 00:01:13,840
Now since you want to be using machine learning models to gain insights on some data to predict the
18
00:01:13,840 --> 00:01:18,930
future it's important to test how well they would go and do in the real world.
19
00:01:19,150 --> 00:01:26,740
To do this you split your data into three different sets a training set to train your model on a validation
20
00:01:26,740 --> 00:01:36,600
set to choosing your model on a test set to test and compare your different models why is this important.
21
00:01:36,600 --> 00:01:42,270
Think of it like this when you're at university you might study the Course materials all through the
22
00:01:42,270 --> 00:01:48,870
semester then before the final exam You might see how you could improve your knowledge on a practice
23
00:01:48,870 --> 00:01:50,070
exam.
24
00:01:50,070 --> 00:01:57,270
After doing well on the practice exam you're confident you'll do well on the final exam when you take
25
00:01:57,270 --> 00:01:58,490
the final exam.
26
00:01:58,500 --> 00:02:03,330
And although some of the problems you've never seen before you're able to adapt the knowledge you've
27
00:02:03,330 --> 00:02:10,440
learned from the study materials to the slightly different but similar questions on the final exam.
28
00:02:10,620 --> 00:02:15,730
Because of this you pass the final exam with great marks.
29
00:02:15,780 --> 00:02:23,760
This adaptation that you had from the course materials and practice exams to the final exam is referred
30
00:02:23,760 --> 00:02:30,540
to in machine learning as a generalisation or the ability for a machine learning model to perform well
31
00:02:30,600 --> 00:02:34,880
on data it hasn't seen before because of what it's learned.
32
00:02:34,950 --> 00:02:43,970
On another dataset Now where might this go wrong well if your professor accidentally sent out the final
33
00:02:43,970 --> 00:02:49,000
exam for everyone to practice on when it came time to the actual exam.
34
00:02:49,070 --> 00:02:52,780
Everyone would have already seen it now.
35
00:02:52,830 --> 00:02:58,000
Since people know what they should be expecting they go through the exam.
36
00:02:58,090 --> 00:03:03,590
They answer all the questions with ease and everyone ends up getting top marks.
37
00:03:03,610 --> 00:03:10,530
Now top marks might appear good but did the students really learn anything or were they just expert
38
00:03:10,540 --> 00:03:17,500
memorization machines for your machine learning models to be valuable at predicting something in the
39
00:03:17,500 --> 00:03:24,130
future on unseen data you'll want to avoid them becoming memorization machines.
40
00:03:24,130 --> 00:03:28,900
This is where training validation and test splits come in.
41
00:03:28,900 --> 00:03:35,750
In our heart disease example let's say there were 100 patients you start off with 100.
42
00:03:35,800 --> 00:03:39,910
One way to create these splits is to shuffle these patients.
43
00:03:39,910 --> 00:03:45,440
Then select 70 percent for training which would mean that would be about 70.
44
00:03:45,440 --> 00:03:46,560
Patient records.
45
00:03:47,000 --> 00:03:54,110
And 15 percent for validation and 15 percent for testing which means to be 70 patients in the training
46
00:03:54,110 --> 00:03:54,820
set.
47
00:03:54,830 --> 00:04:00,250
15 patients in the validation split and 15 patients in the test split.
48
00:04:00,260 --> 00:04:06,580
Now the percentages of each of these may vary but standard practice is usually around 70 to 80 percent
49
00:04:06,590 --> 00:04:07,640
for training.
50
00:04:07,640 --> 00:04:11,570
10 to 15 for validation and 10 15 for test.
51
00:04:11,630 --> 00:04:19,280
You may see in some examples that some sets or some data sets only get split into training and test.
52
00:04:19,280 --> 00:04:21,480
But that's case by case scenario.
53
00:04:21,530 --> 00:04:27,030
Usually you'll have three different sets then once you've got these splits.
54
00:04:27,030 --> 00:04:34,170
Using a model you've chosen you'd feed at the training data or the information of of these 70 patient
55
00:04:34,170 --> 00:04:35,310
records.
56
00:04:35,460 --> 00:04:41,550
And once your model had trained you can check its results and see if you can improve them on the validation
57
00:04:41,550 --> 00:04:41,880
set.
58
00:04:42,180 --> 00:04:44,220
This is where you do model tuning.
59
00:04:44,220 --> 00:04:49,170
So just because you're machine learning the model's got one set of results and the patient records you
60
00:04:49,170 --> 00:04:54,000
can actually improve them and we'll see this in a future lesson on the validation split.
61
00:04:54,080 --> 00:04:58,360
Well the validation split is where you should be testing to see if you can improve.
62
00:04:59,160 --> 00:05:05,910
Finally once you've improved your model you can check the models results as well as any other models
63
00:05:05,910 --> 00:05:12,420
results that you might have done during experimentation on the test said what's important to remember
64
00:05:12,450 --> 00:05:19,020
is that all three of these sets a separate during training the model never sees the validation split
65
00:05:19,290 --> 00:05:20,520
or the test split.
66
00:05:20,700 --> 00:05:26,850
And during testing you're doing it on the test split not the training set it's the same as when you
67
00:05:26,850 --> 00:05:33,180
were studying for your exam if you saw the final exam whilst practicing that would be cheating and your
68
00:05:33,180 --> 00:05:37,500
final result wouldn't reflect how well you'd learned.
69
00:05:37,610 --> 00:05:43,250
For now think about it the last time you went for a test did you practice beforehand.
70
00:05:43,250 --> 00:05:48,530
Was the practice you were doing helpful for the test and when you're thinking about this try and think
71
00:05:48,530 --> 00:05:55,740
of how the lines to why it's important to not let a machine learning model see a test set or test data
72
00:05:55,740 --> 00:05:57,710
simply whilst it's training.
7895
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.