All language subtitles for 8. Modelling - Splitting Data

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian Download
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,610 --> 00:00:01,890 Look at us go. 2 00:00:01,890 --> 00:00:04,650 We're moving to this framework at lightning pace. 3 00:00:04,650 --> 00:00:06,110 We've done Problem Definition. 4 00:00:06,150 --> 00:00:10,150 We've looked at data we've decided on an evaluation metric. 5 00:00:10,170 --> 00:00:13,110 We've understood a few of the features we've got in our data. 6 00:00:13,110 --> 00:00:15,620 Now we're up to step five which is modelling. 7 00:00:15,690 --> 00:00:17,640 Now there's a few parts to modelling. 8 00:00:17,640 --> 00:00:21,750 So we've broken this down into four different sections. 9 00:00:21,750 --> 00:00:23,800 And this is where it's Section One. 10 00:00:23,910 --> 00:00:28,420 And this is probably the most important concept in machine learning that three sets. 11 00:00:28,630 --> 00:00:35,730 And now over the whole of modelling we want to answer the question based on our problem and data what 12 00:00:35,730 --> 00:00:43,570 machine learning model should we use modelling can be broken down into three parts choosing and training 13 00:00:43,570 --> 00:00:50,470 a model churning a model and model comparison before we get into these though. 14 00:00:50,680 --> 00:00:57,160 Part one of modelling is and the most paramount topic to discuss in this whole entire course is the 15 00:00:57,160 --> 00:01:00,550 most important concept in machine learning. 16 00:01:00,760 --> 00:01:07,510 The train validation and test splits or commonly referred to as three sets. 17 00:01:07,510 --> 00:01:13,840 Now since you want to be using machine learning models to gain insights on some data to predict the 18 00:01:13,840 --> 00:01:18,930 future it's important to test how well they would go and do in the real world. 19 00:01:19,150 --> 00:01:26,740 To do this you split your data into three different sets a training set to train your model on a validation 20 00:01:26,740 --> 00:01:36,600 set to choosing your model on a test set to test and compare your different models why is this important. 21 00:01:36,600 --> 00:01:42,270 Think of it like this when you're at university you might study the Course materials all through the 22 00:01:42,270 --> 00:01:48,870 semester then before the final exam You might see how you could improve your knowledge on a practice 23 00:01:48,870 --> 00:01:50,070 exam. 24 00:01:50,070 --> 00:01:57,270 After doing well on the practice exam you're confident you'll do well on the final exam when you take 25 00:01:57,270 --> 00:01:58,490 the final exam. 26 00:01:58,500 --> 00:02:03,330 And although some of the problems you've never seen before you're able to adapt the knowledge you've 27 00:02:03,330 --> 00:02:10,440 learned from the study materials to the slightly different but similar questions on the final exam. 28 00:02:10,620 --> 00:02:15,730 Because of this you pass the final exam with great marks. 29 00:02:15,780 --> 00:02:23,760 This adaptation that you had from the course materials and practice exams to the final exam is referred 30 00:02:23,760 --> 00:02:30,540 to in machine learning as a generalisation or the ability for a machine learning model to perform well 31 00:02:30,600 --> 00:02:34,880 on data it hasn't seen before because of what it's learned. 32 00:02:34,950 --> 00:02:43,970 On another dataset Now where might this go wrong well if your professor accidentally sent out the final 33 00:02:43,970 --> 00:02:49,000 exam for everyone to practice on when it came time to the actual exam. 34 00:02:49,070 --> 00:02:52,780 Everyone would have already seen it now. 35 00:02:52,830 --> 00:02:58,000 Since people know what they should be expecting they go through the exam. 36 00:02:58,090 --> 00:03:03,590 They answer all the questions with ease and everyone ends up getting top marks. 37 00:03:03,610 --> 00:03:10,530 Now top marks might appear good but did the students really learn anything or were they just expert 38 00:03:10,540 --> 00:03:17,500 memorization machines for your machine learning models to be valuable at predicting something in the 39 00:03:17,500 --> 00:03:24,130 future on unseen data you'll want to avoid them becoming memorization machines. 40 00:03:24,130 --> 00:03:28,900 This is where training validation and test splits come in. 41 00:03:28,900 --> 00:03:35,750 In our heart disease example let's say there were 100 patients you start off with 100. 42 00:03:35,800 --> 00:03:39,910 One way to create these splits is to shuffle these patients. 43 00:03:39,910 --> 00:03:45,440 Then select 70 percent for training which would mean that would be about 70. 44 00:03:45,440 --> 00:03:46,560 Patient records. 45 00:03:47,000 --> 00:03:54,110 And 15 percent for validation and 15 percent for testing which means to be 70 patients in the training 46 00:03:54,110 --> 00:03:54,820 set. 47 00:03:54,830 --> 00:04:00,250 15 patients in the validation split and 15 patients in the test split. 48 00:04:00,260 --> 00:04:06,580 Now the percentages of each of these may vary but standard practice is usually around 70 to 80 percent 49 00:04:06,590 --> 00:04:07,640 for training. 50 00:04:07,640 --> 00:04:11,570 10 to 15 for validation and 10 15 for test. 51 00:04:11,630 --> 00:04:19,280 You may see in some examples that some sets or some data sets only get split into training and test. 52 00:04:19,280 --> 00:04:21,480 But that's case by case scenario. 53 00:04:21,530 --> 00:04:27,030 Usually you'll have three different sets then once you've got these splits. 54 00:04:27,030 --> 00:04:34,170 Using a model you've chosen you'd feed at the training data or the information of of these 70 patient 55 00:04:34,170 --> 00:04:35,310 records. 56 00:04:35,460 --> 00:04:41,550 And once your model had trained you can check its results and see if you can improve them on the validation 57 00:04:41,550 --> 00:04:41,880 set. 58 00:04:42,180 --> 00:04:44,220 This is where you do model tuning. 59 00:04:44,220 --> 00:04:49,170 So just because you're machine learning the model's got one set of results and the patient records you 60 00:04:49,170 --> 00:04:54,000 can actually improve them and we'll see this in a future lesson on the validation split. 61 00:04:54,080 --> 00:04:58,360 Well the validation split is where you should be testing to see if you can improve. 62 00:04:59,160 --> 00:05:05,910 Finally once you've improved your model you can check the models results as well as any other models 63 00:05:05,910 --> 00:05:12,420 results that you might have done during experimentation on the test said what's important to remember 64 00:05:12,450 --> 00:05:19,020 is that all three of these sets a separate during training the model never sees the validation split 65 00:05:19,290 --> 00:05:20,520 or the test split. 66 00:05:20,700 --> 00:05:26,850 And during testing you're doing it on the test split not the training set it's the same as when you 67 00:05:26,850 --> 00:05:33,180 were studying for your exam if you saw the final exam whilst practicing that would be cheating and your 68 00:05:33,180 --> 00:05:37,500 final result wouldn't reflect how well you'd learned. 69 00:05:37,610 --> 00:05:43,250 For now think about it the last time you went for a test did you practice beforehand. 70 00:05:43,250 --> 00:05:48,530 Was the practice you were doing helpful for the test and when you're thinking about this try and think 71 00:05:48,530 --> 00:05:55,740 of how the lines to why it's important to not let a machine learning model see a test set or test data 72 00:05:55,740 --> 00:05:57,710 simply whilst it's training. 7895

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.