subtitlecat.com

All language subtitles for 8. Modelling - Splitting Data

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian Download

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,610 --> 00:00:01,890 Look at us go. 2 00:00:01,890 --> 00:00:04,650 We're moving to this framework at lightning pace. 3 00:00:04,650 --> 00:00:06,110 We've done Problem Definition. 4 00:00:06,150 --> 00:00:10,150 We've looked at data we've decided on an evaluation metric. 5 00:00:10,170 --> 00:00:13,110 We've understood a few of the features we've got in our data. 6 00:00:13,110 --> 00:00:15,620 Now we're up to step five which is modelling. 7 00:00:15,690 --> 00:00:17,640 Now there's a few parts to modelling. 8 00:00:17,640 --> 00:00:21,750 So we've broken this down into four different sections. 9 00:00:21,750 --> 00:00:23,800 And this is where it's Section One. 10 00:00:23,910 --> 00:00:28,420 And this is probably the most important concept in machine learning that three sets. 11 00:00:28,630 --> 00:00:35,730 And now over the whole of modelling we want to answer the question based on our problem and data what 12 00:00:35,730 --> 00:00:43,570 machine learning model should we use modelling can be broken down into three parts choosing and training 13 00:00:43,570 --> 00:00:50,470 a model churning a model and model comparison before we get into these though. 14 00:00:50,680 --> 00:00:57,160 Part one of modelling is and the most paramount topic to discuss in this whole entire course is the 15 00:00:57,160 --> 00:01:00,550 most important concept in machine learning. 16 00:01:00,760 --> 00:01:07,510 The train validation and test splits or commonly referred to as three sets. 17 00:01:07,510 --> 00:01:13,840 Now since you want to be using machine learning models to gain insights on some data to predict the 18 00:01:13,840 --> 00:01:18,930 future it's important to test how well they would go and do in the real world. 19 00:01:19,150 --> 00:01:26,740 To do this you split your data into three different sets a training set to train your model on a validation 20 00:01:26,740 --> 00:01:36,600 set to choosing your model on a test set to test and compare your different models why is this important. 21 00:01:36,600 --> 00:01:42,270 Think of it like this when you're at university you might study the Course materials all through the 22 00:01:42,270 --> 00:01:48,870 semester then before the final exam You might see how you could improve your knowledge on a practice 23 00:01:48,870 --> 00:01:50,070 exam. 24 00:01:50,070 --> 00:01:57,270 After doing well on the practice exam you're confident you'll do well on the final exam when you take 25 00:01:57,270 --> 00:01:58,490 the final exam. 26 00:01:58,500 --> 00:02:03,330 And although some of the problems you've never seen before you're able to adapt the knowledge you've 27 00:02:03,330 --> 00:02:10,440 learned from the study materials to the slightly different but similar questions on the final exam. 28 00:02:10,620 --> 00:02:15,730 Because of this you pass the final exam with great marks. 29 00:02:15,780 --> 00:02:23,760 This adaptation that you had from the course materials and practice exams to the final exam is referred 30 00:02:23,760 --> 00:02:30,540 to in machine learning as a generalisation or the ability for a machine learning model to perform well 31 00:02:30,600 --> 00:02:34,880 on data it hasn't seen before because of what it's learned. 32 00:02:34,950 --> 00:02:43,970 On another dataset Now where might this go wrong well if your professor accidentally sent out the final 33 00:02:43,970 --> 00:02:49,000 exam for everyone to practice on when it came time to the actual exam. 34 00:02:49,070 --> 00:02:52,780 Everyone would have already seen it now. 35 00:02:52,830 --> 00:02:58,000 Since people know what they should be expecting they go through the exam. 36 00:02:58,090 --> 00:03:03,590 They answer all the questions with ease and everyone ends up getting top marks. 37 00:03:03,610 --> 00:03:10,530 Now top marks might appear good but did the students really learn anything or were they just expert 38 00:03:10,540 --> 00:03:17,500 memorization machines for your machine learning models to be valuable at predicting something in the 39 00:03:17,500 --> 00:03:24,130 future on unseen data you'll want to avoid them becoming memorization machines. 40 00:03:24,130 --> 00:03:28,900 This is where training validation and test splits come in. 41 00:03:28,900 --> 00:03:35,750 In our heart disease example let's say there were 100 patients you start off with 100. 42 00:03:35,800 --> 00:03:39,910 One way to create these splits is to shuffle these patients. 43 00:03:39,910 --> 00:03:45,440 Then select 70 percent for training which would mean that would be about 70. 44 00:03:45,440 --> 00:03:46,560 Patient records. 45 00:03:47,000 --> 00:03:54,110 And 15 percent for validation and 15 percent for testing which means to be 70 patients in the training 46 00:03:54,110 --> 00:03:54,820 set. 47 00:03:54,830 --> 00:04:00,250 15 patients in the validation split and 15 patients in the test split. 48 00:04:00,260 --> 00:04:06,580 Now the percentages of each of these may vary but standard practice is usually around 70 to 80 percent 49 00:04:06,590 --> 00:04:07,640 for training. 50 00:04:07,640 --> 00:04:11,570 10 to 15 for validation and 10 15 for test. 51 00:04:11,630 --> 00:04:19,280 You may see in some examples that some sets or some data sets only get split into training and test. 52 00:04:19,280 --> 00:04:21,480 But that's case by case scenario. 53 00:04:21,530 --> 00:04:27,030 Usually you'll have three different sets then once you've got these splits. 54 00:04:27,030 --> 00:04:34,170 Using a model you've chosen you'd feed at the training data or the information of of these 70 patient 55 00:04:34,170 --> 00:04:35,310 records. 56 00:04:35,460 --> 00:04:41,550 And once your model had trained you can check its results and see if you can improve them on the validation 57 00:04:41,550 --> 00:04:41,880 set. 58 00:04:42,180 --> 00:04:44,220 This is where you do model tuning. 59 00:04:44,220 --> 00:04:49,170 So just because you're machine learning the model's got one set of results and the patient records you 60 00:04:49,170 --> 00:04:54,000 can actually improve them and we'll see this in a future lesson on the validation split. 61 00:04:54,080 --> 00:04:58,360 Well the validation split is where you should be testing to see if you can improve. 62 00:04:59,160 --> 00:05:05,910 Finally once you've improved your model you can check the models results as well as any other models 63 00:05:05,910 --> 00:05:12,420 results that you might have done during experimentation on the test said what's important to remember 64 00:05:12,450 --> 00:05:19,020 is that all three of these sets a separate during training the model never sees the validation split 65 00:05:19,290 --> 00:05:20,520 or the test split. 66 00:05:20,700 --> 00:05:26,850 And during testing you're doing it on the test split not the training set it's the same as when you 67 00:05:26,850 --> 00:05:33,180 were studying for your exam if you saw the final exam whilst practicing that would be cheating and your 68 00:05:33,180 --> 00:05:37,500 final result wouldn't reflect how well you'd learned. 69 00:05:37,610 --> 00:05:43,250 For now think about it the last time you went for a test did you practice beforehand. 70 00:05:43,250 --> 00:05:48,530 Was the practice you were doing helpful for the test and when you're thinking about this try and think 71 00:05:48,530 --> 00:05:55,740 of how the lines to why it's important to not let a machine learning model see a test set or test data 72 00:05:55,740 --> 00:05:57,710 simply whilst it's training. 7895