Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,560 --> 00:00:03,110
Now we figured out what problem we're trying to solve.
2
00:00:03,110 --> 00:00:07,730
We've matched our specific problem to a different type of machine learning problem.
3
00:00:07,730 --> 00:00:10,640
It's time to have a look at what data we have.
4
00:00:10,700 --> 00:00:16,350
As you may have guessed the question we're trying to answer here is what kind of data do we have.
5
00:00:16,370 --> 00:00:23,180
Data comes in many different shapes and sizes but the main two types are structured and unstructured
6
00:00:23,960 --> 00:00:29,030
structured data is something you'd expect to see in an excel file such as rows and columns of different
7
00:00:29,030 --> 00:00:36,390
patient medical records and whether or not they have heart disease or not or customer purchase transactions.
8
00:00:36,440 --> 00:00:42,920
It's called structured data because all of the samples the different patient records are typically in
9
00:00:42,920 --> 00:00:50,450
similar format meaning one column might contain numbers of a certain type such as average blood pressure
10
00:00:50,570 --> 00:00:58,190
or sags or weight of a patient and another column might have whether they have chest pain or not and
11
00:00:58,190 --> 00:01:00,640
what the level of intensity is.
12
00:01:00,680 --> 00:01:08,870
Unstructured data are things like images natural language text such as transcribed phone calls videos
13
00:01:09,110 --> 00:01:15,120
and audio files although we can turn these into numbers and create structure.
14
00:01:15,120 --> 00:01:17,960
They typically come in many varying formats.
15
00:01:18,000 --> 00:01:24,180
One picture of a dog may look completely different to another image of a dog and the email as you write
16
00:01:24,180 --> 00:01:29,760
back and forth with the friend may have a completely different structure to the emails you'd write to
17
00:01:29,760 --> 00:01:31,150
a co-worker.
18
00:01:31,170 --> 00:01:37,590
Now within these two data types there's static and streaming data static data is data which doesn't
19
00:01:37,590 --> 00:01:39,170
change over time.
20
00:01:39,300 --> 00:01:45,690
You may have a spreadsheet of patient records in a dot CSP format which stands for commas Separated
21
00:01:45,690 --> 00:01:52,360
Values which simply means all of the different data is in one file separated by commas.
22
00:01:52,500 --> 00:01:53,890
It looks like this.
23
00:01:53,940 --> 00:01:55,680
You check this table we got the idea.
24
00:01:55,680 --> 00:01:56,720
Com I'll wait.
25
00:01:56,760 --> 00:02:03,030
Comma sex and if we were to read that into it to a data frame using a tool like pandas we'll have a
26
00:02:03,030 --> 00:02:04,690
look at this in a future lesson.
27
00:02:04,800 --> 00:02:06,510
It would look something like this.
28
00:02:06,540 --> 00:02:11,970
So a lot of data you'll actually come across comes in a simple format like this.
29
00:02:11,970 --> 00:02:18,150
But to turn it into something that a little bit more structural you can convert it to this.
30
00:02:18,290 --> 00:02:21,720
Now CSB is one of the most common types of static data formats.
31
00:02:21,800 --> 00:02:24,920
We're going to get very used to this by the end of the course.
32
00:02:25,140 --> 00:02:31,890
And since these values won't really change over time they're called static Usually what you'll want
33
00:02:31,980 --> 00:02:35,120
is a lot of these examples in machine learning.
34
00:02:35,130 --> 00:02:38,660
There's a saying The more data the better.
35
00:02:38,850 --> 00:02:44,520
Which makes sense if you think about it the more examples you have of something such as the inputs and
36
00:02:44,610 --> 00:02:52,140
outputs of patient records where the inputs are a patient's body parameters and the outputs are whether
37
00:02:52,140 --> 00:02:54,260
they have heart disease or not.
38
00:02:54,540 --> 00:02:58,090
The more chances you'll have to find patterns between them.
39
00:02:58,110 --> 00:03:00,390
The same goes for machine learning algorithms.
40
00:03:00,390 --> 00:03:07,380
The more examples they can look at the more chance they have at finding patterns and thus using those
41
00:03:07,380 --> 00:03:10,160
patterns to predict something in the future.
42
00:03:10,260 --> 00:03:14,970
Like whether a new patient who comes along who isn't in this table whether they have heart disease or
43
00:03:14,970 --> 00:03:15,240
not.
44
00:03:17,070 --> 00:03:20,830
Streaming data is data which is constantly changed over time.
45
00:03:20,880 --> 00:03:26,430
For example say you wanted to predict how a stock price will change based on news headlines you'll be
46
00:03:26,430 --> 00:03:27,990
working with streaming data.
47
00:03:28,050 --> 00:03:34,380
Since news headlines are being updated constantly you'll want to be the first to see how they change
48
00:03:34,380 --> 00:03:42,790
stocks most of the work you will do in practice will start on static data and then if your data analysis
49
00:03:42,850 --> 00:03:48,580
and machine learning efforts prove to show some insights you'll move towards streaming data for when
50
00:03:48,580 --> 00:03:51,470
you go to deployment or in production.
51
00:03:51,910 --> 00:03:58,510
A common data science workflow begins by opening a v file in a Jupiter notebook a tool for building
52
00:03:58,510 --> 00:04:05,830
machine learning projects then exploring the data and performing data analysis using pandas a python
53
00:04:05,830 --> 00:04:12,790
library for data analysis and making visualizations such as graphs and comparing different data points
54
00:04:12,790 --> 00:04:21,280
using map plot lib then building machine learning models on the data using psychic learn such as a machine
55
00:04:21,280 --> 00:04:25,060
learning model to predict using these patterns here.
56
00:04:25,060 --> 00:04:31,790
Whether or not a patient has heart disease don't worry if you're thinking what's a Jupiter notebook
57
00:04:32,110 --> 00:04:35,730
and pandas what are we at the zoo.
58
00:04:35,840 --> 00:04:40,750
We've got dedicated sections and projects coming out for each of these tools.
59
00:04:40,810 --> 00:04:46,030
For now think about the different kinds of data you create or use every day.
60
00:04:46,030 --> 00:04:48,670
Are they structured or unstructured.
61
00:04:48,670 --> 00:04:49,710
How much data is there.
6681
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.