Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,140 --> 00:00:04,380
Instructor: Hey, this is the last theoretical section
2
00:00:04,380 --> 00:00:05,430
of the course.
3
00:00:05,430 --> 00:00:07,740
It is about the first activity you want to do
4
00:00:07,740 --> 00:00:10,740
when you start creating a machine learning algorithm.
5
00:00:10,740 --> 00:00:12,450
Preprocessing.
6
00:00:12,450 --> 00:00:14,910
Preprocessing refers to any manipulation
7
00:00:14,910 --> 00:00:16,230
we apply to the data set
8
00:00:16,230 --> 00:00:18,270
before running it through the model.
9
00:00:18,270 --> 00:00:20,400
Everything we saw so far was conditioned
10
00:00:20,400 --> 00:00:22,830
on the fact that we had already pre-processed our data
11
00:00:22,830 --> 00:00:25,230
in a way suitable for training.
12
00:00:25,230 --> 00:00:27,690
You've already seen some preprocessing.
13
00:00:27,690 --> 00:00:31,680
In the TensorFlow intro, we created an npz file.
14
00:00:31,680 --> 00:00:34,050
All the training we did came from there,
15
00:00:34,050 --> 00:00:37,320
so, if you must work with data in an xl file,
16
00:00:37,320 --> 00:00:38,910
CSV, or whatever,
17
00:00:38,910 --> 00:00:40,980
saving it into an npz file
18
00:00:40,980 --> 00:00:43,023
would be a type of preprocessing.
19
00:00:44,190 --> 00:00:45,450
In this section though,
20
00:00:45,450 --> 00:00:48,270
we will mainly focus on data transformations
21
00:00:48,270 --> 00:00:50,433
rather than reordering as before.
22
00:00:52,350 --> 00:00:55,200
What is the motivation for preprocessing?
23
00:00:55,200 --> 00:00:57,363
There are several important points.
24
00:00:58,350 --> 00:01:00,390
The first one is about compatibility
25
00:01:00,390 --> 00:01:02,220
with the libraries we use.
26
00:01:02,220 --> 00:01:03,600
As we saw earlier,
27
00:01:03,600 --> 00:01:05,550
TensorFlow works with tent source
28
00:01:05,550 --> 00:01:07,500
and not Excel spreadsheets.
29
00:01:07,500 --> 00:01:08,730
In data science,
30
00:01:08,730 --> 00:01:11,460
you will often be given data in whatever format
31
00:01:11,460 --> 00:01:14,373
and you must make it compatible with the tools you use.
32
00:01:16,170 --> 00:01:20,580
Second, we may need to adjust inputs of different magnitude.
33
00:01:20,580 --> 00:01:22,950
Let's say we are 4x traders.
34
00:01:22,950 --> 00:01:24,810
If one input we are working with
35
00:01:24,810 --> 00:01:27,450
is the end of the day Euro/Dollar exchange rate,
36
00:01:27,450 --> 00:01:29,493
it would be a value around 1.
37
00:01:30,540 --> 00:01:34,110
However, if another input is the daily trading volume,
38
00:01:34,110 --> 00:01:37,380
we would have values like 100,000 and higher.
39
00:01:37,380 --> 00:01:41,340
Obviously, the orders of magnitude are quite different.
40
00:01:41,340 --> 00:01:43,260
A linear combination of numbers
41
00:01:43,260 --> 00:01:46,200
based on such different skills as problematic.
42
00:01:46,200 --> 00:01:48,210
In purely mathematical terms,
43
00:01:48,210 --> 00:01:53,130
a value of 1 is negligible regarding a value of 100,000.
44
00:01:53,130 --> 00:01:56,100
As all the inputs are on an equal footing in a vector
45
00:01:56,100 --> 00:01:57,150
or a matrix,
46
00:01:57,150 --> 00:02:00,573
the algorithm is likely to ignore all values around 1.
47
00:02:01,650 --> 00:02:03,570
These values essentially represent
48
00:02:03,570 --> 00:02:05,790
the Euro/Dollar exchange rate itself,
49
00:02:05,790 --> 00:02:09,509
so they are often more important than the volume of trading.
50
00:02:09,509 --> 00:02:13,203
Obviously, something needs to be done to solve this issue.
51
00:02:15,720 --> 00:02:18,360
A third reason is generalization.
52
00:02:18,360 --> 00:02:19,980
Problems that seem different
53
00:02:19,980 --> 00:02:22,800
can often be solved by similar models.
54
00:02:22,800 --> 00:02:25,170
Standardizing inputs of different problems
55
00:02:25,170 --> 00:02:28,200
allows us to reuse the exact same models.
56
00:02:28,200 --> 00:02:29,850
Sometimes there are cases
57
00:02:29,850 --> 00:02:32,820
when we can even reuse already trained networks.
58
00:02:32,820 --> 00:02:34,230
Imagine that!
59
00:02:34,230 --> 00:02:36,510
You have trained a model previously,
60
00:02:36,510 --> 00:02:39,000
you face a new problem, you test your model,
61
00:02:39,000 --> 00:02:40,620
and it works like a charm.
62
00:02:40,620 --> 00:02:42,903
That's not unusual in machine learning.
63
00:02:44,190 --> 00:02:45,450
In the next few lessons,
64
00:02:45,450 --> 00:02:47,190
we will focus on these concepts
65
00:02:47,190 --> 00:02:50,160
and introduce several pre-processing techniques.
66
00:02:50,160 --> 00:02:51,393
Thanks for watching.
5008
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.