Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
0
00:00:00,000 --> 00:00:03,090
1
00:00:03,090 --> 00:00:07,520
RAFAEL IRIZARRY: The data sets used in this series
2
00:00:07,520 --> 00:00:12,520
have been made available to you as R objects, specifically as data frames.
3
00:00:12,520 --> 00:00:16,340
The US murders data, the reported heights data, the Gapminder data,
4
00:00:16,340 --> 00:00:18,950
and the poll data are all examples.
5
00:00:18,950 --> 00:00:22,950
These data sets come included in the dslabs package,
6
00:00:22,950 --> 00:00:25,620
and we loaded them using the data function.
7
00:00:25,620 --> 00:00:28,700
Furthermore, we have made the data available in what
8
00:00:28,700 --> 00:00:34,530
is referred to as tidy form, a concept we define later in this course.
9
00:00:34,530 --> 00:00:38,360
The tidyverse packages and functions assume that the data is tidy,
10
00:00:38,360 --> 00:00:40,710
and this assumption is a big part of the reason
11
00:00:40,710 --> 00:00:42,730
these packages work so well together.
12
00:00:42,730 --> 00:00:45,540
We did quite a bit of work behind the scenes
13
00:00:45,540 --> 00:00:49,710
to get the original raw data into the tidy tables you work with.
14
00:00:49,710 --> 00:00:53,480
However, in a typical data science project,
15
00:00:53,480 --> 00:00:57,770
it is much more typical for the data to be in a file, a database,
16
00:00:57,770 --> 00:01:02,540
or extracted from a document, including web pages, tweets, or PDF.
17
00:01:02,540 --> 00:01:07,600
In these cases, the first step is to import the data into R,
18
00:01:07,600 --> 00:01:11,350
and when using the tidy verse, tidy up the data.
19
00:01:11,350 --> 00:01:13,750
The first step in the data analysis process
20
00:01:13,750 --> 00:01:17,360
usually involves several often complicated steps
21
00:01:17,360 --> 00:01:21,350
to convert data from its raw form to the tidy form that greatly
22
00:01:21,350 --> 00:01:24,010
facilitates the rest of the analysis.
23
00:01:24,010 --> 00:01:27,050
We refer to this process as data wrangling.
24
00:01:27,050 --> 00:01:30,400
In this course, we cover several common steps
25
00:01:30,400 --> 00:01:35,270
of the data wrangling process including importing data into R from files,
26
00:01:35,270 --> 00:01:41,220
tidying data, string processing, HTML parsing, working with dates and times,
27
00:01:41,220 --> 00:01:42,980
and text mining.
28
00:01:42,980 --> 00:01:47,210
Rarely are all these wrangling steps necessary in a single analysis,
29
00:01:47,210 --> 00:01:51,140
but a data scientist will likely face them all at some point.
30
00:01:51,140 --> 00:01:54,970
Some of the examples we used to demonstrate data wrangling techniques
31
00:01:54,970 --> 00:01:59,800
are based on the work we did to convert the raw data into the tidy data
32
00:01:59,800 --> 00:02:05,850
sets provided by the dslab packages and used in the series as examples.
33
00:02:05,850 --> 00:02:08,707
2947
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.