Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,050 --> 00:00:06,600
Hello everyone and welcome to the overview of data frame operations data frames are the workhorse of
2
00:00:06,600 --> 00:00:07,130
our.
3
00:00:07,350 --> 00:00:12,330
So in this lecture we're going to basically be going over all the common operations use of data frames
4
00:00:12,330 --> 00:00:15,090
and our this is going to be a very useful lecture.
5
00:00:15,100 --> 00:00:19,350
We're going to be going over material we've already covered since it's so vital to know data frames
6
00:00:19,470 --> 00:00:22,910
very well in order to save yourself time in the future.
7
00:00:22,920 --> 00:00:25,720
Let's go ahead and open up our studio and get started.
8
00:00:26,280 --> 00:00:33,840
OK so here our studio and you can see here in this the script I have listed a basically a list of topics
9
00:00:33,840 --> 00:00:37,380
we're going to be going over we're going to start off with creating data frames importing exporting
10
00:00:37,380 --> 00:00:42,990
data all the way down until we can get to how to deal of missing data and selecting multiple rows multiple
11
00:00:42,990 --> 00:00:43,950
columns.
12
00:00:43,950 --> 00:00:46,920
So we're going to take this one line slash topic at a time.
13
00:00:46,960 --> 00:00:51,650
You can always reference the notes from this lecture in case you interview anything we go over.
14
00:00:51,650 --> 00:00:56,460
We're going to go ahead and make the console larger and most of what we're going to be doing isn't in
15
00:00:56,460 --> 00:00:57,960
the con..
16
00:00:58,050 --> 00:01:01,640
Let's go ahead and start with creating data frames.
17
00:01:01,690 --> 00:01:04,250
I in and clear the console of control here.
18
00:01:04,810 --> 00:01:05,640
OK.
19
00:01:05,940 --> 00:01:10,950
So if we want to just create a basic data frame just an empty data frame we can do something like this
20
00:01:10,950 --> 00:01:11,010
.
21
00:01:11,020 --> 00:01:16,800
You can say empty and data that frame is going to be the main function you can see our studios already
22
00:01:16,800 --> 00:01:18,300
completing it for us.
23
00:01:18,960 --> 00:01:22,530
And that's the way you can make just a basic empty data frame.
24
00:01:22,530 --> 00:01:27,250
Now let's go ahead and make a data frame from Vector we can say something like this.
25
00:01:27,340 --> 00:01:35,780
C1 which is going to be our first column is going to be just a vector of integers 1 through 10 reason
26
00:01:35,790 --> 00:01:43,740
that colon rotations quickly create that vector of integers and in other key trick to R is if you just
27
00:01:43,740 --> 00:01:49,980
type in letters letters is basically a vector of the alphabet that's built into our.
28
00:01:50,060 --> 00:01:53,390
You can quickly reference for letters A through Z.
29
00:01:53,400 --> 00:02:04,750
So what I'm going to go ahead and do is say C-2 and assign letters 1 through 10.
30
00:02:05,670 --> 00:02:08,830
Enough I check out C-One Anssi too.
31
00:02:09,120 --> 00:02:11,580
I have two columns that are available for me.
32
00:02:11,610 --> 00:02:15,650
Let's go ahead and combine those columns and make them part of a data frame.
33
00:02:16,040 --> 00:02:18,400
Then I go out and make an object called DSF.
34
00:02:18,540 --> 00:02:21,250
Use my keyword data frame data frame.
35
00:02:21,270 --> 00:02:24,950
That's going to be the function that allows me to build these data frames.
36
00:02:25,370 --> 00:02:37,750
I'll start off with call that name that one is equal to see one call not name that two is equal to C2
37
00:02:37,810 --> 00:02:38,770
.
38
00:02:39,840 --> 00:02:45,150
And then we can go ahead and check out this data frame if we can see here we have a rows and or two
39
00:02:45,150 --> 00:02:45,780
columns.
40
00:02:45,780 --> 00:02:49,590
So let's break down the state of frame function just pretty quickly.
41
00:02:49,620 --> 00:02:54,980
It's data frame and then it's the parameters that you pass in or the column names.
42
00:02:55,110 --> 00:02:57,630
And you can actually just create these call names you can make them up.
43
00:02:57,630 --> 00:02:59,360
They don't have to be a reference to something.
44
00:02:59,580 --> 00:03:04,530
And then you say equal and then what we actually want to pasan as the data for that column.
45
00:03:04,530 --> 00:03:07,490
So again I'm saying call that name too.
46
00:03:07,500 --> 00:03:11,460
So I want that to be the second column equals and then I pass in the data.
47
00:03:11,460 --> 00:03:16,450
If you don't pass in these variable names you're kind of making up for the column names.
48
00:03:16,500 --> 00:03:22,800
What are you going to do it's going to automatically name your columns for the variable names of whatever
49
00:03:22,800 --> 00:03:25,420
the data you were putting into it.
50
00:03:25,530 --> 00:03:28,900
All right now let's briefly go over importing and exporting data.
51
00:03:28,920 --> 00:03:32,570
There's a lot more lectures on imprinting exporting data to specific formats.
52
00:03:32,680 --> 00:03:37,740
Here we're just going to go have a brief overview of how to import and export from CSP files.
53
00:03:38,030 --> 00:03:39,680
Let's go at in clear a con..
54
00:03:40,260 --> 00:03:44,170
To read a C as we filed the command the just read that C S V.
55
00:03:44,430 --> 00:03:48,810
That's going to be the most common command throughout the course as far as reading our information.
56
00:03:48,870 --> 00:03:55,080
They can say I've read that C S V and then you just pasan whatever your file name is you can say some
57
00:03:55,080 --> 00:04:03,660
file does CXXVI and this will go ahead and read that into a different if you want to writes to a C S
58
00:04:03,660 --> 00:04:04,180
V.
59
00:04:04,320 --> 00:04:08,550
You can just use write thought C S V passing your data.
60
00:04:08,550 --> 00:04:15,180
So for instance I can pass in the data frame we just made if and then the second argument is file and
61
00:04:15,180 --> 00:04:17,180
you pass in the file you want.
62
00:04:17,760 --> 00:04:22,910
So we can say something like Saved f not sci fi.
63
00:04:22,980 --> 00:04:25,590
So let's go ahead and read that CXXVI.
64
00:04:25,590 --> 00:04:30,950
So say DFI to read the CXXVIII.
65
00:04:30,960 --> 00:04:34,110
We'll go ahead and pass in the file name.
66
00:04:34,410 --> 00:04:40,470
So if we look at this it was saved underscore that CXXVI and I can actually press tab and it should
67
00:04:40,520 --> 00:04:49,430
auto complete to the filename you know we have DFI too which is the same thing as the if.
68
00:04:49,600 --> 00:04:57,790
But the thing to notice here is when you're saving it as a C S V with just these two parameters your
69
00:04:57,790 --> 00:05:01,650
index will also be saved as a column into that CAC file.
70
00:05:01,720 --> 00:05:05,200
So you'll notice that when we read that cxxviii file it's a Ze'ev too.
71
00:05:05,320 --> 00:05:11,570
We've got an extra column called X and the X column is just the index of that data frame that you saved
72
00:05:11,570 --> 00:05:12,970
at CAC.
73
00:05:13,420 --> 00:05:17,900
After reading this CACP you can either drop it keep it doesn't really matter what you do with it but
74
00:05:17,910 --> 00:05:21,820
just keep in mind that it's also going to save that index information.
75
00:05:21,880 --> 00:05:26,320
And the reason for that is because a lot of times in your data frames your index isn't just going to
76
00:05:26,320 --> 00:05:31,610
be numbers are actually going to be string names such as names of states or people cetera.
77
00:05:31,690 --> 00:05:36,400
So it's good to say that information when you're writing to see every file when you're reading it you
78
00:05:36,400 --> 00:05:41,470
can go ahead and either delete drop or do whatever you want that information.
79
00:05:41,470 --> 00:05:42,040
All right.
80
00:05:42,040 --> 00:05:47,650
Now let's go ahead and move on to our third topic getting information about the data frame going to
81
00:05:47,650 --> 00:05:49,040
go ahead and clear the con..
82
00:05:49,540 --> 00:05:51,870
And we have our data frame Dia.
83
00:05:52,460 --> 00:05:54,920
Let's who wanted to know the row and column counts.
84
00:05:54,940 --> 00:06:00,880
There's a couple of ways to do this but the most direct way is the NRO function that's all lowercase
85
00:06:00,940 --> 00:06:01,050
.
86
00:06:01,090 --> 00:06:03,430
And then you just go ahead and pass in your data frame.
87
00:06:03,730 --> 00:06:10,060
And I'll tell you back how many rows you have and you can also say and cnl for end columns passing data
88
00:06:10,060 --> 00:06:12,110
frame it'll tell you how many columns you have.
89
00:06:12,110 --> 00:06:16,980
Back again this doesn't include the actual index just the columns that have names.
90
00:06:17,350 --> 00:06:25,300
If you actually want to get those names you can use the call names function so that C O L names and
91
00:06:25,300 --> 00:06:30,390
then just call on any data frame and all or turnback a vector of the column names.
92
00:06:30,490 --> 00:06:32,110
You can do the same with forenames.
93
00:06:32,120 --> 00:06:36,200
However we should probably be careful of this because we have a very large data frame.
94
00:06:36,470 --> 00:06:38,460
This output will be really large.
95
00:06:38,470 --> 00:06:39,660
So just keep that in mind.
96
00:06:39,730 --> 00:06:44,590
And in this case where that data frame just has an index it's not super useful to have the road names
97
00:06:44,830 --> 00:06:45,560
.
98
00:06:45,670 --> 00:06:51,130
Couple of other ways to get information but data frame is the SDR function which returns the structure
99
00:06:51,190 --> 00:06:52,040
of the data frame.
100
00:06:52,090 --> 00:06:58,210
So here are the math here I'll tell you how many observations you have of how many variables which are
101
00:06:58,210 --> 00:07:01,440
the same as just those columns will tell you the column names.
102
00:07:01,690 --> 00:07:07,940
So you kind of the data type is in those columns so integers or a factor of 10 levels etc..
103
00:07:08,110 --> 00:07:10,600
That's one way of getting information off the data frame.
104
00:07:10,870 --> 00:07:15,820
The other way is with summary and this will give you a statistical summary of what's going on in your
105
00:07:15,820 --> 00:07:17,260
data frame.
106
00:07:17,260 --> 00:07:22,040
Some of these don't make too much sense if you're looking at it from a statistical standpoint such as
107
00:07:22,040 --> 00:07:24,450
a column name to the column name one.
108
00:07:24,450 --> 00:07:27,520
If we check a look it's just a bunch of integers.
109
00:07:27,730 --> 00:07:29,910
So here we get statistical information on the median.
110
00:07:29,910 --> 00:07:36,190
The mean the max value cetera recall him to since it was a bunch of string values what you're going
111
00:07:36,190 --> 00:07:42,100
to get in return instead of the sort of statistical information it's just a count of the terms how many
112
00:07:42,100 --> 00:07:42,940
times they show up.
113
00:07:42,940 --> 00:07:44,140
ABC the.
114
00:07:44,230 --> 00:07:44,880
Cetera.
115
00:07:45,010 --> 00:07:49,920
And if there's too many values here they'll eventually say other and give you another count.
116
00:07:49,960 --> 00:07:52,650
Otherwise this would be a really long output.
117
00:07:53,140 --> 00:07:53,970
OK.
118
00:07:54,280 --> 00:07:56,650
That's it for getting information about a data frame.
119
00:07:56,650 --> 00:07:57,710
Our next topic.
120
00:07:57,790 --> 00:08:00,790
Topic number four is going to be referencing cells.
121
00:08:00,790 --> 00:08:06,070
So now we're going to learn how to reference a particular cell off a data frame and go out and clear
122
00:08:06,070 --> 00:08:06,940
this.
123
00:08:06,940 --> 00:08:08,730
Check out our data from DIA.
124
00:08:09,340 --> 00:08:15,970
And there it is the most basic way to reference a single cell in a data frame is by using two sets of
125
00:08:15,970 --> 00:08:17,980
brackets with index numbers.
126
00:08:17,980 --> 00:08:22,010
For example you could say if your first set of brackets.
127
00:08:22,120 --> 00:08:27,710
And then inside a second set of brackets you'll pasan a row comma column value.
128
00:08:27,730 --> 00:08:35,200
So imagine you wanted to get the value at row number five on column two or I could say something like
129
00:08:35,200 --> 00:08:38,260
this 5 which is the row column too.
130
00:08:38,470 --> 00:08:44,380
So if I look at that I get back e as well of some more information on the levels where the levels is
131
00:08:44,380 --> 00:08:48,620
basically just unique values in this case of column name too.
132
00:08:48,670 --> 00:08:50,010
So it's actually going on here.
133
00:08:50,050 --> 00:08:55,000
We checked that row five value in column two which is.
134
00:08:55,390 --> 00:08:59,920
Now this isn't going to be super useful because a lot of times you want to reference the actual column
135
00:08:59,980 --> 00:09:02,560
name not the column number.
136
00:09:02,860 --> 00:09:07,180
So the way you do that is really similar pasan a set of brackets.
137
00:09:07,180 --> 00:09:09,940
Again the row you want in this case it's five.
138
00:09:09,940 --> 00:09:16,090
You can also pass on that index name and in the string you can pass in the column name instead of the
139
00:09:16,090 --> 00:09:17,580
column number.
140
00:09:17,720 --> 00:09:22,330
So call name 2 here and we get back the same value.
141
00:09:22,330 --> 00:09:27,970
You're probably going to be using this sort of reference usually instead of memorizing what positions
142
00:09:27,970 --> 00:09:34,600
the columns are in and we can use this sort of syntax to actually reassign single cell values.
143
00:09:34,770 --> 00:09:40,380
So for instance let's say I wanted to change this to into a number like nine thousand nine hundred ninety
144
00:09:40,380 --> 00:09:41,170
nine.
145
00:09:41,460 --> 00:09:45,900
I would say DMF has a second set of brackets.
146
00:09:46,020 --> 00:09:57,240
That too is in rote too and it's in call that name one and I'm going to head and use a assignments to
147
00:09:57,250 --> 00:09:58,720
pass in a number there.
148
00:09:58,750 --> 00:10:03,100
So let's go in and say nine thousand nine hundred ninety nine that does the assignment.
149
00:10:03,120 --> 00:10:04,830
So if I take a look at.
150
00:10:05,280 --> 00:10:08,620
Notice now that value is nine thousand nine hundred ninety nine.
151
00:10:08,730 --> 00:10:14,040
So I can call that single cell value and then pass an assignment to it.
152
00:10:14,070 --> 00:10:19,500
So that's an important feature to know that you can use when you're referencing single cells.
153
00:10:19,500 --> 00:10:19,870
All right.
154
00:10:19,890 --> 00:10:26,040
Let's go ahead and move on to referencing rows so Rose is going to be similar except without that double
155
00:10:26,040 --> 00:10:33,030
bracket notation going to go and clear this in order to return a row which you can do is something like
156
00:10:33,030 --> 00:10:33,270
this.
157
00:10:33,270 --> 00:10:41,790
You can say DSF racquet pass in the number or index value of the row you want comma nothing.
158
00:10:41,790 --> 00:10:45,970
Now we should be pretty familiar with this sort of format from our study of Meecher season.
159
00:10:46,830 --> 00:10:47,890
And there it is.
160
00:10:47,940 --> 00:10:48,650
That's the row.
161
00:10:48,660 --> 00:10:51,150
And it returns a data frame not a vector.
162
00:10:51,150 --> 00:10:52,460
So it's something to keep in mind.
163
00:10:52,590 --> 00:10:55,820
You'll still have information of those two column names.
164
00:10:56,760 --> 00:11:01,770
If for some reason you actually didn't want that data frame information you just wanted a straight vector
165
00:11:02,160 --> 00:11:07,920
of those values which you could do is something like an as numeric call to try to convert that row to
166
00:11:07,920 --> 00:11:09,460
a vector.
167
00:11:09,720 --> 00:11:14,960
But again the easiest way to get back a row is just called data frame a single set of brackets.
168
00:11:15,120 --> 00:11:20,250
The row you want comma nothing and then close that bracket.
169
00:11:20,250 --> 00:11:24,960
Let's talk about referencing columns which is going to be kind of similar except our placement is going
170
00:11:24,960 --> 00:11:26,400
to be a little different.
171
00:11:27,060 --> 00:11:32,370
Let's go ahead and play around with the empty cars state a frame that comes built into our.
172
00:11:32,490 --> 00:11:37,290
So the empty car as we take a look at it again it's a bunch of cars and then a bunch of columns with
173
00:11:37,300 --> 00:11:38,210
their difference.
174
00:11:38,250 --> 00:11:43,400
Are there stats like MPG number cylinders horsepower etc..
175
00:11:43,740 --> 00:11:46,810
Let me go ahead and change to suggest looking at the whips.
176
00:11:47,040 --> 00:11:51,000
The head of cars.
177
00:11:51,450 --> 00:11:51,750
All right.
178
00:11:51,880 --> 00:11:55,240
So that's a little more manageable and a control.
179
00:11:55,260 --> 00:11:57,030
I'll do that one more time.
180
00:11:57,090 --> 00:11:58,440
So we have a cleaner screen.
181
00:11:58,800 --> 00:12:00,760
OK we're looking at the head of Indy cars.
182
00:12:00,880 --> 00:12:03,360
There's lots of different ways to actually get a column.
183
00:12:03,370 --> 00:12:06,550
Let's go ahead and talk about four main ways.
184
00:12:06,900 --> 00:12:15,690
One way is just to get a vector back so I can say cars or excuse me empty cars dollar sign and then
185
00:12:15,690 --> 00:12:22,380
the name of my column says one of the most common ways to grab columns as a vector of values.
186
00:12:22,500 --> 00:12:28,260
And you'll notice this is common enough since our studio will actually have a nice little dropdown menu
187
00:12:28,260 --> 00:12:29,970
of all the columns available for you.
188
00:12:29,980 --> 00:12:34,650
So we'll go on to say NPG and then we get back a vector of all the values.
189
00:12:34,650 --> 00:12:40,900
One other way to do this is going to be similar to the way we reference the row.
190
00:12:41,140 --> 00:12:47,730
It's just essentially going to be backwards so you can say nothing comma and then pass in mpg and you'll
191
00:12:47,730 --> 00:12:49,980
get back a vector of those values.
192
00:12:49,980 --> 00:12:55,190
You could have also passed in the number of that column.
193
00:12:55,260 --> 00:12:57,180
So MPG is the first column.
194
00:12:57,180 --> 00:13:02,490
So if I pass then a comma one there I would have also gone back mpg.
195
00:13:02,490 --> 00:13:04,460
And finally a little less common.
196
00:13:04,470 --> 00:13:10,720
But another way we could add this is passing in two sets of brackets like this with a single column
197
00:13:10,710 --> 00:13:12,930
name such as mpg.
198
00:13:13,590 --> 00:13:19,530
OK so those are four different ways you can get back a vector of those column values so empty cars mpg
199
00:13:20,490 --> 00:13:25,450
just with the dollar sign a comma with nothing in front of it.
200
00:13:25,680 --> 00:13:32,190
First comma and then either the name of the column you want or it's position in relation to the other
201
00:13:32,190 --> 00:13:33,780
columns.
202
00:13:33,780 --> 00:13:37,430
Or you can just use a double set of brackets of the column name.
203
00:13:37,470 --> 00:13:43,310
Now something to quickly note here is this were methods of just returning the vectors if we wanted to
204
00:13:43,320 --> 00:13:45,030
actually return a data frame.
205
00:13:45,030 --> 00:13:49,310
So keep this car information here is the MPG with it.
206
00:13:49,360 --> 00:13:58,540
We would just use two other methods so that we could say empty cars single set of brackets mpg and that's
207
00:13:58,530 --> 00:14:02,940
going to return a data frame instead of just a vector of the values.
208
00:14:02,940 --> 00:14:07,680
So note the difference here between these two methods using two sets of brackets is going to return
209
00:14:07,680 --> 00:14:09,340
that vector of values.
210
00:14:09,510 --> 00:14:15,320
If we just use a single set of brackets they'll return a data frame with just that column back.
211
00:14:15,690 --> 00:14:21,100
Now we could have also instead of having to say mpg if we knew the order of the columns we could always
212
00:14:21,100 --> 00:14:25,920
just replace this with an integer value for that column which in this case is one that said use the
213
00:14:25,920 --> 00:14:27,100
first column.
214
00:14:27,610 --> 00:14:33,000
So keep in mind you have this method available to you if you want to add data frame with that column
215
00:14:33,000 --> 00:14:33,940
back.
216
00:14:34,410 --> 00:14:38,270
Let's go ahead and clear the con..
217
00:14:38,600 --> 00:14:42,660
And last thing I want to talk about before we move on from referencing columns is how to get multiple
218
00:14:42,660 --> 00:14:45,120
columns back as a data frame.
219
00:14:45,120 --> 00:14:53,720
So if we wanted to for instance get empty cars this would just return the MPG column who want multiple
220
00:14:53,730 --> 00:14:59,460
columns we could pass any vector of column names such as MPG and cylinders.
221
00:14:59,460 --> 00:15:04,560
Let's say that I'm going to go ahead and type and head at the beginning of all this.
222
00:15:04,560 --> 00:15:07,380
So you only see the first six rows.
223
00:15:08,010 --> 00:15:13,380
Whoops forgot to close off the vector.
224
00:15:13,410 --> 00:15:14,370
There it is.
225
00:15:14,880 --> 00:15:15,470
OK.
226
00:15:15,690 --> 00:15:21,000
So again empty cars instead of just passing MPG to get a single column back as the data frame.
227
00:15:21,050 --> 00:15:25,470
Can pass on a vector of column values you want and you can also pass them back in in whatever order
228
00:15:25,470 --> 00:15:25,850
you want.
229
00:15:25,870 --> 00:15:29,880
They don't have to be in the same order they are in that original data frame.
230
00:15:29,880 --> 00:15:30,150
All right.
231
00:15:30,150 --> 00:15:35,520
So you've gone over creating data frames importing exporting data from a CNC file getting information
232
00:15:35,520 --> 00:15:39,710
about the data frame referencing cells rows and columns.
233
00:15:40,020 --> 00:15:41,310
We'll go ahead and stop here.
234
00:15:41,350 --> 00:15:45,660
And then the next lecture will have a part two to this lecture we'll all go over adding rows adding
235
00:15:45,660 --> 00:15:51,090
columns setting column names or multiple rows or multiple columns and then dealing with missing data
236
00:15:51,100 --> 00:15:51,820
.
237
00:15:51,810 --> 00:15:56,190
All right thanks everyone and I'll see you at the next lecture which is just going to be part two of
238
00:15:56,250 --> 00:15:57,960
overview data frame operations
25024
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.