Beruflich Dokumente
Kultur Dokumente
Page 1
Data Analysis in R, the data.table way DataCamp
1. Data.table Novice
1.1. Create and subset a data.table
Welcome to the interactive exercises for your data.table course. Here you will
learn the ins and outs of working with the data.table package.
While most of the material is covered by Matt and Arun in the videos, you will
sometimes need to show some street smarts to get to the right answer.
Remember that before using the hint you can always have a look at the official
documentation by typing ?data.table in the console.
Let's start with some warm-up exercises based on the topics covered in the
video. Recall from the video that you can use L after a numeric to specify that
it is an integer. You can also give columns with different lengths when creating
a data.table, and R will "recycle" the shorter column to match the length of the
longer one by re-using the first items. In the example below, column x is
recycled to match the length of column y:
You can also review the slides used in the videos by pressing the slides
button.
Select the second and third rows without using commas and print the
result to the console.
# Create my_first_data_table
my_first_data_table <- data.table(x = c("a", "b", "c", "d", "e"),
y = c(1, 2, 3, 4, 5))
Page 2
Data Analysis in R, the data.table way DataCamp
# Print the second and third row to the console without using commas
DT[c(2,3)]
Select row 2 twice and row 3 once, returning a data.table with three rows (two of
which are identical).
Page 3
Data Analysis in R, the data.table way DataCamp
Correct. When you use .() in j, the result is always a data.table. For
convenience, data.table also provides the option to return a vector while
computing on just a single column and not wrapping it with .() .
Type D <- 5 in the console. What do you think is the output of DT[,
.(D)] and DT[, D] ?
Well done! Column D does not exist in DT and is thus not seen as a variable. This
causes data.table to look for D in DT 's parent frame. Also note that .() in j always
returns a data.table.
In the video, the second argument j was covered. j can be used to select
columns by wrapping the column names in .() .
In addition to selecting columns, you can also call functions on them as if the
columns were variables. For example, if you had a data.table heights storing
people's heights in inches, you could compute their heights in feet as follows:
Page 4
Data Analysis in R, the data.table way DataCamp
2: Boris 5.916667
3: Jim 5.666667
Create a subset containing the columns B and C for rows 1 and 3 of DT.
Simply print out this subset to the console.
From DT, create a data.table, ans with two columns: B and val,
where val is the product of A and C.
Fill in the blanks in the assignment of ans2, such that it equals the
data.table specified in target. Use columns from the previously defined
data.tables to produce the val column.
First, just print iris to the console and observe that all rows are printed and
that the column names scroll off the top of your screen. This is
because iris is a data.frame. Scroll back up to the top to see the column
names.
Convert the iris dataset to a data.table DT. You're now ready to use
data.table magic on it!
Create a new column containing the mean Sepal.Length for
each Species. Do not provide a name for this newly created column.
Do exactly the same as in the instruction above, but this time, group by
the first letter of the Species name instead. Use substr() for this.
Page 5
Data Analysis in R, the data.table way DataCamp
# Group the specimens by Sepal area (to the nearest 10 cm2) and count how many occur in
each group
DT[, .N , by = 10 * round(Sepal.Length * Sepal.Width / 10)]
Correct! Notice that the order of the groups is retained, just like when they first
appeared in DT . This exercise was not that simple, so it's good to see you've made
it!
Create a new data.table DT2 with 3 columns, A, B and C, where C is the cumulative
sum of the C column of DT. Call the cumsum() function in the jargument, and group
by .(A, B) (i.e. both columns A and B).
Page 6
Data Analysis in R, the data.table way DataCamp
Select from DT2 the last two values of C using the tail() function, and assign that
to column C while you group by A alone. Make sure the column names don't change.
# Select from DT2 the last two values from C while you group by A
DT2[, .(C = tail(C,2)), by = A]
2. Data.table yeoman
2.1. Chaining: the basics
Now that you are comfortable with data.table 's DT[i, j, by] syntax, it is time
to practice some other very useful concepts in data.table. Here, we'll have a
more detailed look at chaining.
In the previous section, you calculated DT2 by taking the cumulative sum of Cwhile
grouping by A and B. Next, you selected the last two values of C from DT2 while grouping
by A alone. This code is included in the sample code. Use chaining to restructure the code.
Simply print out the result of chaining.
# Build DT
DT <- data.table(A = rep(letters[2:1], each = 4L),
B = rep(1:4, each = 2L),
C = sample(8))
Great! Note that chaining can significantly reduce the amount of instructions
necessary in your code.
In the previous chapter, you converted the iris dataset to a data.table DT .
This DT is already available in your workspace. Print DT to the console to
remind yourself of its contents. Now, let's see how you can use chaining to
simplify manipulations and calculations.
Page 7
Data Analysis in R, the data.table way DataCamp
Nicely done, but this is a little bit repetitive and an error can easily be made if you
aren't careful. Maybe there is a better way to do this kind of analysis? Let's find
out…
M 40 1 12
F 30 4 7
F 80 12 9
M 90 3 14
M 40 6 12
Page 8
Data Analysis in R, the data.table way DataCamp
to produce average weights, ages, and heights for male and female dogs
separately:
A data.table DT has been created for you and is available in the workspace.
Type DT in the console to print it out and inspect it.
# Median of columns
DT[, lapply(.SD, median), by= x]
Using .SDcols allows you to apply a function to all rows of a data.table, but
only to some of the columns. For example, consider the dog example from the
last exercise. If you wanted to compute the average weight and age (the
second and third columns) for all dogs, you could assign .SDcols accordingly:
While learning the data.table package, you may want to occasionally refer to
the documentation. Have a look at ?data.table for more info on .SDcols.
Yet another data.table, DT, has been prepared for you in your workspace. Start
by printing it to the console.
Page 9
Data Analysis in R, the data.table way DataCamp
# Select all but the first row of groups 1 and 2, returning only the grp column and the Q
columns
DT[, .SD[-1], by = grp, .SDcols = paste0("Q", 1:3)]
will return a data.table containing average weights and ages for dogs of each
sex.
It's also helpful to know that combining a list with a vector results in a new
longer list. Lastly, note that when you select .N on its own, it is renamed N in
the output for convenience when chaining.
For this exercise, DT, which contains variables x , y, and z, is loaded in your
workspace. You must combine lapply(), .SD, .SDcols, and .N to get your call
to return a specific output. Good luck!
Get the sum of all columns x, y and z and the number of rows in each
group while grouping by x. Your answer should be identical to this:
x x y z N
1: 2 8 26 30 4
2: 1 3 23 26 3
Get the cumulative sum of column x and y while grouping by x and z >
8 such that the answer looks like this:
Page 10
Data Analysis in R, the data.table way DataCamp
by1 by2 x y
1: 2 FALSE 2 1
2: 2 FALSE 4 6
3: 1 FALSE 1 3
4: 1 FALSE 2 10
5: 2 TRUE 2 9
6: 2 TRUE 4 20
7: 1 TRUE 1 13
# DT is pre-loaded
DT
# Sum of all columns and the number of rows
DT[, c(lapply(.SD, sum), .N), by= x, .SDcols = 1:3]
For example, the following line multiplies every row of column C by 10 and
stores the result in C:
DT[, C := C * 10]
This first exercise will thoroughly test your understanding of := used in the LHS
:= RHS form. It's time for you to show off your knowledge! A data.table DT has
been defined for you in the sample code.
# The data.table DT
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)], B = 1:5)
# Add 1 to column B
DT[c(2,4) , B := B + 1L]
Page 11
Data Analysis in R, the data.table way DataCamp
Great job, this was not that easy. Note that for the second instruction in j the
performance goes up if you coerce RHS to integer yourself via 1L or
via as.integer() .
Print DT to the console. When using the := operator in j, do you need to assign
the result to DT as follows?
Well done! The DT <- part is not necessary because the call makes updates to the
column by reference.
Try deleting a column only for a subset of rows: DT[2, B := NULL]. Did this
work?
Well done! A column is either there or it's not. It makes no sense to partially delete
it. If you find yourself needing to do this, then consider using NA s instead. Rather
than silently ignoring the mistaken use of i , data.table throws a syntax error
straight away so you can fix it.
Page 12
Data Analysis in R, the data.table way DataCamp
Notice that the := is surrounded by two tick marks! Otherwise data.table will
throw a syntax error. It is also important to note that in the generic functional
form above, my_fun() can refer to any function, including the basic arithmetic
functions. The nice thing about the functional form is that you can get both the
RHS alongside the LHS so that it's easier to read.
Time for some experimentation. A data.table DT has been prepared for you in
the sample code.
# Delete my_cols
my_cols <- c("B", "C")
DT[, (my_cols) := NULL]
In the next two exercises, you will focus on using set() and its
siblings setnames() and setcolorder(). You are two exercises away from
becoming a data.table yeoman!
A data.table DT has been created for you in the workspace. Check it out!
Page 13
Data Analysis in R, the data.table way DataCamp
Loop through columns 2, 3, and 4, and for each one, select 3 rows at
random and set the value of that column to NA.
Change the column names to lower case using the tolower() function.
When setnames() is passed a single input vector, that vector needs to
contain all the new names.
Print the resulting DT to the console to see what changed.
First, add a suffix "_2" to all column names of DT. Use paste0() here.
Next, use setnames() to change a_2 to A2 .
Lastly, reverse the order of the columns with setcolorder() .
# Define DT
DT <- data.table(a = letters[c(1, 1, 1, 2, 2)], b = 1)
Page 14
Data Analysis in R, the data.table way DataCamp
3. Data.table expert
3.1. Selecting rows the data.table way
In the video, Matt showed you how to use column names in i to select certain
rows. Since practice makes perfect, and since you will find yourself selecting
rows over and over again, it'll be good to do a small exercise on this with the
familiar iris dataset.
Convert the iris dataset to a data.table and store the result as iris.
Select all the rows where Species is "virginica".
Select all the rows where Species is
either "virginica" or "versicolor".
# Species is "virginica"
iris[Species == "virginica"]
Good job! Now you know how to select using column names in i (to select rows)
and in j (to select columns and run functions on columns).
INSTRUCTIONS
Page 15
Data Analysis in R, the data.table way DataCamp
# iris as a data.table
iris <- as.data.table(iris)
The iris data.table with the variable names you updated in the previous
exercise is available in your workspace.
Select the rows where the area is greater than 20 square centimeters.
Add a new boolean column containing Width * Length > 25 and call
it is_large. Remember that := can be used to create new columns.
Select the rows for which the value of is_large is TRUE.
Page 16
Data Analysis in R, the data.table way DataCamp
A data.table DT has already been created for you with the keys set to A and B.
Page 17
Data Analysis in R, the data.table way DataCamp
The same keyed data.table from before, DT, has been provided in the sample
code.
# Keyed data.table DT
DT <- data.table(A = letters[c(2, 1, 2, 3, 1, 2, 3)],
B = c(5, 4, 1, 9, 8, 8, 6),
C = 6:12,
key = "A,B")
For the group where column A is equal to "b" , print out the sequence
when column B is set equal to (-2):10. Remember, A and B are the keys
for this data.table.
Extend the code you wrote for the first instruction to roll the prevailing
values forward to replace the NA s.
Extend your code with the appropriate rollends value to roll the first
observation backwards.
Page 18
Data Analysis in R, the data.table way DataCamp
# Keyed data.table DT
DT <- data.table(A = letters[c(2, 1, 2, 3, 1, 2, 3)],
B = c(5, 4, 1, 9, 8, 8, 6),
C = 6:12,
key = "A,B")
Page 19