Sie sind auf Seite 1von 6

Data Structures

R Programming Cheat Sheet Vector

• Group of elements of the SAME type
data.frame while using single-square brackets, use
‘drop’: df1[, 'col1', drop = FALSE]
just the basics • R is a vectorized language, operations are applied to
each element of the vector automatically
• R has no concept of column vectors or row vectors What is a data.table
Created By: Arianne Colton and Sean Chen • Special vectors: letters and LETTERS, that contain • Extends and enhances the functionality of data.frames
lower-case and upper-case letters Differences: data.table vs. data.frame
Create Vector v1 <- c(1, 2, 3) • By default data.frame turns character data into factors,
General Manipulating Strings Get Length length(v1) while data.table does not
Check if All or Any is True all(v1); any(v1) • When you print data.frame data, all data prints to the
• R version 3.0 and greater adds support for 64 bit paste('string1', 'string2', sep Integer Indexing v1[1:3]; v1[c(1,6)]
console, with a data.table, it intelligently prints the first
integers = '/') and last five rows
Boolean Indexing v1[] <- 0
• R is case sensitive Putting # separator ('sep') is a space by default • Key Difference: Data.tables are fast because
Together c(first = 'a', ..)or
• R index starts from 1 paste(c('1', '2'), collapse = Naming they have an index like a database.
Strings names(v1) <- c('first', ..)
i.e., this search, dt1$col1 > number, does a
HELP # returns '1/2' Factor sequential scan (vector scan). After you create a key
stringr::str_split(string = v1, for this, it will be much faster via binary search.
• as.factor(v1) gets you the levels which is the
help(functionName) or ?functionName Split String pattern = '-')
number of unique values Create data.table from data.frame data.table(df1)
# returns a list
Help Home Page help.start() stringr::str_sub(string = v1, • Factors can reduce the size of a variable because they dt1[, 'col1', with
Get Substring start = 1, end = 3) only store unique values, but could be buggy if not Index by Column(s)* = FALSE] or
Special Character Help help('[') isJohnFound <- stringr::str_ used properly dt1[, list(col1)]
Search Help ??.. detect(string = df1$col1, Show info for each data.table in tables()
Search Function - with pattern ='John')) list memory (i.e., size, ...)
apropos('mea') Match String
Partial Name # returns True/False if John was found Store any number of items of ANY type Show Keys in data.table key(dt1)
See Example(s) example(topic) df1[isJohnFound, c('col1', Create index for col1 and setkey(dt1, col1)
...)] Create List list1 <- list(first = 'a', ...) reorder data according to col1
vector(mode = 'list', length dt1[c('col1Value1',
Objects in current environment Create Empty List = 3) Use Key to Select Data
'col1Value2'), ]
Get Element list1[[1]] or list1[['first']] Multiple Key Select dt1[J('1', c('2', '3')), ]
Display Object Name
Remove Object
objects() or ls()
rm(object1, object2,..)
Data Types Append Using
Numeric Index
list1[[6]] <- 2 dt1[, list(col1 =
mean(col1)), by =
Append Using Name list1[['newElement']] <- 2 col2]
Aggregation ** dt1[, list(col1 =
Notes: Check data type: class(variable)
mean(col1), col2Sum
Note: repeatedly appending to list, vector, data.frame
1. .name starting with a period are accessible but Four Basic Data Types etc. is expensive, it is best to create a list of a certain
= sum(col2)), by =
list(col3, col4)]
invisible, so they will not be found by ‘ls’ 1. Numeric - includes float/double, int, etc. size, then fill it.
2. To guarantee memory removal, use ‘gc’, releasing * Accessing columns must be done via list of actual
unused memory to the OS. R performs automatic ‘gc’
data.frame names, not as characters. If column names are
periodically 2. Character(string) • Each column is a variable, each row is an observation characters, then "with" argument should be set to
• Internally, each column is a vector FALSE.
nchar(variable) # length of a character or numeric
Symbol Name Environment • idata.frame is a data structure that creates a reference ** Aggregate and d*ply functions will work, but built-in
3. Date/POSIXct to a data.frame, therefore, no copying is performed aggregation functionality of data table is faster
• If multiple packages use the same function name the • Date: stores just a date. In numeric form, number
df1 <- data.frame(col1 = v1,
function that the package loaded the last will get called. of days since 1/1/1970 (see below). Create Data Frame col2 = v2, v3) Matrix
date1 <- as.Date('2012-06-28'), Dimension nrow(df1); ncol(df1); dim(df1) • Similar to data.frame except every element must be
• To avoid this precede the function with the name of the as.numeric(date1) Get/Set Column names(df1) the SAME type, most commonly all numerics
package. e.g. packageName::functionName(..) Names names(df1) <- c(...) • Functions that work with data.frame should work with
• POSIXct: stores a date and time. In numeric
form, number of seconds since 1/1/1970. Get/Set Row rownames(df1) matrix as well
Names rownames(df1) <- c(...)
Library date2 <- as.POSIXct('2012-06-28 18:00') Preview head(df1, n = 10); tail(...) Create Matrix matrix1 <- matrix(1:10, nrow = 5), # fills
rows 1 to 5, column 1 with 1:5, and column 2 with 6:10
Only trust reliable R packages i.e., 'ggplot2' for plotting, Get Data Type class(df1) # is data.frame Matrix matrix1 %*% t(matrix2)
'sp' for dealing spatial data, 'reshape2', 'survival', etc. Note: Use 'lubridate' and 'chron' packages to work df1['col1']or df1[1];† Multiplication # where t() is transpose
with Dates Index by Column(s) df1[c('col1', 'col3')] or
df1[c(1, 3)] Array
Load Package 4. Logical Index by Rows and df1[c(1, 3), 2:3] # returns data • Multidimensional vector of the SAME type
require(packageName) Columns from row 1 & 3, columns 2 to 3
Unload Package detach(packageName) • (TRUE = 1, FALSE = 0) • array1 <- array(1:12, dim = c(2, 3, 2))
• Use ==/!= to test equality and inequality † Index method: df1$col1 or df1[, 'col1'] or • Using arrays is not recommended
Note: require() returns the status(True/False) df1[, 1] returns as a vector. To return single column • Matrices are restricted to two dimensions while array
as.numeric(TRUE) => 1 can have any dimension
Data Munging Functions and Controls Data Reshaping
Apply (apply, tapply, lapply, mapply) group_by(), sample_n() say_hello <- function(first,
Create Function last = 'hola') { } Rearrange
• Apply - most restrictive. Must be used on a matrix, all • Chain functions reshape2.melt(df1, id.vars =
Call Function say_hello(first = 'hello')
elements must be the same type df1 %>% group_by(year, month) %>% Melt Data - from c('col1', 'col2'), variable.
• If used on some other object, such as a data.frame, it
select(col1, col2) %>% summarise(col1mean • R automatically returns the value of the last line of column to row name = 'newCol1', =
= mean(col1)) code in a function. This is bad practice. Use return() 'newCol2')
will be converted to a matrix first reshape2.dcast(df1, col1 +
explicitly instead. Cast Data - from col2 ~ newCol1, value.var =
apply(matrix1, 1 - rows or 2 - columns, • Much faster than plyr, with four types of easy-to-use row to column 'newCol2')
function to apply) joins (inner, left, semi, anti) • - specify the name of a function either as
# if rows, then pass each row as input to the function
string (i.e. 'mean') or as object (i.e. mean) and provide
• Abstracts the way data is stored so you can work with arguments as a list. If df1 has 3 more columns, col3 to col5, 'melting' creates
• By default, computation on NA (missing data) always data frames, data tables, and remote databases with a new df that has 3 rows for each combination of col1
returns NA, so if a matrix contains NAs, you can the same set of functions, args = list(first = '1st')) and col2, with the values coming from the respective col3
ignore them (use na.rm = TRUE in the apply(..) Helper functions to col5.
which doesn’t pass NAs to your function) if /else /else if /switch
each() - supply multiple functions to a function like aggregate Combine (mutiple sets into one)
lapply if { } else ifelse
aggregate(price ~ cut, diamonds, each(mean, 1. cbind - bind by columns
Applies a function to each element of a list and returns median)) Works with Vectorized Argument No Yes
the results as a list Most Efficient for Non-Vectorized Argument Yes No data.frame from two vectors cbind(v1, v2)
sapply data.frame combining df1 and
Works with NA * No Yes cbind(df1, df2)
Same as lapply except return the results as a vector Data Use &&, || **† Yes No
df2 columns

2. rbind - similar to cbind but for rows, you can assign

Note: lapply & sapply can both take a vector as input, a Use &, | ***† No Yes
new column names to vectors in cbind
vector is technically a form of list Load Data from CSV
cbind(col1 = v1, ...)
Aggregate (SQL groupby) • Read csv * NA == 1 result is NA, thus if won’t work, it’ll be an
read.table(file = url or filepath, header = error. For ifelse, NA will return instead 3. Joins - (merge, join, data.table) using common keys
• aggregate(formulas, data, function)
TRUE, sep = ',') ** &&, || is best used in if, since it only compares the 3.1 Merge
• Formulas: y ~ x, y represents a variable that we first element of vector from each side
• “stringAsFactors” argument defaults to TRUE, set it to • by.x and by.y specify the key columns use in the
want to make a calculation on, x represents one or FALSE to prevent converting columns to factors. This
more variables we want to group the calculation by *** &, | is necessary for ifelse, as it compares every join() operation
saves computation time and maintains character data element of vector from each side
• Can only use one function in aggregate(). To apply • Other useful arguments are "quote" and "colClasses", • Merge can be much slower than the alternatives
more than one function, use the plyr() package specifying the character used for enclosing cells and † &&, || are similar to if in that they don’t work with
vectors, where ifelse, &, | work with vectors merge(x = df1, y = df2, by.x = c('col1',
In the example below diamonds is a data.frame; price, the data type for each column. 'col3'), by.y = c('col3', 'col6'))
cut, color etc. are columns of diamonds. • If cell separator has been used inside a cell, then use • Similar to C++/Java, for &, |, both sides of operator
read.csv2() or read delim2() instead of read. 3.2 Join
aggregate(price ~ cut, diamonds, mean)
are always checked. For &&, ||, if left side fails, no • Join in plyr() package works similar to merge but
# get the average price of different cuts for the diamonds need to check the right side. much faster, drawback is key columns in each
aggregate(price ~ cut + color, diamonds, Database • } else, else must be on the same line as } table must have the same name
mean) # group by cut and color Connect to
aggregate(cbind(price, carat) ~ cut, Database
db1 <- RODBC::odbcConnect('conStr') • join() has an argument for specifying left, right,
diamonds, mean) # get the average price and average Query df1 <- RODBC::sqlQuery(db1, 'SELECT inner joins
carat of different cuts Database
..', stringAsFactors = FALSE) Graphics join(x = df1, y = df2, by = c('col1',
Plyr ('split-apply-combine') Connection
RODBC::odbcClose(db1) 'col3'))

• ddply(), llply(), ldply(), etc. (1st letter = the type of • Only one connection may be open at a time. The Default basic graphic 3.3 data.table
input, 2nd = the type of output connection automatically closes if R closes or another
connection is opened. hist(df1$col1, main = 'title', xlab = 'x dt1 <- data.table(df1, key = c('1',
• plyr can be slow, most of the functionality in plyr axis label')
can be accomplished using base function or other • If table name has space, use [ ] to surround the table '2')), dt2 <- ...‡
packages, but plyr is easier to use name in the SQL string. plot(col2 ~ col1, data = df1),
aka y ~ x or plot(x, y) • Left Join
ddply • which() in R is similar to ‘where’ in SQL
Takes a data.frame, splits it according to some Included Data lattice and ggplot2 (more popular) dt1[dt2]
variable(s), performs a desired action on it and returns a R and some packages come with data included.
data.frame • Initialize the object and add layers (points, lines, ‡ Data table join requires specifying the keys for the data
List Available Datasets data() histograms) using +, map variable in the data to an
List Available Datasets in data(package = tables
a Specific Package 'ggplot2')
axis or aesthetic using ‘aes’
• Can use this instead of lapply ggplot(data = df1) + geom_histogram(aes(x
• For sapply, can use laply (‘a’ is array/vector/matrix), Missing Data (NA and NULL) = col1)) Created by Arianne Colton and Sean Chen
however, laply result does not include the names. NULL is not missing, it’s nothingness. NULL is atomical
and cannot exist within a vector. If used inside a vector, it • Normalized histogram (pdf, not relative frequency
DPLYR (for data.frame ONLY) simply disappears. histogram) Based on content from
• Basic functions: filter(), slice(), arrange(), select(), ggplot(data = df1) + geom_density(aes(x = 'R for Everyone' by Jared Lander
Check Missing Data
rename(), distinct(), mutate(), summarise(), col1), fill = 'grey50')
Avoid Using is.null() Updated: December 2, 2015
R Programming Cheat Sheet Access any
Note: Every R package has two environments
associated with it (package and namespace).
advanced on the Every exported function is bound into the package
search list environment, but enclosed by the namespace
Created By: Arianne Colton and Sean Chen Find the
environment 3. Execution environment
where a pryr::where('func1')
name is • Each time a function is called, a new environment
defined is created to host execution. The parent of the
Environments execution environment is the enclosing environment
of the function.
Function environments • Once the function has completed, this environment
Environment Basics Search Path There are 4 environments for functions.
is thrown away.
What is an Environment? What is the Search Path?
1. Enclosing environment (used for lexical Note: Each execution environment has two
Data structure (that powers lexical scoping) is made up An R internal mechansim to look up objects, specifically, scoping) parents: a calling environment and an enclosing
of two components, the frame, which contains the name- functions. environment.
object bindings (and behaves much like a named list), • When a function is created, it gains a reference
and the parent environment. • Access with search(), which lists all parents of the to the environment where it was made. This is the
enclosing environment. • R’s regular scoping rules only use the enclosing
global environment. (See Figure 1) parent; parent.frame() allows you to access the
Named List • The enclosing environment belongs to the function, calling parent.
• It contains one environment for each attached and never changes, even if the function is moved
• You can think of an environment as a bag of names. package. to a different environment.
Each name points to an object stored elsewhere in 4. Calling environment
memory. • Objects in the search path environments can be • Every function has one and only one enclosing • This is the environment where the function was
found from the top-level interactive workspace. environment. For the three other types of called.
• If an object has no names pointing to it, it gets environment, there may be 0, 1, or many
environments associated with each function. • Looking up variables in the calling environment
automatically deleted by the garbage collector. rather than in the enclosing environment is called
• You can determine the enclosing environment of a dynamic scoping.
Parent Environment function by calling i.e. environment(func1)
• Dynamic scoping is primarily useful for developing
2. Binding environment functions that aid interactive data analysis.
• Every environment has a parent, another
environment. Only one environment doesn’t have a • The binding environments of a function are all the Binding names to values
parent: the empty environment. Figure 1. The Search Path environments which have a binding to it.
• The enclosing environment determines how the Assignment
• The parent is used to implement lexical scoping: if function finds values; the binding environments • Assignment is the act of binding (or rebinding) a
a name is not found in an environment, then R will • If you look for a name in a search, it will always start
from global environment first, then inside the latest determine how we find the function. name to a value in an environment.
look in its parent (and so on).
attached package. Name rules
Example for enclosing and binding environment • A complete list of reserved words can be found in
Environments can also be useful data structures in their
own right because they have reference semantics. If there are functions with the same name in two y <- 1 ?Reserved.
different packages, the latest package will get called. e <- new.env()
Regular assignment arrow, <-
Four special environments • Each time you load a new package with library()/ e$g <- function(x) x + y • The regular assignment arrow always creates a
require() it is inserted between the global # function g enclosing environment is the global variable in the current environment.
1. Global environment, access with globalenv(), environment, and the binding environment is “e”.
environment and the package that was previously at Deep assignment arrow, <<-
is the interactive workspace. This is the environment the top of the search path.
in which you normally work. • The deep assignment arrow modifies an
existing variable found by walking up the parent
The parent of the global environment is the last search() : environments. If <<- doesn’t find an existing variable,
package that you attached with library() or it will create one in the global environment. This
'.GlobalEnv' ... 'Autoloads' 'package:base' is usually undesirable, because global variables
library(reshape2); search() introduce non-obvious dependencies between
2. Base environment, access with baseenv(), is '.GlobalEnv' 'package:reshape2' ...
the environment of the base package. Its parent is the
empty environment.
'Autoloads' 'package:base' Environment Creation

3. Empty environment, access with emptyenv(), • To create an environment manually, use new.env().
is the ultimate ancestor of all environments, and Note: There is a special environment called Autoloads You can list the bindings in the environment’s frame
the only environment without a parent. Empty which is used to save memory by only loading package with ls() and see its parent with parent.env().
environments contain nothing. objects (like big datasets) when needed. • When creating your own environment, note that you
Figure 2. Function Environment should set its parent environment to be the empty
4. Current environment, access with environment. This ensures you don’t accidentally
environment() inherit objects from somewhere else.
Functions Data Structures
Function Basics i <- function(a, b) {
• You can also create infix functions where the function 2. Class: represents the object’s abstract type;
name comes in between its arguments, like + or -.
The most important thing to understand about R is that missing(a) -> # return true or false • ‘type’ of the object from R’s object-oriented
functions are objects in their own right. } • All user-created infix functions must start and end programming point of view
with %.
All R functions have three parts: • Access with class()
• By default, R function arguments are lazy -- they're `%+%` <- function(a, b) paste0(a, b)
body() code inside the function only evaluated if they're actually used 'new' %+% 'string' typeof() class()
f <- function(x) { strings or vector of strings character character
list of arguments which controls how you 10 • Useful way of providing a default value in case the
formals() can call the function } output of another function is NULL: numbers or vector of numbers numeric numeric
f(stop('This is an error!')) -> 10
“map” of the location of the function’s `%||%` <- function(a, b) if (!is.
environment() variables (see “Enclosing Environment”) list list list
null(a)) a else b
However, since x is not used. stop("This is an function_that_might_return_null() %||% data.frame* list data.frame
• When you print(func1) a function in R, it shows error!") never get evaluated. default value
you these three important components. If the • Default arguments are evaluated inside the function.
environment isn't displayed, it means that the function This means that if the expression depends on the * Internally, data.frame is a list of equal-length vectors.
was created in the global environment. current environment the results will differ depending Replacement functions
• Like all objects in R, functions can also possess any on whether you use the default value or explicitly
number of additional attributes(). provide one: • Act like they modify their arguments in place, and 1d (vectors: atomic vector and list)
have the special name xxx <-
Every operation is a function call f <- function(x = ls()) { • They typically have two arguments (x and value), • Use is.atomic() || is.list() to test if an object
a <- 1 is actually a vector, not is.vector().
• Everything that exists is an object x although they can have more, and they must return
• Everything that happens in R is a function call, even if } the modified object.
Type typeof() what it is
it doesn’t look like it. (i.e. +, for, if, [, $, { ...) `second<-` <- function(x, value) {
f() -> 'a' 'x' ls() evaluated inside f x[2] <- value Length length() how many elements
Note: the backtick (`), lets you refer to functions or x
f(ls()) ls() evaluated in global environment } Attributes attributes() additonal arbitrary metadata
variables that have otherwise reserved or illegal names: x <- 1:10
e.g. x + y is the same as `+`(x, y) second(x) <- 5L
Return values Factors
Lexical scoping • I say they "act" like they modify their arguments in • Factors are built on top of integer vectors using two
• The last expression evaluated in a function becomes place, because they actually create a modified copy. attributes :
the return value, the result of invoking the function.
What is Lexical Scoping? • We can see that by using pryr::address() to find class(x) -> 'factor'
• Only use explicit return() for when you are
Looks up value of a symbol. (See "Enclosing returning early, such as for an error. the memory address of the underlying object. levels(x) # defines the set of allowed values
Environment" in the "Environment" section.) • Functions can return only a single object. But this
• findGlobals() # lists all the external dependencies of a is not a limitation because you can return a list • While factors look (and often behave) like character
function containing any number of objects. vectors, they are actually integers. Be careful when
treating them like strings.
• Functions can return invisible values, which are not
f <- function() x + 1
printed out by default when you call the function. • Factors are useful when you know the possible
> '+' 'x' f1 <- function() 1 Data Structures values a variable may take, even if you don’t see all
values in a given dataset.
f2 <- function() invisible(1)
environment(f) <- emptyenv() • Most data loading functions in R automatically
f() convert character vectors to factors, use the
• The most common function that returns invisibly is <- Homogeneous Heterogeneous
# error in f(): could not find function “+” * argument stringsAsFactors = FALSE to suppress
Primitive functions 1d Atomic vector List this behavior.

* This doesn’t work because R relies on lexical scoping • There is one exception to the rule that functions 2d Matrix Data frame Attributes
to find everything, even the + operator. It’s never have three components. • All objects can have arbitrary additional attributes.
nd Array
possible to make a function completely self-contained • Primitive functions, like sum(), call C code directly
because you must always rely on functions defined in • Attributes can be accessed individually with attr() or
with .Primitive() and contain no R code. Note: R has no 0-dimensional or scalar types. Individual all at once (as a list) with attributes().
base R or other packages.
• Therefore their formals(), body(), and environment() numbers or strings, are actually vectors of length one,
are all NULL: attr(v1, 'attr1') <- 'my vector'
NOT scalars.
Function arguments sum : function (..., na.rm = FALSE) • By default, most attributes are lost when modifying a
Human readable description of any R data structure:
.Primitive('sum') vector. The only attributes not lost are the three most
When calling a function you can specify arguments by str(variable) important:
position, by complete name, or by partial name. • Primitive functions are only found in the base
Arguments are matched first by exact name (perfect package, and since they operate at a low level, they Names
a character vector giving names(x)
matching), then by prefix matching, and finally by position. can be more efficient. Every Object has a mode and a class each element a name
1. Mode: represents how an object is stored in memory; used to turn vectors into
• Function arguments are passed by reference and Infix functions Dimensions
matrices and arrays
copied on modify. • ‘type’ of the object from R’s point of view
• Most functions in R are ‘prefix’ operators: the name used to implement the S3
• You can determine if an argument was supplied or not of the function comes before the arguments. • Access with typeof() Class
object system
with the missing() function.
Subsetting (operators: [, [[, $)
Simplifying vs. preserving subsetting Subsetting returns a copy of the original Examples Set individual columns to df1$col3 <- NULL
data, NOT copy-on-modified. NULL
• Simplifying subsetting returns the simplest 1. Lookup tables (character subsetting)
Subset to return only the df1[c('col1',
possible data structure that can represent the output. Out of Bounds Character matching provides a powerful way to make columns you want 'col2')]
• Preserving subsetting keeps the structure of the lookup tables.
output the same as the input. • [ and [[ differ slightly in their behavior when the index 5. Selecting rows based on a condition
is out of bounds (OOB). x <- c('m', 'f', 'u', 'f', 'f', 'm', 'm')
lookup <- c(m = 'Male', f = 'Female', u = NA) (logical subsetting)
Simplifying* Preserving • For example, when you try to extract the fifth element
of a length four vector, aka OOB x[5] -> NA, or lookup[x] • Logical subsetting is probably the most commonly
Vector x[[1]] x[1] >m f u f f m m used technique for extracting rows out of a data
subset a vector with NA or NULL: x[NULL] -> x[0]
List x[[1]] x[1] > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male' frame.
Operator Index Atomic List unname(lookup[x]) df1[df1$col1 == 5 & df1$col2 == 4, ]
Factor x[1:4, drop = T] x[1:4]
> 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
x[1, , drop = F] or [ OOB NA list(NULL) • Remember to use the vector boolean operators &
Array x[1, ] or x[, 1]
x[, 1, drop = F]
[ NA_real_ NA list(NULL) 2. Matching and merging by hand and |, not the short-circuiting scalar operators &&
x[, 1, drop = F] (integer subsetting) and || which are more useful inside if statements.
Data frame x[, 1] or x[[1]] or x[1] [ NULL x[0] list(NULL) Lookup table which has multiple columns of • subset() is a specialised shorthand function for
• When you use drop = FALSE, it's preserving. [[ OOB Error Error information. subsetting data frames, and saves some typing
• Omitting drop = FALSE when subsetting matrices grades <- c(1, 2, 2, 3, 1)
because you don't need to repeat the name of the
[[ NA_real_ Error NULL
and data frames is one of the most common sources info <- data.frame(
data frame.
of programming errors. [[ NULL Error Error grade = 3:1, subset(df1, col1 == 5 & col2 == 4)
• [[ is similar to [, except it can only return a single • If the input vector is named, then the names of OOB, desc = c('Excellent', 'Good', 'Poor'),
value and it allows you to pull pieces out of a list. missing, or NULL components will be "<NA>". fail = c(F, F, T) Boolean algebra vs. sets
) (logical & integer subsetting)
* Simplifying behavior varies slightly between different $ Subsetting Operator First method : • It's useful to be aware of the natural equivalence
data types: • $ is a useful shorthand for [[ combined with between set operations (integer subsetting) and
id <- match(grades, info$grade)
character subsetting: boolean algebra (logical subsetting).
• Atomic Vector: x[[1]] is the same as x[1]. info[id, ]
• Using set operations is more effective when:
• List: [ ] always returns a list, to get the contents use x$y is equivalent to x[['y', exact = FALSE]]
[[ ]]. Second method : »» You want to find the first (or last) TRUE.
• Factor: drops any unused levels but it remains a • One common mistake with $ is to try and use it when rownames(info) <- info$grade »» You have very few TRUEs and very many
factor class. you have the name of a column stored in a variable: info[as.character(grades), ] FALSEs; a set representation may be faster and
• Matrix or array: if any of the dimensions has var <- 'cyl' • If you have multiple columns to match on, you’ll require less storage.
length 1, drops that dimension. x$var
need to first collapse them to a single column (with • which() allows you to convert a boolean
• Data.frame is similar, if output is a single column, # doesn't work, translated to x[['var']] representation to an integer representation. There’s
interaction(), paste(), or plyr::id()).
it returns a vector instead of a data frame. # Instead use x[[var]] no reverse operation in base R.
• You can also use merge() or plyr::join(), which
• There's one important difference between $ and [[, do the same thing for you. which(c(T, F, T F)) -> 1 3
Data.frames subsetting $ does partial matching, [[ does not:
# returns the index of the true*
• Data frames possess the characteristics of x <- list(abc = 1)
3. Expanding aggregated counts
x$a -> 1 # since "exact = FALSE" (integer subsetting)
both lists and matrices. If you subset with a
single vector, they behave like lists; if you subset with
x[['a']] -> # would be an error • Sometimes you get a data frame where identical * The integer representation length is always <=
two vectors, they behave like matrices. rows have been collapsed into one and a count boolean representation length.
Subsetting with Assignment column has been added.
List Subsetting df1[c('col1', 'col2')] • rep() and integer subsetting make it easy to • When first learning subsetting, a common mistake is
• All subsetting operators can be combined with uncollapse the data by subsetting with a repeated to use x[which(y)] instead of x[y].
Matrix Subsetting df1[, c('col1', 'col2')]
assignment to modify selected values of the input row index: rep(x, y) • Here the which() achieves nothing, it switches from
The subsetting results are the same in this example. vector. • rep replicates the values in x, y times. logical to integer subsetting but the result will be
• Subsetting with nothing can be useful in conjunction exactly the same.
• Single column subsetting: matrix subsetting with assignment because it will preserve the original df1$countCol is c(3, 5, 1)
simplifies by default, list subsetting does not. • Also beware that x[-which(y)] is not equivalent
object class and structure. rep(1:nrow(df1), df1$countCol)
to x[!y]. If y is all FALSE, which(y) will be
>111222223 integer(0) and -integer(0) is still integer(0),
str(df1[, 'col1']) -> int [1:3] df1[] <- lapply(df1, as.integer)
# the result is a vector # df1 will remain as a data frame so you’ll get no values, instead of all values.
4. Removing columns from data frames
(character subsetting) • In general, avoid switching from logical to integer
str(df1['col1']) -> 'data.frame' df1 <- lapply(df1, as.integer)
subsetting unless you want, for example, the first or
# the result remains a data frame of 1 column # df1 will become a list There are two ways to remove columns from a data last TRUE value.
Debugging, condition handling, & defensive programming Object Oriented (OO) Field Guide
Debugging that can easily be suppressed by the user using Object Oriented Systems • To see if an object is a pure base type, (i.e., it
Use traceback() and browser(), and interactive tools doesn't also have S3, S4, or RC behavior), check
2. Handling conditions programmatically: R has three object oriented systems (plus the base types) that is.object(x) returns FALSE.
in RStudio:
• try() gives you the ability to continue execution 1. S3 is a very casual system. It has no formal definition
• RStudio's error inspector or traceback() which list
of classes. S3 implements a style of OO programming S3
the sequence of calls that lead to the error. even when an error occurs.
called generic-function OO.
• RStudio's breakpoints or browser() which open an • tryCatch() lets you specify handler functions that • S3 is R's first and simplest OO system. It is the only
interactive debug session at an arbitrary location in control what happens when a condition is signaled. • Generic-function OO - a special type of OO system used in the base and stats package.
the code. function called a generic function decides which
result = tryCatch(code, method to call. • In S3, methods belong to functions, called generic
• RStudio's "Rerun with Debug" tool or error = function(c) "error", functions, or generics for short. S3 methods do not
options(error = browser)* which open an warning = function(c) "warning",
message = function(c) "message" Example: drawRect(canvas, 'blue') belong to objects or classes.
interactive debug session where the error occurred. )
Langauge: R • Given a class, the job of an S3 generic is to call the
* There are two other useful functions that you can use
Use conditionMessage(c) or c$message to right S3 method. You can recognise S3 methods by
with the error option:
extract the message associated with the original • Message-passing OO - messages (methods) their names, which look like generic.class().
1. Recover is a step up from browser, as it allows you error. are sent to objects and the object determines which
to enter the environment of any of the calls in the call function to call. For example, the Date method for the mean()
stack. • You can also capture the output of the try() and
tryCatch() functions. generic is called mean.Date()
Example: canvas.drawRect('blue')
This is useful because often the root cause of the
If successful, it will be the last result evaluated in the This is the reason that most modern style guides
error is a number of calls back. block, just like a function. Langauge: Java, C++, and C#
discourage the use of . in function names, it makes
2. dump.frames is an equivalent to recover for non- If unsuccessful it will be an invisible object of class them look like S3 methods.
interactive code. It creates a last.dump.rda file in "try-error". 2. S4 works similarly to S3, but is more formal. There are
the current working directory. two major differences to S3. • See all methods that belong to a generic :
3. Custom signal classes:
Then, in a later interactive R session, you load that • S4 has formal class definitions, which describe the
file, and use debugger() to enter an interactive • One of the challenges of error handling in R is that representation and inheritance for each class, and methods('mean')
debugger with the same interface as recover(). This most functions just call stop() with a string. has special helper functions for defining generics #> mean.Date
allows interactive debugging of batch code. • Since conditions are S3 classes, the solution is to and methods. #> mean.default
In batch R process ---- define your own classes if you want to distinguish • S4 also has multiple dispatch, which means that #> mean.difftime
different types of error. generic functions can pick methods based on the
dump_and_quit <- function() { • Each condition signalling function, stop(), class of any number of arguments, not just one. • List all generics that have a method for a given
warning(), and message(), can be given either a class :
# Save debugging info to file last.dump.rda
dump.frames(to.file = TRUE) list of strings, or a custom S3 condition object. 3. Reference classes, called RC for short, are quite
different from S3 and S4. methods(class = 'Date')
# Quit R with error status Defensive Programming
q(status = 1) • RC implements message-passing OO, so methods • S3 objects are usually built on top of lists, or atomic
} The basic principle of defensive programming is to "fail belong to classes, not functions. vectors with attributes. Factor and data frame are
options(error = dump_and_quit) fast", to raise an error as soon as something goes S3 class.
• $ is used to separate objects and methods, so
In a later interactive session ---- wrong.
method calls look like canvas$drawRect('blue').
In R, this takes three particular forms: Check if an object is a is.object(x) & !isS4(x) or
1. Checking that inputs are correct using stopifnot(),
C Structure S3 object pryr::otype()
the 'assertthat' package, or simple if statements and Check if inherits from a inherits(x, 'classname')
• Underlying every R object is a C specific class
Condition handling (of expected errors) stop()
structure (or struct) that describes how
2. Avoiding non-standard evaluation like subset(), that object is stored in memory. Determine class of any class(x)
1. Communicating potential problems to the object
user is the job of conditions: errors, warnings, and transform(), and with().
messages: • The struct includes the contents of the object, the
These functions save time when used interactively,
information needed for memory management and a
• Fatal errors are raised by stop() and force all but because they make assumptions to reduce
execution to terminate. Errors are used when there typing, when they fail, they often fail with
is no way for a function to continue. uninformative error messages. typeof() # determines an object's base type
• Warnings are generated by warning() and are used 3. Avoiding functions that can return different types
to display potential problems, such as when some of output. The two biggest offenders are [ and • The "Data structures" section explains the most
common base types (atomic vectors and lists), but Created by Arianne Colton and Sean Chen
elements of a vectorised input are invalid. sapply().
base types also encompass functions, environments,
• Messages are generated by message() and Based on content from
Note: Whenever subsetting a data frame in a and other more exotic objects likes names, calls, and “Advanced R” by Hadley Wickham
are used to give informative output in a way function, you should always use drop = FALSE promises.
Updated: January 15, 2016