You are on page 1of 66

# Vocabulary

The following functions outline my expectation for a working vocabulary of R functions. I don't expect you to be intimately conscious of the details of every function, but you should at least be aware that they all exist, so that if you do encounter a problem that requires a special tool, you're already aware of it. The functions come from the base and stats packages, but include a few pointers to other packages and important options. To get the most out of this book, you should be familiar with "the basics". For data analysis, you also need I/O and special data. My current favourite reference for this level of introduction to R is the R Cookbook by Paul Teetor.

The basics
# The first functions to learn ? str # Operators %in%, match =, <-, <<-, assign \$, [, [[, replace, head, tail, subset with within # Comparison all.equal, identical !=, ==, >, >=, <, <= is.na, is.nan, is.finite complete.cases # Basic math *, +, -, /, ^, %%, %/% abs, sign acos, acosh, asin, asinh, atan, atan2, atanh sin, sinh, cos, cosh, tan, tanh ceiling, floor, round, trunc, signif exp, log, log10, log1p, log2, logb, sqrt cummax, cummin, cumprod, cumsum, diff max, min, prod, sum range mean, median, cor, cov, sd, var pmax, pmin rle # Functions function missing on.exit return, invisible # Logical & sets

&, |, !, xor all, any intersect, union, setdiff, setequal which # Vectors and matrices c, matrix # automatic coercion rules character > numeric > logical length, dim, ncol, nrow cbind, rbind names, colnames, rownames t diag sweep as.matrix, data.matrix # Making vectors c, rep, seq, seq_along, seq_len rev sample choose, factorial, combn (is/as).(character/numeric/logical) # Lists & data.frames list, unlist data.frame, as.data.frame split expand.grid # Control flow if, &&, || (short circuiting) for, while next, break switch ifelse

Statistics
# Linear models fitted, predict, resid, rstandard lm, glm hat, influence.measures logLik, df, deviance formula, ~, I anova, coef, confint, vcov contrasts # Miscellaneous tests

apropos("\\.test\$") # Random variables beta, binom, cauchy, chisq, exp, f, gamma, geom, hyper, lnorm, logis, multinom, nbinom, norm, pois, signrank, t, unif, weibull, wilcox, birthday, tukey # Matrix algebra crossprod, tcrossprod eigen, qr, svd %*%, %o%, outer rcond solve # Ordering and tabulating duplicated, unique merge order, rank, quantile sort table, ftable

Working with R
# Workspace ls, exists, get, rm getwd, setwd q source install.packages, library, require # Help help, ? help.search apropos RSiteSearch citation demo example vignette # Debugging traceback browser recover options(error = ) stop, warning, message tryCatch, try

I/O
# Output print, cat message, warning dput format sink # Reading and writing data data count.fields read.csv, read.delim, read.fwf, read.table library(foreign) write.table readLines, writeLines load, save readRDS, saveRDS # Files and directories dir basename, dirname, file.path, path.expand file.choose file.copy, file.create, file.remove, file.rename, dir.create file.exists tempdir, tempfile download.file

Special data
# Date time ISOdate, ISOdatetime, strftime, strptime, date difftime julian, months, quarters, weekdays library(lubridate) # Character manipulation grep, agrep gsub strsplit chartr nchar tolower, toupper substr paste library(stringr) # Factors factor, levels, nlevels

reorder, relevel cut, findInterval interaction options(stringsAsFactors = FALSE) # Array manipulation array dim dimnames aperm library(abind)

Introduction
This book has grown out of over 10 years of programming in R, and constantly struggling to understand the best way of doing things. I would particularly like to thank the tireless contributors to R-help. There are too many that have helped me over the years to list individually, but I'd particularly like to thank Luke Tierney, John Chambers, and Brian Ripley for correcting countless of my misunderstandings and helping me to deeply u nderstand R. R is still a relatively young language, and the resources to help you understand it are still maturing. In my personal journey to understand R, I've found it particularly helpful to refer to resources that describe how other programming languages work. R has aspects of both functional and object-oriented (OO) programming languages, and learning how these aspects are expressed in R, will help you translate your existing knowledge from other programming languages, and to help you identify areas where you can improve. Functional

First class functions Pure functions: a goal, not a prerequisite Recursion: no tail call elimination. Slow Lazy evaluation: but only of function arguments. No infinite streams Untyped OO

Has three distinct OO frameworks built in to base. And more available in add on packages. Two of the OO styles are built around generic functions, a style of OO that comes from lisp. I found the following two books particularly helpful:

The structure and interpretation of computer programs by Harold Abelson and Gerald Jay Sussman. Concepts, Techniques and Models of Computer Programming by Peter van Roy and Sef Haridi It's also very useful to learn a little about lisp, because many of the ideas in R are adapted from lisp, and there are often good descriptions of the basic ideas, even if the implementation differs so mewhat. Part of the purpose of this book is so that you don't have to consult these original source, but if you want to learn more, this is a great way to develop a deeper understanding of how R works. Other websites that helped me to understand smaller pieces of R are:

Getting Started with Dylan for understanding S4 Frames, Environments, and Scope in R and S-PLUS. Section 2 is recommended as a good introduction to the formal vocabulary used in much of the R documentation. Lexical scope and statistical computing gives more examples of the power and utility of closures.

## What you will get out of this book

This book describes the skills that I think you need to be an advanced R developer, producing reproducible code that can be used in a wide variety of circumstances. After reading this book, you will be:

Familiar with the fundamentals of R, so that you can represent complex data types and simplify the operations performed on them. You have a deep understanding of the language, and know how to override default behaviours when necessary Able to produce packages to make your work available to a wider audience, an d how to efficiently program "in the large", so you spend your time solving new problems not struggling with old code. Comfortable reading and understanding the majority of R code. Important so you can learn from and critique others code.

## Who should read this book

Experienced programmers from other languages who want to learn about the features of R that make it special Existing package developers who want to make it less work. R developers who want to take it to the next level - who are ready to release their own code into the wild To get the most out of this look you should already be familiar with the basics of R as described in the next section, and you should have started developing your R vocabulary.

Basics
Data structures
The basic data structure in R is the vector, which comes in two basic flavours: atomic vectors and lists. Atomic vectors are logical, integer, numeric, character and raw. Common vector properties are mode, length and names:
x <- 1:10 mode(x) length(x) names(x) names(x) <- letters[1:10] x names(x)

Lists are different from atomic vectors in that they can contain any other type of vector. This makes them recursive, because a list can contain other lists.
x <- list(list(list(list()))) x str(x) str is one of the most important functions in R: it gives a human readable description of any R data structure.

Vectors can be extended into multiple dimensions. If 2d they are called matrices, if more than 2d they are cal led arrays. Length generalises to nrow and ncol for matrices, and dim for arrays. Names generalises to rownames and colnames for matrices, a dimnames for arrays.
y <- matrix(1:20, nrow = 4, ncol = 5) z <- array(1:24, dim = c(3, 4, 5)) nrow(y) rownames(y) ncol(y) colnames(y) dim(z) dimnames(z)

All vectors can also have additional arbitrary attributes - these can be thought of as a named list (although the names must be unique), and can be accessed individual with attr or all at once with attributes . structure returns a new object with modified attributes. Another extremely important data structure is the data.frame. A data frame is a named list with a restriction that all elements must be vectors of the same length. Each element in the list represents a col umn, which means that each column must be one type, but a row may contain values of different types.

Subsetting

Three subsetting operators. Five types of subsetting. Extensions to more than 1d. All basic data structures can be teased apart using the subsetting operators: [ , [[ and \$ . It's easiest to explain subsetting for 1d first, and then show how it generalises to higher dimensions. You can subset by 5 different things: blank: return everything positive integers: return elements at those positions negative integers: return all elements except at those positions character vector: return elements with matching names logical vector: return all elements where the corresponding logical value is TRUE For higher dimensions these are separated by commas.

[ . Drop argument controls simplification. '[[ returns an element x\$y is equivalent to x'[["y"]]

Functions
Functions in R are created by function . They consist of an argument list (which can include default values), and a body of code to execute when evaluated. In R arguments are passed-by-value, so the only way a function can affect the outside world is through its return value:
f <- function(x) { x\$a <- 2 } x <- list(a = 1) f() x\$a

Functions can return only a single value, but this is not a limitation in practice because you can always return a list containing any number of objects.

## When calling a function you can specify arguments by position, or by name:

mean(1:10) mean(x = 1:10) mean(x = 1:10, trim = 0.05)

Arguments are matched first by exact name, then by prefix matching and finally by position. There is a special argument called ... . This argument will match any arguments not otherwise specifically matched, and can be used to call other functions. This is useful if you want to collect arg uments to call another function, but you don't want to prespecify their possible names. You can define new infix operators with a special syntax:
"%+%" <- function(a, b) paste(a, b) "new" %+% "string"

## And replacement functions to modify arguments "in-place":

"second<-" <- function(x, value) { x <- value x } x <- 1:10 second(x) <- 5 x

## But this is really the same as

x <- "second<-"(x, 5)

and actual modification in place should be considered a performance optimisation, not a fundamental propert y of the language.

Scoping
Scoping is the set of rules that govern how R looks up the value of a symbol, or name. That is, the rules that R applies to go from the symbol x , to its value 10 in the following example.
x <- 10 x #  10

R has two types of scoping: lexical scoping , implemented automatically at the language level, and dynamic scoping , used in select functions to save typing during interactive analysis. This document describes lexical scoping, as well as environments (the underlying data structure). Dynamic scoping is described in the context of controlling evaluation.

Lexical scoping
Lexical scoping looks up symbol values using how functions are nested when they were written, not when they were called. With lexical scoping, you can figure out where the value of each variable will be looked up only by looking at the definition of the function, you don't need to know anything about how the function is called.

The "lexical" in lexical scoping doesn't correspond to the usual English definition ("of or relating to words or the vocabulary of a language as distinguished from its grammar and construction") but comes from the computer science term "lexing", which is part of the process that converts code represented as text to meaningful pieces that the programming language understands. It's lexical in this sense, because you only need the definition of the functions, not how they are called. The following example illustrates the basic principle:
x <- 5 f <- function() { y <- 10 c(x = x, y = y) } f() # # x y 5 10

Lexical scoping is the rule that determines where values are looked for, not when. Unlike some languages, R looks up at the values at run-time, not when the function is created. This means results from a function can be different depending on objects outside its environment:
x <- 15 f() # x y # 15 10 x <- 20 f() # x y # 20 10

## If a name is defined inside a function, it will mask the top-level definition:

g <- function() { x <- 21 y <- 11 c(x = x, y = y) } f() # x y # 20 10 g() # x y # 21 11

The same principle applies regardless of the degree of nesting. See if you can predict what the following function will return before trying it out yourself.
w <- 0 f <- function() { x <- 1 g <- function() {

## y <- 2 h <- function() { z <- 3 c(w = w, x = x, y = y, z = z) } h() } g() } f()

To better understand how scoping works, it's useful to know a little about environments, the data structure that powers scoping.

Environments
An environment is very similar to a list, with two important differences. Firstly, an environment has reference semantics: R's usual copy on modify rules do not apply. Secondly, an environment has a parent: if an object is not found in an environment, then R will look in its parent. Technically, an environment is made up of a frame , a collection of named objects (like a list), and link to a parent environment. When a function is created, it gains a pointer to the environment where it was made. You can access this environment with the environment function. This environment may have access to objects that are not in the global environment: this is how namespaces work.
environment(plot) ls(environment(plot), all = T) get(".units", environment(plot)) get(".units")

Every time a function is called, a new environment is created to host execution. The following example illustrates that each time a function is run it gets a new environment:
f <- function(x) { if (!exists("a")) { message("Defining a") a <- 1 } else { a <- a + 1 } a } f() # Defining a #  1 f() # Defining a #  1 a <- 2 f() #  3 a

#  2

The section on closures describes how to work around this limitation by using a parent environment that stays the same between runs. Environments can also be useful in their own right, if you want to create a data structure that has reference semantics. This is not something that should be undertaken lightly: it will violate users expectations about how R code works, but it can sometimes be critical. The following example shows how to create an environment for this purpose.
e <- new.env(hash = T, parent = emptyenv()) f <- e exists("a", e) e\$a get("a", e) ls(e) # Environments are reference object: R's usual copy-on-modify semantics # do not apply e\$a <- 10 ls(e) f\$a

The new reference based classes, introduced in R 2.12, provide a more formal way to do this, with usual inheritance semantics. There are also a few special environments that you can access directly:

globalenv() : the user's workspace baseenv() : the environment of the base package emptyenv() : the ultimate ancestor of all environments

The only environment that doesn't have a parent is emptyenv(), which is the eventual parent of every other environment. The most common environment is the global environment (globalenv()) which corresponds to the to your top-level workspace. The parent of the global environment is one of the packages you have loaded (the exact order will depend on which packages you have loaded in which order). The eventual parent will be the base environment, which is the environment of "base R" functionality, which has the empty environment as a parent. Apart from that, the environment hierarchy is created by function definition. When you create a function, f , in the global environment, the environment of the function f will have the global environment as a parent. If you create a function g inside f , then the environment of g will have the environment of f as a parent, and the global environment as a grandparent.

Active bindings
makeActiveBinding allows you to create names that look like variables, but act like functions. Every time you access the object a function is run. This lets you do crazy things like: makeActiveBinding("x", function(...) rnorm(1), globalenv()) x #  0.4754442 x #  -1.659971 x

#  -1.040291

Lazy loading
http://stackoverflow.com/questions/8700619/get-specific-object-from-rdata-file

Dynamic scoping
parent.frames , sys.calls etc

## Explicit scoping with local Namespaces Lazyness

R has a number of features designed to help you do as little work as possible. These are collectively known as lazy loading tools, and allow you to put off doing work as long as possible (so hopefully you never have to do it). You are probably familiar with lazy function arguments, but there are a few other ways you can make operations lazy by hand. To demonstrate the use of delayedAssign we're going to create a caching function that saves the results of an expensive operation to disk, and then we you next load it, it lazy loads the objects - this means if we cached something you didn't need, we only pay a small disk usage penalty, we don't use up any data in R. Challenge: if we do it one file per object, how do we know whether the cache has been run before or not? Should print message when loading from cache. Can we make caching robust enough that some objects can be retrieved from disk and some can be computed afresh?

Functions
Creating functions Anonymous functions
In R, functions are objects in their own right. Unlike many other programming languages, functions aren't automatically bound to a name: they can exist independently. You might have noticed this already, because when you create a function, you use the usual assignment operato r to give it a name. Given the name of a function as a string, you can find that function using match.fun . The inverse is not possible: because not all functions have a name, or functions may have more than one name. Functions that don't have a name are called anonymous functions. Anonymous functions are not that useful by themselves, so in this section you'll learn about their basic properties, and then see how they are used in subsequence sections. The tool we use for creating functions is function . It is very close to being an ordinary R function, but it has special syntax: the last argument to the function is outside the call and provides the body of the new function. If we don't assign the results of function to a variable we get an anonymous function:
function(x) 3 # function(x) 3

You can call anonymous functions, but the code is a little tricky to read because you must use parentheses in two different ways: to call a function, and to make it clear that we want to call the anonymous function function(x) 3 not inside our anonymous function call a function called 3 (not a valid function name):

(function(x) 3)() #  3 # Exactly the same as f <- function(x) 3 f() function(x) 3() # function(x) 3()

## The syntax extends in a straightforward way if the function has parameters

(function(x) x)(3) #  3 (function(x) x)(x = 4) #  4

## Functions have three important components

body() : the quoted object representing the code inside the function formals() : the argument list to the function environment() : the environment in which the function was defined

These can both also be used to modify the structure of the function in their assignment form. These are illustrated below:
formals(function(x = 4) g(x) + h(x)) # \$x #  4 body(function(x = 4) g(x) + h(x)) # g(x) + h(x) environment(function(x = 4) g(x) + h(x)) # <environment: R_GlobalEnv>

Pure functions
Why pure functions are easy to reason about. Ways in which functions can have side effects

## Primitive functions Lazy evaluation of function arguments

By default, R function arguments are lazy - they're not evaluated when you call the function, but only when that argument is used:

## f <- function(x) { 10 } system.time(f(Sys.sleep(10))) # user # 0 system elapsed 0 0

If you want to ensure that an argument is evaluated you can use force :
f <- function(x) { force(x) 10 } system.time(f(Sys.sleep(10))) # user # 0 system elapsed 0 10.001

But note that force is just syntactic sugar. The definition of force is:
force <- function(x) x

The argument is evaluated in the environment in it was created, not the environment of the function:
f <- function() { y <- "f" g(y) } g <- function(x) { y <- "g" x } y <- "toplevel" f() #  "f"

More technically, an unevaluated argument is called a promise , or a thunk. A promise is made up of two parts: an expression giving the delayed computation, which can be accessed with substitute (see controlling evaluation for more details) the environment where the expression was created and where it should be evaluated You may notice this is rather similar to a closure with no arguments, and in many languages that don't have laziness built in like R, this is how you can implement laziness. This is particularly useful in if statements:
if (!is.null(x) && y > 0)

And you can use it to write functions that are not possible otherwise
and <- function(x, y) { if (!x) FALSE else y } a <- 1

## and(!is.null(a), a > 0) a <- NULL and(!is.null(a), a > 0)

This function would not work without lazy evaluation because both x and y would always be evaluated, testing if a > 0 even if a was NULL.

delayedAssign
Delayed assign is particularly useful for doing expensive operations that you're not sure you'll need. This is the essence of lazyness - put off doing any work until the last possible minute. To create a variable x , that is the sum of the values a and b , but is not evaluated until we need, we use delayedAssign : a <- 1 b <- 2 delayedAssign("x", a + b) a <- 10 x #  12
delayedAssign also provides two parameters that control where the evaluation happens ( eval.env ) and which in environment the variable is assigned in ( assign.env ). Autoload is an example of this, it's a wrapper around delayedAssign for functions or data in a package - it makes R behave as if the package is loaded, but it doesn't actually load it (i.e. do any work) until you call one of the functions. This is the way that data sets in most packages work - you can call (e.g.) diamonds after library(ggplot2) and it just works, but it isn't loaded into memory unless you actually use it.

Default arguments

## First class functions

R supports "first class functions", functions that can be:

created anonymously, assigned to variables and stored in data structures, returned from functions (closures), passed as arguments to other functions (higher-order functions) You've already learned about anonymous functions in functions. This chapter will explore the other three properties, and show how they can remove redundancy in your code. The chapter concludes with an exploration of numerical integration, showing how all of the properties of first-class functions can be used to solve a real problem. You should be familiar with the basic properties of scoping and environments before reading this chapter.

Closures
"An object is data with functions. A closure is a function with data." --- John D Cook A closure is a function written by another function. Closures are so called because they enclose the environment of the parent function, and can access all variables and parameters in that function. This is useful because it allows us to have two levels of parameters. One level of parameters (the parent) controls how the function works. The othe r level (the child) does the work. The following example shows how can use this idea to generate a family of power functions. The parent function ( power ) creates child functions ( square and cube ) that actually do the hard work.
power <- function(exponent) { function(x) x ^ exponent } square <- power(2) square(2) # ->  4

square(4) # ->  16 cube <- power(3) cube(2) # ->  8 cube(4) # ->  64

The ability to manage variables at two levels makes it possible to maintain the state across function invocations by allowing a function to modify variables in the environment of its parent. Key to managing variables at different levels is the double arrow assignment operator ( <<- ). Unlike the usual single arrow assignment ( <- ) that always assigns in the current environment, the double arrow operator will keep looking up the chain of parent environments until it finds a matching name. This makes it possible to maintain a counter that records how many times a function has been called, as shown in the following example. Each time new_counter is run, it creates an environment, initialises the counter i in this environment, and then creates a new function.
new_counter <- function() { i <- 0 function() { # do something useful, then ... i <<- i + 1 i } }

The new function is a closure, and its environment is the enclosing environment. When the closures counter_one and counter_two are run, each one modifies the counter in its enclosing environment and then returns the current count.
counter_one <- new_counter() counter_two <- new_counter() counter_one() # ->  1 counter_one() # ->  2 counter_two() # ->  1

This is an important technique because it is one way to generate "mutable state" in R. R5 expands on this idea in considerably more detail. An interesting property of functions in R is that basically every function in R is a closure, because all functions remember the environment in which they are created, typically either the globa l environment, if it's a function that you've written, or a package environment, if it's a function that someone else has written. When you print a function in R, it always shows you which environment it comes from. If the environment isn't displayed, it d oesn't mean it doesn't have an environment, it means that it was created in the global environment. The environment inside an arbitrary function doesn't have a special name, so the environment of closures that you've created will have random names.
f <- function(x) x f # function(x) x environment(f) # <environment: R_GlobalEnv> print # function (x, ...) # UseMethod("print") # <environment: namespace:base>

## counter_one # function() { # # # # } # do something useful, then ... i <<- i + 1 i

# <environment: 0x1022be7f0>

The only way to create a function that isn't a closure, is to manually set its environment to the empty environment:.
f <- function(x) x + 1 environment(f) <- emptyenv() f # function(x) x + 1 # <environment: R_EmptyEnv>

But this isn't very useful because functions in R rely on lexical scoping:
f(1) # Error in f(1) : could not find function "+"

Built-in functions
There are two useful built-in functions that return closures:

Negate takes a function that returns a logical vector, and returns the negation of that function. This can be a useful shortcut when the function you have returns the opposite of what you need. Negate <- function(f) { f <- match.fun(f) function(...) !f(...) } (Negate(is.null))(NULL)

This is most useful in conjunction with higher-order functions, as we'll see in the next section.

Vectorize takes a non-vectorised function and vectorises with respect to the arguments given in the vectorise.args parameter. This doesn't give you any magical performance improvements, but it is useful if you want a quick and dirty way of making a vectorised function. An mildly useful extension of sample would be to vectorize it with respect to size: this would allow you to generate multiple samples in one call. sample2 <- Vectorize(sample, "size", SIMPLIFY = FALSE) sample2(1:10, rep(5, 4)) sample2(1:10, 2:5)

In this example we have used SIMPLIFY = FALSE to ensure that our newly vectorised function always returns a list. This is usually a good idea. Vectorize does not work with primitive functions.

Higher-order functions
The power of closures is tightly coupled to another important class of functions: higher -order functions (HOFs), functions that take functions as arguments. Mathematicians distinguish between functionals, which accept a function and return a scalar, and function operators, which accept a function and return a function. Integration over an interval is a functional, the indefinite integral is a function operator. Higher-order functions of use to R programmers fall into two main camps: data structure manipulation and mathematical tools, as described below.

## Data structure manipulation

The first important family of higher-order functions manipulate vectors. They each take a function as their first argument, and a vector as their second argument. The first three functions take a logical predicate, a function that returns either TRUE or FALSE . The predicate function does not need to be vectorised, as all three functions call it element by element. Filter : returns a new vector containing only elements where the predicate is TRUE . Find : return the first element that matches the predicate (or the last element if right = TRUE ). Position : return the position of the first element that matches the predicate (or the last element if right = TRUE ). The following example shows some simple uses:
x <- 200:250 is.even <- function(x) x %% 2 == 0 is.odd <- Negate(is.even) is.prime <- function(x) gmp::isprime(x) > 1 Filter(is.prime, x) #  211 223 227 229 233 239 241 Find(is.even, x) # 200 Find(is.odd, x) # 201 Position(is.prime, x, right = T) # 42

The next two functions work with more general classes of functions:

Map : can take more than one vector as an input and calls f element-by-element on each input. It returns a list. Reduce recursively reduces a vector to a single value by first calling f with the first two elements, then the result of f and the second element and so on. If x = 1:5 then the result would be f(f(f(f(1, 2), 3), 4), 5) . If right = TRUE , then the function is called in the opposite order: f(1, f(2, f(3, f(4, 5)))) . You can also specify an init value in which case the result would be f(f(f(f(f(init, 1), 2),3), 4), 5)

Reduce is useful for implementing many types of recursive operations: merges, finding smallest values, intersections, unions. Apart from Map , the implementation of these five vector-processing HOFs is straightforward and I encourage you to read the source code to understand how they each work.

## Other families of higher-order functions include:

The apply family: eapply , lapply , mapply , tapply , sapply , vapply , by . Each of these functions processes breaks up a data structure in some way, applies the function to each piece and then joins them back tog ether again. The **ply functions of the plyr package which attempt to unify the base apply functions by cleanly separating based on the type of input they break up and the type of output that they produce. The array manipulation functions modify arrays to compute various margins or other summaries, or generalise matrix multiplication in various ways: apply , outer , kronecker , sweep , addmargins .

## Mathematical higher order functions

Higher order functions arise often in mathematics. In this section we'll explor e some of the built in mathematical HOF functions in R. There are three functions that work with a 1d numeric function:

integrate : integrate it over a given range uniroot : find where it hits zero over a given range optimise : find location of minima (or maxima)

## Let's explore how these are used with a simple function:

integrate(sin, 0, pi) uniroot(sin, pi * c(1 / 2, 3 / 2)) optimise(sin, c(0, 2 * pi)) optimise(sin, c(0, pi), maximum = TRUE)

There is one function that works with a more general n-dimensional numeric function, optim , which finds the location of a minima. In statistics, optimisation is often used for maximum likelihood estimation. Maximum likelihood estimation is a natural match to closures because the arguments to a likelihood fall into two gr oups: the data, which is fixed for a given problem, and the parameters, which will vary as we try to find a maximum numerically. This naturally gives rise to an approach like the following:
# Negative log-likelihood for Poisson distribution poisson_nll <- function(x) { n <- length(x) function(lambda) { n * lambda - sum(x) * log(lambda) # + terms not involving lambda } } nll1 <- poisson_nll(c(41, 30, 31, 38, 29, 24, 30, 29, 31, 38)) nll2 <- poisson_nll(c(6, 4, 7, 3, 3, 7, 5, 2, 2, 7, 5, 4, 12, 6, 9)) optimise(nll1, c(0, 100)) optimise(nll2, c(0, 100))

Lists of functions
In R, functions can be stored in lists. Together with closures and higher -order functions, this gives us a set of powerful tools for reducing duplication in code.

We'll start with a simple example: benchmarking, when you are comparing the performance of multiple approaches to the same problem. For example, if you wanted to compare a few approaches to computing the mean, you could store each approach (function) in a list:
compute_mean <- list( base = function(x) mean(x), sum = function(x) sum(x) / length(x), manual = function(x) { total <- 0 n <- length(x) for (i in seq_along(x)) { total <- total + x[i] / n } total } )

Calling a function from a list is straightforward: just get it out of the list first:
x <- runif(1e5) system.time(compute_mean\$base(x)) system.time(compute_mean\$manual(x))

If we want to call all functions to check that we've implemented them correctly and they return the same answer, we can use lapply , either with an anonymous function, or a new function that calls it's first argument with all other arguments:
lapply(compute_mean, function(f) f(x)) call_fun <- function(f, ...) f(...) lapply(compute_mean, call_fun, x)

We can time every function on the list with lapply or Map along with a simple anonymous function:
lapply(compute_mean, function(f) system.time(f(x))) Map(function(f) system.time(f(x)), compute_mean)

If timing functions is something we want to do a lot, we can add another layer of abstraction: a closure that automatically times how long a function takes. We then create a list of timed functions and call the timers with our specified x .
timer <- function(f) { force(f) function(...) system.time(f(...)) } timers <- lapply(compute_mean, timer) lapply(timers, call_fun, x)

Another useful example is when we want to summarise an object in multiple ways. We could store each summary function in a list, and run each function with lapply and call_fun :
funs <- list( sum = sum, mean = mean, median = median )

## lapply(funs, call_fun, 1:10)

What if we wanted to modify our summary functions to automatically remove missing values? One approach would be make a list of anonymous functions that call our summary functions with the appropriate arguments:
funs2 <- list( sum = function(x, ...) sum(x, ..., na.rm = TRUE), mean = function(x, ...) mean(x, ..., na.rm = TRUE), median = function(x, ...) median(x, ..., na.rm = TRUE) )

But this leads to a lot of duplication - each function is almost identical apart from a different function name. We could write a closure to abstract this away:
remove_missings <- function(f) { function(...) f(..., na.rm = TRUE) } funs2 <- lapply(funs, remove_missings)

We could also take a more general approach. A useful function here is Curry (named after a famous computer scientist Haskell Curry, not the food), which implements "partial function application". What the curry function does is create a new function that passes on the arguments you specify. A example will make this more clear:
add <- function(x, y) x + y addOne <- function(x) add(x, 1) addOne <- Curry(add, y = 1)

## One way to implement Curry is as follows:

Curry <- function(FUN,...) { .orig <- list(...) function(...) { do.call(FUN, c(.orig, list(...))) } }

(You should be able to figure out how this works. See the exercises.) But implementing it like this prevents arguments from being lazily evaluated, so it has a somewhat more complicated implementation, basically working by building up an anonymous function by hand. You should be able to work out how this works after you've read the computing on the language chapter. (Hopefully this function will be included in a future version of R.)
Curry <- function(FUN, ...) { args <- match.call(expand.dots = FALSE)\$... args\$... <- as.name("...") env <- new.env(parent = parent.frame()) if (is.name(FUN)) { fname <- FUN } else if (is.character(FUN)) { fname <- as.name(FUN) } else if (is.function(FUN)){

fname <- as.name("FUN") env\$FUN <- FUN } else { stop("FUN not function or name of function") } curry_call <- as.call(c(list(fname), args)) f <- eval(call("function", as.pairlist(alist(... = )), curry_call)) environment(f) <- env f }

But back to our problem. With the Curry function we can reduce the code a bit:
funs2 <- list( sum = Curry(sum, na.rm = TRUE), mean = Curry(mean, na.rm = TRUE), median = Curry(median, na.rm = TRUE) )

But if we look closely that will reveal we're just applying the same function to every element in a list, and that's the job of lapply . This drastically reduces the amount of code we need:
funs2 <- lapply(funs, Curry, na.rm = TRUE)

Let's think about a similar, but subtly different case. Let's take a vector of numbers and generate a list of functions corresponding to trimmed means with that amount of trimming. The following code doesn't work because we want the first argument of Curry to be fixed to mean. We could try specifying the argument name because fixed matching overrides positional, but that doesn't work because the name of the function to ca ll in lapply is also FUN . And there's no way to specify we want to call the trim argument.
trims <- seq(0, 0.9, length = 5) lapply(trims, Curry, "mean") lapply(trims, Curry, FUN = "mean")

## Instead we could use an anonymous function

funs3 <- lapply(trims, function(t) Curry("mean", trim = t)) lapply(funs3, call_fun, c(1:100, (1:50) * 100))

But that doesn't work because each function gets a promise to evaluate t , and that promise isn't evaluated until all of the functions are run. To make it work you need to manually force the evaluation of t:
funs3 <- lapply(trims, function(t) {force(t); Curry("mean", trim = t)}) lapply(funs3, call_fun, c(1:100, (1:50) * 100))

A simpler solution in this case is to use Map , as described in the last chapter, which works similarly to lapply except that you can supply multiple arguments by both name and position. For this example, it doesn't do a good job of figuring out how to name the functions, but that's easily fixed.
funs3 <- Map(Curry, "mean", trim = trims) names(funs3) <- trims lapply(funs3, call_fun, c(1:100, (1:50) * 100))

It's usually better to use lapply because it is more familiar to most R programmers, and it is somewhat simpler and so is slightly faster.

## Case study: numerical integration

To conclude this chapter, we will develop a simple numerical integration tool, and along the way, illustrate the use of many properties of first-class functions: we'll use anonymous functions, lists of functions, functions that make closures and functions that take functions as input. Each step is driven by a desire to make our approach more general and to reduce duplication. We'll start with two very simple approaches: the midpoint and trapezoid rules. Each takes a function we want to integrate, f , and a range to integrate over, from a to b . For this example we'll try to integrate sin x from 0 to pi, because it has a simple answer: 2
midpoint <- function(f, a, b) { (b - a) * f((a + b) / 2) } trapezoid <- function(f, a, b) { (b - a) / 2 * (f(a) + f(b)) } midpoint(sin, 0, pi) trapezoid(sin, 0, pi)

Neither of these functions gives a very good approximation, so we'll do what we normally do in calculus: break up the range into smaller pieces and integrate each piece using one of the simple rules. To do that we create two new functions for performing composite integration:
midpoint_composite <- function(f, a, b, n = 10) { points <- seq(a, b, length = n + 1) h <- (b - a) / n area <- 0 for (i in seq_len(n)) { area <- area + h * f((points[i] + points[i + 1]) / 2) } area } trapezoid_composite <- function(f, a, b, n = 10) { points <- seq(a, b, length = n + 1) h <- (b - a) / n area <- 0 for (i in seq_len(n)) { area <- area + h / 2 * (f(points[i]) + f(points[i + 1])) } area } midpoint_composite(sin, 0, pi, n = 10) midpoint_composite(sin, 0, pi, n = 100) trapezoid_composite(sin, 0, pi, n = 10) trapezoid_composite(sin, 0, pi, n = 100)

mid <- sapply(1:20, function(n) midpoint_composite(sin, 0, pi, n)) trap <- sapply(1:20, function(n) trapezoid_composite(sin, 0, pi, n)) matplot(cbind(mid = mid, trap))

But notice that there's a lot of duplication across midpoint_composite and trapezoid_composite : they are basically the same apart from the internal rule used to integrate over a simple range. Let's extract out a general composite integrate function:
composite <- function(f, a, b, n = 10, rule) { points <- seq(a, b, length = n + 1) area <- 0 for (i in seq_len(n)) { area <- area + rule(f, points[i], points[i + 1]) } area } midpoint_composite(sin, 0, pi, n = 10) composite(sin, 0, pi, n = 10, rule = midpoint) composite(sin, 0, pi, n = 10, rule = trapezoid)

This function now takes two functions as arguments: the function to integrate, and the integration rule to use for simple ranges. We can now add even better rules for integrating small ranges:
simpson <- function(f, a, b) { (b - a) / 6 * (f(a) + 4 * f((a + b) / 2) + f(b)) } boole <- function(f, a, b) { pos <- function(i) a + i * (b - a) / 4 fi <- function(i) f(pos(i)) (b - a) / 90 * (7 * fi(0) + 32 * fi(1) + 12 * fi(2) + 32 * fi(3) + 7 * fi(4)) }

## Let's compare these different approaches.

expt1 <- expand.grid( n = 5:50, rule = c("midpoint", "trapezoid", "simpson", "boole"), stringsAsFactors = F) abs_sin <- function(x) abs(sin(x)) run_expt <- function(n, rule) { composite(abs_sin, 0, 4 * pi, n = n, rule = match.fun(rule)) }

library(plyr) res1 <- mdply(expt1, run_expt) library(ggplot2) qplot(n, V1, data = res1, colour = rule, geom = "line")

It turns out that the midpoint, trapezoid, Simpson and Boole rules are all examples of a more general family called Newton-Cotes rules. We can take our integration one step further by extracting out this commonality to produce a function that can generate any general Newton-Cotes rule:
# http://en.wikipedia.org/wiki/Newton%E2%80%93Cotes_formulas newton_cotes <- function(coef, open = FALSE) { n <- length(coef) + open function(f, a, b) { pos <- function(i) a + i * (b - a) / n points <- pos(seq.int(0, length(coef) - 1)) (b - a) / sum(coef) * sum(f(points) * coef) } } trapezoid <- newton_cotes(c(1, 1)) midpoint <- newton_cotes(1, open = T) simpson <- newton_cotes(c(1, 4, 1)) boole <- newton_cotes(c(7, 32, 12, 32, 7)) milne <- newton_cotes(c(2, -1, 2), open = TRUE) # Alternatively, make list then use lapply lapply(values, newton_cotes, closed) lapply(values, newton_cotes, open, open = TRUE) lapply(values, do.call, what = "newton_cotes") expt1 <- expand.grid(n = 5:50, rule = names(rules), stringsAsFactors = F) run_expt <- function(n, rule) { composite(abs_sin, 0, 4 * pi, n = n, rule = rulesrule) }

Mathematically, the next step in improving numerical integration is to move from a grid of evenly spaced points to a grid where the points are closer together near the end of the range.

Summary Exercises
1. Read the source code for Filter , Negate , Find and Position . Write a couple of sentences for each describing how they work. 2. Write an And function that given two logical functions, returns a logical And of all their results. Extend the function to work with any number of logical functions. Write similar Or and Not functions.

3. Write a general compose function that composes together an arbitrary number of functions. Write it using both recursion and looping. 4. How does the first version of Curry work?

Evaluation
In this document, we're going to write our own (slightly) simplified version of subset, and along the way, learn how we can capture and control evaluation in R. These tools are very useful for developing convenient user -facing functions because they dramatically reduce the amount of typing required to specify an action. This abbreviation comes at a cost of somewhat increasing ambiguity in the function call, and by making the function difficult to call from another function. You might already be familiar with the subset function. If not sure, it's a useful interactive shortcut for subsetting data frames: instead of repeating the data frame you're working with again and again, you can save some typing:
subset(mtcars, vs == am) # equivalent to: mtcars[mtcars\$vs == mtcars\$am, ] subset(mtcars, cyl == 4) # equivalent to: mtcars[mtcars\$cyl == 4, ]

Subset is special because vs == am or cyl == 4 aren't evaluated in the global environment (your usual workspace), they're evaluated in the context of the data frame. In other words, subset implements different scoping rules so instead of looking for those variables in the current environment, subset looks in the specified data frame. (If you're unfamiliar with R's scoping rules, I'd suggest you read up on them before you continue.) To do this, subset must be able to capture the meaning of your condition string without evaluating it. The next section describes how to do that.

Quoting
To talk about quoting, capturing a language object without evaluating it, we need some new vocabulary:

a symbol, or name , describes the name of an object, like x or y, not its value like 5 or "a" (see is.symbol & as.symbol ). a call is an unevaluated expression, like x == y , rather than the result of that call (see is.call & as.call ). an expression is a list of calls and/or symbols (see is.expression & as.expression ). a language object is a name, call, or expression ( is.language ). The most direct way to quote a call is (surprise!) with quote :
quote(vs == am) # vs == am

## We could also build a call manually, using call :

call("==", as.name("vs"), as.name("am")) # vs == am # We need to use as.name because we want to refer to the variables called # vs and am, not the strings "vs" and "am" call("==", "vs", "am")

## Or we could create it from a string, with parse :

parse(text = "vs == am")[] # vs == am # Parse returns an expression, but we just want the first element, a call

The simplest way to write subset would be to require that the user supply a call created using one of the above techniques. But that requires extra typing, just what we were trying to avoid. We want to capture the call inside the function, not outside of it. Think about why quote won't work, and then run the following code to check if your guess was correct:
subset <- function(x, condition) { quote(condition) } subset(mtcars, cyl == 4)

We need a new approach. One way is to capture the ensure function call and then extract the piece that corresponds to condition . This is what the match.call function lets us do:
subset <- function(x, condition) { match.call() } subset(mtcars, vs == am) # subset(x = mtcars, condition = vs == am) subset <- function(x, condition) { match.call()\$x } subset(mtcars, vs == am) # mtcars subset <- function(x, condition) { match.call()\$condition } subset(mtcars, vs == am) # vs == am

## Also advisable to see the classes of what we are dealing with:

subset <- function(x, condition) { subset(mtcars, vs == am) # "name" subset <- function(x, condition) { subset(mtcars, vs == am) # "call" class(match.call()\$condition) } class(match.call()\$x) }

## Another function that can capture the original call is substitute :

subset <- function(x, condition) { substitute(condition) } subset(mtcars, vs == am) # vs == am

Substitute is the tool most commonly used for this task, but it's the hardest to explain because it relies on lazy evaluation. In R, function arguments are only evaluated when they are needed, not automatically when the function is called. Before the argument is evaluated (aka forced), it is stored as a promise . A promise stores two things: the call that should be evaluated and the environment in which to evaluate it. When substitute is passed a promise, it extracts the call, which is the thing we're looking for.

Once we have the call, we need to evaluate it in the way we want, as described next.

Evaluation
Now that we have the call that represents the subset condition we want, we want to evaluate it in the right context, so that cyl is interpreted as mtcars\$cyl . To do this we need the eval function:
eval(quote(cyl), mtcars) #  6 6 4 6 8 6 8 4 4 6 6 8 8 ...

eval(quote(cyl1), mtcars) # Error in eval(expr, envir, enclos) : object 'cyl1' not found eval(cyl, mtcars) # Error in eval(cyl, mtcars) : object 'cyl' not found eval(as.name('cyl'), mtcars) #  6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4 cyl <- quote(cyl) eval(cyl, mtcars) #  6 6 4 6 8 6 8 4 4 6 6 8 8 ...

The first argument to eval is the language object to evaluate, and the second argument is the environment to use for evaluation. If the second argument is a list or data frame, eval will convert it to an environment for you. (There are a number of short cut functions: evalq , eval.parent , and local that are also documented with eval - I won't use or explain these here, I'd recommend you read about them and figure out what they do.) Now we have all the pieces we need to write the subset function: we can capture the call representing condition then evaluate it in the context of the data frame:
subset <- function(x, condition) { condition_call <- substitute(condition) r <- eval(condition_call, x) x[r, ] }

Unfortunately, we're not quite done because this function doesn't work as we always expect.
subset(mtcars, cyl == 4) # # NA # NA.1 # NA.2 # NA.3 # NA.4 # NA.5 mpg cyl disp hp drat wt qsec vs am gear carb NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Scoping issues
What do you expect the following to do:
x <- 4

subset(mtcars, cyl == x)

It should be the same as subset(mtcars, cyl == 4) , right? But how exactly does this work and where does subset look for the value of x ? It's not going to be the right value inside the subset function because the first argument of subset is called x . Inside the subset function x will have the same value of mtcars . The key is the third argument of eval : enclos . This allows us to specify the parent (or enclosing) environment for objects that don't have one like lists and data frames ( enclos is ignored if we pass in a real environment). The enclos ing environment is where any objects that aren't found in the data frame will be looked for. By default it uses the current environment, which is not what we want. We want to look for x in the environment in which subset was called. In R terminology this is called the parent frame and is accessed with parent.frame() . This is an example of dynamic scope. With this modification our function works:
subset <- function(x, condition) { condition_call <- substitute(condition) r <- eval(condition_call, x, parent.frame()) x[r, ] } x <- 4 subset(mtcars, cyl == x)

When evaluating code in a non-standard way, it's also a good idea to test your code outside of the global environment:
f <- function() { x <- 6 subset(mtcars, cyl == x) } f()

And it works :) There is one other approach we could use: a formula. ~ works much like quote, but it also captures the environment in which it is created. We need to extract the second component of the formula because the first component is ~ .
subset <- function(x, f) { r <- eval(f[], x, environment(f)) x[r, ] } subset(mtcars, ~ cyl == x)

## Calling from another function

While subset saves typing, it has one big disadvantage: it's now difficult to use non-interactively, e.g. from another function. Imagine we want to create a function that randomly reorders a subset of the data. A nice way to write that function would be to write a function for random reordering and a function for subsetting (that we already have!) and combine the two together. Let's try that:
scramble <- function(x) x[sample(nrow(x)), ]

## But when we run that we get:

subscramble(mtcars, cyl == 4) # Error in eval(expr, envir, enclos) : object 'cyl' not found traceback() # 5: eval(expr, envir, enclos) # 4: eval(condition_call, x) # 3: subset(x, condition) # 2: scramble(subset(x, condition)) # 1: subscramble(mtcars, cyl == 4)

What's gone wrong? To figure it out, lets debug subset and work through the code line-by-line:
> debugonce(subset) > subscramble(mtcars, cyl == 4) debugging in: subset(x, condition) debug: { condition_call <- substitute(condition) r <- eval(condition_call, x) x[r, ] } Browse> n debug: condition_call <- substitute(condition) Browse> n debug: r <- eval(condition_call, x) Browse> condition_call condition Browse> eval(condition_call, x) Error in eval(expr, envir, enclos) : object 'cyl' not found Browse> condition Error: object 'cyl' not found In addition: Warning messages: 1: restarting interrupted promise evaluation 2: restarting interrupted promise evaluation

Can you see what the problem is? condition_call contains the call condition and when we try to evaluate that it looks up the symbol condition which has the value cyl == 4 , which can't be computed in the parent environment because it doesn't contain an object called cyl . If cyl is set in the global environment, far more confusing things can happen:
cyl <- 4 subscramble(mtcars, cyl == 4) cyl <- sample(10, 100, rep = T) subscramble(mtcars, cyl == 4)

This is an example of the general tension in R between functions that are designed for interactive use, and functions that are safe to program with. Generally any function that uses substitute or match.call to retrieve a call, instead of a value, is more suitable for interactive use.

As a developer you should also provide an alternative version that works when passed a call. You might wonder why the function couldn't do this automatically:
subset <- function(x, condition) { if (!is.call(condition)) { condition <- substitute(condition) } r <- eval(condition, x) x[r, ] } subset(mtcars, quote(cyl == 4)) subset(mtcars, cyl == 4)

But hopefully a little thought, or maybe some experimentation, will show why this doesn't work.

Conclusion
Now that you understand how our version of subset works, go back and read the source code for subset.data.frame , the base R version which does a little more. Other functions that work similarly are with.default , within.data.frame , transform.data.frame , and in the plyr package . , arrange , and summarise . Look at the source code for these functions and see if you can figure out how they work.

Exercises
1. Compare the simplified subset function described in this chapter with the real subset.data.frame . What's different? Why? How does the select parameter work? 2. Read the code for transform.data.frame and subset.data.frame . What do they do and how do they work? Compare transform.data.frame to plyr::mutate what's different? 3. Read the code for write.csv . How does it work? What are the advantages/disadvantages of this approach over calling write.table in the usual manner?

## Computing on the language

Writing R code that modifies R code.

Basics of R code
There are three fundamental building blocks of R code:

names, which represent the name, not value, of a variable constants, like "a" or 1:10 calls, which represents a function call Collectively I'll call these three things parsed code, because they each re present a stand-alone piece of R code that you could run from the command line. A call is made up of two parts:

a name, giving the name of the function to be called arguments, a list of parsed code Calls are recursive because the arguments to a call can be other calls, e.g. f(x, 1, g(), h(i())) . This means we can think about calls as trees. For example, we can represent that call as:

## \- f() \- x \- 1 \- g() \- h() \- i()

Everything in R parses into this tree structure - even things that don't look like calls such as { , function , control flow, infix operators and assignment. The figure below shows the parse tree for some of these special constructs. All calls are labelled with "()" even if we don't normally think of them as function calls.
draw_tree(expression( { a + b; 2}, function(x, y = 2) 3, (a - b) * 3, if (x < 3) y <- 3 else z <- 4, name(df) <- c("a", "b", "c"), -x )) # \- {() # # # # # # \- function() # # # # # \- *() # # # # # # # \- if() # # # # # # # # # # # \- <-() # \- name() \- <() \- x \- 3 \- <-() \- y \- 3 \- <-() \- z \- 4 \- 3 \- (() \- -() \- a \- b \- list(x = , y = 2) \- 3 \- "function(x, y = 2) 3" \- +() \- a \- b \- 2

# # # # # #

## \- df \- c() \- "a" \- "b" \- "c"

# \- -() # \- x

Code chunks can be built up into larger structures in two ways: with expressions or braced expressions:

Expressions are lists of code chunks, and are created when you parse a file. Expressions have one special behaviour compared to lists: when you eval() a expression, it evaluates each piece in turn and returns the result of last piece of parsed code. Braced expressions represent complex multiline functions as a call to the special function { , with one argument for each code chunk in the function. Despite the name, braced expressions are not actually expressions, although the accomplish much the same task.

## Code to text and back again

As well as representation as an AST, code also has a string representation. This section shows how to go from a string to an AST, and from an AST to a string. The parse function converts a string into an expression. It's called parse because this is the formal CS name for converting a string representing code into a format that the computer can work with. Note that parse defaults to work within files - if you want to parse a string, you need to use the text argument. The deparse function is an almost inverse of parse - it converts an call into a text string representing that call. It's an almost inverse because it's impossible to be completely symmetric. Deparse will returns character vector with an entry for each line - if you want a single string be sure to paste it back together. A common idiom in R functions is deparse(substitute(x)) - this will capture the character representation of the code provided for the argument x . Note that you must run this code before you do anything to x , because substitute can only access the code which will be used to compute x before the promise is evaluated.

Missing argument
If you experiment with the structure function arguments, you might come across a rather puzzling beast: the "missing" object:
formals(plot)\$x x <- formals(plot)\$x x # Error: argument "x" is missing, with no default

## It is basically an empty symbol, but you can't create it directly:

is.symbol(formals(plot)\$x) #  TRUE deparse(formals(plot)\$x) #  "" as.symbol("") # Error in as.symbol("") : attempt to use zero-length variable name

You can either capture it from a missing argument of the formals of a function, as above, or create with substitute() , bquote() .

Modifying calls
It's a bad idea to create code by operating on it's string representation: the re is no guarantee that you'll create valid code. Instead, you should use tools like substitute and bquote to modify expressions, where you are guaranteed to produce syntactically correct code (although of course it's still easy to make code that doesn't work). We've seen substitute used for it's ability to capture the unevalated expression of a promise, but it also has another important role for modifying expressions. The second argument to substitute , env , can be an environment or list giving a set of replacements. It's easiest to see this with an example:
substitute(a + b, list(a = 1, b = 2)) # 1 + 2

Note that substitute expects R code in it's first argument, not parsed code:
x <- quote(a + b) substitute(x, list(a = 1, b = 2)) # x

If you want to substitute in a variable or function call, you need to be careful to supply the right type object to substitute:
substitute(a + b, list(a = y)) # Error: object 'y' not found substitute(a + b, list(a = "y")) # "y" + b substitute(a + b, list(a = as.name("y"))) # y + b substitute(a + b, list(a = quote(y))) # y + b substitute(a + b, list(a = y())) # Error: could not find function "y" substitute(a + b, list(a = call("y"))) # y() + b substitute(a + b, list(a = quote(y()))) # y() + b

Another useful tool is bquote . It quotes an expression apart from any terms wrapped in .() which it evaluates:
x <- 5 bquote(y + x) # y + x bquote(y + .(x)) # y + 5

You can also modify calls because of their list-like behaviour: just like a list, a call has length , [[ and [ methods. The length of a call minus 1 gives the number of arguments:
x <- quote(write.csv(x, "important.csv", row.names = FALSE)) length(x) - 1 #  3

## The first element of the call is the name of the function:

x'[] # write.csv is.name(x[]) #  TRUE

The remaining elements are the arguments to the function, which can be extracted by name or by position.
x\$row.names # FALSE x[] #  "important.csv"

## You can modify elements of the call with replacement operators:

x\$col.names <- FALSE x # write.csv(x, "important.csv", row.names = FALSE, col.names = FALSE) x[] <- NULL x[] <- "less-imporant.csv" x # write.csv(x, "less-imporant.csv", row.names = FALSE)

Calls also support the [ method, but use it with care: it produces a call object, and it's easy to produce invalid calls. If you want to get a list of the arguments, explicitly convert to a list.
x[-3] # remove the second arugment # write.csv(x, row.names = FALSE) x[-1] # just look at the arguments - but is still a call! x("important.csv", row.names = FALSE) as.list(x[-1]) # [] # x # # [] #  "important.csv" # # \$row.names #  FALSE

Creating a function

when you can't use a closure because you don't know in advance what the arguments will be, and similarly you can't use substitute (need to make sure both of those sections point here) the function function, and its two arguments: arglist and body (and arglist). pairlists, and creating arguments with no default values: http://stackoverflow.com/questions/8611080/how-do-icreate-pairlist-with-empty-elements-in-r correcting the environment of the function. look at examples in email from John Nash and Randy Pruim make_function <- function(args, body, env) { args <- as.pairlist(args) stopifnot(is.language(body)) f <eval(call("function", args, body)) environment(f) <- env f }

## Walking the code tree

Because code is a tree, we're going to need recursive functions to work with it. You now have basic understanding of how code in R is put together internally, and so we can now start to write some useful functions. In this section we'll write some functions to tackle some useful problems. In particular, we will look at two common development mistakes.

Find assignments Replace T with TRUE and F with FALSE The codetools package, included in the base distribution, provides some other built -in tools based for automated code inspection: findGlobals : locates all global variables used by a function. This can be useful if you want to check that your functions don't inadvertently rely on variables defined in their parent environment. checkUsage : checks for a range of common problems including unused local variables, unused parameters and use of partial argument matching. showTree : works in a similar way to the drawTree used above, but presents the parse tree in a more compact format based on Lisp's S-expressions

Find assignment
The key to any function that works with the parse tree right is getting the recursion right, which means making sure that you know what the base case is (the leaves of the tree) and figuring out how to combine the results from the recursive case. The nodes of the tree can be any recursive data structure that can appear in a quoted object: pairlists, calls, and expressions. The leaves can be 0-argument calls (e.g. f() ), atomic vectors, names or NULL . One way to detect a leaf is that it's length is less than or equal to 1. One way to detect a node is to use is.recursive . In this section, we will write a function that figures out all variables that are created by assignment in an expression. We'll start simply, and make the function progressively more rigorous. On e reason to start with this function is because the recursion is a little bit simpler - we never need to go all the way down to the leaves because we are looking for assignment, a call to <- . This means that our base case is simple: if we're at a leaf, we' ve gone too far and can immediately return. We have two other cases: we have hit a call, in which case we should check if it's <- , otherwise it's some other recursive structure and we should call the function recursively on each element. Note the use of id entical to compare the call to the name of the assignment function, and recall that the second element of a call object is the first argument, which for <- is the left hand side: the object being assigned to.
find_assign <- function(obj) { if (length(obj) <= 1) { NULL } else if (is.call(obj) && identical(obj[], as.name("<-"))) { obj[]

} else { lapply(obj, find_assign) } } find_assign(quote(a <- 1)) find_assign(quote({ a <- 1 b <- 2 }))

This function seems to work for these simple cases, but it's a bit verbose. Instead of returning a list, let's keep it simple and stick with a character vector. We'll also test it with two slightly more complicated examples:
find_assign <- function(obj) { if (length(obj) <= 1) { character(0) } else if (is.call(obj) && identical(obj[], as.name("<-"))) { as.character(obj[]) } else { unlist(lapply(obj, find_assign)) } } find_assign(quote({ a <- 1 b <- 2 a <- 3 })) #  "a" "b" "a" find_assign(quote({ system.time(x <- print(y <- 5)) })) #  "x"

The fix for the first problem is easy: we need to wrap unique around the recursive case to remove duplicate assignments. The second problem is a bit more subtle: it's possible to do assignment within the arguments to a call, but we're failing to recurse down in to this case.
find_assign <- function(obj) { if (length(obj) <= 1) { character(0) } else if (is.call(obj) && identical(obj[], as.name("<-"))) { call <- as.character(obj[]) c(call, unlist(lapply(obj[], find_assign))) } else { unique(unlist(lapply(obj, find_assign))) } } find_assign(quote({ a <- 1 b <- 2 a <- 3

})) #  "a" "b" find_assign(quote({ system.time(x <- print(y <- 5)) })) #  "x" "y"

## There's one more case we need to test:

find_assign(quote({ ls <- list() ls\$a <- 5 names(ls) <- "b" })) #  "ls" draw_tree(quote({ ls <- list() ls\$a <- 5 names(ls) <- "b" })) # \- {() # # # # # # # # # # # # \- <-() \- ls \- list() \- <-() \- \$() \- ls \- a \- 5 \- <-() \- names() \- ls \- "b" "ls\$a" "names(ls)"

This behaviour might be ok, but we probably just want assignment into whole objects, not assignment that modifies some property of the object. Drawing the tree for that quoted object helps us see what condition we should test for we want the object on the left hand side of assignment to be a name. This gives the final version of the find_assign function.
find_assign <- function(obj) { if (length(obj) <= 1) { character(0) } else if (is.call(obj) && identical(obj[], as.name("<-"))) { call <- if (is.name(obj[])) as.character(obj[]) c(call, unlist(lapply(obj[], find_assign))) } else { unique(unlist(lapply(obj, find_assign))) } } find_assign(quote({ ls <- list()

## ls\$a <- 5 names(ls) <- "b" }))  "ls"

Making this function absolutely correct requires quite a lot more work, because we need to figure out all the other ways that assignment might happen: with = , assign , or delayedAssign .

## Replacing logical shortcuts

First we'll start just by locating logical abbreviations. A logical abbreviation is the name T or F . We need to recursively breakup any recursive objects, anything else is not a short cut.
find_logical_abbr <- function(obj) { if (is.name(obj)) { identical(obj, as.name("T")) || identical(obj, as.name("F")) } else if (is.recursive(obj)) { any(vapply(obj, find_logical_abbr, logical(1))) } else { FALSE } }

f <- function(x = T) 1 g <- function(x = 1) 2 h <- function(x = 3) T find_logical_abbr(body(f)) find_logical_abbr(formals(f)) find_logical_abbr(body(g)) find_logical_abbr(formals(g)) find_logical_abbr(body(h)) find_logical_abbr(formals(h))

To replace T with TRUE and F with FALSE , we need to make our recursive function return either the original object or the modified object, making sure to put calls back together appropriately. The main difference here is that we need to be much more careful with recursive objects, because each type needs to be treated differently: pairlists, which are used for function formals, can be processed as lists, but need to be turned back into pairlists with as.pairlist expressions, again can be processed like lists, but need to be turned back into expressions objects with as.expression . You won't find these inside a quoted object, but this might be the quoted object that you're passing in. calls require the most special treatment: don't process first element (that's the name of the function), and turn back into a call with as.call This leads to:
fix_logical_abbr <- function(obj) { if (is.name(obj)) { if (identical(obj, as.name("T"))) { quote(TRUE) } else if (identical(obj, as.name("F"))) { quote(FALSE)

} else { obj } } else if (is.call(obj)) { args <- lapply(obj[-1], fix_logical_abbr) as.call(c(list(obj[]), args)) } else if (is.pairlist(obj)) { as.pairlist(lapply(obj, fix_logical_abbr)) } else if (is.expression(obj)) { as.expression(lapply(obj, fix_logical_abbr)) } else { obj } }

However, this function is still not that useful because it loses all non -code structure:
g <- quote(f <- function(x = T) { # This is a comment if (x) }) fix_logical_abbr(g) # f <- function(x = TRUE) { # # # # # } g <- parse(text = "f <- function(x = T) { # This is a comment if (x) }") attr(g, "srcref") attr(g, "srcref")1 as.character(attr(g, "srcref")1) source(textConnection("f <- function(x = T) { # This is a comment if (x) }")) return(4) if (emergency_status()) return(T) return(4) if (emergency_status()) return(T) if (x) return(4) if (emergency_status()) return(TRUE) return(4) if (emergency_status()) return(T)

If we want to be able to apply this function to an existing code base, we need to be able to maintain all the non -code structure. This leads us to our next code: srcrefs.

Source references
Would have to recursively parse.

## The parser package

Exceptions Debugging
This chapter describes techniques to use when things go wrong:

Exceptions: dealing with errors in your code Debugging: understanding or figuring out problems in other people's codes (debugging) The debugging techniques are also useful when you're trying to understand other people's R code, or R code that I've highlighted through out this book that you should be able to tease apart and figure out how it works. Getting help: what to do if you can't figure out what the problem is As with many other parts of R, the approach to dealing with errors and exceptions comes from a LISP-heritage, and is quite different (although some of the terminology is the same) to that of languages like Java.

## Interactive analysis vs. programming

There is a tension between interactive analysis and programming. When you a doing an analysis, you want R to do what you mean, and if it guesses wrong, then you'll discover it right away and can fix it. If you're creating a function, then you want to make it as robust as possible so that any problems become apparent right away (see fail fast below).
o o o o o

Be explicit: Be explicit about missings Use TRUE and FALSE instead of T and F Try to avoid relying on position matching or partial name matching when calling functions Avoid functions that can return different types of objects: Always use drop = FALSE Don't use sapply : use vapply , or lapply plus the appropriate transformation

Debugging
The key function for performing a post-mortem on an error is traceback , which shows all the calls leading up to the error. This can help you figure out where the error occurred, but often you'll need the interactive environment created by browser : c : leave interactive debugging and continue execution n : execute the next step. Be careful if you have a variable named n : to print it you'll need to be explicit print(n) . return : the default behaviour is the same as c , but this is somewhat dangerous as it makes it very easy to accidentally continue during debugging. I recommend options(browserNLdisabled = TRUE) so that return is simply ignored. Q : stops debugging, terminate the function and return to the global workspace where : prints stack trace of active calls (the interactive equivalent of traceback ) You can add Don't forget that you can combine if statements with browser() to only debug when a certain situation occurs.

## Browsing arbitrary R code

There are two ways to insert browser() statements in arbitrary R code: debug inserts a browser statement in the first line of the specified function. undebug will remove it, or you can use debugonce to insert a browser call for the next run, and have it automatically removed afterwards. utils::setBreakpoint does the same thing, but instead inserts browser in the function corresponding to the specified file name and line number.

These two functions are both special cases of trace() , which allows you to insert arbitrary code in any position in an existing function. The complement of trace is untrace . You can only perform one trace per function - subsequent traces will replace prior. Locating warnings is a little trickier. The easiest way to turn it in an error with options(warn = 2) and then use the standard functions described above. Turn back to default behaviour with options(warn = 0) .

Browsing on error
It's also possible to start browser automatically when an error occurs, by setting options(error = browser) . This will start the interactive debugger in the environment in which the error occurred. Other functions that you can supply to error are: recover : a step up from browser , as it allows you to drill down into any of the calls in the call stack. This is useful because often the cause of the error is a number of calls back - you're just seeing the consequences. This is the result of "fail-slow" code dump.frames : an equivalent to recover for non-interactive code. Will save an rdata file containing the nested environments where the error occurred. This allows you to later use debugger to re-create the error as if you had called recover from where the error occurred NULL : the default. Prints an error message and stops function execution. We can use withCallingHandlers to set up something similar for warnings. The following function will call the specified action when a warning is generated. The code is slightly tricky because we need to f ind the right environment to evaluate the action - it should be the function that calls warning.
on_warning <- function(action, code) { q_action <- substitute(action) withCallingHandlers(code, warning = function(c) { for(i in seq_len(sys.nframe())) { f <- as.character(sys.call(i)1) if (f == "warning") break; } eval(q_action, sys.frame(i - 1)) }) } x <- 1 f <- function() { x <- 2 g() } g <- function() { x <- 3 warning("Leaving g") } on_warning(browser(), f()) on_warning(recover(), f())

## Creative uses of trace

Trace is a useful debugging function that along with some of our computing on the language tools can be used to set up warnings on a large number of functions at a time. This is useful if you for automatically detecting some of the errors described above. The first step is to find all functions that have a na.rm argument. We'll do this by first building a list of all functions in base and stats, then inspecting their formals.

objs <- c(ls("package:base", "package:stats")) has_missing_arg <- function(name) { x <- get(name) if (!is.function(x)) return(FALSE) args <- names(formals(x)) "na.rm" %in% args } f_miss <- Filter(has_missing_arg, objs)

Next, we write a version of trace vectorised over the function name, and then use that function to add a warning to every function that we found above.
trace_all <- function(fs, tracer, ...) { lapply(fs, trace, tracer = tracer, print = FALSE, ...) invisible(return()) } trace_all(f_miss, quote(if(missing(na.rm)) stop("na.rm not set"))) # But misses primitives pmin(1:10, 1:10) # Error in eval(expr, envir, enclos) : na.rm not set pmin(1:10, 1:10, na.rm = T) #  1 2 3 4 5 6 7 8 9 10

One problem of this approach is that we don't automatically pick up any primitive functions, because for these functions formals returns NULL .

Exceptions
Defensive programming is the art of making code fail in a well-defined manner even when something unexpected occurs. There are two components of this art related to exceptions: raising exceptions as soon as you notice something has gone wrong, and responding to errors as cleanly as possible. A general principle for errors is to "fail fast" - as soon as you figure out something as wrong, and your inputs are not as expected, you should raise an error.

Creating
There are a number of options for letting the user know when something has gone wrong:

don't use cat() or print() , except for print methods, or for optional debugging information. use message() to inform the user about something expected - I often do this when filling in important missing arguments that have a non-trivial computation or impact. Two examples are reshape2::melt package which informs the user what melt and id variables where used if not specific, and plyr::join which informs which variables where used to join the two tables. use warning() for unexpected problems that aren't show stoppers. options(warn = 2) will turn warnings into errors. Warnings are often more appropriate for vectorised functions when a single value in the vector is incorrect, e.g. log(-1:2) and sqrt(-1:2) . use stop() when the problem is so big you can't continue stopifnot() is a quick and dirty way of checking that pre-conditions for your function are met. The problem with stopifnot is that if they aren't met, it will display the test code as an error, not a more informative message.

Checking pre-conditions with stopifnot is better than nothing, but it's better still to check the condition yourself and return an informative message with stop()

Handling
Error handling is performed with the try and tryCatch functions. try is a simpler version, so we'll start with that first. The try functions allows execution to continue even if an exception occurs, and is particularly useful when operating on elements in a loop. The silent argument controls whether or not the error is still printed.
elements <- list(1:10, c(-1, 10), c(T, F), letters) results <- lapply(elements, log) results <- lapply(elements, function(x) try(log(x)))

If code fails, try invisibly returns an object of class try-error . There isn't a built in function for testing for this class, so we'll define one. Then we can easily strip all errors out of a list with Filter :
is.error <- function(x) inherits(x, "try-error") successful <- Filter(Negate(is.error), results)

Or if we want to know where the successes and failures were, when can use sapply or vapply :
vapply(results, is.error, logical(1)) sapply(results, is.error) # (Aside: vapply is preferred when you're writing a function because # it guarantees you'll always get a logical vector as a result, even if # you list has length zero) sapply(NULL, is.error) vapply(NULL, is.error, logical(1))

If we wanted to avoid the anonymous function call, we could create our own function to automatically wrap a call in a try :
try_to <- function(f, silent = FALSE) { function(...) try(f(...), silent = silent) } results <- lapply(elements, try_to(log)) tryCatch gives more control than try , but to understand how it works, we first need to learn a little about conditions,

## the S3 objects that represent errors, warnings and messages.

is.condition <- function(x) inherits(x, "condition")

There are three convenience methods for creating errors, warnings and messages. All take two arguments: the message to display, and an optional call indicating where the condition was created
e <- simpleError("My error", quote(f(x = 71))) w <- simpleWarning("My warning") m <- simpleMessage("My message")

There is one class of conditions that can't be directly: interrupts, which occur when the user presses Ctrl + Break, Escape, or Ctrl + C (depending on the platform) to terminate execution. The components of a condition can be extracted with conditionMessage and conditionCall :
conditionMessage(e) conditionCall(e)

Conditions can be signalled using signalCondition . By default, no one is listening, so this doesn't do anything.
signalCondition(e)

signalCondition(w) signalCondition(m)

To listen to signals, we have two tools: tryCatch and withCallingHandlers . tryCatch is an exiting handler: it catches the condition, but the rest of the code after the exception is not run. withCallingHandlers sets up calling handlers: it catches the condition, and then resumes execution of the code. We will focus first on tryCatch . The tryCatch call has three arguments: expr : the code to run. ... : a set of named arguments setting up error handlers. If an error occurs, tryCatch will call the first handler whose name matches one of the classes of the condition. The only useful names for built -in conditions are interrupt , error , warning and message . finally : code to run regardless of whether expr succeeds or fails. This is useful for clean up, as described below. All handlers have been turned off by the time the finally code is run, so errors will propagate as usual. The following examples illustrate the basic properties of tryCatch :
# Handlers are passed a single argument tryCatch(stop("error"), error = function(...) list(...) ) # This argument is the signalled condition, so we'll call # it c for short. # If multiple handlers match, the first is used tryCatch(stop("error"), error = function(c) "a", error = function(c) "b" ) # If multiple signals are nested, the the most internal is used first. tryCatch( tryCatch(stop("error"), error = function(c) "a"), error = function(c) "b" ) # Uncaught signals propagate outwards. tryCatch( tryCatch(stop("error")), error = function(c) "b" ) # The first handler that matches a class of the condition is used, # not the "best" match: a <- structure(list(message = "my error", call = quote(a)), class = c("a", "error", "condition")) tryCatch(stop(a), error = function(c) "error", a = function(c) "a" ) tryCatch(stop(a), a = function(c) "a",

error = function(c) "error" ) # No matter what happens, finally is run: tryCatch(stop("error"), finally = print("Done.")) tryCatch(a <- 1, finally = print("Done.")) # Any errors that occur in the finally block are handled normally a <- 1 tryCatch(a <- 2, finally = stop("Error!"))

## What can handler functions do?

Return a value. Pass the condition along, by re-signalling the error with stop(c) , or signalCondition(c) for non-error conditions. Kill the function completely and return to the top-level with invokeRestart("abort") A common use for try and tryCatch is to set a default value if a condition fails. The easiest way to do this is to assign within the try :
success <- FALSE try({ # Do something that might fail success <- TRUE })

We can write a simple version of try using tryCatch . The real version of try is considerably more complicated to preserve the usual error behaviour.
try <- function(code, silent = FALSE) { tryCatch(code, error = function(c) { if (!silent) message("Error:", conditionMessage(c)) invisible(structure(conditionMessage(c), class = "try-error")) }) } try(1) try(stop("Hi")) try(stop("Hi"), silent = TRUE) rm(try) withCallingHandlers({ a <- 1 stop("Error") a <- 2 }, error = function(c) {})

Using

tryCatch

With the basics in place, we'll next develop some useful tools based the ideas we just learned about.

## browse_on_error <- function(f) { function(...) { tryCatch(f(...), error = function(c) browser()) } }

The finally argument to tryCatch is very useful for clean up, because it is always called, regardless of whether the code executed successfully or not. This is useful when you have: modified options , par or locale opened connections, or created temporary files and directories opened graphics devices changed the working directory modified environment variables The following function changes the working directory, executes some code, and always resets the working directory back to what it was before, even if the code raises an error.
in_dir <- function(path, code) { cur_dir <- getwd() tryCatch({ setwd(path) force(code) }, finally = setwd(cur_dir)) } getwd() in_dir(R.home(), dir()) getwd() in_dir(R.home(), stop("Error!")) getwd()

Another more casual way of cleaning up is the on.exit function, which is called when the function terminates. It's not as fine grained as tryCatch , but it's a bit less typing.

Getting help
Currently, there are two main venues to get help when you are stuck and can't figure out what's causing the problem: stackoverflow and the R-help mailing list.. While you can get some excellent advice on R-help, fools are not tolerated gladly, so follow the tips below for getting the best help:

Make sure you have the latest version of R, and the package (or packages) you are having problems with. It may be that your problem is the result of a bug that has been fixed recently. If it's not clear where the problem is, include the results of sessionInfo() so others can see your R setup. Spend some time creating a reproducible example. Make sure you have run your script in a vanilla R sessions ( R -vanilla ). This is often a useful example by itself, because in the course of making the problem reproducible you figure out what's causing the problem.

Exercises
1. Write a function that walks the code tree to find all functions that are missing an explicit drop argume nt that need them.

2. Write a function that takes code as an argument and runs that code with options(warn = 2) and returns options back to their previous values on exit (either from an exception or a normal exit). 3. Write a function that opens a graphics device, runs the supplied code, and closes the graphics device (always, regardless of whether or not the plotting code worked).

S3
R has three object oriented (OO) systems: S3, S4 and R5. This page describes S3. Central to any object-oriented system are the concepts of class and method. A class defines a type of object, describing what properties it possesses, how it behaves, and how it relates to other types of objects. Every object must be an instance of some class. A method is a function associated with a particular type of object. S3 implements a style of object oriented programming called generic-function OO. This is different to most programming languages, like Java, C++ and C#, which implement message -passing OO. In message-passing style, messages (methods) are sent to objects and the object determines which fu nction to call. Typically this object has a special appearance in the method call, usually appearing before the name of the method/message: e.g. canvas.drawRect("blue") . S3 is different. While computations are still carried out via methods, a special type of function called a generic function decides which method to call. Methods are defined in the same way as a normal function, but are called in a different way, as we'll see shortly. The primary use of OO programming in R is for print, summary and plot met hods. These methods allow us to have one generic function, e.g. print() , that displays the object differently depending on its type: printing a linear model is very different to printing a data frame.

Object class
The class of an object is determined by its class attribute, a character vector of class names. The following example shows how to create an object of class foo :
x <- 1 attr(x, "class") <- "foo" x # Or in one line x <- structure(1, class = "foo") x

Class is stored as an attribute, but it's better to modify it using the class() function, since this communicates your intent more clearly:
class(x) <- "foo" class(x) #  "foo"

You can use this approach to to turn any object into an object of class "foo", whether it makes sense or not. Objects are not limited to a single class, and can have many classes:
class(x) <- c("A", "B") class(x) <- LETTERS

As discussed in the next section, R looks for methods in the order in which they appear in the class vector. So in this example, it would be like class A inherits from class B - if a method isn't defined for A, it will fall back to B. However, if you switched the order of the classes, the opposite would be true! This is because S3 doesn't define any formal relationship between classes, or even any definition of what an individual class is. If you're coming from a strict environment like Java, this will seem pretty frightening (and it is!) but it does give your users a tremendous amount of freedom. While it's very difficult to stop someone from doing something you don't want them to do, your users will never be held back because there is something you haven't implemented yet.

## Generic functions and method dispatch

Method dispatch starts with a generic function that decides which specific method to dispatch t o. Generic functions all have the same form: a call to UseMethod that specifies the generic name and the object to dispatch on. This means that generic functions are usually very simple, like mean :
mean <- function (x, ...) { UseMethod("mean", x) }

Methods are ordinary functions that use a special naming convention: generic.class :
mean.numeric <- function(x, ...) sum(x) / length(x) mean.data.frame <- function(x, ...) sapply(x, mean, ...) mean.matrix <- function(x, ...) apply(x, 2, mean)

(These are somewhat simplified versions of the real code). As you might guess from this example, UseMethod uses the class of x to figure out which method to call. If x had more than one class, e.g. c("foo","bar") , UseMethod would look for mean.foo and if not found, it would then look for mean.bar . As a final fallback, UseMethod will look for a default method, mean.default , and if that doesn't exist it will raise an error. The same approach applies regardless of how many classes an object has:
x <- structure(1, class = letters) bar <- function(x) UseMethod("bar", x) bar.z <- function(x) "z" bar(x) #  "z"

Once UseMethod has found the correct method, it's invoked in a special way. Rather than creating a new evaluation environment, it uses the environment of the current function call (the call to the generic), so any assignments or evaluations that were made before the call to UseMethod will be accessible to the method. The arguments that were used in the call to the generic are passed on to the method in the same order t hey were received. Because methods are normal R functions, you can call them directly. However, you shouldn't do this because you lose the benefits of having a generic function:
bar.x <- function(x) "x" # You can call methods directly, but you shouldn't! bar.x(x) #  "x" bar.z(x) #  "z"

Methods
To find out which classes a generic function has methods for, you can use the methods function. Remember, in R, that methods are associated with functions (not objects), so you pass in the name of the function, rather than the class, as you might expect:
methods("bar") #  bar.x bar.z methods("t") #  t.data.frame t.default t.ts* # Non-visible functions are asterisked

Non-visible functions are functions that haven't been exported by a package, so you'll need to use the getAnywhere function to access them if you want to see the source.

Internal generics
Some internal C functions are also generic, which means that the method dispatch is not performed by R function, but is instead performed by special C functions. It's important to know which functions are internally generic, so you can write methods for them, and so you're aware of the slight differences in method dispatch. It's not easy to tell if a function is internally generic, because it just looks like a typical call to a C:
length <- function (x) .Primitive("length")

## cbind <- function (..., deparse.level = 1) .Internal(cbind(deparse.level, ...))

As well as length and cbind , internal generic functions include dim , c , as.character , names and rep . A complete list can be found in the global variable .S3PrimitiveGenerics , and more details are given in ?InternalMethods . Internal generic have a slightly different dispatch mechanism to other generic functions: before trying the default method, they will also try dispatching on the mode of an object, i.e. mode(x) . The following example shows the difference:
x <- structure(as.list(1:10), class = "myclass") length(x) #  10 mylength <- function(x) UseMethod("mylength", x) mylength.list <- function(x) length(x) mylength(x) # Error in UseMethod("mylength", x) : # # no applicable method for 'mylength' applied to an object of class "myclass"

Inheritance
The NextMethod function provides a simple inheritance mechanism, using the fact that the class of an S3 object is a vector. This is very different behaviour to most other languages because it means that it's possible to have different inheritance hierarchies for different objects:
baz <- function(x) UseMethod("baz", x) baz.A <- function(x) "A" baz.B <- function(x) "B" ab <- structure(1, class = c("A", "B")) ba <- structure(1, class = c("B", "A")) baz(ab) baz(ba) NextMethod() works like UseMethod but instead of dispatching on the first element of the class vector, it will dispatch based on the second (or subsequent) element: baz.C <- function(x) c("C", NextMethod()) ca <- structure(1, class = c("C", "A")) cb <- structure(1, class = c("C", "B")) baz(ca) baz(cb)

The exact details are a little tricky: NextMethod doesn't actually work with the class attribute of the object, it uses a global variable ( .Class ) to keep track of which class to call next. This means that manually changing the class of the object will have no impact on the inheritance:
# Turn object into class A - doesn't work! baz.D <- function(x) { class(x) <- "A" NextMethod() } da <- structure(1, class = c("D", "A")) db <- structure(1, class = c("D", "B")) baz(da) baz(db)

Methods invoked as a result of a call to NextMethod behave as if they had been invoked from the previous method. The arguments to the inherited method are in the same order and have the same names as the call to the current method, and are therefore are the same as the call to the generic. However, the expressions for the arguments are the names of the corresponding formal arguments of the current method. Thus the arguments will have values that correspond to their value at the time NextMethod was invoked. Unevaluated arguments remain unevaluated. Missing arguments remain missing. If NextMethod is called in a situation where there is no second class it will return an error. A selection of these errors are shown below so that you know what to look for.
c <- structure(1, class = "C") baz(c) # Error in UseMethod("baz", x) : # no applicable method for 'baz' applied to an object of class "C" baz.c(c) # Error in NextMethod() : generic function not specified baz.c(1) # Error in NextMethod() : object not specified

(Contents adapted from the R language definition. This document is licensed with the GPL-2 license.)

Double dispatch
How does + work.

Best practices

Create a construction method that checks the types of the input, and returns a list with the correct class label. new_XXX <- function(...) {} Write a function to check if an object is of your class: is.XXX <- function(x) inherits(x, "XXX") When implementing a vector class, you should implement these methods: length , [ , [<When implementing a matrix/array class, you should implement these methods: dim (gets you nrow and ncol), t , dimnames (gets you rownames and colnames), dimnames<- (gets you colnames<-, rownames<-)

## The S4 object system

Page History

R has three object oriented (OO) systems: S3, S4 and R5. This page describes S4. Compared to S3, the S4 object system is much stricter, and much closer to other OO systems. I recommend you familiarise yourself with the way that S3 works before reading this document - many of underlying ideas are the same, but the implementation is much stricter. There are two major differences from S3: formal class definitions: unlike S3, S4 formally defines the representation and inheritance for each class

multiple dispatch: the generic function can be dispatched to a method based on the class of any number of argument, not just one Here we introduce the basics of S4, trying to stay away from the esoterica and focussing on the ideas that you need to understand and write the majority of S4 code.

## Classes and instances

In S3, you can turn any object into an object of a particular class just by setting the class attribute. S4 is much stricter: you must define the representation of the call using setClass , and the only way to create it is through the constructer function new . A class has three key properties:

a name : an alpha-numeric string that identifies the class representation: a list of slots (or attributes), giving their names and classes. For example, a person class might be represented a character name and a numeric age, as follows: representation(name = "character", age =
"numeric")

a character vector of classes that it inherits from, or in S4 terminology, contains. Note that S4 supports multiple inheritance, but this should be used with extreme caution as it makes method lookup extremely complicated. You create a class with setClass :
setClass("Person", representation(name = "character", age = "numeric")) setClass("Employee", representation(boss = "Person"), contains = "Person")

## and create an instance of a class with new :

hadley <- new("Person", name = "Hadley", age = 31)

Unlike S3, S4 checks that all of the slots have the correct type:
hadley <- new("Person", name = "Hadley", age = "thirty") # invalid class "Person" object: invalid object for slot "age" in class # "Person": got class "character", should be or extend class "numeric"

hadley <- new("Person", name = "Hadley", sex = "male") # invalid names for slots of class "Person": sex

If you omit a slot, it will initiate it with the default object of the class. To access slots of an S4 object you use @ , not \$ :
hadley <- new("Person", name = "Hadley") hadley@age # numeric(0)

Or if you have a character string giving a slot name, you use the slot function:
slot(hadley, "age")

This is the equivalent of [[ . An empty value for age is probably not what you want, so you can also assign a default prototype for the class:
setClass("Person", representation(name = "character", age = "numeric"), prototype(name = NA_character_, age = NA_real_)) hadley <- new("Person", name = "Hadley") hadley@age

#  NA getSlots will return a description of all the slots of a class: getSlots("Person") # name age "numeric" # "character"

You can find out the class of an object with is . Note that there's some tension between the usual interactive functional style of R and the global side -effect causing S4 class definitions. In most programming languages, class definition occurs at compile -time, while object instantiation occurs at run-time - it's unusual to be able to create new classes interactively. In particular, note that the examples rely on the fact that multiple calls to setClass with the same class name will silently override the previous definition unless the first definition is sealed with sealed = TRUE .

Checking validity
You can also provide an optional method that applies additional restrictions. This function should have a single argument called object and should return TRUE if the object is valid, and if not it should return a character vector giving all reasons it is not valid.
check_person <- function(object) { errors <- character() length_age <- length(object@age) if (length_age != 1) { msg <- paste("Age is length ", length_age, ". errors <- c(errors, msg) } length_name <- length(object@name) if (length_name != 1) { msg <- paste("Name is length ", length_name, ". errors <- c(errors, msg) } if (length(errors) == 0) TRUE else errors } setClass("Person", representation(name = "character", age = "numeric"), validity = check_person) new("Person", name = "Hadley") # invalid class "Person" object: Age is length 0. new("Person", name = "Hadley", age = 1:10) Error in validObject(.Object) : invalid class "Person" object: Age is length 10. Should be 1 Should be 1 Should be 1", sep = "") Should be 1", sep = "")

# But note that the check is not automatically applied when we modify # slots directly hadley <- new("Person", name = "Hadley", age = 31) hadley@age <- 1:10 # Can force check with validObject: validObject(hadley)

Should be 1

## Generic functions and methods

Generic functions and methods work similarly to S3, but dispatch is based on the class of all arguments, and there is a special syntax for creating both generic functions and new methods. The setGeneric function provides two main ways to create a new generic. You can either convert an existing function to a generic function, or you can create a new one from scratch.
sides <- function(object) 0 setGeneric("sides")

If you create your own, the second argument to setGeneric should be a function that defines all the arguments that you want to dispatch on and contains a call to standardGeneric :
setGeneric("sides", function(object) { standardGeneric("sides") })

The following example sets up a simple hierarchy of shapes to use with the sides function.
setClass("Shape") setClass("Polygon", representation(sides = "integer"), contains = "Shape") setClass("Triangle", contains = "Polygon") setClass("Square", contains = "Polygon") setClass("Circle", contains = "Shape")

Defining a method for polygons is straightforward: we just use the sides slot. The setMethod function takes three arguments: the name of the generic function, the signature to match for this method and a function to compute the result. Unfortunately R doesn't offer any syntactic sugar for this task so the code is a little verbose and repetitive.
setMethod("sides", signature(object = "Polygon"), function(object) { object@sides })

For the others we supply exact values. Note that that for generics with few arguments you can can simplify the signature without giving the argument names. This saves spaces at the expensive of having to remember which position corresponds to which argument - not a problem if there's only one argument.
setMethod("sides", signature("Triangle"), function(object) 3) setMethod("sides", signature("Square"), setMethod("sides", signature("Circle"), function(object) 4) function(object) Inf)

You can optionally also specify valueClass to define the expected output of the generic. This will raise a run -time error if a method returns output of the wrong class.
setGeneric("sides", valueClass = "numeric", function(object) { standardGeneric("sides") }) setMethod("sides", signature("Triangle"), function(object) "three") sides(new("Triangle")) # invalid value from generic function "sides", class "character", expected # "numeric"

Note that arguments that the generic dispatches on can't be lazily evaluated - otherwise how would R know which class the object was? This also means that you can use substitute to access the unevaluated expression. To find what methods are already defined for a generic function, use showMethods :
showMethods("sides") # Function: sides (package .GlobalEnv) # object="Circle" # object="Polygon" # object="Square" # object="Triangle" showMethods(class = "Polygon") # Function: initialize (package methods) # .Object="Polygon" # # # Function: sides (package .GlobalEnv) # object="Polygon" (inherited from: .Object="ANY")

Method dispatch
This section describes the strategy for matching a call to a generic function to the correct method. If there's an exact match between the class of the objects in the call, and the signature of a method, it's easy - the generic function just calls that method. Otherwise, R will figure out the method using the following method:

For each argument to the function, calculate the distance between the class in the class, and the class in the signature. If they are the same, the distance is zero. If the class in the signature is a parent of the class in the call, then the distance is 1. If it's a grandparent, 2, and so on. Compute the total distance by adding together the individual distances. Calculate this distance for every method. If there's a method with a unique smallest distance, use that. Otherwise, give a warning and call one of the matching methods as described below. Note that it's possible to create methods that are ambiguous - i.e. it's not clear which method the generic should pick. In this case R will pick the method that is first alphabetically and return a warning message about the situation:
setClass("A") setClass("A1", contains = "A") setClass("A2", contains = "A1") setClass("A3", contains = "A2") setGeneric("foo", function(a, b) standardGeneric("foo")) setMethod("foo", signature("A1", "A2"), function(a, b) "1-2") setMethod("foo", signature("A2", "A1"), function(a, b) "2-1") foo(new("A2"), new("A2")) # Note: Method with signature "A2#A1" chosen for function "foo", # target signature "A2#A2". "A1#A2" would also be valid

Generally, you should avoid this ambiguity by providing a more specific method:
setMethod("foo", signature("A2", "A2"), function(a, b) "2-2")

foo(new("A2"), new("A2"))

(The computation is cached for this combination of classes so that it doesn't have to be done again.) There are two special classes that can be used in the signature: missing and ANY . missing matches the case where the argument is not supplied, and ANY is used for setting up default methods. ANY has the lowest possible precedence in method matching. You can also use basic classes like numeric , character and matrix . A matrix of (e.g.) characters will have class matrix .
setGeneric("type", function(x) standardGeneric("type")) setMethod("type", signature("matrix"), function(x) "matrix") setMethod("type", signature("character"), function(x) "character") type(letters) type(matrix(letters, ncol = 2))

You can also dispatch on S3 classes provided that you have made S4 aware of them by calling setOldClass .
foo <- structure(list(x = 1), class = "foo") type(foo) setOldClass("foo") setMethod("type", signature("foo"), function(x) "foo") type(foo) setMethod("+", signature(e1 = "foo", e2 = "numeric"), function(e1, e2) { structure(list(x = e1\$x + e2), class = "foo") }) foo + 3

It's also possible to dispatch on ... under special circumstances. See ?dotsMethods for more details.

Inheritance
Let's develop a fuller example. This is inspired by an example from the Dylan language reference, one of the languages that inspired the S4 object system. In this example we'll develop a simple model of vehicle inspections that vary depending on the type of vehicle (car or truck) and type of inspector (normal or state). In S4, it's the callNextMethod that (surprise!) is used to call the next method. It figures out which method to call by pretending the current method doesn't exist, and looking for the next closest match . First we set up the classes: two types of vehicle (car and truck), and two types of inspect.
setClass("Vehicle") setClass("Truck", contains = "Vehicle") setClass("Car", contains = "Vehicle") setClass("Inspector", representation(name = "character")) setClass("StateInspector", contains = "Inspector")

Next we define the generic function for inspecting a vehicle. It has two arguments: the vehicle being inspected and the person doing the inspection.

## setGeneric("inspect.vehicle", function(v, i) { standardGeneric("inspect.vehicle") })

All vehicle must be checked for rust by all inspectors, so we'll add the first. Cars also need to have working seatbelts.
setMethod("inspect.vehicle", signature(v = "Vehicle", i = "Inspector"), function(v, i) { message("Looking for rust") }) setMethod("inspect.vehicle", signature(v = "Car", i = "Inspector"), function(v, i) { callNextMethod() # perform vehicle inspection message("Checking seat belts") }) inspect.vehicle(new("Car"), new("Inspector")) # Looking for rust # Checking seat belts

Note that it's the most specific method that's responsible for ensuring that the more generic methods are called. We'll next add methods for trucks (cargo attachments need to be ok), and the special task that the state inspected performs on cars: checking for insurance.
setMethod("inspect.vehicle", signature(v = "Truck", i = "Inspector"), function(v, i) { callNextMethod() # perform vehicle inspection message("Checking cargo attachments") }) inspect.vehicle(new("Truck"), new("Inspector")) # Looking for rust # Checking cargo attachments setMethod("inspect.vehicle", signature(v = "Car", i = "StateInspector"), function(v, i) { callNextMethod() # perform car inspection message("Checking insurance") }) inspect.vehicle(new("Car"), new("StateInspector")) # Looking for rust # Checking seat belts

# Checking insurance

This set up ensures that when a state inspector checks a truck, they perform all of the checks a regular inspector would:
inspect.vehicle(new("Truck"), new("StateInspector")) # Looking for rust # Checking cargo attachments

In the wild
To conclude, lets look at some S4 code in practice. The Bioconductor EBImage , by Oleg Sklyar, Gregoire Pau, Mike Smith and Wolfgang Huber, package is a good place to start because it's so simple. It has only one class, an Image , which represents a image as an array of pixel values.
setClass ("Image", representation (colormode="integer"), prototype (colormode=Grayscale), contains = "array" ) imageData = function (y) { if (is(y, 'Image')) y@.Data else y }

The author wrote the imageData convenience method to extract the underlying S3 object, the array. They could have also used the S3Part function to extract this. Methods are used to define numeric operations for combining two images, or an image with a constant. Here the author is using the Ops group generic which will match all calls to + , - , * , ^ , %% , %/% , / , == , > , < , != , <= , >= , & , and | . The callGeneric function then passed on this call to the generic method for arrays. Finally, each method checks that the modified object is valid, before returning it.
setMethod("Ops", signature(e1="Image", e2="Image"), function(e1, e2) { e1@.Data=callGeneric(imageData(e1), imageData(e2)) validObject(e1) return(e1) } ) setMethod("Ops", signature(e1="Image", e2="numeric"), function(e1, e2) { e1@.Data=callGeneric(imageData(e1), e2) validObject(e1) return(e1) } ) setMethod("Ops", signature(e1="numeric", e2="Image"), function(e1, e2) { e2@.Data=callGeneric(e1, imageData(e2)) validObject(e2) return(e2)

} )

The Matrix package by Douglas Bates and Martin Maechler is a great example of a more complicated setup. It is designed to efficiently store and compute with many different special types of matrix. As at version 0.999375 -50 it defines 130 classes and 24 generic functions. The package is well written, well commented and fa irly easy to read. The accompanying vignette gives a good overview of the structure of the package. I'd highly recommend downloading the source and then skimming the following R files: AllClass.R : where all classes are defined AllGenerics.R : where all generics are defined Ops.R : where pairwise operators are defined, including automatic conversion of standard S3 matrices Most of the hard work is done in C for efficiency, but it's still useful to look at the other R files to see how the code is arranged.

R5
Page History

R has three object oriented (OO) systems: S3, S4 and R5. This page describes the new reference based class system, colloquially known as R5. R5 is new in R 2.12. They fill a long standing need for mutable objects that had previously been filled by non -core packages like R.oo , proto and mutatr . While the core functionality of R5 is solid, reference classes are still under active development and some details will change. The most up-to-date documentation for R5 can always be found in ?ReferenceClasses . There are two main differences between R5 and S3 and S4:

R5 objects use message-passing OO R5 objects are mutable: the usual R copy on modify semantics do not apply These properties makes this object system behave much more like Java and C#. Surprisingly, the implementation of R5 is almost entirely in R code - they are a combination of S4 methods and environments. This is a testament to the flexibility of S4. Particularly suited for: simulations where you're modelling complex state, GUIs. Note that when using reference based classes we want to minimise side effects, and use them only where mutable state is absolutely required. The majority of functions should still be "functional", and side effect free. This makes code easier to reason about (because you don't need to worry about methods changing things in surprising ways), and easier for other R programmers to understand. Limitations: can't use enclosing environment - because that's used for the object.

## Classes and instances

Creating a new reference based class is straightforward: you use setRefClass . Unlike setClass from S4, you want to keep the results of that function around, because that's what you use to create new objects of that type:
# Or keep reference to class around. Person <- setRefClass("Person") Person\$new()

A reference class has three main components, given by three arguments to setRefClass : contains , the classes which the class inherits from. These should be other reference class objects:
setRefClass("Polygon") setRefClass("Regular")

# Specify parent classes setRefClass("Triangle", contains = "Polygon") setRefClass("EquilateralTriangle", contains = c("Triangle", "Regular")) fields are the equivalent of slots in S4 . They can be specified as a vector of field names, or a named list of field

types:
setRefClass("Polygon", fields = c("sides")) setRefClass("Polygon", fields = list(sides = "numeric"))

The most important property of R5 objects is that they are mutable, or equivalently they have reference semantics:
Polygon <- setRefClass("Polygon", fields = c("sides")) square <- Polygon\$new(sides = 4) triangle <- square triangle\$sides <- 3 square\$sides

methods are functions that operate within the context of the object and can modify its fields. These can also be added after object creation, as described below. setRefClass("Dist") setRefClass("DistUniform", c("a", "b"), "Dist", methods = list( mean <- function() { (a + b) / 2 } ))

## You can also add methods after creation:

# Instead of creating a class all at once: Person <- setRefClass("Person", methods = list( say_hello = function() message("Hi!") )) # You can build it up piece-by-piece Person <- setRefClass("Person") Person\$methods(say_hello = function() message("Hi!"))

It's not currently possible to modify fields because adding fields would invalidate existing objects that didn't have those fields.

The object returned by setRefClass (or retrieved later by getRefClass ) is called a generator object. It has methods: new for creating new objects of that class. The new method takes named arguments specifying initial values for the fields methods for modifying existing or adding new methods help for getting help about methods fields to get a list of fields defined for class lock locks the named fields so that their value can only be set once accessors a convenience method that automatically sets up accessors of the form getXXX and setXXX .

Methods
R5 methods are associated with objects, not with functions, and are called using the special syntax obj\$method(arg1, arg2, ...) . (You might recall we've seen this construction before when we called functions stored in a named list). Methods are also special because they can modify fields. This is different We've also seen this construct before, when we used closures to create mutable state. Reference classes work in a similar manner but give us some extra functionality:

inheritance a way of documenting methods a way of specifying fields and their types Modify fields with <<- . Will call accessor functions if defined. Special fields: .self (Don't use fields with names starting with . as these may be used for special purposes in future versions.)
initialize

Common methods

Because all R5 classes inherit from the same superclass, envRefClass , they a have common set of methods: obj\$callSuper : obj\$copy : creates a copy of the current object. This is necessary because R5 classes don't behave like most R objects, which are copied on assignment or modification. obj\$field : named access to fields. Equivalent to slots for S4. obj\$field("xxx") the same as obj\$xxx . obj\$field("xxx", 5) the same as obj\$xxx <- 5 obj\$import(x) coerces into this object, and obj\$export(Class) coerces a copy of obj into that class. These should be super classes.
obj\$initFields

Documentation
Python style doc-strings. obj\$help() .

In packages
Note: collation Note: namespaces and exporting

In the wild
Rook package. Scales package?

## Profiling performance and memory

Page History

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" -- Donald Knuth. Your code should be correct, maintainable and fast. Notice that speed comes last - if your function is incorrect or unmaintainable (i.e. will eventually become incorrect) it doesn't matter if it's fast. As computers get faster and R is optimised, your code will get faster all by itself. Your code is never going to automatically become correct or elegant if it is not already. What can you do if your code is slow? Before you can figure out how to improve it, you first need to figure out why it's slow. That's what this chapter will teach you.

Build up your vocab, and use built-in vectorised operations where possible.

## Figure out what is slow and then optimise that

Benchmarking
The basic tool for benchmarking, recording how long an operation takes, is system.time . There are two contributed packages that provide useful additional functionality:

The rbenchmark package provides a convenient wrapper around system.time that allows you to compare multiple functions and run them multiple times The microbenchmark function uses a more precise timing mechanism than system.time allowing it to accurately compare functions that take a small amount of time without having to repeat the function millions of time. It also takes care to estimate the overhead associated with timing, and randomly orders the evaluation o f each expression. It also times each expression individually, so you get a distribution of times, which helps estimate error. The following example compares two different ways of computing means of random uniform numbers, as you might use for simulations teaching the central limit theorem (inspired by the rbenchmark documentation)
sequential <- function(n, m) mean(replicate(n, runif(m))) parallel <- function(n, m) colMeans(matrix(runif(n*m), m, n))

library(rbenchmark) res1 <- benchmark( seq = sequential(100, 100), par = parallel(100, 100), replications=10^(0:2), order = c('replications', 'elapsed')) res1\$elapsed <- res1\$elapsed / res1\$replications library(microbenchmark) res2 <- microbenchmark( seq = sequential(100, 100), par = parallel(100, 100), times = 100 ) print(res2, unit = "s")

Performance profiling
R provides a built in tool for profiling: Rprof . When active, this records the current call stack to disk very interval seconds. This provides a fine grained report showing how long each function takes. The function summaryRprof provides a way to turn this list of call stacks into useful information. But I don't think it's terribly useful, because it makes it hard to see the entire structure of the program at once. Ins tead, we'll use the profr package, which turns the call stack into a data.frame that is easier to manipulate and visualise.

## Byte code compilation

R 2.13 introduced a new byte code compiler which can increase the speed of certain types of code 4 -5 fold. This improvement is likely to get better in the future as the compiler implements more optimisations - this is an active area of research. Using the compiler is an easy way to get speed ups - it's easy to use, and if it doesn't work well for your function, then you haven't invested a lot of time in it, and so you haven't lost much.

Memory profiling
object.size

## There are three ways to explore memory usage:

tracemem Rprof + memory Rprofmem

## High performance functions with Rcpp

Page History

Sometimes R code just isn't fast enough - you've already used all of the tips and tricks you know and you've used profiling to find the bottleneck, and there's simply no way to make the code any faster. This chapter is the answer to that problem: use Rcpp to easily write key functions in C++ to get all the performance of C, while sacrificing the minimum of convenience. Rcpp is a fantastic tool written by Dirk Eddelbuettel and Romain Francois that makes it dead simple to write high-performance code in C++ that easily interfaces with the R-level data structures. You can also write high performance code in straight C or Fortran. These may (or may not) be more performant than C++, but you have to sacrifice a lot of convenience and master the complex C internals of R, as well as doing memory management yourself. In my opinion, using Rcpp is currently the best balance between speed and convenience. Writing performant code may also require you to rethink your basic approach: a solid understand of basic data structures and algorithms is very helpful here. That's beyond the scope of this book, but I'd suggest the "algorithm design handbook" as a good place to start. Orhttp://ocw.mit.edu/courses/electrical-engineering-and-computerscience/6-046j-introduction-to-algorithms-sma-5503-fall-2005/ The basic strategy is to keep as much code as possible in R, because:

don't need to compile which makes development faster you are probably more familiar with R than C++ people reading/maintaining your code in the future will probably be more familiar with R than C++ don't know the length of the output vector Implementing bottlenecks in C++ can give considerable speed ups (2 -3 orders of magnitude) and allows you to easily access best-of-breed data structures. Keeping the majority of your code in straight R, means that you don't have to sacrifice the benefits of R. Typically bottlenecks involve:

loops that can't easily be vectorised because each iteration depends on the previous. (e.g. Sieve of Erastothenes:http://stackoverflow.com/questions/3789968/generate-a-list-of-primes-in-r-up-to-a-certain-number, simulation) recursive functions The aim of this chapter is to give you the absolute basics to get up and running with Rcpp for the purpose of speeding up slow parts of your code. Other resources that I found helpful when writing this chapter and you might too are:

Slides from the Rcpp master class taught by Dirk Eddelbuettel and Romain Francois in April 2011.

Essentials of C++

Many many other components of C++, the idea here is to give you the absolute basics so that you can start writing code. Emphasis on the C-like aspect of C++. Minimal OO, minimal templating - just how to use them, not how to program them yourself.

## basic syntax creating variables and assigning control flow functions

Variable types
Table that shows R, standard C++, and Rcpp classes.

int , long , float , double , bool , char , std::string lists and data frames, named vectors functions

## from R to C++: as from C++ to R: wrap

Differences from R

semi-colon at end of each line static typing vectors are 0-based compiled, not interpreted variables are scalars by default pow instead of ^

## RCpp syntax sugar

vectorised operations: ifelse, sapply lazy functions: any, all More details are available in the Rcpp syntactic sugar vignette.

Building
Prerequisites: R development environment (which you need for package building any). C++ compiler (e.g. g++).

Inline
Inline package is the absolute easiest way to
# TODO: Update to own example src <- int n = as<int>(ns); double x = as<double>(xs); for (int i = 0; i<n; i++) x = 1 / (1 + x); return wrap(x); l <- cxxfunction(signature(ns = "integer", xs = "numeric"),

## body = src, plugin = "Rcpp")

In package
For more details see the vignette, Writing a package that uses Rcpp.
Rcpp.package.skeleton( "mypackage" )

Description:
Depends: Rcpp (>= 0.9.4.1) LinkingTo: Rcpp

Namespace:
useDynLib(mypackage)

Makefiles R function

## Useful data structures and algorithms

From the standard template library (STL). Can be very useful when implementing algorithms that rely on data structures with particular performance characteristics. Useful resources:

http://www.davethehat.com/articles/eff_stl.htm http://www.sgi.com/tech/stl/ http://www.uml.org.cn/c++/pdf/EffectiveSTL.pdf Compared to R these objects are typically mutable, and most operations operator destructively in place - this can be much faster.

vector: like an R vector, but efficiently grows (but see the reserve method if you know something about how many space you'll need) deque list set: no duplicates, sorted map ropes Useful algorithms

partial sort binary search accumulate (example: total length of strings) sort with custom comparison operation? http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=alg_index

More Rcpp
This chapter has only touched on a small part of Rcpp, giving you the basic tools to rewrite poorly performing R code in C++. Rcpp has many other capabilities that make it easy to interface R to existing C++ code, including:

automatically creating the wrappers between R data structures and C++ data stru ctures (modules). A good introduction to this topic is the vignette of Rcpp modules mapping of C++ classes to reference classes. I strongly recommend keeping an eye on the Rcpp homepage and signing up for the Rcpp mailing list. Rcpp is still under active development, and is getting better with every release.