You are on page 1of 2

Data Transformation with dplyr : : CHEAT SHEET

dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
Manipulate Cases Manipulate Variables
A B C A B C
& EXTRACT CASES EXTRACT VARIABLES
pipes
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) pull(.data, var = -1) Extract column values as
filter(.data, …) Extract rows that meet logical
a vector. Choose by name or index.

Summarise Cases
w
www
ww criteria. filter(iris, Sepal.Length > 7)
w
www pull(iris, Sepal.Length)

distinct(.data, ..., .keep_all = FALSE) Remove select(.data, …)


rows with duplicate values. 
 Extract columns as a table. Also select_if().
These apply summary functions to columns to create a new
table of summary statistics. Summary functions take vectors as
input and return one value (see back).
w
www
ww distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
w
www select(iris, Sepal.Length, Species)

weight = NULL, .env = parent.frame()) Randomly Use these helpers with select (),
summary function select fraction of rows. 
 e.g. select(iris, starts_with("Sepal"))
summarise(.data, …)

Compute table of summaries. 

w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
ends_with(match) one_of(…) -, e.g, -Species
w
ww summarise(mtcars, avg = mean(mpg))

count(x, ..., wt = NULL, sort = FALSE)



NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE)
matches(match) starts_with(match)

Count number of rows in each group defined slice(.data, …) Select rows by position. MAKE NEW VARIABLES
slice(iris, 10:15)
by the variables in … Also tally().

w
ww count(iris, Species)
w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
VARIATIONS vectorized function
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns. mutate(.data, …) 

summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter() Compute new column(s).
<
>
<=
>=
is.na()
!is.na()
%in%
!
|
&
xor() w
wwww
w mutate(mtcars, gpm = 1/mpg)

transmute(.data, …)

Group Cases See ?base::logic and ?Comparison for help. Compute new column(s), drop others.
Use group_by() to create a "grouped" copy of a table. 

dplyr functions will manipulate each "group" separately and
w
ww transmute(mtcars, gpm = 1/mpg)

mutate_all(.tbl, .funs, …) Apply funs to every


then combine the results. ARRANGE CASES column. Use with funs(). Also mutate_if().


mtcars %>%
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
w
www mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))

w
www
ww group_by(cyl) %>% w
www
ww desc() to order from high to low.
arrange(mtcars, mpg) mutate_at(.tbl, .cols, .funs, …) Apply funs to

ww
w summarise(avg = mean(mpg)) arrange(mtcars, desc(mpg))
ww
w
specific columns. Use with funs(), vars() and
the helper functions for select().

mutate_at(iris, vars( -Species), funs(log(.)))
group_by(.data, ..., add = ungroup(x, …) ADD CASES add_column(.data, ..., .before = NULL, .after =
FALSE) Returns ungrouped copy 
 NULL) Add new column(s). Also add_count(),
add_row(.data, ..., .before = NULL, .after = NULL)
Returns copy of table 

grouped by …
g_iris <- group_by(iris, Species)
of table.
ungroup(g_iris)
w
www
ww
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
w
www
ww add_tally(). add_column(mtcars, new = 1:32)

rename(.data, …) Rename columns.



rename(iris, Length = Sepal.Length)
w
wwww
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03
Vector Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES dplyr
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1 x
a
b
c
t
u
v
1
2
3

A B C

vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4

COUNTS bind_cols(…) Returns tables placed side by


OFFSETS
dplyr::n() - number of values/rows side as a single table.  Use bind_rows() to paste tables below each
dplyr::lag() - Offset elements by 1 BE SURE THAT ROWS ALIGN.
dplyr::n_distinct() - # of uniques other as they are.
dplyr::lead() - Offset elements by -1 sum(!is.na()) - # of non-NA’s
CUMULATIVE AGGREGATES Use a "Mutating Join" to join one table to bind_rows(…, .id = NULL)
LOCATION DF
x
A
a
B
t
C
1
dplyr::cumall() - Cumulative all() columns from another, matching values with Returns tables one on top of the other
mean() - mean, also mean(!is.na()) the rows that they correspond to. Each join
x b u 2
as a single table. Set .id to a column
dplyr::cumany() - Cumulative any() x c v 3
median() - median retains a different combination of values from name to add a column of the original
cummax() - Cumulative max() z c v 3
z d w 4
dplyr::cummean() - Cumulative mean() the tables. table names (as pictured)
LOGICALS
cummin() - Cumulative min()
cumprod() - Cumulative prod() mean() - Proportion of TRUE’s A B C D left_join(x, y, by = NULL, A B C intersect(x, y, …)
cumsum() - Cumulative sum() sum() - # of TRUE’s a t 1 3 copy=FALSE, suffix=c(“.x”,“.y”),…) c v 3
Rows that appear in both x and y.
b u 2 2
c v 3 NA Join matching values from y to x.
RANKINGS POSITION/ORDER A B C setdiff(x, y, …)
dplyr::first() - first value right_join(x, y, by = NULL, copy = a t 1 Rows that appear in x but not y.
dplyr::cume_dist() - Proportion of all values <= A B C D
b u 2
dplyr::last() - last value a t 1 3 FALSE, suffix=c(“.x”,“.y”),…)
dplyr::dense_rank() - rank with ties = min, no
dplyr::nth() - value in nth location of vector
b u 2 2
Join matching values from x to y. A B C union(x, y, …)
gaps d w NA 1
a t 1 Rows that appear in x or y. 

dplyr::min_rank() - rank with ties = min b u 2
RANK A B C D inner_join(x, y, by = NULL, copy = (Duplicates removed). union_all()
dplyr::ntile() - bins into n bins c v 3
a t 1 3 FALSE, suffix=c(“.x”,“.y”),…) d w 4 retains duplicates.
dplyr::percent_rank() - min_rank scaled to [0,1] quantile() - nth quantile  b u 2 2
Join data. Retain only rows with
dplyr::row_number() - rank with ties = "first" min() - minimum value matches.
max() - maximum value
MATH Use setequal() to test whether two data sets
A B C D full_join(x, y, by = NULL,
+, - , *, /, ^, %/%, %% - arithmetic ops SPREAD a t 1 3
copy=FALSE, suffix=c(“.x”,“.y”),…) contain the exact same rows (in any order).
b u 2 2
log(), log2(), log10() - logs IQR() - Inter-Quartile Range c v 3 NA Join data. Retain all values, all rows.
<, <=, >, >=, !=, == - logical comparisons mad() - median absolute deviation d w NA 1

dplyr::between() - x >= left & x <= right sd() - standard deviation EXTRACT ROWS
dplyr::near() - safe == for floating point var() - variance x y
numbers A B.x C B.y D
Use by = c("col1", "col2") to A B C A B D

MISC
a
b
t 1
u 2
t 3
u 2 specify the column(s) to match on. a
b
t
u
1
2 + a
b
t
u
3
2 =
dplyr::case_when() - multi-case if_else()
dplyr::coalesce() - first non-NA values by
Row Names c v 3 NA NA left_join(x, y, by = "A") c v 3 d w 1

Tidy data does not use rownames, which store a A.x B.x C A.y B.y Use a named vector, by = c("col1" =
element  across a set of vectors a t 1 d w
"col2"), to match on columns with Use a "Filtering Join" to filter one table against
variable outside of the columns. To work with the
dplyr::if_else() - element-wise if() + else() rownames, first move them into a column.
b u 2 b u
different names in each data set. the rows of another.
c v 3 a t
dplyr::na_if() - replace specific values with NA left_join(x, y, by = c("C" = "D"))
C A B
pmax() - element-wise max() rownames_to_column() semi_join(x, y, by = NULL, …)
A B A B C
pmin() - element-wise min() 1 a t 1 a t Move row names into col. A1 B1 C A2 B2 Use suffix to specify suffix to give to a t 1 Return rows of x that have a match in y.
dplyr::recode() - Vectorized switch() 2 b u 2 b u a <- rownames_to_column(iris, var a t 1 d w duplicate column names. b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
dplyr::recode_factor() - Vectorized switch()
 3 c v 3 c v
= "C") b u 2 b u
left_join(x, y, by = c("C" = "D"), suffix =
for factors c v 3 a t
c("1", "2")) A B C anti_join(x, y, by = NULL, …)

A B C A B column_to_rownames() c v 3 Return rows of x that do not have a
1 a t 1 a t
Move col in row names.  match in y. USEFUL TO SEE WHAT WILL
2 b u 2 b u
3 c v 3 c v column_to_rownames(a, var = "C") NOT BE JOINED.

Also has_rownames(), remove_rownames()

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03