A collection of lesser-known but powerful base R idioms and shortcuts for writing concise and fast base R code, useful for beginner level to intermediate level R developers.
Please help me improve and extend this list. See contributing guide and code of conduct.
Why?
From 2012 to 2022, I answered thousands of R questions in the online community Capital of Statistics. These recipes are observed and digested from the recurring patterns I learned from the frequently asked questions with less common answers.
- Object creation
- Create sequences with
seq_len()
andseq_along()
- Repeat character strings with
strrep()
- Create an empty list of a given length
- Create and assigning S3 classes in one step
- Assign names to vector elements or data frame columns at creation
- Use
I()
to include objects as is in data frames - Generate factors using
gl()
- Create sequences with
- Object transformation
- Insert elements into a vector with
append()
- Modify data frames with
transform()
- Modify data frames with
within()
- Use
[
and[[
as functions in apply calls - Sum all components in a list
- Bind multiple data frames in a list
- Use
modifyList()
to update a list - Use
aperm()
andasplit()
to permute and split arrays - Run-length encoding
- Insert elements into a vector with
- Conditions
- Vectorization
- Use
match()
for fast lookups - Use environments as fast key-value stores for fast lookups
- Use
mapply()
for element-wise operations on multiple lists - Simplify element-wise min and max operations with
pmin()
andpmax()
- Apply a function to all combinations of parameters
- Generate all possible combinations of given characters
- Vectorize a function with
Vectorize()
- Pairwise computations using
outer()
- Subtract column means from non-zero elements in a sparse matrix
- Use
- Functions
- Side-effects
- Numerical computations
- Further reading
seq_len()
and seq_along()
are safer than 1:length(x)
or 1:nrow(x)
because they avoid the unexpected result when x
is of length 0
:
# Safe version of 1:length(x)
seq_len(length(x))
# Safe version of 1:length(x)
seq_along(x)
When you need to repeat a string a certain number of times, instead of using
the tedious pattern of paste(rep("foo", 10), collapse = "")
, you can use
the strrep()
function:
strrep("foo", 10)
strrep()
is vectorized, meaning that you can pass vectors as arguments and
it will return a vector of the same length as the first argument:
fruits <- c("apple", "banana", "orange")
strrep(c("*"), nchar(fruits))
strrep(c("-", "=", "**"), nchar(fruits))
Use the vector()
function to create an empty list of a specific length:
x <- vector("list", length)
Avoid creating an object and assigning its class separately.
Instead, use the structure()
function to do both at once:
x <- structure(list(), class = "my_class")
Instead of:
x <- list()
class(x) <- "my_class"
This makes the code more concise when returning an object of a specific class.
The setNames()
function allows you to assign names to vector elements or
data frame columns during creation:
x <- setNames(1:3, c("one", "two", "three"))
x <- setNames(data.frame(...), c("names", "of", "columns"))
The I()
function allows you to include objects as is when creating data frames:
df <- data.frame(x = I(list(1:10, letters)))
df$x
#> [[1]]
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> [[2]]
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
#> [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
This creates a data frame with one column x
that is a list of vectors.
Create a vector with specific levels with gl()
by specifying the levels
and the number of repetitions:
gl(n = 2, k = 5, labels = c("Low", "High"))
#> [1] Low Low Low Low Low High High High High High
#> Levels: Low High
The gl()
function is particularly useful when setting up experiments
or simulations that involve categorical variables.
When you need to insert elements into a vector at a specific position,
use append()
. It has an argument after
that specifies the position after
which the new elements should be inserted, defaulting to length of the vector
being appended to.
For example, To insert the numbers 4, 5, 6 between 1, 2, 3 and 7, 8, 9:
x <- c(1, 2, 3, 7, 8, 9)
append(x, 4:6, after = 3)
#> [1] 1 2 3 4 5 6 7 8 9
Without append()
, the solution would be more verbose and less readable:
c(x[1:3], 4:6, x[4:length(x)])
#> [1] 1 2 3 4 5 6 7 8 9
When after
is set to 0
, the new values are "appended" to the beginning of
the input vector:
append(x, 4:6, after = 0)
#> [1] 4 5 6 1 2 3 7 8 9
When adding new columns or modifying existing columns in a data frame,
instead of assigning each column individually, use transform()
to perform
multiple transformations in a single step:
df <- data.frame(x = 1:5, y = 6:10)
transform(df, z = x + y, y = y * 2, w = sqrt(x))
This is more concise and readable compared to the alternative of
multiple assignments and repeating df$
:
df$z <- df$x + df$y
df$y <- df$y * 2
df$w <- sqrt(df$x)
For more complex data transformations that involve multiple steps or
intermediate variables, consider using the within()
function
(not to be confused with with()
):
df <- data.frame(x = 1:5, y = 6:10)
within(df, {
y <- x / sum(x)
z <- log(y)
category <- ifelse(z > -2, "High", "Low")
})
Note that both transform()
and within()
return a modified copy of the
original data frame and does not change the original data frame,
unless you assign the result back.
When you need to extract the same element from each item in a list or
list-like object, you can leverage [
and [[
as functions
(they actually are!) within lapply()
and sapply()
calls.
Consider a list of named vectors:
lst <- list(
item1 = c(a = 1, b = 2, c = 3),
item2 = c(a = 4, b = 5, c = 6),
item3 = c(a = 7, b = 8, c = 9)
)
# Extract named element "a" using `[[`
element_a <- sapply(lst, `[[`, "a")
lst <- list(
item1 = c(1, 2, 3),
item2 = c(4, 5, 6),
item3 = c(7, 8, 9)
)
# Extract first element using `[`
first_element <- sapply(lst, `[`, 1)
Use the Reduce()
function with the infix function +
to sum up all components
in a list:
x <- Reduce("+", list)
The do.call()
function with the rbind
argument allows you to bind
multiple data frames in a list into one data frame:
df_combined <- do.call("rbind", list_of_dfs)
Alternatively, more performant solutions for such operations are offered in
data.table::rbindlist()
and dplyr::bind_rows()
. See
this article for details.
The modifyList()
function allows you to easily update values in a list
without a verbose syntax:
old_list <- list(a = 1, b = 2, c = 3)
new_vals <- list(a = 10, c = 30)
new_list <- modifyList(defaults, new_vals)
This can be very useful for maintaining and updating a set of configuration parameters.
Use aperm()
and asplit()
to avoid nested for-loops in array manipulation.
aperm()
is the generalization of matrix transpose. asplit()
can split along
any array dimension.
arr <- array(
1:24,
dim = c(2, 3, 4),
dimnames = list(
row = paste0("R", 1:2),
col = paste0("C", 1:3),
slice = paste0("S", 1:4)
)
)
# Rearrange dimensions from (2 x 3 x 4) to (4 x 3 x 2)
aperm(arr, perm = c(3, 2, 1))
# Split into a length-4 list of (2 x 3) matrices
asplit(arr, MARGIN = 3)
Run-length encoding is a simple form of data compression in which sequences of the same element are replaced by a single instance of the element followed by the number of times it appears in the sequence.
Suppose you have a vector with many repeating elements:
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 1)
You can use rle()
to compress this vector and decompress the result back
into the original vector with inverse.rle()
:
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 1)
(y <- rle(x))
#> Run Length Encoding
#> lengths: int [1:5] 3 2 4 3 2
#> values : num [1:5] 1 2 3 2 1
inverse.rle(y)
#> [1] 1 1 1 2 2 3 3 3 3 2 2 2 1 1
Instead of using the class()
function in conjunction with ==
, !=
,
or %in%
operators to check if an object belongs to a certain class,
use the inherits()
function.
if (inherits(x, "class"))
This will return TRUE
if "class" is one of the classes from which x
inherits.
This replaces the following more verbose forms:
if (class(x) == "class")
or
if (class(x) %in% c("class1", "class2"))
It is also more reliable because it checks for class inheritance, not just the first class name (R supports multiple classes for S3 and S4 objects).
For a series of range-based conditions, use cut()
instead of chaining
multiple if-else
conditions or ifelse()
calls:
categories <- cut(
x,
breaks = c(-Inf, 0, 10, Inf),
labels = c("negative", "small", "large")
)
This assigns each element in x
to the category that corresponds to the
range it falls in.
When dealing with categorical variables, you might need to replace or
recode certain levels. This can be achieved using chained ifelse()
statements,
but a more efficient and readable approach is to use the factor()
function:
x <- c("M", "F", "F", NA)
factor(
x,
levels = c("F", "M", NA),
labels = c("Female", "Male", "Missing"),
exclude = NULL # Include missing values in the levels
)
Sometimes, the number of conditions checked in multiple if
statements
can be reduced by cleverly using the fact that in R,
TRUE
is upcasted to 1
and FALSE
to 0
in numeric contexts.
This can be useful for selecting an index based on a set of conditions:
i <- (width >= 960) + (width >= 1140) + 1
p <- p + facet_wrap(vars(class), ncol = c(1, 2, 4)[i])
This does the same thing as the following code, but in a much more concise way:
if (width >= 1140) p <- p + facet_wrap(vars(class), ncol = 4)
if (width >= 960 & width < 1140) p <- p + facet_wrap(vars(class), ncol = 2)
if (width < 960) p <- p + facet_wrap(vars(class), ncol = 1)
This works because the condition checks in the parentheses result in a
TRUE
or FALSE
, and when they are added together, they are
upcasted to 1
or 0
.
If you want to assign a variable to many different groups or intervals,
instead of using a series of if
statements, you can use the
findInterval()
function. Using the same example above:
breakpoints <- c(960, 1140)
ncols <- c(1, 2, 4)
i <- findInterval(width, breakpoints) + 1
p <- p + facet_wrap(vars(class), ncol = ncols[i])
The findInterval()
function finds which interval each number in a
given vector falls into and returns a vector of interval indices.
It's a faster alternative when there are many breakpoints.
The match()
function can be faster than which()
for looking up
values in a vector:
index <- match(value, my_vector)
This code sets index
to the index of value
in my_vector
.
Hashed environments created by new.env(hash = TRUE)
can be used as fast
key–value store (hash tables).
Lookups (to check if a key exists) are effectively O(1) in a hashed environment
versus O(N) when using a regular list with names()
.
This makes it a much faster and more memory-friendly choice than lists or
named vectors for determining if "something already exists".
# Generate keys
set.seed(42)
n_keys <- 100000
keys <- replicate(n_keys, paste0(sample(letters, 10, replace = TRUE), collapse = ""))
# Store in a hashed environment
hash_env <- new.env(hash = TRUE, size = n_keys)
for (k in keys) hash_env[[k]] <- TRUE
# Store in a named list
my_list <- vector("list", length(keys))
names(my_list) <- keys
# Benchmark
n_tests <- 50000
test_keys <- sample(keys, n_tests, replace = TRUE)
system.time(for (k in test_keys) invisible(exists(k, envir = hash_env, inherits = FALSE)))
# user system elapsed
# 0.044 0.000 0.045
system.time(for (k in test_keys) invisible(k %in% names(my_list)))
# user system elapsed
# 29.518 2.026 32.129
mapply()
applies a function over a set of lists in an element-wise fashion:
mapply(sum, list1, list2, list3)
When comparing two or more vectors on an element-wise basis and get the
minimum or maximum of each set of elements, use pmin()
and pmax()
.
vec1 <- c(1, 5, 3, 9, 5)
vec2 <- c(4, 2, 8, 1, 7)
# Instead of using sapply() or a loop:
sapply(1:length(vec1), function(i) min(vec1[i], vec2[i]))
sapply(1:length(vec1), function(i) max(vec1[i], vec2[i]))
# Use pmin() and pmax() for a more concise and efficient solution:
pmin(vec1, vec2)
pmax(vec1, vec2)
pmin()
and pmax()
perform these operations much more efficiently than
alternatives such as applying min()
and max()
in a loop or using sapply()
.
This can lead to a noticeable performance improvement when working with large vectors.
Sometimes we need to run a function on every combination of a set of
parameter values, for example, in grid search. We can use the combination of
expand.grid()
, mapply()
, and do.call()
+ rbind()
to accomplish this.
Suppose we have a simple function that takes two parameters, a
and b
:
f <- function(a, b) {
result <- a * b
data.frame(a = a, b = b, result = result)
}
Create a grid of a
and b
parameter values to evaluate:
params <- expand.grid(a = 1:3, b = 4:6)
We use mapply()
to apply f
to each row of our parameter grid.
We will use SIMPLIFY = FALSE
to keep the results as a list of data frames:
lst <- mapply(f, a = params$a, b = params$b, SIMPLIFY = FALSE)
Finally, we bind all the result data frames together into one final data frame:
do.call(rbind, lst)
To generate all possible combinations of a given set of characters,
expand.grid()
and do.call()
with paste0()
can help.
The following snippet produces all possible three-digit character
strings consisting of both letters (lowercase) and numbers:
x <- c(letters, 0:9)
do.call(paste0, expand.grid(x, x, x))
Here, expand.grid()
generates a data frame where each row is a unique
combination of three elements from x
. Then, do.call(paste0, ...)
concatenates each combination together into a string.
If a function is not natively vectorized (it has arguments that only take
one value at a time), you can use Vectorize()
to create a new function
that accepts vector inputs:
f <- function(x) x^2
lower <- c(1, 2, 3)
upper <- c(4, 5, 6)
integrate_vec <- Vectorize(integrate, vectorize.args = c("lower", "upper"))
result <- integrate_vec(f, lower, upper)
unlist(result["value", ])
The Vectorize()
function works internally by leveraging the mapply()
function, which applies a function over two or more vectors or lists.
The outer()
function is useful for applying a function to every pair of
elements from two vectors. This can be particularly useful for U-statistics
and other situations requiring pairwise computations.
Consider two vectors of numeric values for which we wish to compute a custom function for each pair:
x <- rnorm(5)
y <- rnorm(5)
outer(x, y, FUN = function(x, y) x + x^2 - y)
Here are three methods to achieve this, with increasing levels of optimization.
library(Matrix)
set.seed(42)
mat <- rsparsematrix(nrow = 1000, ncol = 500, density = 0.01)
Method 1. Loop over columns and subtract the mean for non-zero elements:
f1 <- function(mat) {
col_means <- colSums(mat) / colSums(mat != 0)
for (i in seq_len(ncol(mat))) {
mat[mat[, i] != 0, i] <- mat[mat[, i] != 0, i] - col_means[i]
}
mat
}
Method 2. Use a helper matrix to subtract column means with matrix multiplication:
f2 <- function(mat) {
mat_copy <- mat
mat_copy@x <- rep(1, length(mat_copy@x))
col_means <- colSums(mat) / colSums(mat_copy)
mat - mat_copy %*% Diagonal(x = col_means)
}
Method 3. Modify sparse matrix non-zero values directly:
f3 <- function(mat) {
col_means <- colSums(mat) / colSums(mat != 0)
mat@x <- mat@x - rep(col_means, diff(mat@p))
mat
}
microbenchmark::microbenchmark(f1(mat), f2(mat), f3(mat), times = 100)
#> Unit: microseconds
#> expr min lq mean median uq max
#> f1(mat) 110731.242 113995.1290 133040.4843 115918.4595 119605.159 263641.562
#> f2(mat) 473.509 504.6280 680.0215 571.9705 602.659 4543.620
#> f3(mat) 172.446 192.5155 278.6069 238.0460 259.448 3965.356
The speedup is achieved by avoiding making redundant copies (from R's copy-on-modify semantics) and making in-place modifications as much as possible.
The alist()
function can create lists where some elements are intentionally
left blank (or are "missing"), which can be helpful when we want to specify
formal arguments of a function, especially in conjunction with formals()
.
Consider this scenario. Suppose we are writing a function that wraps another
function, and we want our wrapper function to have the same formal arguments
as the original function, even if it does not use all of them.
Here is how we can use alist()
to achieve that:
original_function <- function(a, b, c = 3, d = "something") a + b
wrapper_function <- function(...) {
# Use the formals of the original function
arguments <- match.call(expand.dots = FALSE)$...
# Update the formals using `alist()`
formals(wrapper_function) <- alist(a = , b = , c = 3, d = "something")
# Call the original function
do.call(original_function, arguments)
}
Now, wrapper_function()
has the same formal arguments as
original_function()
, and any arguments passed to wrapper_function()
are forwarded to original_function()
. This way, even if wrapper_function()
does not use all the arguments, it can still accept them, and code that uses
wrapper_function()
can be more consistent with code that uses
original_function()
.
The alist()
function is used here to create a list of formals where
some elements are missing, which represents the fact that some arguments
are required and have no default values. This would not be possible
with list()
, which cannot create lists with missing elements.
To use internal functions from packages without using :::
, you can use
f <- utils::getFromNamespace("f", ns = "package")
f(...)
R functions always return a value. However, some functions are primarily
designed for their side effects. To suppress the automatic printing
of the returned value, use invisible()
.
f <- function(x) {
print(x^2)
invisible(x)
}
The value of x
can be used later when the result is assigned to a variable
or piped into the next function.
on.exit()
is a useful function for cleaning up side effects, such as
deleting temporary files or closing opened connections, even if a function
exits early due to an error:
f <- function() {
temp_file <- tempfile()
on.exit(unlink(temp_file))
# Do stuff with temp_file
}
f <- function(file) {
con <- file(file, "r")
on.exit(close(con))
readLines(con)
}
This function creates a temporary file and then ensures it gets deleted
when the function exits, regardless of why it exits. Note that the arguments
add
and after
in on.exit()
are important for controlling the overwriting
and ordering behavior of the expressions.
The stepfun()
function is an effective tool for creating step functions,
which can be particularly handy in survival analysis.
For instance, say we have two survival curves generated from Kaplan-Meier
estimators, and we want to determine the difference in survival probabilities
at a given time.
Create the survival curves using survfit()
:
library("survival")
fit_km <- survfit(Surv(stop, event == "pcm") ~ 1, data = mgus1, subset = (start == 0))
fit_cr <- survfit(Surv(stop, event == "death") ~ 1, data = mgus1, subset = (start == 0))
Convert these survival curves into step functions:
step_km <- stepfun(fit_km$time, c(1, fit_km$surv))
step_cr <- stepfun(fit_cr$time, c(1, fit_cr$surv))
With these step functions, it becomes straightforward to compute the difference in survival probabilities at specific times:
t <- 1:3 * 1000
step_km(t) - step_cr(t)