Leaving the valley of intermediate competence

Replacing for loops with apply

Published 1 Mar 2021 7 min read code

If it ain’t broke, don’t fix it?

So you’ve spent a lot of time learning and practising R and you’re pretty comfortable with using functions, if else statements and loops like they teach at introductory programming. What more is there to improve?

If the answer is no or you subscribe to the quote above, then turn back now. If yes, continue.

I think that even if one has the skills to do fundamental programming competently, there’s always room for improvement or something new to learn. Or you know that there’s a better, more efficient, way to do it but something is holding you back. For me, it’s usually the latter.

In a milestone of using R I think I have wrapped my head around replacing for loops with the apply family, specifically mapply. The last hurdle in delving into functional programming.

I’ve used iterative coding quite a bit over the years and I’ve been using for loops to do so. As I’ve gotten more competent with applying basic concepts (like loops and functions), I’ve been moving towards optimising my code with more advanced R methods. I started with using more manual functions and sourcing functions from external scripts but I was still relying on loops to apply those functions iteratively.

I know loops are inefficient. I’ve waited days for computationally intensive loops on large datasets to finish. I know that apply and co. can be more computationally efficient but in your typical learning something new way, they hadn’t really clicked for me…until now.

I’ve been trying to use apply family functions where appropriate for years but I’ve never felt comfortable with using them to use them from the start. So, I default back to loops to save time and frustration.

I think the slow uptake is because the syntax is different to the logic of loops that are taught, even if apply’s logic is better from a computing perspective. The syntax and the logic is also inconsistent within the apply family; a known disadvantage over similar functions (like purrr::map).

But let’s focus on a specific case before this becomes a cooking blog: replacing for loops. I’m going to assume that you are competent with manual functions, for loops and lists, and that you want to improve your code. I’m going to focus on lists because they are an efficient way of storing lots of similarly structured data in R.

Here are two ways to replace a for loop.

An example loop

Let’s create an example scenario and data:

# some data to use
loop_data <- data.frame(col1 = c(11:15), col2 = c(20:24))

# define variable to change
a <- seq(0.2, 1, 0.2)

loop_data is a data frame with two numeric columns (col1 & col2). We technically won’t use loop_data$col2 but it’s there to create a 5x2 data frame.
a is a variable that we need for our function. There are 5 values.

We want to add each element of a to loop_data$col1 and save that in a new column loop_data$col1a. We will also add a as a column in loop_data just so we can keep track of which value was used to calculate col1a. So the final output should have 25 rows (5 observations in loop_data x 5 values of a) and 4 columns (col1, col2, col1a, a).

We will be storing our data in lists in all our scenarios. Note that I create the list to hold the answers (loop_ans) before the function rather than to append newly calculated answers sequentially to the list within the function. I use the same replicate function before all the examples. You could also start with an empty list.

# data sets stored as a list - must not simplify or it will reduce to a matrix!
loop_ans <- replicate(length(a), loop_data, simplify = FALSE)

# A function to add a value a to a data frame x
loop_function <- function(x, a) {
  x$col1a <- x$col1 + a # add answer to a new column 
  x$a <- a # add a to a new column
  return(x) # give us the updated data frame
}

# Let's loop
for(i in seq_along(a)){
  loop_ans[[i]] <- loop_function(loop_ans[[i]], a = a[i]) 
}

# merge to single data frame
loop_ans <- do.call(rbind, loop_ans)

# view the data
summary(loop_ans)

##       col1         col2        col1a            a      
##  Min.   :11   Min.   :20   Min.   :11.2   Min.   :0.2  
##  1st Qu.:12   1st Qu.:21   1st Qu.:12.4   1st Qu.:0.4  
##  Median :13   Median :22   Median :13.6   Median :0.6  
##  Mean   :13   Mean   :22   Mean   :13.6   Mean   :0.6  
##  3rd Qu.:14   3rd Qu.:23   3rd Qu.:14.8   3rd Qu.:0.8  
##  Max.   :15   Max.   :24   Max.   :16.0   Max.   :1.0

That’s the loop - should be familiar to you. Merging into a single data frame is optional if you want to keep using lists. Now let’s look at lapply for a less elegant solution (!).

1. `lapply`

lapply takes a list as input, does stuff and gives a list as output. Hence, the l in lapply stands for list. The difference with loops and lapply is that lapply can only take one input - your data frame (or element in list). This means that we need to add the corresponding value of a as a column in each element of lapply - in other words to do part of what loop_function did but outside the loop/lapply. Thus, each data frame in the input list should have three columns: col1, col2 & a.

Incidentally, we can add the corresponding a value as a column using mapply and cbind.

# the function only accepts one element: x
lapply_function <- function(x){
  x$col1a <- x$col1 + x$a
  return(x)
}

# Prepare the answer list
lapply_ans <- replicate(length(a), loop_data, simplify = FALSE)

# add a column using mapply
lapply_ans <- mapply(FUN = cbind, lapply_ans, "a" = a, SIMPLIFY = FALSE)

# apply function
lapply_ans <- lapply(lapply_ans, FUN = lapply_function)

# merge to single data frame
lapply_ans <- do.call(rbind, lapply_ans)

# view the data
summary(lapply_ans)

##       col1         col2          a           col1a     
##  Min.   :11   Min.   :20   Min.   :0.2   Min.   :11.2  
##  1st Qu.:12   1st Qu.:21   1st Qu.:0.4   1st Qu.:12.4  
##  Median :13   Median :22   Median :0.6   Median :13.6  
##  Mean   :13   Mean   :22   Mean   :0.6   Mean   :13.6  
##  3rd Qu.:14   3rd Qu.:23   3rd Qu.:0.8   3rd Qu.:14.8  
##  Max.   :15   Max.   :24   Max.   :1.0   Max.   :16.0

As you see it’s not as simple as the loop or mapply and requires mapply anyway 🤷
So we can do better…

2. `mapply`

The m in mapply stands for multiple because it takes multiple arguments and applies them to the data. There are some key differences in the structure of the data and the function compared to lapply:

We can use the original loop function with two variables!
- The additional variables (a in this example) are written after the function FUN is defined in mapply
We can also use the original list (loop_data) without further modification!
We need to tell mapply not to simplify the output into a matrix by default. Note the use of upper case in SIMPLIFY.

# Prepare the answer list
mapply_ans <- replicate(length(a), loop_data, simplify = FALSE)
# mapply function
mapply_ans <- mapply(mapply_ans, FUN = loop_function, a = a, SIMPLIFY = FALSE)
# merge to single data frame
mapply_ans <- do.call(rbind, mapply_ans)
# view the data
summary(mapply_ans)

##       col1         col2        col1a            a      
##  Min.   :11   Min.   :20   Min.   :11.2   Min.   :0.2  
##  1st Qu.:12   1st Qu.:21   1st Qu.:12.4   1st Qu.:0.4  
##  Median :13   Median :22   Median :13.6   Median :0.6  
##  Mean   :13   Mean   :22   Mean   :13.6   Mean   :0.6  
##  3rd Qu.:14   3rd Qu.:23   3rd Qu.:14.8   3rd Qu.:0.8  
##  Max.   :15   Max.   :24   Max.   :16.0   Max.   :1.0

What mapply is doing is using the n^th element of a with the corresponding n^th element in the list loop_data. So the fifth value of a (1.0) is used in the calculations on the 5th data frame in loop_data.

We’ve replace the for loop with a mapply function! 👏
Here’s to functional programming. Next up is purrr::map…

code R stats

Jacinta Kong

Postdoctoral Fellow

My research interests include species distributions, phenology & climate adaptation of ectotherms.