Loops in R

Tidyverse

ggplot2

forloop

map

purrrr

An Introduction look at loops/ifelse/apply/mapping.

Published

June 5, 2024

Repeating Tasks

Often in R (or other programming languages) we want to repeat a task many times. There are a few options we might use to carry out the same task multiple times: copy and paste the code with minor changes, write a loop that repeats the task for us, map (not the GIS sense but is programming language to alter one thing by another) a function across a list or write a whole function to carry out our task them map that function across our list. Generally, we will copy and paste code when it is only a couple times we repeat the task, map a function over a list when the function to be applied is simple, loop when it is a complex task we want to repeat many times and write a bespoke function when it is a complex process we will carry out many times and we will carry out this process often in time generally.

Loops

So we want to carry out a task multiple times with a slight difference and we don’t want to copy and paste the code lots of times. For example reading in many csv files but there is a different one for each day of a month or we have a 10 locations and want to create and save a map (GIS sense) for each but all in the same style. To achieve this we can use loops. Many of the examples for loops could also be achieved with the apply functions or map functions but each repeats some task with a rule of systematically changing some element of the task each time. Depending on the task or personal preference will dictate which to use. We will first talk about For loops.

We set up a for loop by saying:

for (variable in sequence){expression}.

This is potentially easier to understand in code. We will tell r that for every number from 1 to 7 we want it to calculate and print the square of this value.

The curly brackets aren’t needed when the expression is on the same line, but for easier reading we will normally use curly brackets and have on multiple lines.

for (i in 1:7){
  
  x<-i*i
  
  print(paste0(i, " Squared is equal to ", x))
  
}

[1] "1 Squared is equal to 1"
[1] "2 Squared is equal to 4"
[1] "3 Squared is equal to 9"
[1] "4 Squared is equal to 16"
[1] "5 Squared is equal to 25"
[1] "6 Squared is equal to 36"
[1] "7 Squared is equal to 49"

So we can see that the code on its own is first creating some object called x that stores the square of i and then we print a statement saying “i squared is equal to x”. For each loop we replace i with the values 1 to 7 in ascending order. We could change this to be random values that we might want. i is often the first replacement used but it could be anything (letter or phrase).

for (ThisNumber in c(10,2,33,8,152)){
  
  SquareNumber<-ThisNumber*ThisNumber
  
  print(paste0(SquareNumber ," is ",ThisNumber, " squared"))
  
}

[1] "100 is 10 squared"
[1] "4 is 2 squared"
[1] "1089 is 33 squared"
[1] "64 is 8 squared"
[1] "23104 is 152 squared"

If / Else

Within our for loop me might want to have some logical statement that means we apply a different process depending on the sequence value. For example, we might want to print the square of all numbers up to 10 and then print just the raw value after that. To do this we can create an if else statement. Where we have an if statement, which when it is true we do one then, else we do something else.

for (j in 1:20){
  
  if (j<11){
    
    print(paste0(j*j ," is ",j, " squared"))
    
  }else {
    
    print(paste0(j ," is not squared"))
    
    }
  
}

[1] "1 is 1 squared"
[1] "4 is 2 squared"
[1] "9 is 3 squared"
[1] "16 is 4 squared"
[1] "25 is 5 squared"
[1] "36 is 6 squared"
[1] "49 is 7 squared"
[1] "64 is 8 squared"
[1] "81 is 9 squared"
[1] "100 is 10 squared"
[1] "11 is not squared"
[1] "12 is not squared"
[1] "13 is not squared"
[1] "14 is not squared"
[1] "15 is not squared"
[1] "16 is not squared"
[1] "17 is not squared"
[1] "18 is not squared"
[1] "19 is not squared"
[1] "20 is not squared"

Saving Objects from Loops

For some more complex analyses we might want to save a df for each loop that we carry out. One easy way to do this is save each df as an element of a list object. To do this we create an empty list object. Lets create a random data column with some other groupings. Lets create random data that has a mean the same as the iteration of the for loop and then we can try find the mean of a specific iteration to see that it has a close mean to the iteration value.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

List_to_fill<-list()

for (i in 1:100){
  
  df<-data.frame(
    response = rnorm(300,mean=i,sd=4),
    group1 = c("a","b","c"),
    group2 = c("d","e","f")
  )
  
  List_to_fill[[i]]<-df
  
}

df_25<-List_to_fill[[25]]

mean(df_25$response)

[1] 24.90474

Looping Over Elements of a List

So we might, because we just created it, a list of elements that we want to apply a loop over. So lets print all mean values that are tens (10, 20, 30, etc) and plot the data of the 50th iteration. We know there are 100 dfs in our list but we don’t need to to set the for loop going. We will use the %in% sign to say “is one of” then create a vector of all possible tens between 10 and the length of our list. For some other examples later on we will also save every 10 dataframe as a csv.

You will notice we use an if without and else. R will just carry out the true statements and do nothing for the else statements. To plot a graph from within a for loop we have to explicitly use the print() function. We want our computer to know that Datafrme_100 is our last df and not our second one so we will add leading zeros to make sure the ordering works.

TenSequence<-seq(10,round(length(List_to_fill)),by=10)

TenSequence

 [1]  10  20  30  40  50  60  70  80  90 100

NewList<-list()

for ( i in 1:length(List_to_fill)){
  
  if( i %in% TenSequence){
    Mean_Response<-mean(List_to_fill[[i]]$response)
    print(Mean_Response)
    write_csv(List_to_fill[[i]],paste0("loop_data/Dataframe_",str_pad(i,3,pad="0"),".csv"))
  }
  
  if( i == length(List_to_fill)){
    
    df_100<-List_to_fill[[i]]
    
    print(
      ggplot(df_100)+
      geom_density(aes(x=response,colour=group1))+
      theme_classic()+
      labs(x=paste0("Response variable for i = ", i),
         y=paste0("Density distribution of Response fr i = ", i))
      )
    
  }
  
  NewList[[i]]<-List_to_fill[[i]]$response
  
}

[1] 10.36748
[1] 19.93955
[1] 30.4026
[1] 40.13306
[1] 49.78063
[1] 60.10767
[1] 69.72728
[1] 79.94347
[1] 90.00314
[1] 99.8032

Map

So we can create a new list where we have applied (mapped) a function to every element of the list object we have. This is very similar to for looping, but imaging if we create a list of directory locations for all csv files we want to read into r, we can then apply read_csv() across this list of directories and read in all dfs into a new list object. Lets find out the max value in each of our random distributions, remember the means increased and the standard error stayed consistent. Then lets read back all the csv files we save earlier. The first element in map is the list of directories of files, the second is the function to apply to this list, then we can add arguments to this function as a third argument, for example the way to stop printing column information with read_csv by putting col_types=cols().

MaxValues<-map_dbl(NewList,max)

ggplot(mapping=aes(x=1:length(MaxValues),y=MaxValues))+
  geom_point()+
  theme_classic()

Vector_of_File_Locations<-list.files("loop_data/",full.names = T)

List_of_dfs<-map(Vector_of_File_Locations,
                 read_csv, 
                 col_types = cols())


head(List_of_dfs[[1]])

# A tibble: 6 × 3
  response group1 group2
     <dbl> <chr>  <chr> 
1     7.27 a      d     
2     9.02 b      e     
3    11.5  c      f     
4     7.51 a      d     
5    13.2  b      e     
6    13.0  c      f

So we now have a list of dataframes, maybe we would want to combine them but keep the information of which iteration they originally came from. To do this we can apply mutate to add a new column and then add them together with bind_rows(). Being part of the tidyverse we can start to use pipes to lay out our code more nicely. As we want to include information about how far along the list we are we will use imap before using map_df to bind the dfs together and return a df.

DF<-List_of_dfs %>% 
  imap(~.x %>% mutate(ID = as.factor(.y))) %>% 
  map_df(bind_rows)
  
ggplot(DF)+
  geom_density(aes(x=response,colour=ID,fill=ID),
               alpha=0.7)+
  scale_colour_viridis_d()+
  scale_fill_viridis_d()+
  theme_classic()

Function Writing

So we can apply already made functions to lists/vectors using map but maybe what we want to do is more complex than a couple functions but we will be doing this really often so we dont want to write a for loop every time we want to do it. This is where we can create out our function that we will save in our global environment and use when we want.

The layout of function writing is creating a name, this should not be the same as other functions from packages we will use, define what inputs the function will take (arguments) then code to show how the inputs will be processed.

FunctionName <- function(argument1,argument2,…){The process we want our function to do}

At its simplest it could be taking a vector and providing the mean. For Example:

NewMean<- function(vector){
  
  sum_of_vector<-sum(vector)
  N_vector<-length(vector)
  Average <- sum_of_vector/N_vector
  return(Average)
  
}

Vector_of_Numbers<-c(2,3,6,4,2,2,4,455,6,777,8,8,9,5)

NewMean(Vector_of_Numbers)

[1] 92.21429

So now we have our bespoke function we can map it over a list of vectors to give a mean value for each vector in the list!

map_dbl(NewList,NewMean)

  [1]  0.5214515  2.0179997  2.7349912  4.0123729  4.9755988  6.0041771
  [7]  6.8691304  8.2631078  8.8711523 10.3674789 11.0093459 11.8649303
 [13] 13.0470201 13.8633256 14.9915393 15.9014576 16.9110091 18.1242437
 [19] 19.1040508 19.9395497 20.7747025 22.4479201 23.1566201 23.8133111
 [25] 24.9047370 25.7758256 26.8396820 28.2881186 29.1104265 30.4026036
 [31] 30.8958319 31.8929704 33.1412271 34.2243564 34.7783360 35.7962329
 [37] 36.4030056 37.8922454 39.0394889 40.1330569 41.1846061 42.4162350
 [43] 42.6296000 43.6898546 44.9472961 45.4498889 46.8415029 47.6696720
 [49] 49.2711117 49.7806280 50.7993460 52.2251594 52.8410748 53.6879532
 [55] 55.0303707 56.3160863 57.0412353 57.9503410 59.3288498 60.1076732
 [61] 60.7913924 62.6105341 62.8846831 63.7208229 64.9659304 65.8981619
 [67] 67.2200312 67.8072290 69.1879936 69.7272831 71.0474479 72.0917500
 [73] 73.0701974 73.8670266 74.6634460 76.1995099 76.8326152 77.9235796
 [79] 79.0114163 79.9434696 81.1518071 82.0465290 82.7513367 84.0554713
 [85] 84.6188140 85.8771705 86.6407995 88.4460343 88.6255059 90.0031396
 [91] 90.9752420 92.1510657 93.0900721 94.1502352 95.2476322 95.8953304
 [97] 96.6317660 97.8135420 98.7293428 99.8031986

This is quite simplistic but it could be quite a complex function, for example some way of plotting distributions. As column selection in ggplot2 is not using characters like other packages, a double curly brackets syntax is used to show that it is the name of a column inside the data and not another dataframe.

Density_Plot_New<-function(data,response_colname,group_colname){
  
  ggplot(data)+
    geom_density(aes(x={{response_colname}},
                     fill={{group_colname}},
                     colour={{group_colname}}),
               alpha=0.7)+
    scale_colour_viridis_d()+
    scale_fill_viridis_d()+
    coord_cartesian(xlim=c(-10,160))+
    theme_classic()
  
}

List_of_dfs %>% 
  imap(~.x %>% mutate(ID = as.factor(.y),
                      response = case_when(group1=="a"~response*0.5,
                                           group1=="b"~response,
                                           group1=="c"~response*1.5))) %>% 
  map(Density_Plot_New,response_colname=response,group_colname=group1)

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]