Introduction to R

Introduction

An Introduction to R for Research Scientists: From Installation to Reading and Writing Data.

Published

September 20, 2023

Installation

R

To install R we can install it from the internet by either googling or

For Windows: https://cran.r-project.org/bin/windows/base/

For Mac: https://cran.r-project.org/bin/macosx/

For Linux: https://cran.r-project.org/bin/linux/

RStudio

R is an open source statistical programming language. It can be used through many Graphic User Interfaces (GUI) my preference is to use RStudio but VSCode is good and you can also code in base R (as well as many others).

We can install RStudio from https://posit.co/download/rstudio-desktop/

Basics

This tutorial will rely on using code written in RStudio and locations of things (Script, Console, Environment, Plots/Help) will be RStudio specific but the code could be run in any GUI.

Scripts are saved code that you are editing (What I am writing in currently), you then execute (run) the code in the ‘console’ (Normally located below the script window)

You can execute one line of code by having your cursor on that line in the script or select many lines then click the run button or cmd+enter (mac) or ctrl+enter (pc)

Everything to the right of a hastag ‘#’ is not executed, therefore we can use this to make comments or write notes in our scripts

R code can be used to do simple calculations with values or even create lists, vectors, values, dataframes and more complex objects in the “global environment”, where R stores our information in a session. (normally top right)

Simple Mathematics

6*6

[1] 36

4*3/4

[1] 3

4+3/5

[1] 4.6

(4+3)/5 # mathematical ordering matters!

[1] 1.4

sqrt(144)

[1] 12

pi

[1] 3.141593

We can use either <- or = to assign a value, list or dataframe into an object, thus saving it to R’s global environment for use later

An object is something (usually some sort of data) that is saved in temporary memory (global environment)

a<- 17

Functions

In R we can use functions to do tasks for us, they normally precede a parenthesis (), some are named after what they do and some are less well named,

Within functions there are ‘arguments’ (like options), what you put into these arguments will define how they perform.

One function used very often is c(),

We use c() to concatenate elements together, which means combine them into a vector, which is a series of values

b<- c(1,5,5,3,7)

We can apply functions to an object

mean(b)

[1] 4.2

If we want to check the documentation for a package we can go to the Help window, or type ? before the function name.

?mean()

We can then perform different functions between objects

a*b

[1]  17  85  85  51 119

mean(a*b)

[1] 71.4

We can even save the results to a new object

p<-a*b

Then we can look at what is in the new object by running the object or printing it (print())

[1]  17  85  85  51 119

print(p)

[1]  17  85  85  51 119

We can also create data systematically with R

For example a sequence of 10 values going up by 1

Sequence<-seq(from=1,to=10,by=0.5)

Sequence

 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0

AnotherSequence<-c(1:200)

AnotherSequence

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
[181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200

We will come back to generating data systematically later.

Data Types

In R there are many different types of data, the most common four are Numeric, Integer, Character and Factor.

Logical and Complex are also data types but very rarely used explicitly. (logical is used a lot by functions but we rarely use it ourselves)

Numeric data is any real numbers so 8 or 12.3 or 1.00000002 etc, while Integer data is just whole numbers 3, 4, 111 etc

Character data are words or letters surrounded by quotations (either ” or ‘) such as “A”, “Red”, ’Treated’, Character data has no order to it in Rs ‘mind’

Factor data is like character data but r (or you) have assigned an order to it e.g. “A”, “B”, “C”

Objects

As we saw above we can store data in R as an Object, these can be many different types and combinations,

The most common Object types are Vectors, Lists, Matrices, DataFrames and Arrays,

The main differences of these Object types are what types and combinations of data can be stored in them and how many dimensions they have,

Vector

A single group of one data type (it could be Numeric, Character, Integer, Factor), with one dimension is called a Vector.

Vector_Numeric<- c(1.3,5.8,5.122,3.00,7.12)

Vector_Integer<- as.integer(c(1,5,5,3,7)) 
# we change between data types with these functions 

Vector_Character<- c("This","is","A","Character Vector")

Vector_Factor<-as.factor(c("This","is","A","Character", "Vector")) 
# Notice how r automatically orders alphabetically if we don't tell it the order

We can also change between types (even if they don’t fit that description)

Convert_Numbers_To_Characters<-as.character(Vector_Numeric)

Convert_Numbers_To_Characters

[1] "1.3"   "5.8"   "5.122" "3"     "7.12"

Now our numbers are thought of as characters, so we can’t apply numeric operations to them!

Matrix

Multiple groups of one data type (it could be Numeric, Character, Integer, Factor), with two dimensions is called a Matrix.

Matrix_Numeric<- as.matrix(c(1.3,5.8,5.122,3.00,7.12))

Matrix_Character<- as.matrix(c("This","is","A","Character Matrix"))

DataFrame

Multiple groups of a combination of data types (it could be Numeric, Character, Integer, Factor), with two dimensions is called a Dataframe. Each element of a dataframe must be the last length as the other elements.

df<-data.frame(Column1=c(1.3,5.8,5.122,3.00,7.12),Column2=c(1,5,5,3,7),Column3=Vector_Factor)

List

Multiple groups of a combination of data types or object types (it could be Numeric, Character, Integer, Factor or vectors, dataframes or matrices of these), with two dimensions is called a List. Each element in a list can be a different length to the other elements.

List_Numeric<-list(c(1.3,5.8,5.122,3.00,7.12),
                   c(1,5,5,3,7))

List_From_Vectors<-list(Vector_Character,Matrix_Numeric,Matrix_Numeric)

List_Different_Lengths<-list(Item1=c(1,2,3,4,5,6),Item2=c("a","B","C","D"), Item3=seq(from=1,to=100,by=1))

List_Different_Lengths

$Item1
[1] 1 2 3 4 5 6

$Item2
[1] "a" "B" "C" "D"

$Item3
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

Array

Multiple groups of one data type (it could be Numeric, Character, Integer, Factor or vectors or matrices), with more than two dimensions is called an Array.

Array_1d<-array(c(Matrix_Numeric,c(1.3,5.8,5.122,3.00,7.12)),dim=c(5))
Array_2d<-array(c(Matrix_Numeric,c(1.3,5.8,5.122,3.00,7.12)),dim=c(5,2))
Array_3d<-array(c(Matrix_Numeric,c(1.3,5.8,5.122,3.00,7.12)),dim=c(5,2,2))

Arrays are rarely used so probably won’t discuss much further.

Table of Object Types and Their Dimensions

Indexing

Objects have dimensions and we can use a technique called indexing to select specific elements of an object

We use square brackets to do this,

If the object is 1 dimensional, one number will return one value

Vector_Numeric[4]

[1] 3

If the object is 2 dimensional, one number will return that column

df[2]

by adding a comma we can select the from both dimensions (rows first, then columns)

df[4,2]

[1] 3

If we want all rows but only a specific column we add a comma without a number

df[,2]

[1] 1 5 5 3 7

And vice verse

df[2,]

  Column1 Column2 Column3
2     5.8       5      is

We can also use multiple numbers inside c() to select multiple elements

For example, row 4 and 2 of all columns

df[c(2,4),]

  Column1 Column2   Column3
2     5.8       5        is
4     3.0       3 Character

Or we can use -c() to select all but the mentioned elements

For example, all rows but not columns 2 and 4

df[,-c(2,4)]

  Column1   Column3
1   1.300      This
2   5.800        is
3   5.122         A
4   3.000 Character
5   7.120    Vector

Packages

R relies upon packages, groups of specific functions, which can be installed from the internet and then loaded into a script.

Base R, a package already installed and loaded within R, is very powerful and useful but less user friendly for some tasks.

From Base R we can use the install.packages() function to install a package from online repositories.

R assumes you want to download packages from CRAN (the official online repository but sometimes you might want to download from other repositories)

#install.packages("dplyr")

You only have to do this when you first want the package or want to update it or when you have updated r.

Once a package is installed we have to tell R that we want to use functions from this package so we load it

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

This needs to be run every new R session when this package is used.

We can now run functions from the dplyr library, specifically dplyr is a package, which is part of a group or ‘ecosystem’ of packages called the tidyverse

We will use this group of packages for reading and writing data into and out of R (readr), manipulating and organising data (dplyr and tidyr) and visualisng data (ggplot2)

Data Inspection

First we can make some data into a dataframe, explore this data, then save it as a file and then read the file back into r.

R has some very useful random and non-random data generation functions

Year <- seq(from=1950,to=2023,by=1)
Treatment <- c("Control","Treatment 1","Treatment 2")
Rep<- seq(from=1,to=10,by=1)

These are three vectors, which we can check information about them with a few simple functions

length(Year)

[1] 74

summary(Year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1950    1968    1986    1986    2005    2023

length(Treatment)

[1] 3

summary(Treatment)

   Length     Class      Mode 
        3 character character

length(Rep)

[1] 10

summary(Rep)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00

We want to combine these vectors so we have a row for each rep, year and treatment, we can do this by expanding the grid and create a new dataframe called df.

We can inspect specific elements of a dataframe too

df<-expand.grid(Year=Year,Treatment=Treatment,Rep=Rep)

class(df) # type of object

[1] "data.frame"

nrow(df) # number of rows

[1] 2220

ncol(df) # number of columns

[1] 3

dim(df) # dimensions of object

[1] 2220    3

head(df) # the first 6 rows of the df

  Year Treatment Rep
1 1950   Control   1
2 1951   Control   1
3 1952   Control   1
4 1953   Control   1
5 1954   Control   1
6 1955   Control   1

tail(df) # the last 6 rows of the df

     Year   Treatment Rep
2215 2018 Treatment 2  10
2216 2019 Treatment 2  10
2217 2020 Treatment 2  10
2218 2021 Treatment 2  10
2219 2022 Treatment 2  10
2220 2023 Treatment 2  10

This df is all the meta data we want for our dataframe that we want to now make up some response data

Response<-rnorm(n=nrow(df),mean = 15,sd=8) # we need the response to be same length as the df so we use nrow() for the number of values we want.

We can then combine this to our df, the dollar sign is used to select one dimension (column) from within an object (here a dataframe)

the column Response isn’t present in the data but by assigning our Response vector to it with df$Response <- Response it adds a new column called Response to the dataframe df.

df$Response<-Response

Saving and Loading Data

Data Writing

Once we have our data set we can save it to our computer, but where that is on our computer is important.

To do this we need to know where R is looking for files on your computer. This is called your current working directory.

This information is displayed at the top of the console in Rstudio or you can use the base R function getwd().

We can set our working directory to change where this is in r (not recommended normally), or we can use our saving/loading functions to look in the correct folders (recommended).

Side Note: there is a method for not really needing either of these called using Projects (highly recommended) but that is a bit more advanced so lets leave that for now.

Lets find out where our current working directory is, we can then create a new folder in that location, then save our df to that location.

getwd()

dir.create("NewFolderName/")

We now can save the df we created to this new folder using the write.csv() function from base r, or even better the write_csv() function from the readr package

To save to our folder we only need to say the directory we want the file saved to, followed by a /, then the name of the file with file extension.

Inside reading and writing functions such as write_csv() the main argument will be where is the file to be save to or taken from and we write this out as a character string inside quotations.

# install.packages("readr")

library(readr)

write_csv(df,"NewFolderName/OurNewFile.csv")

Data Reading

Often we don’t want to make fake data as done here, but we will have our own data set that we want to read in from our computer to then clean, organise, manipulate, visualise, analyse and report on.

These data are normally saved as excel spreadsheets. However, Excel is awful and should never be used for reproducible science! That being said it is often where a data spreadsheet starts before we bring it into R.

Excel spreadsheets (.xsl) have lots of added information that actually is not needed and becomes complicated to work with so the easiest file format to read into R is a Comma Separated Values sheet (.csv)

We can convert our spreadsheet in Excel to a .csv file, then we read in the .csv file with the base function read.csv() or even better the readr function read_csv().

Again, inside reading and writing functions such as read_csv() the main argument will be where the file is. we write this out with a character string inside quotations.

To navigate up or down inside folders on your computer you use / to signify a folder, with the highest level folder on the far left

For example:

My_DF<-read_csv("NewFolderName/OurNewFile.csv")

Extra Resources: Cheatsheets

For almost all popular and well used packages there are “cheatsheets” available that provide info on all their most used functions and how to use them.

You can google them. However, as Packages update there may be deprecated functions (not in use any more), but normally the package will tell you the name of the new function.

Here is a short list of some of the packages we will use (these links were all found from googling so may break over time):