Introduction to Machine Learning

Introduction
R
Machine Learning
Stats
Random Forest
Brief Introduction to Machine Learning
Published

January 8, 2024

Basic Concepts

So far we have used models to describe data but a lot of times we might want to create a model that is able to predict new data. This may be new data outside of our temporal or spatial range, or outside of our current knowledge base. The best set of tools for predicting new data is generally Machine Learning (ML) methods. For a pretty full and comprehensive discussion, with advantages and disadvantages go here: https://ml-science-book.com/

Unsupervised ML

One group of ML models, requiring minimum input, are Unsupervised models do not required training/labelled data and thus require less initial work. These methods will group similar data into groups or clusters. As the vast majority of data is unlabelled Unsupervised models appeal in that they can be applied with no prior work. However, the user will normally define how many groups to split the data into or at what level of similarity to split, and thus may influence the model itself. Generally, these methods are poor at complex prediction tasks and can be easily outperformed by Supervised ML.

Supervised ML

Supervised learning algorithms or models are models where we ‘train’ a model with known and labelled data so that the model can discern patterns that when it is given unknown/unlabelled data it will be able to, hopefully, predict what the label should be. These models can perform both regression (continuous data) or classification (categorical data), but will be highly reliant on the data used to train the model, as values or classes outside of what it has already seen will have much higher inaccuracy. There are a few different types of Supervised ML with varying levels of complexity from General Linear Models (GLMs), K-Nearest Neighbours (KNN), Random Forests (RF), Support Vector Machines (SVMs), eXtreme Gradient Boosting (XGBoost) and Neural Networks (NN).

General Workflow for Supervised ML

Regardless of the algorithm to be used the general steps to create a ML model is the same:

1). You have a dataset with a labelled response variable (quantitative or qualitative) and one or more predictor variables (can be quantitative or qualitative or a combination, also called features).

2). These data are split into training data*, testing data. Depending on the amounts of data you have the ratios will vary but often 75/25.

3). The training data are used to train a ML algorithm*.

4). The ML is applied to the testing data to predict what their label should be.

5). The predicted label is compared to the true label of the testing data, and the accuracy of the model is calculated from the difference/similarity between the predicted and true labels.

*For higher accuracy and better model performance, the training data are often split again into folds (often cross validation folds) to repetitively repeat model training with different values elements termed hyper parameters (These are different tuning elements for each different type of model). The different hyper parameter values are used and accuracy is assess on each fold, then the ‘best’ performing hyper parameter values are selected for the model that is then used to predict on the testing data.

Which model type to chose? and why is the data more important?

The most popular ML models are Random Forests (RFs), which require very little hyper parameter training and usually provide good accuracy “off the shelf”. XGBoost models perform to higher accuracy once their hyper parameters are effectively trained but are not as good as RF off the shelf, similarly Neural Networks are generally the most accurate and powerful models available but require far more hyper parameter and model architecture to perfect. With all the different ML models the quantity and quality of training data will dictate the eventually ability of the models. For example, none of these models are able to predict a new class when predicting a classification problem, likewise many of these models (but not all) will be unable or very poor at extrapolating outside of the range of the label training values.

Theory

We won’t go into depth of what each of these models is doing inside these functions as there are very many useful and clear tutorials on the theory behind each different machine learning algorithm. Here we will focus on their application, common mistakes, tricks and challenges. For further reading I would recommend: Julia Silge for tidymodels tutorials (Where I learnt): https://juliasilge.com/, for more complicated theory of random forest (although with python code examples): https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/ and for indepth explanation of neural networks I would recommend Jeremy Howards Youtuube channel: https://www.youtube.com/@howardjeremyp (again this is mostly from a python coding approach).

Random Forest with Tidymodels

Here we will go through a simple worked example of a Random Forest in R. There are many packages for applying machine learning algorithms but to continue on the ‘tidy’ theme we will utilise the ‘tidymodels’ ecosystem of packages to apply ML models. Tidymodels brings together many different modelling engines, packages and frameworks, and has a consistent coding style across all of them so that we can swap model type, package or style with minimal effort.

ML Modelling steps:

  • Pre-process: tidying, splitting and transforming data

  • Train: Training the model (with or without hyperparameter training)

  • Validate: Testing the model on the testing data and calculating accuracy

Let’s Give it a Go

We will actually only need the tidymodels package as it lots all the tidyverse in the background too! But lets bring in our old favourite penguin dataset too!

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.4.4     ✔ tidyr        1.3.1
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following object is masked from 'package:modeldata':

    penguins

Penguins Image by Allison Horst

So the first step in any analysis is to do some basic data exploration i.e. plotting our response variable against other features

We will try classify penguin species based on biometric information.

penguins %>% 
  ggplot(aes(x=flipper_length_mm,y=bill_length_mm,colour=species))+
  geom_point()+
  theme_classic()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Pre-Processing

The data are already tidy so lets split and then carry out any transformations we might want to do.

penguins_noNA<-penguins %>% 
  drop_na()

splitdata<-initial_split(penguins_noNA)

df_train<-splitdata %>% 
  training()

df_test<-splitdata %>% 
  testing()

We will be tuning hyper parameters later so we will also split the training data further into 5 cross validation folds. When we look at the fold object we can see that there are 10 sets of the same training data where the split is randomly different each time. As this element has randomisation involved we set a seed to make sure we get consistent results each time the code is run. We also want to make sure that the different levels of the response variable is present in both sides of each fold so we set strata to be species.

set.seed(234)

df_train_folds<-vfold_cv(df_train,strata = species)

df_train_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [223/26]> Fold01
 2 <split [223/26]> Fold02
 3 <split [223/26]> Fold03
 4 <split [223/26]> Fold04
 5 <split [223/26]> Fold05
 6 <split [224/25]> Fold06
 7 <split [225/24]> Fold07
 8 <split [225/24]> Fold08
 9 <split [226/23]> Fold09
10 <split [226/23]> Fold10

With tidymodels we use recipes to integrate transformations and model formula

For our categorical predictors we will convert them to dummy variables. We will also “prep” and “juice” datasets, which creates a dataset where the recipe hase been applied to it for later on.

penguin_recipe<-recipe(species~.,data=df_train)%>% 
  step_dummy(island,sex) 


penguin_juiced <- prep(penguin_recipe) %>% 
  juice()

Model Building

We now create a model object, we can add specific values to the three hyper parameters, or we can tell the model we will try to tune them across our data folds. Here we create the random forest model and tell tidy models it is a classification mode model and we want to use the ranger “engine”.

tune_spec <- rand_forest(
  mtry = tune(),
  trees = 500,
  min_n = tune()
) %>%
  set_mode("classification") %>%
  set_engine("ranger")

Once we have the recipe and the model we can then add them to a workflow and tune the model on the folded training data.

tune_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(tune_spec)

set.seed(234)

tune_res <- tune_grid(
  tune_wf,
  resamples = df_train_folds,
  grid = 10
)
i Creating pre-processing data to finalize unknown parameter: mtry
tune_res
# Tuning results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 4
   splits           id     .metrics          .notes          
   <list>           <chr>  <list>            <list>          
 1 <split [223/26]> Fold01 <tibble [20 × 6]> <tibble [0 × 3]>
 2 <split [223/26]> Fold02 <tibble [20 × 6]> <tibble [0 × 3]>
 3 <split [223/26]> Fold03 <tibble [20 × 6]> <tibble [0 × 3]>
 4 <split [223/26]> Fold04 <tibble [20 × 6]> <tibble [0 × 3]>
 5 <split [223/26]> Fold05 <tibble [20 × 6]> <tibble [0 × 3]>
 6 <split [224/25]> Fold06 <tibble [20 × 6]> <tibble [0 × 3]>
 7 <split [225/24]> Fold07 <tibble [20 × 6]> <tibble [0 × 3]>
 8 <split [225/24]> Fold08 <tibble [20 × 6]> <tibble [0 × 3]>
 9 <split [226/23]> Fold09 <tibble [20 × 6]> <tibble [0 × 3]>
10 <split [226/23]> Fold10 <tibble [20 × 6]> <tibble [0 × 3]>

Lets plot the results of each fold using the ROC_AUC.

tune_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")+
  theme_classic()

We can see that the best values are low min_n and low mtry. But we can use a function to select the best hyper parameters from this AUC value, then that is our final model.

best_auc <- select_best(tune_res, metric = "roc_auc")

final_rf <- finalize_model(
  tune_spec,
  best_auc
)

final_rf
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 2
  trees = 500
  min_n = 30

Computational engine: ranger 

So this is our model, lets investigate the different features and their importance using the vip package. This is where we use the juiced data set.

library(vip)

set.seed(234)

final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(species~.,data=penguin_juiced
  ) %>%
  vip(geom = "point")+
  theme_classic()

From this we can see that bill length is the best predictor of species, followed by flipper length.

Validate Model on testing data

Lets prepare the workflow with our final model then apply the model to the testing data, and collect the accuracy metrics.

final_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(final_rf)

final_res <- final_wf %>%
  last_fit(splitdata)

final_res %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy multiclass     0.952 Preprocessor1_Model1
2 roc_auc  hand_till      0.995 Preprocessor1_Model1

So in this very specific example we have got ridiculously good results in our model. This is not always the case.

We can also create a confusion matrix to compare where inaccuracies lay (i.e. false positives, false negatives?).

final_res  %>% 
  unnest(cols=.predictions) %>% 
  conf_mat(.pred_class,species)
           Truth
Prediction  Adelie Chinstrap Gentoo
  Adelie        30         1      0
  Chinstrap      3        17      0
  Gentoo         0         0     33

So we can see from this that almost all the predictions are the same as the truth. If you want to see how much the random number generation from before effects the results you can rerun this script with different set.seed() numbers.