For this tutorial we will be trying to take some known landcover classes, associate them with the surface reflectance of those areas from Sentinel-2, then we will use tidymodels to create a supervised prediction model that can use the surface reflectance of Sentinel-2 to predict habitat from another Sentinel-2 image provided by the European Space Agency (ESA). We will then validate this new prediction using separate data.
Necessary Data For ML
This breaks down into having 4 different data types:
Training/Testing Sentinel-2 Image: the surface reflectance we will use to train the supervised machine learning model.
Training/Testing Habitat Information: the classes associated with the surface reflectance from the Training/Testing Sentinel-2 Image.
Validation Sentinel-2 Image: a new Sentinel-2 image that we will use our trained model on.
Validation Habitat Information: the true classes associated with the Validation Sentinel-2 Image.
Our Data
Creating these datasets is in itself a large task to do, and requires using resources outside of R. To download Sentinel-2 imagery you need to create a free account here: https://dataspace.copernicus.eu/. Once you have that you can browse locations and times and download a full tile. This will come as a folder called a SAFE file. This has a lot of information and data inside that is important but not needed currently. Likewise the tif file for a whole Sentinel-2 tile is quite large and will take up space and time in this tutorial (plus it won’t work nicely with github as it is over 100MB). So I have taken two Sentinel-2 tiles from the European Atlantic Coast, areas that include large cities, forests and farming areas: Bordeaux and Bilbao. I have then taken the landcover classification from ESRI (found here: https://www.arcgis.com/home/item.html?id=cfcb7609de5f478eb7666240902d4d3d). This landcover classification is created with complex modelling techniques and has its own uncertainty/accuracy. We will ignore this uncertainty and pretend the landcover is the truth! (Just for simplicity’s sake, this isn’t good practice at all!).
We will plot the RGB alongside the landcover classification.
The classes are:
1: Water
2: Trees
4: Flooded vegetation
5: Crops
7: Built Area
8: Bare ground 9: Snow/Ice
10: Clouds
11: Rangeland
We will covert the numbers to their classification and then turn them into a factor, and assign them a nice (ish) colour scheme. There already is a colortab associated with the raster, which would plot the colours assigned by ESRI to the tif, but it causes issues with factor alteration so we remove it by setting it as null.
Before we can combine all bands together in one multiband raster we need to resample the 20 m resolution bands to be 10 m. We will use the function we used previously. Lets also plot this to visualise.
The intensity of spectral bands normally increase with increased wavelength but for Sentinel-2 this ranges around 0-10,000. I set a colourscale limit of 5000 to see the majority of the data, only a few pixels have a much higher value than that.
<SpatRaster> resampled to 500955 cells for plotting
Align Classification
So now we can add the classification as another band in the raster using the same method of using c() from above. The bands are also not named in an easy way to read so lets rename them so that they are easier to work with. Note that the original name of the classification tif was starting with a number and had a hyphen, which is generally bad naming style. We can use little ticks `` to tell r that it is a name of a column. We also remove NA values to make life easier. These tidyverse functions being applied to spatRasters are made available by the tidyterra package, which we have previously used. They make code easier to read but are slower than terra functions when dealing with very big spatrasters.
Now this could be our training data but this is a lot of pixels to train a model with (~20,000,000 training pixels). We will therefore take a random subset from here to create our training data. Using spatSample will take a random sample of our raster and return a dataframe. Lets create a dataset of 10,000 rows. We will also change the Class column to be a factor and be readable. Note that the spatSample function returns a dataframe.
This will follow exactly the same steps as the other ML tutorials, split data, create recipe, specifiy model, set up workflow, tune hyperparametres and then finalise best model. (We could also ensemble many models when finalising as in the stacks package).
Data Splitting
First we will split our model building data from Bordeaux into training, different cross validation folds of the training data and testing data.
We can now write a recipe with any transformations and updates of roles we want. We will keep this simple but we could add in all sorts of updates or transformations if we wanted to.
We will create a Random forest model and try and tune the mtry, trees and min_n over all the folds. But we could do a different model at this stage and the rest of the code stays the same. Or we could even create a model stack like in the Comparing and Ensembling ML tutorial.
Once we have the recipe and the specification we can then add them to a workflow and tune the hyperparameters of the model across the folded training data. We will use a different method to tune the hyperparameters here, tune_race_anova() from the finetune package. This will search for best hyperpararmeters but will not use values that have been poor previously. We will also set up some simple parallel processing to speed up this process using the parallel and doParallel packages.
So we see varying accuracy based on our hyperparameters but all very similar around/above 0.8 or so regardless. Lets select the best and finalise our model.
This is pretty good accuracy generally. We can also create a confusion matrix to compare where inaccuracies lay (i.e. false positives, false negatives?).
Truth
Prediction Bare Built Crops Flooded Rangeland Trees Water
Bare 2 0 0 0 0 0 0
Built 1 451 46 0 14 60 0
Crops 1 64 164 0 36 127 0
Flooded 0 0 0 0 0 1 0
Rangeland 3 36 71 0 63 28 0
Trees 0 44 29 0 12 963 0
Water 2 2 0 0 0 3 278
Validate with New Data
Read in New Data
Above we used the testing data set to evaluate the model accuracy. These are data randomly split from the training data, which in our case means it is the same satellite image, same day and geographically similar location. But we may be wanting a model that can be applied at scale, across large geographical ranges and different images. To do this we want to validate our model on another image. We will take our bilbao image and its labels (again we are using labels generated by another model so a bit flawed but lets assume the labels are correct) then apply our model, comparing our predictions with the ‘true’ labels. I have used the full tif to predict on but a sample could be taken as above. This is the code that is commented out.
We will predict on these new data, creating a new column. To calculate the confusion matrix and accuracy we need the two columns to have the same factor levels, but they are not all in both columns. Se we will set the levels of the factor columns to include all potential levels.
Okay so we can see that the accuracy is 0.701, not amazing but also not awful.
Plotting Prediction
So lets look on a map, comparing the new prediction with the ‘true’ class. We will clean the prediction dataset, convert the predictions to rasters then plot them alongside the original ‘true’ classification. Again as before we will remove the colortab as it make mutation difficult.
Okay so these two images look very similar generally but with some big differences, especially in the amount of rangeland predicted. We can also see whether certain areas were better predicted or not.