library(tidyverse)
library(terra)
library(sf)
library(patchwork)
library(tidyterra)
<-as.factor(rast("S2Data/Bordeaux_S2/T30TXQ_20230821T105629_Classification.tif"))
landcover_Bordeaux
<-list.files("S2Data/Bordeaux_S2/",full.names = T,pattern = "10m")
Bordeaux_FileNames_10m
<-list.files("S2Data/Bordeaux_S2/",full.names = T,pattern = "20m")
Bordeaux_FileNames_20m
<-rast(Bordeaux_FileNames_10m)
Bordeaux_10m
<-rast(Bordeaux_FileNames_20m)
Bordeaux_20m
<-ggplot()+
p1geom_spatraster_rgb(data=Bordeaux_10m,
r=3,g=2,b=1,
max_col_value = 5000,
interpolate=T)+
theme_classic()
<-landcover_Bordeaux %>%
landcover_Bordeaux_fctmutate(`30T_20230101-20240101`=factor(as.factor(case_when(`30T_20230101-20240101`==1~"Water",
`30T_20230101-20240101`==2~"Trees",
`30T_20230101-20240101`==4~"Flooded",
`30T_20230101-20240101`==5~"Crops",
`30T_20230101-20240101`==7~"Built",
`30T_20230101-20240101`==8~"Bare",
`30T_20230101-20240101`==9~"Snow",
`30T_20230101-20240101`==10~"Clouds",
`30T_20230101-20240101`==11~"Rangeland")),
levels=c("Water",
"Trees",
"Flooded",
"Crops",
"Built",
"Bare",
"Snow",
"Clouds",
"Rangeland")
))
coltab(landcover_Bordeaux_fct[[1]])<-NULL
<-ggplot()+
p2geom_spatraster(data=landcover_Bordeaux_fct)+
scale_fill_manual(values=c("#2494a2",
"#389318",
"#3d26ab",
"#DAA520",
"#e4494f",
"#bec3c5",
"#f5f8fd",
"#70543e"))+
labs(fill="Class")+
theme_classic()
/p2 p1
Using Machine Learning with Satellite Imagery
Downloading Satellite Imagery
For this tutorial we will be trying to take some known landcover classes, associate them with the surface reflectance of those areas from Sentinel-2, then we will use tidymodels to create a supervised prediction model that can use the surface reflectance of Sentinel-2 to predict habitat from another Sentinel-2 image provided by the European Space Agency (ESA). We will then validate this new prediction using separate data.
Necessary Data For ML
This breaks down into having 4 different data types:
- Training/Testing Sentinel-2 Image: the surface reflectance we will use to train the supervised machine learning model.
- Training/Testing Habitat Information: the classes associated with the surface reflectance from the Training/Testing Sentinel-2 Image.
- Validation Sentinel-2 Image: a new Sentinel-2 image that we will use our trained model on.
- Validation Habitat Information: the true classes associated with the Validation Sentinel-2 Image.
Our Data
Creating these datasets is in itself a large task to do, and requires using resources outside of R. To download Sentinel-2 imagery you need to create a free account here: https://dataspace.copernicus.eu/. Once you have that you can browse locations and times and download a full tile. This will come as a folder called a SAFE file. This has a lot of information and data inside that is important but not needed currently. Likewise the tif file for a whole Sentinel-2 tile is quite large and will take up space and time in this tutorial (plus it won’t work nicely with github as it is over 100MB). So I have taken two Sentinel-2 tiles from the European Atlantic Coast, areas that include large cities, forests and farming areas: Bordeaux and Bilbao. I have then taken the landcover classification from ESRI (found here: https://www.arcgis.com/home/item.html?id=cfcb7609de5f478eb7666240902d4d3d). This landcover classification is created with complex modelling techniques and has its own uncertainty/accuracy. We will ignore this uncertainty and pretend the landcover is the truth! (Just for simplicity’s sake, this isn’t good practice at all!).
I have downloaded the S2 imagery and landcover classifications then applied spatial cropping to minimise the file sizes. I have also selected the bands we are interested in for this tutorial. The two folders in this repository can be found with the 10 bands of Sentinel-2 and a landcover classification for Bordeaux https://github.com/BedeFfinian/BedeFfinianRoweDavies/tree/main/StatisticsMLTutorials/S2Data/Bordeaux_S2 and Bilbao https://github.com/BedeFfinian/BedeFfinianRoweDavies/tree/main/StatisticsMLTutorials/S2Data/Bilbao_S2. You can download these datasets by copying the URL links above into this website: https://download-directory.github.io/. If not you will have to download the S2 imagery yourself, crop them to an area of interest and do the same with the accompanying landcover file downloaded from the above arcgis link. Lets first have a look at Bordeaux data.
Read Data
We will plot the RGB alongside the landcover classification.
The classes are:
1: Water
2: Trees
4: Flooded vegetation
5: Crops
7: Built Area
8: Bare ground 9: Snow/Ice
10: Clouds
11: Rangeland
We will covert the numbers to their classification and then turn them into a factor, and assign them a nice (ish) colour scheme. There already is a colortab associated with the raster, which would plot the colours assigned by ESRI to the tif, but it causes issues with factor alteration so we remove it by setting it as null.
Resample Bands
Before we can combine all bands together in one multiband raster we need to resample the 20 m resolution bands to be 10 m. We will use the function we used previously. Lets also plot this to visualise.
The intensity of spectral bands normally increase with increased wavelength but for Sentinel-2 this ranges around 0-10,000. I set a colourscale limit of 5000 to see the majority of the data, only a few pixels have a much higher value than that.
<-resample(Bordeaux_20m,Bordeaux_10m,method="average")
Bordeaux_20m_resampled
<-c(Bordeaux_10m,Bordeaux_20m_resampled)
Bordeaux_Multispec
ggplot()+
geom_spatraster(data=Bordeaux_Multispec)+
labs(fill="")+
facet_wrap(~lyr,ncol = 2)+
scale_fill_whitebox_c("viridi",na.value = NA,limits=c(0,5000))+
theme_classic()
<SpatRaster> resampled to 500955 cells for plotting
Align Classification
So now we can add the classification as another band in the raster using the same method of using c() from above. The bands are also not named in an easy way to read so lets rename them so that they are easier to work with. Note that the original name of the classification tif was starting with a number and had a hyphen, which is generally bad naming style. We can use little ticks `` to tell r that it is a name of a column. We also remove NA values to make life easier. These tidyverse functions being applied to spatRasters are made available by the tidyterra package, which we have previously used. They make code easier to read but are slower than terra functions when dealing with very big spatrasters.
<-c(Bordeaux_Multispec,landcover_Bordeaux) %>%
Bordeaux_Multispec_cleanedrename(B02=T30TXQ_20230821T105629_B02_10m,
B03=T30TXQ_20230821T105629_B03_10m,
B04=T30TXQ_20230821T105629_B04_10m,
B08=T30TXQ_20230821T105629_B08_10m,
B05=T30TXQ_20230821T105629_B05_20m,
B06=T30TXQ_20230821T105629_B06_20m,
B07=T30TXQ_20230821T105629_B07_20m,
B11=T30TXQ_20230821T105629_B11_20m,
B12=T30TXQ_20230821T105629_B12_20m,
B8A=T30TXQ_20230821T105629_B8A_20m,
Class=`30T_20230101-20240101`) %>%
drop_na()
names(Bordeaux_Multispec_cleaned)
[1] "B02" "B03" "B04" "B08" "B05" "B06" "B07" "B11" "B12"
[10] "B8A" "Class"
ncell(Bordeaux_Multispec_cleaned)
[1] 21298846
Now this could be our training data but this is a lot of pixels to train a model with (~20,000,000 training pixels). We will therefore take a random subset from here to create our training data. Using spatSample will take a random sample of our raster and return a dataframe. Lets create a dataset of 10,000 rows. We will also change the Class column to be a factor and be readable. Note that the spatSample function returns a dataframe.
set.seed(2345)
<-spatSample(Bordeaux_Multispec_cleaned,size=10000,xy=TRUE) %>%
Bordeaux_Multispec_cleaned_Samplemutate(Class=as.factor(case_when(Class==1~"Water",
==2~"Trees",
Class==4~"Flooded",
Class==5~"Crops",
Class==7~"Built",
Class==8~"Bare",
Class==9~"Snow",
Class==10~"Clouds",
Class==11~"Rangeland"))) Class
Model Building
This will follow exactly the same steps as the other ML tutorials, split data, create recipe, specifiy model, set up workflow, tune hyperparametres and then finalise best model. (We could also ensemble many models when finalising as in the stacks package).
Data Splitting
First we will split our model building data from Bordeaux into training, different cross validation folds of the training data and testing data.
library(tidymodels)
<-initial_split(Bordeaux_Multispec_cleaned_Sample,strata = Class)
init
<-training(init)
train
<-vfold_cv(train,strata = Class)
folds
<-testing(init) test
Recipe
We can now write a recipe with any transformations and updates of roles we want. We will keep this simple but we could add in all sorts of updates or transformations if we wanted to.
<-recipe(Class~B02+B03+B04+B08+B05+B06+B07+B11+B12+B8A,data=train)
recipe_Class
<- prep(recipe_Class) %>%
class_juiced juice()
Specification
We will create a Random forest model and try and tune the mtry, trees and min_n over all the folds. But we could do a different model at this stage and the rest of the code stays the same. Or we could even create a model stack like in the Comparing and Ensembling ML tutorial.
<- rand_forest(
tune_spec mtry = tune(),
trees = tune(),
min_n = tune()
%>%
) set_mode("classification") %>%
set_engine("ranger")
Workflow
Once we have the recipe and the specification we can then add them to a workflow and tune the hyperparameters of the model across the folded training data. We will use a different method to tune the hyperparameters here, tune_race_anova() from the finetune package. This will search for best hyperpararmeters but will not use values that have been poor previously. We will also set up some simple parallel processing to speed up this process using the parallel and doParallel packages.
library(finetune)
library(parallel)
library(doParallel)
<- workflow() %>%
tune_wf add_recipe(recipe_Class) %>%
add_model(tune_spec)
set.seed(123)
<- detectCores(logical = FALSE)
cores <- makePSOCKcluster(cores)
cl registerDoParallel(cl)
<- tune_race_anova(
tune_res
tune_wf,resamples = folds
)
i Creating pre-processing data to finalize unknown parameter: mtry
tune_res
# Tuning results
# 10-fold cross-validation using stratification
# A tibble: 10 × 5
splits id .order .metrics .notes
<list> <chr> <int> <list> <list>
1 <split [6747/752]> Fold01 2 <tibble [20 × 7]> <tibble [0 × 3]>
2 <split [6747/752]> Fold03 3 <tibble [20 × 7]> <tibble [10 × 3]>
3 <split [6750/749]> Fold06 1 <tibble [20 × 7]> <tibble [0 × 3]>
4 <split [6751/748]> Fold10 4 <tibble [18 × 7]> <tibble [0 × 3]>
5 <split [6750/749]> Fold07 5 <tibble [16 × 7]> <tibble [0 × 3]>
6 <split [6751/748]> Fold09 6 <tibble [16 × 7]> <tibble [0 × 3]>
7 <split [6747/752]> Fold02 7 <tibble [14 × 7]> <tibble [0 × 3]>
8 <split [6749/750]> Fold04 8 <tibble [12 × 7]> <tibble [6 × 3]>
9 <split [6750/749]> Fold08 9 <tibble [12 × 7]> <tibble [0 × 3]>
10 <split [6749/750]> Fold05 10 <tibble [12 × 7]> <tibble [0 × 3]>
There were issues with some computations:
- Warning(s) x6: No observations were detected in `truth` for level(s): 'Bare', 'F...
- Warning(s) x10: No observations were detected in `truth` for level(s): 'Flooded' ...
Run `show_notes(.Last.tune.result)` for more information.
Lets plot the results of each fold using the ROC_AUC.
%>%
tune_res collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, min_n,trees, mtry) %>%
pivot_longer(min_n:mtry,
values_to = "value",
names_to = "parameter"
%>%
) ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")+
theme_classic()
So we see varying accuracy based on our hyperparameters but all very similar around/above 0.8 or so regardless. Lets select the best and finalise our model.
<- select_best(tune_res, metric = "roc_auc")
best_auc
<- finalize_model(
final_rf
tune_spec,
best_auc
)
final_rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = 8
trees = 1518
min_n = 37
Computational engine: ranger
So this is our model, lets investigate the different features and their importance using the vip package. This is where we use the juiced data set.
library(vip)
set.seed(234)
%>%
final_rf set_engine("ranger", importance = "permutation") %>%
fit(Class~B02+B03+B04+B08+B05+B06+B07+B11+B12+B8A,data=class_juiced
%>%
) vip(geom = "point")+
theme_classic()
Validate Model on testing data
Lets prepare the workflow with our final model then apply the model to the testing data, and collect the accuracy metrics.
<- workflow() %>%
final_wf add_recipe(recipe_Class) %>%
add_model(final_rf)
<- final_wf %>%
final_res last_fit(init)
<- final_res %>%
final_model extract_workflow()
%>%
final_res collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy multiclass 0.768 Preprocessor1_Model1
2 roc_auc hand_till 0.899 Preprocessor1_Model1
This is pretty good accuracy generally. We can also create a confusion matrix to compare where inaccuracies lay (i.e. false positives, false negatives?).
%>%
final_res unnest(cols=.predictions) %>%
conf_mat(.pred_class,Class)
Truth
Prediction Bare Built Crops Flooded Rangeland Trees Water
Bare 2 0 0 0 0 0 0
Built 1 451 46 0 14 60 0
Crops 1 64 164 0 36 127 0
Flooded 0 0 0 0 0 1 0
Rangeland 3 36 71 0 63 28 0
Trees 0 44 29 0 12 963 0
Water 2 2 0 0 0 3 278
Validate with New Data
Read in New Data
Above we used the testing data set to evaluate the model accuracy. These are data randomly split from the training data, which in our case means it is the same satellite image, same day and geographically similar location. But we may be wanting a model that can be applied at scale, across large geographical ranges and different images. To do this we want to validate our model on another image. We will take our bilbao image and its labels (again we are using labels generated by another model so a bit flawed but lets assume the labels are correct) then apply our model, comparing our predictions with the ‘true’ labels. I have used the full tif to predict on but a sample could be taken as above. This is the code that is commented out.
<-as.factor(rast("S2Data/Bilbao_S2/T30TWN_20230925T105801_Classification.tif"))
landcover_Bilbao
<-list.files("S2Data/Bilbao_S2/",full.names = T,pattern = "10m")
Bilbao_FileNames_10m
<-list.files("S2Data/Bilbao_S2/",full.names = T,pattern = "20m")
Bilbao_FileNames_20m
<-rast(Bilbao_FileNames_10m)
Bilbao_10m
<-rast(Bilbao_FileNames_20m)
Bilbao_20m
<-resample(Bilbao_20m,Bilbao_10m,method="average")
Bilbao_20m_resampled
<-c(Bilbao_10m,Bilbao_20m_resampled)
Bilbao_Multispec
<-c(Bilbao_Multispec,landcover_Bilbao) %>%
Bilbao_Multispec_cleanedrename(B02=T30TWN_20230925T105801_B02_10m,
B03=T30TWN_20230925T105801_B03_10m,
B04=T30TWN_20230925T105801_B04_10m,
B08=T30TWN_20230925T105801_B08_10m,
B05=T30TWN_20230925T105801_B05_20m,
B06=T30TWN_20230925T105801_B06_20m,
B07=T30TWN_20230925T105801_B07_20m,
B11=T30TWN_20230925T105801_B11_20m,
B12=T30TWN_20230925T105801_B12_20m,
B8A=T30TWN_20230925T105801_B8A_20m,
Class=`30T_20230101-20240101`) %>%
drop_na()
set.seed(2345)
<-as.data.frame(Bilbao_Multispec_cleaned,xy=TRUE)%>%
Bilbao_Multispec_cleaned_Samplemutate(Class=as.factor(case_when(Class==1~"Water",
==2~"Trees",
Class==4~"Flooded",
Class==5~"Crops",
Class==7~"Built",
Class==8~"Bare",
Class==9~"Snow",
Class==10~"Clouds",
Class==11~"Rangeland")))
Class
#Bilbao_Multispec_cleaned_Sample<-spatSample(Bilbao_Multispec_cleaned,size=1000000,xy=TRUE) %>%
# mutate(Class=as.factor(case_when(Class==1~"Water",
# Class==2~"Trees",
# Class==4~"Flooded",
# Class==5~"Crops",
# Class==7~"Built",
# Class==8~"Bare",
# Class==9~"Snow",
# Class==10~"Clouds",
# Class==11~"Rangeland")))
Predict on New Data
We will predict on these new data, creating a new column. To calculate the confusion matrix and accuracy we need the two columns to have the same factor levels, but they are not all in both columns. Se we will set the levels of the factor columns to include all potential levels.
<-Bilbao_Multispec_cleaned_Sample %>%
BilbaoPredictionmutate(predict(final_model,.),
.pred_class=factor(.pred_class,levels=c("Water",
"Trees",
"Flooded",
"Crops",
"Built",
"Bare",
"Snow",
"Clouds",
"Rangeland")),
Class=factor(Class,levels=c("Water",
"Trees",
"Flooded",
"Crops",
"Built",
"Bare",
"Snow",
"Clouds",
"Rangeland")))
%>%
BilbaoPredictionconf_mat(truth=Class,estimate=.pred_class)
Truth
Prediction Water Trees Flooded Crops Built Bare Snow Clouds Rangeland
Water 9869 9551 0 21 9546 0 0 0 248
Trees 436 473788 0 2498 84310 3 0 0 36686
Flooded 0 0 0 0 0 0 0 0 0
Crops 39 31138 0 4970 45239 15 0 0 65353
Built 1283 16852 0 1994 265142 529 0 2 10960
Bare 0 0 0 0 442 0 0 0 0
Snow 0 0 0 0 0 0 0 0 0
Clouds 0 0 0 0 0 0 0 0 0
Rangeland 0 12592 0 1409 7418 1 0 0 39848
%>%
BilbaoPredictionaccuracy(truth=Class,estimate=.pred_class)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.701
Okay so we can see that the accuracy is 0.701, not amazing but also not awful.
Plotting Prediction
So lets look on a map, comparing the new prediction with the ‘true’ class. We will clean the prediction dataset, convert the predictions to rasters then plot them alongside the original ‘true’ classification. Again as before we will remove the colortab as it make mutation difficult.
<-BilbaoPrediction %>%
BilbaoPrediction_tifmutate(.pred_class=case_when(.pred_class=="Water"~1,
=="Trees"~2,
.pred_class=="Flooded"~4,
.pred_class=="Crops"~5,
.pred_class=="Built"~7,
.pred_class=="Bare"~8,
.pred_class=="Snow"~9,
.pred_class=="Clouds"~10,
.pred_class=="Rangeland"~11))%>%
.pred_class::select(x,y,.pred_class) %>%
dplyrrast(type="xyz",crs="EPSG:32630")
<-as.factor(rast("S2Data/Bilbao_S2/T30TWN_20230925T105801_Classification.tif"))
landcover_Bilbao_True
<-c(landcover_Bilbao_True,BilbaoPrediction_tif) %>%
Bilbao_Comparisonrename(Truth=`30T_20230101-20240101`,
Prediction=.pred_class)
coltab(Bilbao_Comparison[[1]])<-NULL
<-Bilbao_Comparison %>%
Bilbao_Comparison_fctmutate(Prediction=factor(case_when(Prediction==1~"Water",
==2~"Trees",
Prediction==4~"Flooded",
Prediction==5~"Crops",
Prediction==7~"Built",
Prediction==8~"Bare",
Prediction==9~"Snow",
Prediction==10~"Clouds",
Prediction==11~"Rangeland"),
Predictionlevels=c("Water",
"Trees",
"Flooded",
"Crops",
"Built",
"Bare",
"Snow",
"Clouds",
"Rangeland")),
Truth=factor(case_when(Truth==1~"Water",
==2~"Trees",
Truth==4~"Flooded",
Truth==5~"Crops",
Truth==7~"Built",
Truth==8~"Bare",
Truth==9~"Snow",
Truth==10~"Clouds",
Truth==11~"Rangeland"),
Truthlevels=c("Water",
"Trees",
"Flooded",
"Crops",
"Built",
"Bare",
"Snow",
"Clouds",
"Rangeland")))
ggplot()+
geom_spatraster(data=Bilbao_Comparison_fct,
maxcell = 10000000)+
scale_fill_manual(values=c("#2494a2",
"#389318",
"#DAA520",
"#e4494f",
"#bec3c5",
"#f5f8fd",
"#70543e"))+
facet_wrap(~lyr,ncol = 1)+
labs(fill="Class")+
theme_classic()
Okay so these two images look very similar generally but with some big differences, especially in the amount of rangeland predicted. We can also see whether certain areas were better predicted or not.
<-Bilbao_Comparison_fct %>%
Correctmutate(Correct=as.factor(case_when(Truth==Prediction~"Correct",
TRUE~"Incorrect"))) %>%
select(Correct)
ggplot()+
geom_spatraster(data=Correct,
maxcell = 10000000)+
scale_fill_manual(values=c("#cccc00",
"#b3002d"))+
labs(fill="Correct?")+
theme_classic()
Okay so there are some clear patterns here of where we have correctly identified and where we haven’t. This may be acceptable to us or not.