So in the common issues with GLMs tutorial we talked about zero inflated data and how this type of data will cause a issues, often diagnosed by residuals plots. Here we will have some examples of discovering zero inflated data then how to model more appropriately using zero inflated distributions often called ZIP models. But first a little bit of theory and thought.

Cause of the Zeros?

As we have mentioned throughout the tutorials, the cause of data allows us best to model it and discover or describe natural phenomena. This structure data allows us to effectively create models that capture the causal links between factors. Likewise, with high levels of zeros we want to know why there are lots of zeros? Are the Zeros True Zeros? or are they false zeros?

A false zero is often called non-detection, as with the experimental fish example, if we have a net with holes bigger than a certain size of fish we will get lots of zeros for that fish, even if they were in the net but just swam through before we counted the sample.

A true zero would be there were non of that fish there. This is almost impossible to know for sure but if we suspect the zeros are true zeros (we didn’t use a net, we used a bucket and nothing could escape if it was there) then we can model those zeros as well as the counts we do get in a very similar way.

We need to decide if we want to model the process of the zeros occuring or not. This will be case specific.

Zero Inflated Poisson Data

What better way to explore poisson models than with fish data. We will use the remotes package to install a package from github called stats4nr. Within this package there is a fishing data set of fish caught in counts with livebait, whether they came in a camper, number of persons in the group and number of children in the group.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(patchwork)data(fishing)summary(fishing)

nofish livebait camper persons
Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:2.000
Median :0.000 Median :1.000 Median :1.000 Median :2.000
Mean :0.296 Mean :0.864 Mean :0.588 Mean :2.528
3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.000
Max. :1.000 Max. :1.000 Max. :1.000 Max. :4.000
child count
Min. :0.000 Min. : 0.000
1st Qu.:0.000 1st Qu.: 0.000
Median :0.000 Median : 0.000
Mean :0.684 Mean : 3.296
3rd Qu.:1.000 3rd Qu.: 2.000
Max. :3.000 Max. :149.000

As expected this isn’t very good at all, with much of the residuals all around zero, maybe there are too many zeros? Lets look at the distribution of count data.

ggplot(fishing)+geom_bar(aes(count),fill="darkcyan")+labs(x="Count of Fish Caught", y="Count")+theme_classic()

Well there are a lot of zeros there! But remember poisson will break down when the variance isn’t proportional to the expected value, so we cal look at the variance and the mean and see how similar they are.

mean(fishing$count)

[1] 3.296

var(fishing$count)

[1] 135.3739

Definitely not. This shows extreme over dispersion. We have discussed using a Negative Binomial model for this situation. But as the name of tutorial shows we are looking at Zero-Inflated models. But lets try the negative binomial anyway..

library(MASS)

Attaching package: 'MASS'

The following object is masked from 'package:patchwork':
area

The following object is masked from 'package:dplyr':
select

It does go some of the way, but there are still issues around the zero values in the Homogeneity of variance plot! Okay lets get on task then, we will use the zeroinfl() function from the pscl package. We will need to extract the fitted and residuals to plot ourselves as the check_model() function doesn’t support this glm type yet.

#install.packages("pscl")library(pscl)

Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.

For a full example we will actually make up some data. Lets pretend we have count of fish inside and outside of a Marine Protected Area (MPA). We will also pretend our MPA is unbelieveablly good at providing protection to the species of fish we are looking at. But we have some unknown situation that means that sometimes we have high levels of zero records.

We will first make our factors, Time and MPA, then we will create poisson data where lambda is influenced by both the MPA and Time factors. Lets visualise each factor so we can see how we have created the data and what our data look like. Don’t forget we still need to add in the zero-inflation. We will create lambda from a linear equation using b0, b1, b2 and b3 as the intercept and the effects of Time, MPA, and interaction of Time and MPA.

library(patchwork)n <-5000MPA <-sample(c(0,1), size = n, replace =TRUE)Time <-c(1:100)b0<-log(2) # Interceptb1<-log(1.1) # Effect of Timeb2<-log(1.2) # Effect of MPAb3<-log(1.3) # Effect of interaction between Time and MPAlambda<-exp(b0 + b1 *log(Time) + b2 * (MPA==1) + b3 * (MPA==1) *log(Time))y_sim <-rpois(n = n, lambda = lambda) df<-data.frame(MPA=as.factor(MPA),Time=Time,Count=y_sim)p6<-ggplot(df)+geom_bar(aes(x=MPA),fill="darkcyan")+theme_classic()+labs(y="Count",x="MPA")p7<-ggplot(df)+geom_bar(aes(x=Time),fill="darkcyan")+theme_classic()+labs(y="Count",x="Time")p8<-ggplot(df)+geom_bar(aes(x=Count),fill="darkcyan")+theme_classic()+labs(y="Count",x="Fish Count")p6+p7+p8

Now lets create a high number of zeros, which are going to more likely to occur (with a probability of 0.1) if it is inside the MPA but not difference with time.