Binomial GLMs

Introduction
R
GLMs
Stats
Binomial
Examples of Binomial GLMs
Published

October 2, 2023

Using Binomial GLMs

Data Loading Simple - Diamonds

Okay let’s grab the data from the yarrr dataset for diamonds to assess what drives the value of diamonds based on their weight, clarity and colour.

library(performance)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)
#install.packages("yarrr")
library(yarrr)
Loading required package: jpeg
Loading required package: BayesFactor
Loading required package: coda
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

************
Welcome to BayesFactor 0.9.12-4.7. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).

Type BFManual() to open the manual.
************
Loading required package: circlize
========================================
circlize version 0.4.16
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: https://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization
  in R. Bioinformatics 2014.

This message can be suppressed by:
  suppressPackageStartupMessages(library(circlize))
========================================

yarrr v0.1.5. Citation info at citation('yarrr'). Package guide at yarrr.guide()
Email me at Nathaniel.D.Phillips.is@gmail.com

Attaching package: 'yarrr'

The following object is masked from 'package:ggplot2':

    diamonds
data(diamonds)
summary(diamonds)
     weight          clarity           color          value      
 Min.   : 5.360   Min.   :0.4400   Min.   :2.00   Min.   :174.0  
 1st Qu.: 8.598   1st Qu.:0.8900   1st Qu.:4.00   1st Qu.:184.0  
 Median : 9.805   Median :1.0000   Median :5.00   Median :189.6  
 Mean   : 9.901   Mean   :0.9996   Mean   :4.96   Mean   :189.4  
 3rd Qu.:11.155   3rd Qu.:1.1200   3rd Qu.:6.00   3rd Qu.:194.9  
 Max.   :14.270   Max.   :1.4400   Max.   :8.00   Max.   :206.4  

Step One - Scienctific Model to Stats Model

So we could model the actual diamond value but for this example we will split the value into high and low values (higher than 190 or lower than 190).

df_bin<-diamonds %>% 
  mutate(binary=if_else(value>190,1,0))

ggplot(df_bin,aes(x=as.factor(binary)))+
  geom_bar(fill="darkcyan",alpha=0.7)+
  labs(x="Binary Value Above 190 or not",y="Count")+
  theme_classic()

We imagine that the colour, weight and clarity will all influence the value of a diamond, thus we shall additively include each of these variables in the model.

Binary Value (above 190) ~ Clarity + Weight + Colour

Step Two - Resonse Variable Distribution

We created the response variable to be either 1 or 0. We therefore know that it is a binomial distribution but more specifically it is a Bernoulli distribution. For modelling in r the binomial distribution family is what is used.

Step Three - Organising Fixed Effects

Okay so lets look at our fixed effects, we will use density for continuous values (clarity and weight), while bars can show the counts for the integer Colour. Here colour is an ordinal integer, therefore we will model it as such. If we had Colour as a category (such as red, green, blue etc) we would want to convert it to be a factor for modelling as it would be more nominal.

p1<-ggplot(df_bin,aes(x=clarity))+
  geom_density(fill="darkcyan",alpha=0.7)+
  labs(x="Diamond Clarity",y="Density")+
  theme_classic()

p2<-ggplot(df_bin,aes(x=color))+
  geom_bar(fill="darkcyan",alpha=0.7)+
  labs(x="Diamond Colour",y="Count")+
  theme_classic()

p3<-ggplot(df_bin,aes(x=weight))+
  geom_density(fill="darkcyan",alpha=0.7)+
  labs(x="Diamond Weight",y="Denisty")+
  theme_classic()


p1+p2+p3

Let’s fit the glm using these our statistical model from above.

glm1<-glm(binary~clarity+color+weight,data=df_bin, family= "binomial")

Step Four - Assessing Model Functioning

ModelOutputs<-data.frame(Fitted=fitted(glm1),
                  Residuals=resid(glm1))

p3<-ggplot(ModelOutputs)+
    geom_histogram(aes(x=Residuals))+
    theme_classic()+
    labs(x="Residuals",y="Count")


p4<-ggplot(ModelOutputs) +
    stat_qq(aes(sample=Residuals))+
    stat_qq_line(aes(sample=Residuals))+
    theme_classic()+
    labs(y="Sample Quartiles",x="Theoretical Quartiles")


p3+p4
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Okay, here we have a different looking plot than before as the response variable is not continuous or integers. Therefore, homogeneity of residuals is less important, but normality of residuals is. Here we see that the qqnorm plot looks fairly good but high values are moving away from normality. Generally the model seems to fit well so we will interpret it.

summary(glm1)

Call:
glm(formula = binary ~ clarity + color + weight, family = "binomial", 
    data = df_bin)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -18.8009     3.4634  -5.428 5.69e-08 ***
clarity       9.2910     1.9629   4.733 2.21e-06 ***
color        -0.3836     0.2481  -1.547    0.122    
weight        1.1251     0.1968   5.716 1.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 207.52  on 149  degrees of freedom
Residual deviance: 106.67  on 146  degrees of freedom
AIC: 114.67

Number of Fisher Scoring iterations: 6

Lets write out the equation for the model, then we can use the values from the summary to create an equation for the model (the predict function will do this for us).

If we wanted to we could write out our model as this:

\[ Value \:Above \:190 \:(or \:Not) = Binomial(N, Pr)\]

\[Pr=logit(y)\]

\[ \begin{aligned} y = \beta_{1} Weight + \beta_{2} Colour + \beta_{3} Clarity+ Intercept \end{aligned} \]

Here our link function is slightly different again. This time it is a logit. This is effectively a way of converting a value to be a probability.

We can check this is the link function r is using here.

glm1$family$link
[1] "logit"

Step Five - Model Interpretation

Thankfully we don’t have to extract each \(\beta\) parameter from the summary table as R has useful functions that can do this for us! To do this we make simulated raw data with the same predictor variables in.

We then use the model to predict() the response variable based on those predictor variables.

Therefore, we make a data set with just Clarity, Color and Weight the same as our original data (be careful of spelling and capitalisation, R wants it identical).

The model then predicts the average Probability that a Diamond has a value over 190 based on its weight, color and clarity.

We can also tell the predict function to predict error (Standard Error here that we then convert to an approximation of the 95% confidence interval, we make sure these Upper and Lower confidence intervals don’t stray outside of 0 and 1).

Note that here I tell Predict that I want the fit to be returned on the response scale and not the link scale.

To make plotting easier I will set a Low, middle and high value for weight and color as clarity has the highest influence on diamond value.

NewData_1<-expand.grid(weight=c(min(df_bin$weight),mean(df_bin$weight),max(df_bin$weight)),
                       color=c(2,5,8),
                       clarity=seq(min(df_bin$clarity),max(df_bin$clarity),length.out=50)
                       ) %>% 
  mutate(Weight_Group=factor(case_when(weight==min(df_bin$weight)~"Low",
                                weight==mean(df_bin$weight)~"Mid",
                                weight==max(df_bin$weight)~"High"
                                ),levels=c("Low","Mid","High")),
         Color_Group=factor(case_when(color==2~"Low Colour",
                                color==5~"Mid Colour",
                                color==8~"High Colour"
                                ),levels=c("Low Colour","Mid Colour","High Colour")))

Pred<-predict(glm1,NewData_1,se.fit=TRUE,type="response")

NewData<-NewData_1 %>% 
  mutate(response=Pred$fit,
         se.fit=Pred$se.fit,
         Upr=response+(se.fit*1.96),
         Lwr=response-(se.fit*1.96),
         Upr=case_when(Upr>1~1,
                       TRUE~Upr),
         Lwr=case_when(Lwr<0~0,
                       TRUE~Lwr))

ggplot(NewData)+
  geom_ribbon(aes(x=clarity,
                    ymax=Upr,
                    ymin=Lwr,
                  fill=Weight_Group),
              alpha=0.6)+
  geom_line(aes(x=clarity,
                 y=response,
                  colour=Weight_Group))+
  facet_wrap(~Color_Group,ncol = 1)+
   scale_colour_manual(values=c("darkorange","grey70","darkcyan"),)+
   scale_fill_manual(values=c("darkorange","grey70","darkcyan"))+
  labs(x="Clarity",y="Response Variable (Probability of Diamond value over 190)",
       fill="Weight Group",colour="Weight Group")+
  theme_classic()

From this graph we can see that Higher Weight class leads to a higher probability of a diamond being over 190 in value, likewise higher clarity leads to higher probability of high value. Colour seems to have less effect, although this is hard to see from this graph.

Now lets plot this model output over the raw values to see how well the model has worked. We will create new columns to show the Colour and Weight Group from the raw data. I will set arbitary cut offs, which could be interquartiles or thirds of the data. Or if there were specific values of interest these could be plotted on their own.

Raw_df_bin<-df_bin %>% 
  mutate(Weight_Group=factor(case_when(weight<=7.5~"Low",
                                weight>7.5 & weight<12.5~"Mid",
                                weight>=12.5~"High"
                                ),levels=c("Low","Mid","High")),
         Color_Group=factor(case_when(color<=3~"Low Colour",
                                color>3 & color<7~"Mid Colour",
                                color>=7~"High Colour"
                                ),levels=c("Low Colour","Mid Colour","High Colour")))


ggplot(NewData)+
  geom_ribbon(aes(x=clarity,
                    ymax=Upr,
                    ymin=Lwr,
                  fill=Weight_Group),
              alpha=0.6)+
  geom_line(aes(x=clarity,
                 y=response,
                  colour=Weight_Group))+
  geom_point(data=Raw_df_bin,aes(x=clarity,y=binary,colour=Weight_Group))+
  facet_grid(Weight_Group~Color_Group)+
   scale_colour_manual(values=c("darkorange","grey70","darkcyan"),)+
   scale_fill_manual(values=c("darkorange","grey70","darkcyan"))+
  labs(x="Clarity",y="Response Variable (Probability of Diamond value over 190)",
       fill="Weight Group",colour="Weight Group")+
  theme_classic()

Some Caveats

Now this looks quite good, with more uncertainty where there are less values to influence the prediction. This is a very simplified model that is not taking into consideration many different factors, such as origin of diamond, the current diamond market etc. It is also a relatively small data set.

Data Loading Complex - Affairs

Lets create a more complex Binomial model. This data set is the frequency of affairs within the last year with their gender, age, yearsmarried, children, religiousness, education, occupation and self rating of marriage.

#install.packages("AER")
library("AER")
Loading required package: car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data(Affairs, package="AER")

summary(Affairs)
    affairs          gender         age         yearsmarried    children 
 Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171  
 1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430  
 Median : 0.000                Median :32.00   Median : 7.000            
 Mean   : 1.456                Mean   :32.49   Mean   : 8.178            
 3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000            
 Max.   :12.000                Max.   :57.00   Max.   :15.000            
 religiousness     education       occupation        rating     
 Min.   :1.000   Min.   : 9.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
 Median :3.000   Median :16.00   Median :5.000   Median :4.000  
 Mean   :3.116   Mean   :16.17   Mean   :4.195   Mean   :3.932  
 3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :20.00   Max.   :7.000   Max.   :5.000  
Affairs_binom<-Affairs %>% 
  mutate(affairs_bin=if_else(affairs>0,1,0))

Our response variable will be a summary of the affairs column for occurrence or not of an affair in the last year where affairs was how often someone engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4–10 times, 12 = monthly, 12 = weekly, 12 = daily.

Step One - Scienctific Model to Stats Model

We will assess the frequency of affairs and whether someones gender, religiousness affects this value, we will also include yearsmarried and whether the pattern of yearsmarried changes depending having children or not.

This is a bit more complex model with some interacting fixed effects and additional additive fixed effects and can be written as:

Frequency of Affairs ~ Children*Yearsmarried + Gender + Religiousness

Step Two - Resonse Variable Distribution

As before there are only integer values for affairs or none in the last year. This type of data could be presented in a different way if we wanted but we will use this scale for now.

Step Three - Organising Fixed Effects

Lets check all our fixed effects. For numeric values we can assess their distribution, categorical we can see the number of samples and if it is relatively even. From the way our data has been stored, they are all ordered categories but r has read some as numeric values. This is why we code them as factors first. Gender and children are already factors. We could use the yearsmarried as a categry but we will use a numeric here. I will also do a bit of house keeping such as capitalising the factors (personal preference!).

Affairs_binom<-Affairs_binom %>% 
  mutate(Yearsmarried_fct=factor(case_when(yearsmarried==0.125~"<1 Year",
                                          yearsmarried==0.417~"<1 Year",
                                          yearsmarried==0.75~"<1 Year",
                                          yearsmarried==1.5~"1-2 Years",
                                          yearsmarried==4~"3-5 Years",
                                          yearsmarried==7~"6-8 Years",
                                          yearsmarried==10~"9-11 Years",
                                          yearsmarried==15~"> 12 Years"
                                          ),
                                 levels=c("<1 Year",
                                          "1-2 Years","3-5 Years","6-8 Years",
                                          "9-11 Years","> 12 Years")),
         Religiousness_fct=factor(case_when(religiousness==1~"Anti",
                                          religiousness==2~"Not",
                                          religiousness==3~"Slightly",
                                          religiousness==4~"Somewhat",
                                          religiousness==5~"Very"
                                          ),
                                 levels=c("Anti","Not","Slightly","Somewhat","Very")),
         Gender=case_when(gender=="male"~"Male",
                          gender=="female"~"Female"),
         Children=case_when(children=="yes"~"Yes",
                            children=="no"~"No")) %>% 
  dplyr::rename(Yearsmarried=yearsmarried)
  

p1<-ggplot(Affairs_binom,aes(x=Children))+
  geom_bar(fill="darkcyan",alpha=0.7)+
  theme_classic()


p2<-ggplot(Affairs_binom,aes(x=Yearsmarried))+
  geom_density(fill="darkcyan",alpha=0.7)+
  theme_classic()

p3<-ggplot(Affairs_binom,aes(x=Gender))+
  geom_bar(fill="darkcyan",alpha=0.7)+
  theme_classic()
  
p4<-ggplot(Affairs_binom,aes(x=Religiousness_fct))+
  geom_bar(fill="darkcyan",alpha=0.7)+
  theme_classic()

(p1+p2)/(p3+p4)

These look okay, not always totally even but we shall see how the model performs.

glm2<-glm(affairs_bin~Children*Yearsmarried+Gender+Religiousness_fct,data=Affairs_binom, family= "binomial")

Step Four - Assessing Model Functioning

ModelOutputs<-data.frame(Fitted=fitted(glm2),
                  Residuals=resid(glm2))

p3<-ggplot(ModelOutputs)+
    geom_histogram(aes(x=Residuals))+
    theme_classic()+
    labs(x="Residuals",y="Count")


p4<-ggplot(ModelOutputs) +
    stat_qq(aes(sample=Residuals))+
    stat_qq_line(aes(sample=Residuals))+
    theme_classic()+
    labs(y="Sample Quartiles",x="Theoretical Quartiles")


p3+p4
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As earlier, we don’t see the plot for homogeneity of residuals but we do see binned residuals, which show two clear peaks of residuals. There seems to be two clear elements/patterns within our data. Generally, we shouldn’t accept this model diagnostic. One trick to find out where these issues come from can be to plot the residuals against the fixed factors in the model.

ModelOutputs<-data.frame(Fitted=fitted(glm2),
                  Residuals=resid(glm2),
                  Children= Affairs_binom$Children,
                  Yearsmarried= Affairs_binom$Yearsmarried,
                  Gender= Affairs_binom$Gender,
                  Religiousness_fct= Affairs_binom$Religiousness_fct)

p3<-ggplot(ModelOutputs)+
    geom_point(aes(x=Residuals,y=Children))+
    theme_classic()+
    labs(x="Residuals",y="Children")

p4<-ggplot(ModelOutputs)+
    geom_point(aes(x=Residuals,y=Yearsmarried))+
    theme_classic()+
    labs(x="Residuals",y="Years Married")

p5<-ggplot(ModelOutputs)+
    geom_point(aes(x=Residuals,y=Gender))+
    theme_classic()+
    labs(x="Residuals",y="Gender")

p6<-ggplot(ModelOutputs)+
    geom_point(aes(x=Residuals,y=Religiousness_fct))+
    theme_classic()+
    labs(x="Residuals",y="Religiousness")

(p3+p4)/(p5+p6)

There seems to be no clear patterns we can see, so we must have some other issue somewhere, we will continue with caution now but that would not be advisable normally.

summary(glm2)

Call:
glm(formula = affairs_bin ~ Children * Yearsmarried + Gender + 
    Religiousness_fct, family = "binomial", data = Affairs_binom)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -1.61951    0.42272  -3.831 0.000128 ***
ChildrenYes                1.10918    0.40614   2.731 0.006314 ** 
Yearsmarried               0.16780    0.05222   3.213 0.001312 ** 
GenderMale                 0.19597    0.19837   0.988 0.323212    
Religiousness_fctNot      -0.73164    0.35634  -2.053 0.040052 *  
Religiousness_fctSlightly -0.42674    0.36140  -1.181 0.237685    
Religiousness_fctSomewhat -1.45482    0.36579  -3.977 6.97e-05 ***
Religiousness_fctVery     -1.37705    0.44453  -3.098 0.001950 ** 
ChildrenYes:Yearsmarried  -0.12986    0.05682  -2.286 0.022282 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 628.38  on 592  degrees of freedom
AIC: 646.38

Number of Fisher Scoring iterations: 4

Here we have, for the first time sing our penguins example, multiple levels in a categorical fixed effect. This highlights a behaviour that \(\beta\) values or Estimate values from GLMs show pairwise effect. So each row in this summary where it says yearsmarried_ft then one of the factor levels, it is comparing that factor level to the base level (<1 Year). Likewise, for religiousness_fct, where Anti is our base value and all factor levels are compared to this base level.

Lets write out the equation for the model, then we can use the values from the summary to create an equation for the model (the predict function will do this for us).

If we wanted to we could write out our model as this (this is a long one as there are so many factors):

\[ Affair \:Occurence \:in \:Last \:Year = Binomial(N, Pr)\]

\[Pr=logit(y)\]

\[ \begin{aligned} y = \beta_{1} Children : Years \:Married \\ + \beta_{2} Anti \:vs \:Not \:Regligious \\ + \beta_{3} Anti \:vs \:Somewhat \:Regligious \\ + \beta_{4} Anti \:vs \:Slightly \:Regligious \\ + \beta_{5} Anti \:vs \:Very \:Regligious \\ + \beta_{6} Female \:vs \:Male \:Gender \\ + \beta_{7} Years \:Married \\ + \beta_{8} Children \\ + Intercept \end{aligned} \]

This is a lot of \(\beta\) values but thankfully we don’t have to deal with them directly. As before our link function is the logit value.

glm2$family$link
[1] "logit"

When we plot the estimates into this equation, this should be similar to our raw data but not identical. Remember we are creating a model to Generalise the patterns of the raw data, not copy them!

Step Five - Model Interpretation

Thankfully we don’t have to extract each \(\beta\) parameter from the summary table as R has useful functions that can do this for us! To do this we make simulated raw data with the same predictor variables in.

We then use the model to predict() the response variable based on those predictor variables.

Therefore, we make a data set with Children, Years Married, Gender and Religiousness the same as our original data (be careful of spelling and capitalisation, R wants it identical).

The model then predicts the probability of affairs based on those Children, Years Married, Gender and Religiousness factors.

We can also tell the predict function to predict error (Standard Error here that we then convert to an approximation of the 95% confidence interval again we make sure they stay between 0 and 1).

Note that here I tell Predict that I want the fit to be returned on the response scale and not the link scale.

NewData_1<-expand.grid(Children=as.factor(c("Yes","No")),
                       Yearsmarried=unique(Affairs_binom$Yearsmarried),
                       Gender=as.factor(c("Female","Male")),
                       Religiousness_fct=as.factor(c("Anti","Not","Slightly","Somewhat","Very")))

Pred<-predict(glm2,NewData_1,se.fit=TRUE,type="response")

NewData<-NewData_1 %>% 
  mutate(response=Pred$fit,
         se.fit=Pred$se.fit,
         Upr=response+(se.fit*1.96),
         Lwr=response-(se.fit*1.96),
         Upr=case_when(Upr>1~1,
                       TRUE~Upr),
         Lwr=case_when(Lwr<0~0,
                       TRUE~Lwr))

ggplot(NewData)+
  geom_ribbon(aes(x=Yearsmarried,
                    ymax=Upr,
                    ymin=Lwr,
                  fill=Children),
              alpha=0.6)+
  geom_line(aes(x=Yearsmarried,
                 y=response,
                  colour=Children),
              alpha=0.6)+
  facet_grid(Religiousness_fct~Gender)+
   scale_colour_manual(values=c("darkorange","darkcyan"),)+
   scale_fill_manual(values=c("darkorange","darkcyan"))+
  labs(x="Years Married",y="Probability of an Affair\nin the Last Year",
       fill="Children",
       colour="Children")+
  theme_classic()

We wont plot this model output with the raw values to see how well the model has worked as it will look pretty horrible. This is one of the biggest issues with Binomial models, the raw data are all 0s and 1s so when plotted it is very hard to see any patterns.

We can see from these plots that the probability of having had an affair in the last year increases with how many years you have been married especially if you don’t have children. This pattern is less clear or obvious if someone is more religious and there doesn’t appear to be any difference in these trends between Males or Females.