A quick logistic regression example in ggplot

Using the Pinus example data (grottkop.csv) from the Day 4 assignment, we found a possible best-fit model glm(formula = Invasiveness - ln_sm + SLA + MGT + ln_sm:MGT + SLA:MGT, family = "binomial", data = pine).

Let’s recreate that model. First I will load the dataset. I am going to transform the seed mass variable into ln_sm by log-transforming it, but if you already went through the assignment then you likely already have the transformed variable in your dataframe.

pine <- read.csv("grottkop.csv", header=TRUE)
pine$ln_sm <- log(pine$seed_mass)

Use summary to check that your dataframe has the log-transformed variable.

Now re-create the best-fit model:

invasive.mod <- glm(formula = invasiveness ~ ln_sm + SLA + MGT + ln_sm:MGT + SLA:MGT, family = "binomial", data = pine)

summary(invasive.mod)
## 
## Call:
## glm(formula = invasiveness ~ ln_sm + SLA + MGT + ln_sm:MGT + 
##     SLA:MGT, family = "binomial", data = pine)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.09543  -0.16221   0.03605   0.30675   1.43845  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.923098  11.551066   1.205    0.228
## ln_sm       -2.485442   2.508049  -0.991    0.322
## SLA         -0.024663   0.080596  -0.306    0.760
## MGT         -0.691732   0.618000  -1.119    0.263
## ln_sm:MGT    0.055283   0.139932   0.395    0.693
## SLA:MGT      0.004219   0.004822   0.875    0.382
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.6173  on 24  degrees of freedom
## Residual deviance:  8.2474  on 19  degrees of freedom
## AIC: 20.247
## 
## Number of Fisher Scoring iterations: 7

With the summary of the model we can see the effect size and direction (“estimate” column) for each variable and set of interactions we specified in the model. We also get the null and residual deviance, and the AIC value, which we can use to compare this model fit to other models fit to the same data.

Now, I want to plot a logistic regression onto my binomial data (invasiveness is a yes/no variable) using ggplot. This is fairly straightforward.

Remember when we added a linear regression to a plot, we used the command stat_smooth(method = "lm", se = TRUE), which told R to add a line to our plot representing the linear regression between our two variables. If we were only using two variables in this model the plotting would be equally easy. Just make your plot of the data points:

library(ggplot2)

basic.plot <- ggplot(pine, aes(x = ln_sm, y = invasiveness)) + 
  geom_point() +
  theme_bw()

basic.plot

And then add the logistic regression line using the stat_smooth line:

basic.plot +
  stat_smooth(method = "glm", method.args = list(family="binomial"))

#you can get rid of the standard-error lines using se=FALSE:
basic.plot +
  stat_smooth(method = "glm", method.args = list(family="binomial"), se=FALSE)

But our actual model has several explanatory variables, which are not represented at all in this graph. To make a curve that represents the data, we will need to generate some predictions of invasiveness based on our best-fit model, and then plot the predicted data.

First, generate your predictions (base your range of values for your simulated on the actual spread of your real data):

#make the dataframe
newdat = data.frame(ln_sm = seq(1,6, length.out=25), MGT = seq(3,40, length.out = 25), SLA=rep(82, length.out = 25))
#predict your response variable using your best-fit model
newdat$response <- predict(invasive.mod, newdata = newdat, type = "response")
summary(newdat)
##      ln_sm           MGT             SLA        response      
##  Min.   :1.00   Min.   : 3.00   Min.   :82   Min.   :0.02723  
##  1st Qu.:2.25   1st Qu.:12.25   1st Qu.:82   1st Qu.:0.07846  
##  Median :3.50   Median :21.50   Median :82   Median :0.48182  
##  Mean   :3.50   Mean   :21.50   Mean   :82   Mean   :0.51380  
##  3rd Qu.:4.75   3rd Qu.:30.75   3rd Qu.:82   3rd Qu.:0.97331  
##  Max.   :6.00   Max.   :40.00   Max.   :82   Max.   :0.99981
#the response variable is the one we want to use for our Y-axis now, as it contains all of the information on the effect of the model paraemeters on invasiveness. 

Now that we have our new data, we can use it to plot the line of our actual response variable. This time we will use geom_line, which allows us to specify the data we want to add (rather than calculating our regression for us based on the x and y axes on the graph).

basic.plot +
  geom_line(aes(y = response), data = newdat) 

It should be fairly easy to make new lines for different levels of your response variable depending on different levels of your explanatory variables. Just add a new column to your data frame holding the level of one of your explanatory variables constant and generate a new prediction.

I will now try to generate new lines based on the min, mean, and max values of the MGT and SLA variables, just as we do in the practical assignment.

NOTE: Sometimes I struggle with R:

I am going to specify the min, mean, and max of the MGT and SLA variables here because it’s not working when I try to pull them directly out of the data frame. You should be able to specify these variables within your data.frame( ) command, but it’s just not working out for me today.

#min(pine$MGT) #3
#mean(pine$MGT) #16.6
#max(pine$MGT) #40
#min(pine$SLA) #40
#mean(pine$SLA) #82.784
#max(pine$SLA) #132.3

All right, back to the business of making multiple curves:

newdat2 <- data.frame( 
            ln_sm = seq(min(pine$ln_sm), max(pine$ln_sm), length.out = 60),
            MGT = rep(c(3, 16.6, 40), length.out = 60), 
            SLA = rep(c(40, 82.784, 132.3), length.out = 60))

#use the model to predict the values of invasiveness (your response variable) 
#based on the new dataframe:
newdat2$predicted.invasiveness<-predict(invasive.mod, newdata = newdat2, type = "response")
summary(newdat2)
##      ln_sm            MGT             SLA         predicted.invasiveness
##  Min.   :1.281   Min.   : 3.00   Min.   : 40.00   Min.   :0.02249       
##  1st Qu.:2.572   1st Qu.: 3.00   1st Qu.: 40.00   1st Qu.:0.47933       
##  Median :3.863   Median :16.60   Median : 82.78   Median :0.97171       
##  Mean   :3.863   Mean   :19.87   Mean   : 85.03   Mean   :0.73755       
##  3rd Qu.:5.155   3rd Qu.:40.00   3rd Qu.:132.30   3rd Qu.:0.98737       
##  Max.   :6.446   Max.   :40.00   Max.   :132.30   Max.   :0.99977
#Now re-make the basic plot and add the predicted line to it
basic.plot + 
 geom_line(data = newdat2, aes(y=predicted.invasiveness, group = as.factor(MGT),
                               linetype = as.factor(MGT)))

And that’s it for the basic ggplot-ing of logistic regression!

A few things to try:

  1. Change the grouping and linetype to as.factor(SLA) to see how that changes the curves.
  2. Try increasing and decreasing the length.out number in the data.frame() command to see how it makes the curves smoother or rougher.
  3. Try changing the line color based on the grouping factor (either MGT or SLA).
  4. use labs(xlab = “”, ylab = “”, title = “”) to add labels to the graph.