PART I. Plotting Group Differences

Suppose we want to plot data which varies according to a factor/categorical variable. These data are typically analyzed using an ANOVA, which tells us whether the grouping factor has a significant effect on the variation among group means.

In the ANOVA/ANCOVA assignment, we performed a one-way ANOVA testing the effect of zinc pollution on diversity in streams. Let’s re-load the necessary packages and the zinc data (medley.csv from Canvas):

#load packages
library(ggplot2)
library(broom)
#read in data (be sure to add the path to the folder 
#where you stored the 'medley' csv file)
medley<-read.csv("medley.csv")
summary(medley)

##    STREAM    ZINC     DIVERSITY        ZNGROUP     
##  Arkan:7   BACK:8   Min.   :0.630   Min.   :1.000  
##  Blue :7   HIGH:9   1st Qu.:1.377   1st Qu.:2.000  
##  Chalk:5   LOW :8   Median :1.855   Median :3.000  
##  Eagle:4   MED :9   Mean   :1.694   Mean   :2.559  
##  Snake:5            3rd Qu.:2.058   3rd Qu.:3.750  
##  Splat:6            Max.   :2.830   Max.   :4.000

Now re-make that model:

zinc.aov<-aov(DIVERSITY~ZINC, data=medley)
summary(zinc.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## ZINC         3  2.567  0.8555   3.939 0.0176 *
## Residuals   30  6.516  0.2172                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Okay, so zinc does cause significant variation among groups. Remember we used the Tukey test to compare the differences between each possible pairing of groups (I’m using the stats package for this, you could also go back and find the code we used on Friday to run the Tukey post-hoc test):

library(stats)
TukeyHSD(zinc.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DIVERSITY ~ ZINC, data = medley)
## 
## $ZINC
##                  diff        lwr        upr     p adj
## HIGH-BACK -0.51972222 -1.1355064 0.09606192 0.1218677
## LOW-BACK   0.23500000 -0.3986367 0.86863665 0.7457444
## MED-BACK  -0.07972222 -0.6955064 0.53606192 0.9847376
## LOW-HIGH   0.75472222  0.1389381 1.37050636 0.0116543
## MED-HIGH   0.44000000 -0.1573984 1.03739837 0.2095597
## MED-LOW   -0.31472222 -0.9305064 0.30106192 0.5153456

Bar charts in base graphics

We plotted a bar-chart in base graphics that looked pretty good:

#First, order the categorical variable into a logical order
medley$ZINCorder<-ordered(medley$ZINC, levels=c("BACK", "LOW", "MED", "HIGH"))
#calculate means
zmeans<-by(medley$DIVERSITY, medley$ZINCorder, mean)
#calculate standard errors
zses<-by(medley$DIVERSITY, medley$ZINCorder, function(x) sd(x)/sqrt(length(x)-1))
#Make the barplot
zxs<-barplot(zmeans, ylim=c(0, 3), ylab="Diversity")
#Add error bars
segments(x0=zxs, y0=zmeans-zses, y1=zmeans+zses)
#Add letters indicating differences
text(zxs, rep(2.5, 2), labels=c("a", "a", "ab", "b"))

(You can decide whether my contrast labels are correct; if not, go ahead and change them in the code).

Bar charts in ggplot

We can make the same chart using ‘ggplot’ code. First, I will make a new data frame with all of the variables we need for plotting the groups and the standard error bars (ggplot works with data frames rather than vectors or functions).

To make the data frame, we need to calculate the mean and standard errors and save them as objects, then put them together into a data fram. There is no shortcut that I know of for this!

First, group the values of DIVERSITY according to the ZINCorder variable, and calculate the means and standard deviations. I have done that here using the ‘aggregate’ function.

dat<-aggregate(medley$DIVERSITY, 
               by=list(ZINCorder = medley$ZINCorder), 
               FUN = function(x) c(mean=mean(x), sd=sd(x), n=length(x)))

Next, specify that this is a dataframe so we can use it later. You can do this with the do.call() function produces a dataframe with your mean, sd, and n. View the dataframe to see what you get.

dat<-do.call(data.frame, dat)

Now calculate the standard errors for each mean and add it to your dataframe. Note that I saved this variable directly into the dataframe by specifying ‘se’ as a column within the data frame ‘dat’: dat$se.

dat$se <- dat$x.sd/sqrt((dat$x.n)-1)

Now we can make a bar plot and add standard error as a graphics parameter in our plot. In this plot I am adding the error using geom_errorbar and the labels for groupwise differences using geom_text. There are many parameters specified within these arguments, you can adjust them as you would like.

bar.plot2<-ggplot(dat, aes(x=ZINCorder, y=x.mean, fill=ZINCorder))+
  geom_col()+
  geom_errorbar(aes(ymin=x.mean-se, ymax=x.mean+se), width=0.05) +
  geom_text(aes(label=c("a", "a", "ab", "b"), vjust = -3.5)) 

bar.plot2

#This is fine, but the labels are way at the top of the graph and sort of hard to read. If I want more space within the graph for my labels I can specify the xlim and ylim using the command coord_cartesian.
bar.plot2 + 
  coord_cartesian(xlim=NULL, ylim=c(0, 3.5))

Try playing around with the colors, adding labels, or doing whatever else you want with the graphics. Just remember to add extra parameters using a +. For example, here I want to change the axis labels and add a chart title:

library(viridis)
bar.plot2 + 
  xlab("Zinc Concentration") +
  ylab("Mean Diversity") +
  ggtitle("TITLE")

Notice this is a different format than I used for the linear regression tutorial. If you want to use that method, you can put all of your labels in a list and it produces the same thng:

bar.plot2 +
  labs(x="Zinc Concentration", 
       y="Mean Diversity", 
       title="TITLE")

This is fine, but to be honest, I don’t like bar charts very much, I think they cover up a lot of interesting and useful information.

Strip plots

I tend to use strip plots in the place of bar charts and box plots. This is really easy in ggplot and it lets you plot all of the individual observations and provide means and standard error bars with just a few lines of code.

First I will plot the points, axis labels, and add the parameter position_jitter() to keep the data points from all stacking on top of one another.

#Basic plot
zincdiv<-ggplot(medley, aes(x=ZINC, y=DIVERSITY)) 
#add points, axis labels, and title: 
zincdiv +
  ggtitle("Diversity by Zinc concentration") +
  geom_point(cex=1.0, pch=1.0, position = position_jitter(w = 0.1, h = 0.1)) +
  xlab("Zinc Concentration") +
  ylab("Mean Diversity")

But how do we add the means and errorbars? This is going to be so much work, right?

Nope.

This is why I love strip-plots in ggplot:

#first do all the stuff we did before, then add stat_summary():
zincdiv +
  ggtitle("Diversity by Zinc level") +
  geom_point(cex=1.0, pch=1.0, position = position_jitter(w = 0.1, h = 0.1)) +
  xlab("Zinc Concentration") +
  ylab("Mean Diversity") +
#stat_summary() adds the statistical information you want to your plot.
stat_summary(fun.data=mean_se, fun.args=list(mult=1),
  geom="pointrange", color="red", position=position_dodge(w=0.5))

That’s it. One function. I will save it as an object so I don’t have to keep typing out all of this code every time…

zincdiv <- zincdiv +
  ggtitle("Diversity by Zinc level") +
  geom_point(cex=1.0, pch=1.0, position = position_jitter(w = 0.1, h = 0.1)) +
  xlab("Zinc Concentration") +
  ylab("Mean Diversity")+
  stat_summary(fun.data=mean_se, fun.args=list(mult=1),
  geom="pointrange", color="red", position=position_dodge(w=0.5))

…then add text themes (you can look these up on your own if you want):

zincdiv + theme(plot.title = element_text(lineheight=.8, 
                                          family="serif", face="plain", vjust=1),
        axis.title.x = element_text(family="serif", vjust=-0.5),
        axis.title.y = element_text(family="serif", vjust=0.3))

#add summary statistics (in this case the error bars)

You can also make a strip plot using base graphics, though I don’t like the coding as much.

medley$ZINCorder<-ordered(medley$ZINC, levels=c("BACK", "LOW", "MED", "HIGH"))

plot(DIVERSITY ~ as.numeric(ZINCorder), data=medley,
     xaxt="n", ylim=c(0,3), xlim=c(0.5, 4.5),
     xlab="Zinc Concentration")

axis(side=1, at=1:4, labels=levels(medley$Zincordered))

means<-by(medley$DIVERSITY, medley$ZINCorder, mean)

segments(x0=(1:4)-0.1, y0=as.numeric(means), 
         x1=(1:4+0.1), lwd=3)

PART II. Plotting the results of ANCOVA

In the ANOVA/ANCOVA lab we also made an ANCOVA model (one continuous response variable ~ one categorical + one continuous explanatory variable), where the response variable was a function of both a categorical and a continuous variable.

If we want to plot several regression lines on our plot to show how the y-intercept and/or the slope of the response variable changes depending on the categorical grouping variable, we can specify this in our ggplot graphics. I will use the model we created from the fruit fly data.

First, read in the data.

partridge <- read.csv("partridge.csv")
summary(partridge)

##     PARTNERS        TYPE        TREATMEN     LONGEV         LLONGEV     
##  Min.   :0.0   Min.   :0.0   Min.   :1   Min.   :16.00   Min.   :1.204  
##  1st Qu.:1.0   1st Qu.:0.0   1st Qu.:2   1st Qu.:46.00   1st Qu.:1.663  
##  Median :1.0   Median :1.0   Median :3   Median :58.00   Median :1.763  
##  Mean   :3.6   Mean   :2.2   Mean   :3   Mean   :57.44   Mean   :1.736  
##  3rd Qu.:8.0   3rd Qu.:1.0   3rd Qu.:4   3rd Qu.:70.00   3rd Qu.:1.845  
##  Max.   :8.0   Max.   :9.0   Max.   :5   Max.   :97.00   Max.   :1.987  
##      THORAX     
##  Min.   :0.640  
##  1st Qu.:0.760  
##  Median :0.840  
##  Mean   :0.821  
##  3rd Qu.:0.880  
##  Max.   :0.940

#make TREATMENT a factor
partridge$ftreat<-factor(partridge$TREATMEN)

Next, recreate the model:

model<-lm(LLONGEV~ftreat + THORAX, data=partridge)
summary(model)

## 
## Call:
## lm(formula = LLONGEV ~ ftreat + THORAX, data = partridge)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.226736 -0.058445 -0.003469  0.059961  0.170389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.82743    0.08149  10.154  < 2e-16 ***
## ftreat2     -0.03648    0.02385  -1.530 0.128726    
## ftreat3     -0.01389    0.02374  -0.585 0.559705    
## ftreat4     -0.09030    0.02387  -3.783 0.000244 ***
## ftreat5     -0.21813    0.02366  -9.218 1.33e-15 ***
## THORAX       1.19385    0.09900  12.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08364 on 119 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6932 
## F-statistic: 57.02 on 5 and 119 DF,  p-value: < 2.2e-16

You can go back to the ANCOVA assignment to see how we made the regression lines in base graphics. Here I will just use ggplot. First, as before I will make a data frame of the predictor variables and use this to calculate my values for the fitted lines (remember ggplot works with data frames, not vectors or matrices or functions).

Creating a data frame with information needed to plot the model:

Here I use the function expand.grid() which automatically creates a dataframe from the information you supply. If you don’t know the functions I use here (ie,seq(), expand.grid(), glance(), augment()), that’s okay, you can just copy it for now and read up on the details later. I will point out, though, that the reason the last argument in my THORAX=seq() function is length.out=5 is to make sure that the length of my THORAX column is the same as the length of my ftreat column.

#make the data frame:
newdat<-expand.grid(THORAX=seq(min(partridge$THORAX),
                    max(partridge$THORAX), length.out=5),
                    ftreat=factor(seq(c(1,2,3,4,5))))
             
#use the model to predict the values of LLONGEV (your response variable)
#based on the new dataframe:
predicted.long<-augment(model, newdata=newdat)

Plotting the ANCOVA model in ggplot

Now we’ll make the plot. First, we will use the original data frame to make our axes and plot our data points.

longev<-ggplot(partridge, aes(x=THORAX, y=LLONGEV)) +
  geom_point()
longev

Now, with one more argument, I will pull the fitted lines out of the predicted.long object (remember, the predicted.long object contains the output of our fitted model using the new data frame. Type summary(predicted.long) into your console to see what the data frame contains.

longev + 
   geom_jitter(width = 0.1) +
    geom_line(data=predicted.long, aes(x=THORAX, y=.fitted, group=ftreat,
                                       linetype = ftreat, color = ftreat))

Graphics challenge:

Try adding your own axis labels, coloring the points according to some other variable in the data frame, changing the shape and size of the points, etc. It’s pretty to play around with graphics in ggplot once you’ve plotted the data. I would say this is the main reason I like ggplot.

Thanks for trying out ggplot, I hope it was helpful for you!

Becca

GGPLOT for ANOVA and ANCOVA models

Becca

October 2017 (Updated October 2018)