6. Make It Pretty: Plotting 2-way Interactions with ggplot2

ggplot2, as I’ve already made clear, is one of my favourite packages for R. And since that original post about ggplot2 remains one of my most frequently visited, I thought I would proceed with starting a series of posts called “Make It Pretty”, all about sharing ways of visualizing data that I think are attractive/effective/comprehensive. So with this inaugural MIP post, I will be covering how to plot 2-way interactions using ggplot2.

2-way interactions can come in one of three general forms, and I will be providing code for plotting each. This will be a pretty lengthy post (lots of code/explanation), so if you’re only interested in learning how to plot a particular form, just click the the one below. Oh, and each uses an APA format theme that I’ve shared before, but just click here to quickly flip to the code if you need it.

  1. Two categorical variables (e.g., an experimental manipulation and participant gender) interact to predict some outcome
  2. One categorical variable (e.g., an experimental manipulation) interacts with a continuous variable (e.g., participant age) to predict some outcome
  3. Two continuous variables (e.g., participant age and participant income) interact to predict some outcome (Updated Mar. 22, 2016)

Disclaimer About My Plotting Philosophy

I think it is important for visualizations of data to plot both signal and noise. Often times, I encounter plots–mainly from regression analyses–where all I can see are the trends (e.g., lines of best fit), with no attempts made to visualize uncertainty or error in the those trends. As such, you will notice that with my scatterplots, in particular, I am adamant about plotting the raw data points along linear trends, as I think it is important to recognize that significant associations between variables often still leave much to be explained. This approach may not be for you–I’ve had some colleagues say it’s not a “clean” approach to visualization (especially for continuous x continuous interactions)–but it is how I will proceed, and wanted to at least provide some rationale for this practice.

2-Way Interactions with Two Categorical Variables

2-way interactions between categorical variables will most commonly be analyzed using a factorial ANOVA approach. Visualizing 2-way interactions from this kind of design actually takes more coding effort, because you will not be plotting the raw data. Instead, you will need to first summarize the data (means, standard deviations, n per group) via the psychpackage, then calculate a standard error for each group’s mean, and finally, create a bar graph.

Start by installing and calling the necessary packages (psych for summary statistics, and ggplot2 for plotting). For all the example figures, I will be using the mtcars dataset, which we will store in a data frame called dat.

install.packages('psych')
install.packages('ggplot2')
library(psych)
library(ggplot2)

dat=mtcars

For this particular example, we are going to plot the interaction between the transmission type (variable name: am; 0 = automatic, 1 = manual) and V/S* (variable name: vs), for vehicle fuel efficiency (variable name: mpg). A precautionary first step in creating a bar graph is ensuring that R knows to treat your independent variables as categorical factors and not continuous variables (this matters for plotting with ggplot2):

dat$am=as.factor(dat$am)
dat$vs=as.factor(dat$vs)

Next, we are going to use the psych package’s describeBy function, in order to calculate summary statistics for mpg, for each combination of am and vs levels, and we will store these summary statistics in a new data frame called dat2.

dat2 = describeBy(dat$mpg,list(dat$am,dat$vs), mat=TRUE,digits=2)

Have a look at the new data frame, dat2. describeBY will give very generic names (“group1” and “group2”) to your two independent variables. The next few lines of code will give more meaningful names to your variables (Transmission and VS, respectively, because we listed am before vs in the previous line of code), and then we will recode the levels of Transmission to be more descriptive (from 0 and 1 to Automatic and Manual).

names(dat2)[names(dat2) == 'group1'] = 'Transmission'
names(dat2)[names(dat2) == 'group2'] = 'VS'

levels(dat2$Transmission)[levels(dat2$Transmission)=='0'] = 'Automatic'
levels(dat2$Transmission)[levels(dat2$Transmission)=='1'] = 'Manual'

Lastly, we will create a new variable for the standard errors of each group (called se) by dividing the group standard deviations (sd) by the sqrt group sizes (n):

dat2$se = dat2$sd/sqrt(dat2$n)

Now we can finally dive into some plotting-related code. We are going to quickly define values for our 95% confidence interval bars (in limits), the spacing for the bars in our bar graph (in dodge), and a theme that encapsulate a number of formatting changes to make our figure compliant with APA style (in apatheme):

limits = aes(ymax = mean + (1.96*se), ymin=mean - (1.96*se))

dodge = position_dodge(width=0.9)

apatheme=theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.border=element_blank(),
        axis.line=element_line(),
        text=element_text(family='Times'))

And finally, the main event: coding the plot itself (into an object called p). We will map levels of Transmission to the x-axis, the mean of MPG to the y-axis, and fill bars with different color depending on levels of VS. We will then add bars to the plot (via geom_bar), spacing them by the dodge parameter we specified earlier. Then add the 95% CIs (via geom_errorbar), also spaced by the same dodge parameter. Finally, take care of some minor formatting by applying our APA theme, relabeling the y-axis, and using a grey-scale color-scheme for the bars, appropriate for publication.

p=ggplot(dat2, aes(x = Transmission, y = mean, fill = VS))+
  geom_bar(stat='identity', position=dodge)+
  geom_errorbar(limits, position=dodge, width=0.25)+
  apatheme+
  ylab('Mean MPG')+
  scale_fill_grey()
p
A pretty bar plot!
A pretty bar plot!

If you like how it looks, use the ggsave command to save a high-res .png image file to your home directory.

ggsave('barplot.png', width=6, height=6, unit='in', dpi=300)

2-Way Interactions with One Categorical and One Continuous Variable

Visualizing an interaction between a categorical variable and a continuous variable is the easiest of the three types of 2-way interactions to code (usually done in regression models). In this example, we will visualize the interaction between the same transmission type variable as before (variable name: am) and the weight of vehicle (variable name: wt) for vehicle fuel efficiency (variable name: mpg).

We will start off, as before, by ensuring that R accurately treats transmission type as a categorical variable:

dat$am=as.factor(dat$am)

Then, we will actually need to run the analysis (note the significant interaction, using the summary function) underlying this visualization (output stored in model1), in order to create a column of predicted values (called predicted) in our data frame, for plotting lines of best fit:

model1=lm(mpg~am*wt, data=dat)
summary(model1)

dat$predicted=predict(model1)

Next, we’ll store the same APA-format template to use for this plot:

apatheme=theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.border=element_blank(),
        axis.line=element_line(),
        text=element_text(family='Times'))

And then we can go right into creating our scatterplot (called p), mapping weight to the x-axis, mpg to the y-axis, and using different shapes for the data points based on transmission type. We’ll add data points (via geom_point), and manually specify shape numbers corresponding to solid and hollow circles (good for two-group plots; hollow for Automatic cars, solid for Manual cars), and creating a more informative legend with descriptive labels. Add a separate line of best fit each group (solid for Automatic, and dashed for Manual), relabel the x- and y-axes, and apply the APA template:

p=ggplot(dat, aes(x=wt, y=mpg, shape=am))+
geom_point()+
scale_shape_manual(values=c(1,16), name='Transmission', labels=c('Automatic','Manual'))+
geom_line(aes(x = wt, y = predicted, linetype=am)) +
scale_linetype_discrete(name='Transmission', labels=c('Automatic','Manual'))+
labs(x = 'Vehicle Weight', y = 'Vehicle MPG')+
apatheme
p
A pretty scatter plot!
A pretty scatter plot!

Save the plot, and Bob’s your Uncle:

ggsave('scatter1.png', width=6, height=6, unit='in', dpi=300)

One word of caution: be sure to use the same name and labels when specifying the legend details for the shape and line-type parameters, otherwise it will create two separate legends (in this case: … name=”Transmission”, labels=c(“Automatic”,”Manual”)).
2-Way Interactions with Two Continuous Variables (updated Mar. 22, 2016)

Visualizing an interaction between two continuous variables presents a bit of a problem. Specifically, because there is a continuous range of moderator values, it’s not as if there are discrete lines of best–there are no groups here. Instead, there are a number of simple slopes–presumably varying in magnitude and direction–for the entire range of the moderator variable. Often times, only a couple simple slopes are plotted, from one standard deviation above and below the mean of the moderator variable.  I am going to keep with this practice, though I am very open to suggestions for how to improve this plot (and either of the others too!).

We are going to be looking at the interaction between vehicle weight (variable name: wt) and number of gears (variable name: gear) for predicting vehicle fuel efficiency (variable name: mpg). A quick disclaimer: gear is a variable with only three different values: 3, 4, and 5. As such, it’s pretty tenuous to call it continuous, but I’m just using it for the sake of a plotting example.

The first thing we will need to do is run our regression model in order to calculate the simple slopes of wt on mpg at different levels of gear. I like to use Kris Preacher’s simple intercepts/slopes calculator, but do what works for you–but you need BOTH the simple slopes AND (less commonly reported by some programs, like the PROCESS Macro) the simple intercepts.

model2=lm(mpg~gear*wt, data=dat)

After running the model, I used the vcov command to get the parameter variances and covariances for Kris’s interaction utility. Then, I got the degrees of freedom for the model from the summary of the regression model, and used the psych package’s describe function to get the mean and standard deviation of gear, so that I could plug in the values for 1 SD below the mean, the mean itself, and 1 SD above the mean, as the conditional values of the moderator. I also scooped the min and max of wt as the x-variable points.

summary(model2)
vcov(model2)
describe(dat)
Info needed for Kris Preacher's 2-way interaction utility.
Info needed for Kris Preacher’s 2-way interaction utility.

Click “Calculate”, and scroll down to the “Simple Intercepts and Slopes at Conditional Values of Z” section of output in the first window of the interaction utility–these are the values we will use to plot our simple slope lines.

Thanks to a question/suggestion by James Conigrave, I’ve been able to update the code make the plot more pretty. Make a plot called p, and map wt to the x-axis, mpg to the y-axis, and our continuous moderator gear to size–this will produce smaller or bigger points in our scatter plot depending on the value of gear. Give your axes more informative labels using the labs command. We’re going to be creating a legend for different styles of regression lines, and data point size is a pretty intuitive visualization (you can just mention its interpretation in a table note, if necessary), so silence the default size-based legend using “guide = FALSE” in the scale_size_continuous command (otherwise the plot looks cluttered). Now we manually add our simple slopes using the geom_abline command, for which we need to–at the very least in this case–specify intercept and slope values. We will do this three times: first for the simple intercept and slope of wt on mpg at 1 SD below the mean of gear, then at the mean of gear, and then at 1 SD above the mean of gear. In order to specify a different looking line for each (so we can distinguish them), we will enter labels for our lines in the “linetype” sub-command; this simultaneously assigns different linestyles to each simple slope, while controlling how the lines are labeled in the resulting legend. Then, we use scale_linetype_manual in order to specify the particular styles of lines we want for each simple slope, the order we wish for each to appear in our legend, and a name for our legend. Lastly, add our APA format theme, and call the plot.


p=ggplot(dat, aes(x = wt, y = mpg, size = gear))+
geom_point()+
labs(x = 'Vehicle Weight', y = 'Vehicle MPG')+
scale_size_continuous(guide = FALSE)+
geom_abline(aes(intercept=33.965, slope=-4.3985, linetype='-1SD Gear'))+
geom_abline(aes(intercept=38.1208, slope=-5.854, linetype='Mean Gear'))+
geom_abline(aes(intercept=42.2767, slope=-7.3095, linetype='+1SD Gear'))+
scale_linetype_manual(values=c('dotted','dashed','solid'),
breaks=c('-1SD Gear','Mean Gear','+1SD Gear'),name='Simple\nSlope')+
apatheme
p

prettycontinuousinteraction

If you’re happy, save the plot.

ggsave('scatter2.png', width=6, height=6, unit='in', dpi=300)

Wrap Up and Extras

We’ve made some pretty plots, but you might have suggestions for how to improve them.

Three pretty plots!
Three pretty plots!

My buddy Matt Baldwin, for example, likes to put a vertical line(s) (based on the Johnson-Neyman technique; see Preacher, Curran, & Bauer, 2006, for a nice writeup) showing at what value(s) for a moderator that simple slopes diverge (see below). This can be easily done by using the geom_vline() command, and entering in the particular J-N value for the xintercept value (e.g., geom_vline(xintercept=4.45)).

From Baldwin & Biernat (2015)
From Baldwin & Biernat (2015); the simple slopes for the nostalgia condition and control condition significantly diverge beyond a certain level of past self inclusion (indicated by dashed line).

I’m sure there are a number of other ways to improve these plots. Feel free to post suggestions and/or code in the comments; I’ll amend the code in the core of the post based on really good suggestions, or for others, might provide example code and  plots down here in the extras. And if you have any requests for a Make It Pretty post, let me know–I’m thinking I might do forest plots next.

Footnotes

*Full disclosure: I have no idea what V/S refers to. I’m pretty car-stupid, and the online documentation for the mtcars dataset isn’t helpful for this particular variable. I simply selected it for the sake of example, as it produces a significant interaction with transmission type.

Advertisements

11 thoughts on “6. Make It Pretty: Plotting 2-way Interactions with ggplot2

    • Thank you–glad you like it! I ended up finding a solution with the help of this post on StacksOverflow (http://stackoverflow.com/questions/18060116/adding-legend-to-ggplot-when-lines-were-added-manually). Amended code is below (and Ill make the default in the post later, because it’s much prettier than the original (so thank you for indirectly improving the plot)!

      In a nutshell, for each abline you add, use the aesthetics subcommand to capture the intercept and slope values, but then for the linetype aestethics, specify text for the legend labels you’d like for each line. ggplot2 will automatically assign different linetypes to ablines with different text, so we add a scale_linetype_manual option, where we manually specify the linetypes we want for each simple slope (I think solid line for mean makes most sense), reorder the legend using the breaks option, and give the legend a sensible name. I also silenced the sizing legend because it already was ugly (sizing interpretation can just be conveyed in a note under your table), and it looks cluttered with both present–do this by adding the “guide=false) option to scale_size_continuous:

      p=ggplot(dat, aes(x = wt, y = mpg, size = gear))+
      geom_point()+
      labs(x = ‘Vehicle Weight’, y = ‘Vehicle MPG’)+
      scale_size_continuous(name=’# of Gears’,
      guide = FALSE)+
      geom_abline(aes(intercept=33.965,
      slope=-4.3985,
      linetype=’-1SD’))+
      geom_abline(aes(intercept=38.1208,
      slope=-5.854,
      linetype=’Mean’))+
      geom_abline(aes(intercept=42.2767,
      slope=-7.3095,
      linetype=’+1SD’))+
      scale_linetype_manual(values=c(‘dotted’,’dashed’,’solid’),
      breaks=c(‘-1SD’,’Mean’,’+1SD’),
      name=’Simple\nSlope’)+
      apatheme
      p

      Like

  1. Hi, I am really thankful for such a nice explanation about interaction mode. Could you please do a favor by describing the statistical significant effect of each graph you created? More precise, how can I understand, looking at the figure, whether the result is statistically significant particularly for the one continuous and one categorical interaction? Many thanks again!

    Like

    • Hi Winnipeg,

      Unfortunately, there is no simple, foolproof way to eye-ball the figure to determine what, if anything, is significant. To get this kind of information, you first need to test the interaction term (in regression) to see if it is significant, and then, if so, follow up by testing the simple-slopes for significance. In the present example, this amounts to evaluating whether the slope of weight is significant among automatic cars, and then separately, whether the slope of weight is significant among manual cars. You can do this using some available simple-slope calculators (such as the one, here, supplied by Kris Preacher: http://quantpsy.org/interact/index.htm), or you can use a re-centering method. With the re-centering method, you simply conduct the same regression model two different ways, changing the way the categorical variable is dummy coded each time. For example, if you coded automatic as 0 and manual as 1, the slope of weight in the full model would represent the simple slope of weight for automatic cars; changing the dummy coding so that automatic is 1 and manual is 0 and running the same model again, will result in the slope of weight representing the simple slope of weight for manual cars.

      Hope that helps! If you need a more thorough walkthrough of testing simple slopes, I recommend checking out Andrew Hayes’s newish book on testing mediation, moderation, and conditional process models in regression.

      Like

    • Hi Syuhada, I’ll consider it, but even if I do, it probably won’t happen for awhile (I’m in the midst of preparing to move). However, I’m a little apprehensive; 3-way interactions are UGLY, and oftentimes research (at least in my area–psychology) is too terribly underpowered to be even attempting to look at such nuanced effects, though I appreciate this might not be the case in other areas of research. I’ll think it over. If you wanted to give it a shot yourself, though, I would think about creating separate plots (using facets) showing the 2-way continuous interactions (like I’ve plotted above) for separate levels of your third continuous moderators (e.g., a 2-way interaction plot for 1 SD below, a 2-way interaction plot for mean, and a 2-way interactionplot for 1 SD above, for the third moderator)

      Like

      • Thanks for the quick reply and your suggestions! I am in psychology too, though specifically psycholinguistics. I’ll try out your suggestion. 🙂 Hope your move goes smoothly!

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s