ggplot2, as I’ve already made clear, is one of my favourite packages for
R. And since that original post about
ggplot2 remains one of my most frequently visited, I thought I would proceed with starting a series of posts called “Make It Pretty”, all about sharing ways of visualizing data that I think are attractive/effective/comprehensive. So with this inaugural MIP post, I will be covering how to plot 2-way interactions using
2-way interactions can come in one of three general forms, and I will be providing code for plotting each. This will be a pretty lengthy post (lots of code/explanation), so if you’re only interested in learning how to plot a particular form, just click the the one below. Oh, and each uses an APA format theme that I’ve shared before, but just click here to quickly flip to the code if you need it.
- Two categorical variables (e.g., an experimental manipulation and participant gender) interact to predict some outcome
- One categorical variable (e.g., an experimental manipulation) interacts with a continuous variable (e.g., participant age) to predict some outcome
- Two continuous variables (e.g., participant age and participant income) interact to predict some outcome (Updated Mar. 22, 2016)
Disclaimer About My Plotting Philosophy
I think it is important for visualizations of data to plot both signal and noise. Often times, I encounter plots–mainly from regression analyses–where all I can see are the trends (e.g., lines of best fit), with no attempts made to visualize uncertainty or error in the those trends. As such, you will notice that with my scatterplots, in particular, I am adamant about plotting the raw data points along linear trends, as I think it is important to recognize that significant associations between variables often still leave much to be explained. This approach may not be for you–I’ve had some colleagues say it’s not a “clean” approach to visualization (especially for continuous x continuous interactions)–but it is how I will proceed, and wanted to at least provide some rationale for this practice.
2-way interactions between categorical variables will most commonly be analyzed using a factorial ANOVA approach. Visualizing 2-way interactions from this kind of design actually takes more coding effort, because you will not be plotting the raw data. Instead, you will need to first summarize the data (means, standard deviations, n per group) via the
psychpackage, then calculate a standard error for each group’s mean, and finally, create a bar graph.
Start by installing and calling the necessary packages (
psych for summary statistics, and
ggplot2 for plotting). For all the example figures, I will be using the
mtcars dataset, which we will store in a data frame called
install.packages('psych') install.packages('ggplot2') library(psych) library(ggplot2) dat=mtcars
For this particular example, we are going to plot the interaction between the transmission type (variable name:
am; 0 = automatic, 1 = manual) and V/S* (variable name:
vs), for vehicle fuel efficiency (variable name:
mpg). A precautionary first step in creating a bar graph is ensuring that
R knows to treat your independent variables as categorical factors and not continuous variables (this matters for plotting with
Next, we are going to use the
describeBy function, in order to calculate summary statistics for
mpg, for each combination of
vs levels, and we will store these summary statistics in a new data frame called
dat2 = describeBy(dat$mpg,list(dat$am,dat$vs), mat=TRUE,digits=2)
Have a look at the new data frame,
describeBY will give very generic names (“group1” and “group2”) to your two independent variables. The next few lines of code will give more meaningful names to your variables (
VS, respectively, because we listed
vs in the previous line of code), and then we will recode the levels of
Transmission to be more descriptive (from 0 and 1 to Automatic and Manual).
names(dat2)[names(dat2) == 'group1'] = 'Transmission' names(dat2)[names(dat2) == 'group2'] = 'VS' levels(dat2$Transmission)[levels(dat2$Transmission)=='0'] = 'Automatic' levels(dat2$Transmission)[levels(dat2$Transmission)=='1'] = 'Manual'
Lastly, we will create a new variable for the standard errors of each group (called
se) by dividing the group standard deviations (
sd) by the sqrt group sizes (
dat2$se = dat2$sd/sqrt(dat2$n)
Now we can finally dive into some plotting-related code. We are going to quickly define values for our 95% confidence interval bars (in
limits), the spacing for the bars in our bar graph (in
dodge), and a theme that encapsulate a number of formatting changes to make our figure compliant with APA style (in
limits = aes(ymax = mean + (1.96*se), ymin=mean - (1.96*se)) dodge = position_dodge(width=0.9) apatheme=theme_bw()+ theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), panel.border=element_blank(), axis.line=element_line(), text=element_text(family='Times'))
And finally, the main event: coding the plot itself (into an object called
p). We will map levels of Transmission to the x-axis, the mean of MPG to the y-axis, and fill bars with different color depending on levels of VS. We will then add bars to the plot (via
geom_bar), spacing them by the
dodge parameter we specified earlier. Then add the 95% CIs (via
geom_errorbar), also spaced by the same
dodge parameter. Finally, take care of some minor formatting by applying our APA theme, relabeling the y-axis, and using a grey-scale color-scheme for the bars, appropriate for publication.
p=ggplot(dat2, aes(x = Transmission, y = mean, fill = VS))+ geom_bar(stat='identity', position=dodge)+ geom_errorbar(limits, position=dodge, width=0.25)+ apatheme+ ylab('Mean MPG')+ scale_fill_grey() p
If you like how it looks, use the
ggsave command to save a high-res .png image file to your home directory.
ggsave('barplot.png', width=6, height=6, unit='in', dpi=300)
Visualizing an interaction between a categorical variable and a continuous variable is the easiest of the three types of 2-way interactions to code (usually done in regression models). In this example, we will visualize the interaction between the same transmission type variable as before (variable name:
am) and the weight of vehicle (variable name:
wt) for vehicle fuel efficiency (variable name:
We will start off, as before, by ensuring that R accurately treats transmission type as a categorical variable:
Then, we will actually need to run the analysis (note the significant interaction, using the
summary function) underlying this visualization (output stored in
model1), in order to create a column of predicted values (called
predicted) in our data frame, for plotting lines of best fit:
model1=lm(mpg~am*wt, data=dat) summary(model1) dat$predicted=predict(model1)
Next, we’ll store the same APA-format template to use for this plot:
apatheme=theme_bw()+ theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), panel.border=element_blank(), axis.line=element_line(), text=element_text(family='Times'))
And then we can go right into creating our scatterplot (called
p), mapping weight to the x-axis, mpg to the y-axis, and using different shapes for the data points based on transmission type. We’ll add data points (via
geom_point), and manually specify shape numbers corresponding to solid and hollow circles (good for two-group plots; hollow for Automatic cars, solid for Manual cars), and creating a more informative legend with descriptive labels. Add a separate line of best fit each group (solid for Automatic, and dashed for Manual), relabel the x- and y-axes, and apply the APA template:
p=ggplot(dat, aes(x=wt, y=mpg, shape=am))+ geom_point()+ scale_shape_manual(values=c(1,16), name='Transmission', labels=c('Automatic','Manual'))+ geom_line(aes(x = wt, y = predicted, linetype=am)) + scale_linetype_discrete(name='Transmission', labels=c('Automatic','Manual'))+ labs(x = 'Vehicle Weight', y = 'Vehicle MPG')+ apatheme p
Save the plot, and Bob’s your Uncle:
ggsave('scatter1.png', width=6, height=6, unit='in', dpi=300)
One word of caution: be sure to use the same name and labels when specifying the legend details for the shape and line-type parameters, otherwise it will create two separate legends (in this case: … name=”Transmission”, labels=c(“Automatic”,”Manual”)).
2-Way Interactions with Two Continuous Variables (updated Mar. 22, 2016)
Visualizing an interaction between two continuous variables presents a bit of a problem. Specifically, because there is a continuous range of moderator values, it’s not as if there are discrete lines of best–there are no groups here. Instead, there are a number of simple slopes–presumably varying in magnitude and direction–for the entire range of the moderator variable. Often times, only a couple simple slopes are plotted, from one standard deviation above and below the mean of the moderator variable. I am going to keep with this practice, though I am very open to suggestions for how to improve this plot (and either of the others too!).
We are going to be looking at the interaction between vehicle weight (variable name:
wt) and number of gears (variable name:
gear) for predicting vehicle fuel efficiency (variable name:
mpg). A quick disclaimer:
gear is a variable with only three different values: 3, 4, and 5. As such, it’s pretty tenuous to call it continuous, but I’m just using it for the sake of a plotting example.
The first thing we will need to do is run our regression model in order to calculate the simple slopes of
mpg at different levels of
gear. I like to use Kris Preacher’s simple intercepts/slopes calculator, but do what works for you–but you need BOTH the simple slopes AND (less commonly reported by some programs, like the PROCESS Macro) the simple intercepts.
After running the model, I used the
vcov command to get the parameter variances and covariances for Kris’s interaction utility. Then, I got the degrees of freedom for the model from the
summary of the regression model, and used the
describe function to get the mean and standard deviation of
gear, so that I could plug in the values for 1 SD below the mean, the mean itself, and 1 SD above the mean, as the conditional values of the moderator. I also scooped the min and max of
wt as the x-variable points.
summary(model2) vcov(model2) describe(dat)
Click “Calculate”, and scroll down to the “Simple Intercepts and Slopes at Conditional Values of Z” section of output in the first window of the interaction utility–these are the values we will use to plot our simple slope lines.
Thanks to a question/suggestion by James Conigrave, I’ve been able to update the code make the plot more pretty. Make a plot called
p, and map
wt to the x-axis,
mpg to the y-axis, and our continuous moderator
gear to size–this will produce smaller or bigger points in our scatter plot depending on the value of
gear. Give your axes more informative labels using the
labs command. We’re going to be creating a legend for different styles of regression lines, and data point size is a pretty intuitive visualization (you can just mention its interpretation in a table note, if necessary), so silence the default size-based legend using “guide = FALSE” in the
scale_size_continuous command (otherwise the plot looks cluttered). Now we manually add our simple slopes using the
geom_abline command, for which we need to–at the very least in this case–specify intercept and slope values. We will do this three times: first for the simple intercept and slope of
mpg at 1 SD below the mean of
gear, then at the mean of
gear, and then at 1 SD above the mean of
gear. In order to specify a different looking line for each (so we can distinguish them), we will enter labels for our lines in the “linetype” sub-command; this simultaneously assigns different linestyles to each simple slope, while controlling how the lines are labeled in the resulting legend. Then, we use
scale_linetype_manual in order to specify the particular styles of lines we want for each simple slope, the order we wish for each to appear in our legend, and a name for our legend. Lastly, add our APA format theme, and call the plot.
p=ggplot(dat, aes(x = wt, y = mpg, size = gear))+ geom_point()+ labs(x = 'Vehicle Weight', y = 'Vehicle MPG')+ scale_size_continuous(guide = FALSE)+ geom_abline(aes(intercept=33.965, slope=-4.3985, linetype='-1SD Gear'))+ geom_abline(aes(intercept=38.1208, slope=-5.854, linetype='Mean Gear'))+ geom_abline(aes(intercept=42.2767, slope=-7.3095, linetype='+1SD Gear'))+ scale_linetype_manual(values=c('dotted','dashed','solid'), breaks=c('-1SD Gear','Mean Gear','+1SD Gear'),name='Simple\nSlope')+ apatheme p
If you’re happy, save the plot.
ggsave('scatter2.png', width=6, height=6, unit='in', dpi=300)
Wrap Up and Extras
We’ve made some pretty plots, but you might have suggestions for how to improve them.
My buddy Matt Baldwin, for example, likes to put a vertical line(s) (based on the Johnson-Neyman technique; see Preacher, Curran, & Bauer, 2006, for a nice writeup) showing at what value(s) for a moderator that simple slopes diverge (see below). This can be easily done by using the
geom_vline() command, and entering in the particular J-N value for the
xintercept value (e.g.,
I’m sure there are a number of other ways to improve these plots. Feel free to post suggestions and/or code in the comments; I’ll amend the code in the core of the post based on really good suggestions, or for others, might provide example code and plots down here in the extras. And if you have any requests for a Make It Pretty post, let me know–I’m thinking I might do forest plots next.
*Full disclosure: I have no idea what V/S refers to. I’m pretty car-stupid, and the online documentation for the
mtcars dataset isn’t helpful for this particular variable. I simply selected it for the sake of example, as it produces a significant interaction with transmission type.