I’m back with a new Make It Pretty post. I’ve been quietly thrilled with how well my other two Make It Pretty posts have done. My post on visualizing various 2-way interactions (easily my most popular not-current-issue post) has been viewed over 1000 times, and more excitingly, is now the top hit if you google “2-way interaction ggplot2”. And though much less popular, I’m still happy with the ~300 views my meta-analysis visualizations post (on forest and funnel plots) has attracted–I even saw one of my funnel plots in the ‘wild’ shortly after! With this post, I’m going to be showing how you can use the
psych package in conjunction with
ggplot2 in order to create a prettier scree plot with parallel analysis–a very useful visualization when conducting exploratory factor analysis.
psych package is one of the first packages for
R that I learned how to use, and now resides in my Top 5 favourite packages to use. Compiled by William Revelle,
psych provides users with a number of simple yet powerful functions, mostly related to various ways of analyzing data from psychological scales (e.g., factor analysis, cluster analysis, item response theory, reliability analysis, etc.,). And since I am a bit…preoccupied?…obsessed?…let’s maybe go with passionate? about the finer points of factor analysis methodology, the
psych package immediately drew me in as a new
R user, because of its powerful factor-analysis-related functions.
Factor analysis is a statistical technique for uncovering latent (i.e., unobserved) continuous dimensions that parsimoniously explain patters of correlations between variables, commonly used to group items into (sub)scales for self-report measurement of a given construct. Factor analysis, however, is a very flexible form of analysis, in that there are dozens of options for deciding how many factors to retain, and how to estimate and rotate those factors. It’s therefore difficult to describe my parallel analysis code without getting a little bit into the details of some of these options. But if you’re a well-read EFA user, or you simply don’t care, feel free to skip the next section or two and get right to conducting parallel analysis using
- EFA Estimation Options and their Relevance for Parallel Analysis
- What Is Parallel Analysis?
- Parallel Analysis Using the
- Making a Pretty Scree Plot with Parallel Analysis Using
Parallel analysis is one method for helping to determine how many factors to retain, but it, like your EFA itself, is affected by your choice of estimation method. Specifically, your EFA and parallel analysis are going to be impacted by whether you adopt a common factor (CF) or principal components (PC) model. I won’t bore you blog readers with the full version of my common factor vs. principle components rant, but here’s the short-form:
- EFA users often consider the CF and PC models to be identical, in practice, if not in principle. But the two models are definitely not the same (theoretically), and often are do not produce the same results (empirically).
- Empirically, simulation studies have shown that PC models often produce exaggerated factor (err, component) loading estimates, and attenuated component correlations (when more than one component is estimated), compared to estimates yielded from CF models.
- Theoretically, the PC model assumes that all observed variance in your items is “true construct variance” that is meaningful to retain. The CF model, by comparison, partitions observed variance into variance shared between other items, and variance that is unique to each item (thought to represent unique aspects of that item + error). So in effect, if you believe that your measures contain some amount of error, it’s a little weird to adopt a model that assumes that they are error-free.
These are just a couple “big picture” reasons why many EFA experts advocate for CF models (there are a few to choose from) over the PC model. If you want more detail on this aspect of factor analysis, you can check out the pre-print of a recent EFA review article I coauthored for sexuality researchers, and I also cannot recommend Fabrigar and Wegener’s short-but-sweet EFA book (and the 1999 paper they wrote with their colleagues) enough.
Anyways, what does this have to do with parallel analysis? Well, parallel analysis is visualized using a scree plot, which highlights the eigenvalues (a metric of variance explained) for each component/factor that you could possibly extract–from 1 all the way to the maximum number (i.e., however many items you have). But your eigenvalue estimates will be HUGELY different, depending on whether you use a CF model or a PC model. PC model eigenvalues are larger because they assume all variance is meaningful construct variance; CF model eigenvalues are smaller because they assume that some variance is error/unique variance, and therefore there is a smaller amount of true construct variance to explain.
What’s the takeaway from all of this? I conduct my parallel analyses, using a CF model. And because it’s my blog, well…
So here’s how parallel analysis helps you to decide how may factors to retain. You start with creating a scree plot, in which you plot the eigenvalues (variance explained), in descending order, of each factor you could conceivably extract. The scree plot is so-named because of the scree test proposed by Cattell (1966), who suggested you could look at where the “scree”–a term for rubble at the base of a hill–begins, in order to determine how many factors are necessary to retain.
Sometimes this scree test is a legitimately helpful tool for making factor retention decision, but this is most often in cases when scree plots are painfully obvious as to where there is a huge drop and levelling-off of eigenvalues. And as we shall soon see, there are often cases where the beginning of the “scree” is not so easy to identify in a scree plot. In those cases, the scree test is highly subjective at best, and simply uninformative at worst.
Parallel analysis (introduced by Horn, 1965) is a technique designed to help take some of the subjectivity out of interpreting the scree plot. It is a simulation-based method, and the logic is pretty straightforward:
- Simulate a random data set (or sets) with the same number of items that have the same possible range of observed values. In essence, this data is junk/garbage/noise, in the same number of variables and on the same scale as your EFA.
- Extract eigenvalues from the junk data set(s)
- Plot the eigenvalues from the junk data alongside your observed eigenvalues on the same scree plot
- Retain, at maximum, the number of factors with observed eigenvalues that are larger than those extracted from corresponding factors based on junk/garbage/noise data.
There are some variations/nuance with parallel analysis, but that’s pretty much it, in a nutshell: keep the number of “real” factors that outperform “garbage” factors–seems reasonable enough. And indeed, simulation studies have shown that it does a reasonably decent job at guiding factor retention decisions, especially when used to identify the maximum number of factors to consider (and you can choose among that solution, and the simpler ones, using other criteria).
Conducting parallel analysis with the
psych package is actually super simple. First, let’s install and call the two packages we’ll be using:
install.packages('psych') install.packages('ggplot2') library(psych) library(ggplot2)
Now, we need some data to work with–a set of variables that we hypothetically might be interested in factor analyzing. Run the code below mindlessly if you want to generate the data used in this example, but truthfully, the code presented later to run parallel analysis isn’t too difficult to amend for your own data–just make sure you input the name of your own data frame (instead of efa.dat, which is what I’ve called mine), and ensure that your data frame consists ONLY of the variables to-be factor-analyzed.
dat = msq keys = make.keys(msq[1:75],list( EA = c('active', 'energetic', 'vigorous', 'wakeful', 'wide.awake', 'full.of.pep', 'lively', '-sleepy', '-tired', '-drowsy'), TA =c('intense', 'jittery', 'fearful', 'tense', 'clutched.up', '-quiet', '-still', '-placid', '-calm', '-at.rest') , PA =c('active', 'excited', 'strong', 'inspired', 'determined', 'attentive', 'interested', 'enthusiastic', 'proud', 'alert'), NAf =c('jittery', 'nervous', 'scared', 'afraid', 'guilty', 'ashamed', 'distressed', 'upset', 'hostile', 'irritable' ), HAct = c('active', 'aroused', 'surprised', 'intense', 'astonished'), aPA = c('elated', 'excited', 'enthusiastic', 'lively'), uNA = c('calm', 'serene', 'relaxed', 'at.rest', 'content', 'at.ease'), pa = c('happy', 'warmhearted', 'pleased', 'cheerful', 'delighted' ), LAct = c('quiet', 'inactive', 'idle', 'still', 'tranquil'), uPA =c( 'dull', 'bored', 'sluggish', 'tired', 'drowsy'), naf = c( 'sad', 'blue', 'unhappy', 'gloomy', 'grouchy'), aNA = c('jittery', 'anxious', 'nervous', 'fearful', 'distressed'), Fear = c('afraid' , 'scared' , 'nervous' , 'jittery' ) , Hostility = c('angry' , 'hostile', 'irritable', 'scornful' ), Guilt = c('guilty' , 'ashamed' ), Sadness = c( 'sad' , 'blue' , 'lonely', 'alone' ), Joviality =c('happy','delighted', 'cheerful', 'excited', 'enthusiastic', 'lively', 'energetic'), Self.Assurance=c( 'proud','strong' , 'confident' , '-fearful' ), Attentiveness = c('alert' , 'determined' , 'attentive' ) #acquiscence = c('sleepy' , 'wakeful' , 'relaxed','tense') )) msq.scores = scoreItems(keys,msq[1:75]) efa.data = msq.scores$scores
What we’ve generated is a humongous set of 19 personality variables (see the
psych documentation of the msq dataset for more details) that we might be interested in factor analyzing.
Next, we use the package’s
fa.parallel function to run parallel analysis. Then, in order, we: (1) specify our data frame; (2) indicate that we want to estimate eigenvalues using maximum likelihood (a form of CF model); (3) indicate that we only want the CF eigenvalues (and not the component eigenvalues) reported; (4) indicate that we 50 random datasets simulated for our parallel analysis; (5) indicate that we want to use squared multiple correlations (SMCs) as the starting communality estimates for our factor analysis (a common default with CF methods); and finally (6) specify that we would like the 95th quantile of simulated eigenvalues to be used as the point of comparison when deciding whether our observed eigenvalues are explaining more variance.
set.seed(123) parallel = fa.parallel(efa.data, fm = 'ml', fa = 'fa', n.iter = 50, SMC = TRUE, quant = .95)
Execute that command, and
R will store the detailed output (which we’ll use for pretty plotting next) in the object that we called “parallel”. Note, since parallel analysis is a simulation technique, you may get slightly different results from run-to-run. Be sure to re-run it a couple times to make sure it provides you a relatively stable recommendation. I’ve set the random seed generator in the code above to ensure this code will produce the same results each time.
R will also output the number of maximum number of factors you are encouraged to retain, as well as the following plot, showing the results of your parallel analysis–a helpful visualization to be sure, but it could use some gussying up.
In order to make a pretty scree plot with parallel analysis using
ggplot2, we’re going to have to extract and save some information from our parallel object in our to render it in a more useable form for plotting. Note, as long as you named your saved
fa.parallel object “parallel”, as in the example above, you should be able to execute all the code in this section mindlessly to produce a pretty plot, even if you’ve used your own data with different numbers of variables during parallel analysis.
Okay, first things first: we’re going to create a data frame called “obs” in which we are going to store our estimated observed eigenvalues (
fa.values). We’ll then add a column to this data frame called “type” and give every row a value of “Observed Data” (to distinguish these eigenvalues from our simulated eigenvalues). Finally, we add a column that automatically populates with the number of factors that could be extracted, codes it as a numeric value, and then we assign column names for the variables in our new “obs” data frame:
#Create data frame &amp;amp;amp;amp;quot;obs&amp;amp;amp;amp;quot; from observed eigenvalue data obs = data.frame(parallel$fa.values) obs$type = c('Observed Data') obs$num = c(row.names(obs)) obs$num = as.numeric(obs$num) colnames(obs) = c('eigenvalue', 'type', 'num')
Now, we want to create a similar data frame for our simulated eigenvalues, but unfortunately, the values saved in the parallel object (
fa.sim) are the 50th quantile–not the 95th like we want. So we first need to calculate these ourselves. The psych documentation provides some code to help us to this, but making matters more complicated, it spits out 95th quantiles for four groups of eigenvalues: (1) observed PC eigenvalues; (2) observed CF eigenvalues; (3) simulated PC eigenvalues; and (4) simulated CF eigenvalues. And we only want to use the simulated CF eigenvalues. So the first chunk of code below calculates the 95th quantiles of these four groups of eigenvalues, and then extracts the ones we care about (simulated CF eigenvalues) and saves them in an object called “percentile1”. Then we repeat the process above, but this time create a data frame (called “sim”) of simulated eigenvalues, and finally, merge the two data frames together into a data frame called “eigendat”:
#Calculate quantiles for eigenvalues, but only store those from simulated CF model in percentile1 percentile = apply(parallel$values,2,function(x) quantile(x,.95)) min = as.numeric(nrow(obs)) min = (4*min) - (min-1) max = as.numeric(nrow(obs)) max = 4*max percentile1 = percentile[min:max] #Create data frame called &amp;amp;amp;amp;quot;sim&amp;amp;amp;amp;quot; with simulated eigenvalue data sim = data.frame(percentile1) sim$type = c('Simulated Data (95th %ile)') sim$num = c(row.names(obs)) sim$num = as.numeric(sim$num) colnames(sim) = c('eigenvalue', 'type', 'num') #Merge the two data frames (obs and sim) together into data frame called eigendat eigendat = rbind(obs,sim)
Now we just need to save an APA-format-friendly theme for
ggplot2. Note, if you’ve used my apatheme before, I’ve had to amend it a bit to get it to cooperate, following a major-ish update to the
apatheme=theme_bw()+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), panel.border = element_blank(), text=element_text(family='Arial'), legend.title=element_blank(), legend.position=c(.7,.8), axis.line.x = element_line(color='black'), axis.line.y = element_line(color='black'))
Note, the legend placement (
legend.position) for most scree plots of your own data should be fine. This example, however, has an oddly high number of recommended factors, so our legend partially obscures the vertical line showing the recommended number of factors. If your data leaves you wanting to shift the legend around a bit, play around with the first value in legend.position until your happy; decreasing it will move the legend left, whereas increasing it will move the legend to the right. We can now finally create our pretty scree plot with parallel analysis:
#Use data from eigendat. Map number of factors to x-axis, eigenvalue to y-axis, and give different data point shapes depending on whether eigenvalue is observed or simulated p = ggplot(eigendat, aes(x=num, y=eigenvalue, shape=type)) + #Add lines connecting data points geom_line()+ #Add the data points. geom_point(size=4)+ #Label the y-axis 'Eigenvalue' scale_y_continuous(name='Eigenvalue')+ #Label the x-axis 'Factor Number', and ensure that it ranges from 1-max # of factors, increasing by one with each 'tick' mark. scale_x_continuous(name='Factor Number', breaks=min(eigendat$num):max(eigendat$num))+ #Manually specify the different shapes to use for actual and simulated data, in this case, white and black circles. scale_shape_manual(values=c(16,1)) + #Add vertical line indicating parallel analysis suggested max # of factors to retain geom_vline(xintercept = parallel$nfact, linetype = 'dashed')+ #Apply our apa-formatting theme apatheme #Call the plot. Looks pretty! p
Save a high-res version of your plot, and you’ve got yourself a pretty scree plot with parallel analysis:
ggsave('parallel.png', width=6, height=6, unit='in', dpi=300)
Unlike my previous Make It Pretty posts, I actually have a publication that is related to this code (our supplemental materials on OSF have more EFA-related-code-goodness, like excel calculators for model fit indexes and nested model comparisons). So if you end up using the code supplied in this post (or any of the supplemental materials) in your own research, I’d be grateful (as would Steve) if you’d cite our paper:
Sakaluk, J. K., & Short, S. D. (2016). A Methodological Review of Exploratory Factor Analysis in Sexuality Research: Used Practices, Best Practices, and Data Analysis Resources. Journal of Sex Research.
Until next time, keep those plots pretty!