9  EFA

9.1 Factor analysis: exploratory vs confirmatory

In exploratory factor analysis (EFA), all measured variables are related to every latent variable. It is used to reduce data to a smaller set of summary variables and to explore the underlying theoretical structure of the phenomena. It asks what factors are given in observed data and, thereby, requires interpretation of usefulness of a model (it should be confirmed with confirmatory factor analysis).

In confirmatory factor analysis (CFA), researchers can specify the number of factors required in the data and which measured variable is related to which latent variable. It asks how well a proposed model fits a given data. However, it does not give a definitive answer, but is rather useful to compare models (and data).

The differences in both approaches can be summarized as follows:

9.1.1 Prerequisites

Prerequisites of factor analysis include:

  • Condition: There are several interval-scaled characteristics (items).
  • Rule of thumb: At least 50 people and 3x more people as variables (ideally: 5x more people as variables!).

9.1.2 Dimensionality reduction

Dimensionality reduction transforms a data set from a high-dimensional space into a low-dimensional space, and can be a good choice when you suspect there are too many variables which can be a problem because it is difficult to understand (or visualize) data in higher dimensions.

Another potential consequence of having a multitude of predictors is possible harm to a model. The simplest example is a method like ordinary linear regression where the number of predictors should be less than the number of data points used to fit the model.

Another issue is multicollinearity, where between-predictor correlations can negatively impact the mathematical operations used to estimate a model. If there are an extremely large number of predictors, it is fairly unlikely that there are an equal number of real underlying effects. Predictors may be measuring the same latent effect(s), and thus such predictors will be highly correlated.

There are several dimensionality reduction methods that can be used with different types of data for different requirements:

  • combinating features:
    • linear: Principal component analysis (PCA), Factor analysis (FA), Multiple correspondence analysis (MCA), Linear discriminant analysis (LDA) or Singular value decomposition (SVD)
    • non-linear: Kernel PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) or Multidimensional scaling (MDS)
  • keeping most important features: Random forests, Forward or Backward selection

9.2 General procedure of EFA

Several steps need to be undertaken to conduct exploratorty factor analysis:

  • Setup and evaluate data set
  • Choose number of factors to extract
  • Extract (and rotate) factors
  • Evaluate what you have and possibly repeat the second and third steps
  • Interpret and write-up results

There are also general guidelines to follow:

  • It is better to select only the variables of interest from the data set.
  • As factor analysis (and PCA) does not play well with missing data, it is better to remove cases that have missing data.
  • Remember that factor analysis is designed for continuous data, although it is possible to include categorical data in a factor analysis.

9.3 Suitability of the data

Only relevant items may be included in the factor analysis. For instance, items must correlate significantly. Here, Bartlett Test is useful to assess the hypothesis that the sample came from a population in which the variables are uncorrelated. It checks whether the correlation matrix is an identity matrix or not:

  • H0: The variables are uncorrelated in the population.
  • H1: The variables are correlated in the population.

Kaiser-Meyer-Olkin criterion (KMO) and Measure of Sampling Adequacy (MSA) further test to what extent the variance of one variable is explained by the other variables. This indicates whether a data set is suitable for a factor analysis:

  • value range 0-1
  • values from .8 desirable
  • values below .5 unacceptable

9.4 PCA versus EFA

In Principal Component Analysis (PCA) each variable can be fully explained by a linear combination of r factors:

\[ Z_{ij} = f_{i1}a_{1j} + f_{i2}a_{2j} + ... + f_{ip}a_{pj} \]

where z is the standard variable x.

This approach should be used when data set structuring and data reduction is the primary goal.

In Exploratory Factor Analysis (EFA) each variable cannot be fully explained by a linear combination of r factors: e.g \[ Z_{ij} = f_{i1}a_{1j} + f_{i2}a_{2j} + ... + f_{ip}a_{pj} + e_j\]

This apporach should be used when trying to identify latent variables that are crucial to answering the items.

9.5 Spatial representation

Each of the given vectors (items) can be written exactly using the basis vectors (factors). The angle between the base vector (factor) and a given vector (item) is the correlation coefficient between item and factor: the factor loading.

Fundamental theorem: every observed value of a variable \(x_j\) can be described as a linear combination of several (hypothetical) factors (\(f_{jn}\)). That means that the answer to an item can be traced back to the sum of the factor values (\(f_{jn}\)), which are weighted with the factor loadings (\(a_{jn}\)).

9.6 Extraction of the factors

To extract the factors, we can z-standardize all variables (mean = 0, standard deviation = 1, variance = 1). The factor that explains the variance of all variables (\(x_1\) to \(x_j\)) at most is searched for step by step. Then - similar to multiple regression - a linear combination of the items is formed:

\[ F_1 = b_{11}x_1 + b_{12}x_2 + ... + b_{1j}x_j \]

To measure the explanation of the variance of a variable, the coefficient of determination represents the square of the factor loading (see \(R^2\) in the regression). The first factor is the z-standardized variable for which the sum of the determination coefficients is maximum.

9.7 Decision about the number of factors

There are 4 decision aids to decide about the number of factors:

  • Kaiser criterion: significant factor explains more variance than any of the original variables and this criterion is fulfilled from an eigenvalue of 1 onwards
  • Scree plot: eigenvalues are plotted in the graphic and the factors that lie above a “threshold” are extracted
  • Content plausibility: so many factors are accepted that a plausible interpretation results
  • A priori criterion: it is theoretically determined in advance how many factors there should be

9.8 Factor rotation

By rotation one obtains simple structure, the factor interpretation is relieved. There are 2(3) types of rotation:

  • Varimax (Right Angle): the factor axes remain at right angles, so that they are uncorrelated
  • Oblique: the factor axes do not remain at right angles, so that they are correlated
  • Combination (first right angles and then oblique)

Which form of factor rotation is chosen in a specific case often depends on the theory behind it:

  • In the case of orthogonal rotations, the results are easier to interpret because the factors are uncorrelated.
  • Orthogonal rotations are usually also appropriate for pure dimension reduction.

However, orthogonal models often do not do justice to the underlying relationships:

  • The assumption of zero correlation between the factors is often too strict and does not reflect the complexity of the data.
  • If highly correlated factors are assumed in the data, an oblique rotation appears to make more sense than an orthogonal one.

9.9 Interpretation and naming

Factor interpretation is based on the factor loadings. Loadings from .5 are usually interpreted. Ideally, variables load exactly high on one factor and low on another.

For factor naming, variables with higher loadings should be given more consideration. A negative charge does not mean that the item does not belong to the factor. However, the sign must be taken into account in the interpretation.

Possible problems:

  • Loading several variables on several factors poses interpretation difficulties
  • Negative factor, or one-item factor

9.10 Lexicon

  • Factor: latent dimension “behind” the variables, which is responsible for the manifestation of a directly measured variable
  • Component: Collection of variables that have things in common
  • Indicator (Item): a directly measured variable that, together with other variables, makes up a component/factor
  • Factor loading: Correlation between a variable and a factor (should be > .5 on one factor and preferably <.3 on other factors)
  • Squared factor loadings: indicates the proportion of variance (=determination coefficient) of the variable that is explained by the factor
  • Eigenvalue of the factor: proportion of the variance of all directly measured variables that is explained by a factor (=sum of the squared factor loadings [column by column]). For a factor to be relevant enough to be extracted, its eigenvalue should be > 1
  • Commonality of a variable: proportion of the variance of an item that is explained by all factors (= sum of the squared factor loadings [row by row])
  • Factor value: calculated value based on the observed values on the measured variables and the factor loading of the variable on the factor
  • Rotation: optimization of the factor solution, perpendicular (orthogonal) vs oblique rotation

9.11 How it works in R?

See the lecture slides exploratory factor analysis:

You can also download the PDF of the slides here: Click here

9.12 Quiz

True False Statement
The Kaiser’s criterion is a measure of whether the data is suitable for an exploratory factor analysis.
Both in EFA and CFA we specify the pattern of indicator-factor loadings.
Bartlett Test is useful to assess whether the sample came from a population in which the variables are uncorrelated.
Kaiser’s criterion and scree plot are alternative methods for determining how many factors to retain.
My results will appear here

9.13 Example from the literature

The following article relies on EFA as a method of analysis:

Schulz, A., Müller, P., Schemer, C., Wirz, D. S., Wettstein, M., & Wirth, W. (2018). Measuring populist attitudes on three dimensions. International Journal of Public Opinion Research, 30(2), 316-326. Available here.

Please reflect on the following questions:

  • What is the research question of the study?
  • What are the research hypotheses?
  • Is EFA an appropriate method of analysis to answer the research question?
  • What are the main findings of the EFA?

9.14 Time to practice on your own

You can download the PDF of the EFA exercises here: Click here

9.14.1 Exercise 1: Big-5

To illustrate EFA, let us use the International Personality Item Pool data available in the psych package. It includes 25 personality self report items following the big 5 personality structure.

The first step is to test if the dataset is suitable for conducting factor analysis. To do so, run the Bartlett’s Test of Sphericity and the Kaiser Meyer Olkin (KMO) measure.

Reminder: Bartlett’s Test of Sphericity tests whether a matrix (of correlations) is significantly different from an identity matrix (probability that the correlation matrix has significant correlations among at least some of the variables in a dataset). KMO measure indicates the degree to which each variable in a set is predicted without error by the other variables (a KMO value close to 1 indicates that the sum of partial correlations is not large relative to the sum of correlations and so factor analysis should yield distinct and reliable factors).

Show the code
# load the data
data <- psych::bfi[, 1:25] # 25 first columns corresponding to the items
data <- na.omit(data)
# option 1: check suitability
psych::KMO(data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = data)
## Overall MSA =  0.85
## MSA for each item = 
##   A1   A2   A3   A4   A5   C1   C2   C3   C4   C5   E1   E2   E3   E4   E5   N1   N2   N3 
## 0.75 0.84 0.87 0.88 0.90 0.84 0.80 0.85 0.83 0.86 0.84 0.88 0.90 0.88 0.89 0.78 0.78 0.86 
##   N4   N5   O1   O2   O3   O4   O5 
## 0.89 0.86 0.86 0.78 0.84 0.77 0.76
bartlett = psych::cortest.bartlett(data)
## R was not square, finding R from data
print(paste0("Chi-2: ", round(bartlett[["chisq"]],2), 
             "; p-value: ", bartlett[["p.value"]]))
## [1] "Chi-2: 18146.07; p-value: 0"
# option 2: check suitability
# performance::check_factorstructure(data)

Once you are confident that the dataset is appropriate for factor analysis, you can explore a factor structure made of 5 (theoretically motivated) latent variables. Start by defining the model.

Show the code
# optional: eigenvalues
ev <- eigen(cor(data)) 
print(ev$values)
##  [1] 5.1343112 2.7518867 2.1427020 1.8523276 1.5481628 1.0735825 0.8395389 0.7992062
##  [9] 0.7189892 0.6880888 0.6763734 0.6517998 0.6232530 0.5965628 0.5630908 0.5433053
## [17] 0.5145175 0.4945031 0.4826395 0.4489210 0.4233661 0.4006715 0.3878045 0.3818568
## [25] 0.2625390
psych::scree(data, pc=FALSE)
# fit an EFA
efa <- psych::fa(data, nfactors = 5, rotate="oblimin")
## Le chargement a nécessité le package : GPArotation
efa_para <- psych::fa(data, nfactors = 5, rotate="oblimin") |>
  parameters::model_parameters(sort = TRUE, threshold = "max")
efa_para
## # Rotated loadings from Factor Analysis (oblimin-rotation)
## 
## Variable | MR2  |  MR1  |  MR3  |  MR5  |  MR4  | Complexity | Uniqueness
## -------------------------------------------------------------------------
## N1       | 0.83 |       |       |       |       |    1.07    |    0.32   
## N2       | 0.78 |       |       |       |       |    1.03    |    0.39   
## N3       | 0.70 |       |       |       |       |    1.08    |    0.46   
## N5       | 0.48 |       |       |       |       |    2.00    |    0.65   
## N4       | 0.47 |       |       |       |       |    2.33    |    0.49   
## E2       |      | 0.67  |       |       |       |    1.08    |    0.45   
## E4       |      | -0.59 |       |       |       |    1.52    |    0.46   
## E1       |      | 0.55  |       |       |       |    1.22    |    0.65   
## E5       |      | -0.42 |       |       |       |    2.68    |    0.59   
## E3       |      | -0.41 |       |       |       |    2.65    |    0.56   
## C2       |      |       | 0.67  |       |       |    1.18    |    0.55   
## C4       |      |       | -0.64 |       |       |    1.13    |    0.52   
## C3       |      |       | 0.57  |       |       |    1.10    |    0.68   
## C5       |      |       | -0.56 |       |       |    1.41    |    0.56   
## C1       |      |       | 0.55  |       |       |    1.20    |    0.65   
## A3       |      |       |       | 0.68  |       |    1.06    |    0.46   
## A2       |      |       |       | 0.66  |       |    1.03    |    0.54   
## A5       |      |       |       | 0.54  |       |    1.48    |    0.53   
## A4       |      |       |       | 0.45  |       |    1.74    |    0.70   
## A1       |      |       |       | -0.44 |       |    1.88    |    0.80   
## O3       |      |       |       |       | 0.62  |    1.16    |    0.53   
## O5       |      |       |       |       | -0.54 |    1.21    |    0.70   
## O1       |      |       |       |       | 0.52  |    1.10    |    0.68   
## O2       |      |       |       |       | -0.47 |    1.68    |    0.73   
## O4       |      |       |       |       | 0.36  |    2.65    |    0.75   
## 
## The 5 latent factors (oblimin rotation) accounted for 42.36% of the total variance of the original data (MR2 = 10.31%, MR1 = 8.83%, MR3 = 8.39%, MR5 = 8.29%, MR4 = 6.55%).

What do you see? How can you interpret the output?

The 25 items spread on the 5 latent factors nicely - the famous big 5.

It is possible to visualize the results to ease the interpretation:

Show the code
loads <- efa$loadings
psych::fa.diagram(loads)

Based on this model, you could predict back the scores for each individual for these new variables. This could be useful for further analysis (e.g. regression analysis).

Tips: use the function predict() and give labels to the latent factors. Based on this model, you could predict back the scores for each individual for these new variables. This could be useful for further analysis (e.g. regression analysis).

Tips: use the function predict() and give labels to the latent factors.

Show the code
# predictions <- predict(
#   efa_para,
#   names = c("Neuroticism", "Conscientiousness", "Extraversion", "Agreeableness", "Opennness"),
#   verbose = FALSE
# )
# head(predictions)

9.14.2 Exercise 2: Environmental concerns

We will use survey data from the World Values Survey (WVS) website to investigate human belief and values, especially about environmental (EC) We will analyse Swiss data derived from the much larger WVS cross-country database (2007). The data can be downloaded here.

EC has been measured by a set of 8 items on a four-step Likert Scale. These items are thought to load on three different facets of EC: concerns about one’s own community (water quality, air quality and sanitation), concerns about the world at large (fears about global warming, loss of biodiversity and ocean pollution), willingness to pay/do more. We want to answer the following question: Is the three-factor model proposed by the structure of items (or can we assume a one-factor structure)?

Let’s first prepare the data and get the table of correlations:

Show the code
db <- openxlsx::read.xlsx(paste0(getwd(),
                  "/data/WV5_Data_Switzerland_Excel_v20201117.xlsx"))
colnames(db) <- gsub(":.*","",colnames(db))
sel <- db |>
  dplyr::select(V108,V109,V110,
                V111,V112,V113,
                V106,V107
                ) |>
  dplyr::rename("water"="V108",
                "air"="V109",
                "sanitation"="V110",
                "warming"="V111",
                "biodiv"="V112",
                "pollution"="V113",
                "taxes"="V106",
                "gov"="V107") |>
  stats::na.omit()
sel = replace(sel, sel==-1, NA)
sel = sel[complete.cases(sel),]
# reverse scale
for(i in 1:ncol(sel)){sel[,i] <- (sel[,i]-5)*(-1)}
# correlation
round(cor(sel),2)
##            water  air sanitation warming biodiv pollution taxes   gov
## water       1.00 0.73       0.84    0.04   0.06      0.10  0.00  0.09
## air         0.73 1.00       0.74    0.14   0.14      0.15  0.07  0.03
## sanitation  0.84 0.74       1.00    0.06   0.08      0.12  0.00  0.11
## warming     0.04 0.14       0.06    1.00   0.49      0.39  0.20  0.01
## biodiv      0.06 0.14       0.08    0.49   1.00      0.45  0.13  0.09
## pollution   0.10 0.15       0.12    0.39   0.45      1.00  0.14  0.00
## taxes       0.00 0.07       0.00    0.20   0.13      0.14  1.00 -0.23
## gov         0.09 0.03       0.11    0.01   0.09      0.00 -0.23  1.00

The first step is to test if the dataset is suitable for conducting factor analysis. To do so, run the Bartlett’s Test of Sphericity and the Kaiser Meyer Olkin (KMO) measure.

Show the code
# option 1: check suitability
psych::KMO(sel)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = sel)
## Overall MSA =  0.72
## MSA for each item = 
##      water        air sanitation    warming     biodiv  pollution      taxes        gov 
##       0.71       0.84       0.70       0.69       0.66       0.74       0.62       0.50
bartlett = psych::cortest.bartlett(sel)
## R was not square, finding R from data
print(paste0("Chi-2: ", round(bartlett[["chisq"]],2), 
             "; p-value: ", bartlett[["p.value"]]))
## [1] "Chi-2: 3331.64; p-value: 0"
# option 2: check suitability
# performance::check_factorstructure(sel)

Now, you can explore a factor structure made of 3 (theoretically motivated) latent variables. Start by defining the model.

Show the code
# optional: eigenvalues
ev <- eigen(cor(sel)) 
print(ev$values)
## [1] 2.6757147 1.8568003 1.2061976 0.7281643 0.6039547 0.4827758 0.2920477 0.1543450
psych::scree(sel, pc=FALSE)
# fit an EFA
efa <- psych::fa(sel, nfactors = 3, rotate="oblimin")
efa_para <- psych::fa(sel, nfactors = 3, rotate="oblimin") |>
  parameters::model_parameters(sort = TRUE, threshold = "max")
efa_para
## # Rotated loadings from Factor Analysis (oblimin-rotation)
## 
## Variable   | MR1  | MR2  |  MR3  | Complexity | Uniqueness
## ----------------------------------------------------------
## sanitation | 0.93 |      |       |    1.01    |    0.14   
## water      | 0.92 |      |       |    1.00    |    0.17   
## air        | 0.79 |      |       |    1.04    |    0.34   
## biodiv     |      | 0.79 |       |    1.02    |    0.40   
## warming    |      | 0.64 |       |    1.05    |    0.56   
## pollution  |      | 0.57 |       |    1.04    |    0.65   
## gov        |      |      | -0.53 |    1.14    |    0.72   
## taxes      |      |      | 0.48  |    1.20    |    0.73   
## 
## The 3 latent factors (oblimin rotation) accounted for 53.70% of the total variance of the original data (MR1 = 29.23%, MR2 = 17.66%, MR3 = 6.81%).

What do you see? How can you interpret the output?

The 8 items spread on the 3 latent factors nicely. Note that the ‘gov’ item has a negative loading.

It is possible to visualize the results to ease the interpretation:

Show the code
loads <- efa$loadings
psych::fa.diagram(loads)