11  Recap EFA and CFA

11.1 Recap on factor analysis: exploratory and confirmatory

In exploratory factor analysis (EFA), all measured variables are related to every latent variable. It is used to reduce data to a smaller set of summary variables and to explore the underlying theoretical structure of the phenomena. It asks what factors are given in observed data and, thereby, requires interpretation of usefulness of a model (it should be confirmed with confirmatory factor analysis).

In confirmatory factor analysis (CFA), researchers can specify the number of factors required in the data and which measured variable is related to which latent variable. It asks how well a proposed model fits a given data.

The differences in both approaches can be summarized as follows:

11.1.1 Prerequisites

Prerequisites of factor analysis include:

  • Condition: There are several interval-scaled characteristics (items).
  • Rule of thumb: At least 50 people and 3x more people as variables (ideally: 5x more people as variables!).

11.2 Recap slides on EFA and CFA

See the recap slides EFA and EFA:

You can also download the PDF of the slides here: Click here

11.3 Additional optional exercices on CFA

11.3.1 Optional exercise: Anxiety (inspired from J. Lin’s UCLA presentation)

We will use a real world example of a questionnaire which Andy Field terms the SPSS Anxiety Questionnaire (SAQ). The first eight items consist of the following:

  • Statistics makes me cry
  • My friends will think I’m stupid for not being able to cope with SPSS
  • Standard deviations excite me
  • I dream that Pearson is attacking me with correlation coefficients
  • I don’t understand statistics
  • I have little experience with computers
  • All computers hate me
  • I have never been good at mathematics
Show the code
library(foreign)
dat <- read.spss(paste0("https://stats.idre.ucla.edu/wp-content/uploads/2018/05/SAQ.sav"),to.data.frame=TRUE,use.value.labels = FALSE)
# correlations
round(cor(dat[,1:8]),2)
##       q01   q02   q03   q04   q05   q06   q07   q08
## q01  1.00 -0.10 -0.34  0.44  0.40  0.22  0.31  0.33
## q02 -0.10  1.00  0.32 -0.11 -0.12 -0.07 -0.16 -0.05
## q03 -0.34  0.32  1.00 -0.38 -0.31 -0.23 -0.38 -0.26
## q04  0.44 -0.11 -0.38  1.00  0.40  0.28  0.41  0.35
## q05  0.40 -0.12 -0.31  0.40  1.00  0.26  0.34  0.27
## q06  0.22 -0.07 -0.23  0.28  0.26  1.00  0.51  0.22
## q07  0.31 -0.16 -0.38  0.41  0.34  0.51  1.00  0.30
## q08  0.33 -0.05 -0.26  0.35  0.27  0.22  0.30  1.00
# covariances
round(cov(dat[,1:8]),2)
##       q01   q02   q03   q04   q05   q06   q07   q08
## q01  0.69 -0.07 -0.30  0.34  0.32  0.20  0.28  0.24
## q02 -0.07  0.72  0.29 -0.09 -0.10 -0.07 -0.15 -0.04
## q03 -0.30  0.29  1.16 -0.39 -0.32 -0.27 -0.45 -0.24
## q04  0.34 -0.09 -0.39  0.90  0.37  0.30  0.43  0.29
## q05  0.32 -0.10 -0.32  0.37  0.93  0.28  0.36  0.23
## q06  0.20 -0.07 -0.27  0.30  0.28  1.26  0.64  0.22
## q07  0.28 -0.15 -0.45  0.43  0.36  0.64  1.22  0.29
## q08  0.24 -0.04 -0.24  0.29  0.23  0.22  0.29  0.76

The interpretation of the correlation table are the standardized covariances between a pair of items. In a correlation table, the diagonal elements are always one because an item is always perfectly correlated with itself.

In a typical variance-covariance matrix, the diagonals constitute the variances of the item and the off-diagonals the covariances.

We decide the use only Items 1, 3, 4, 5, and 8 as indicators of SPSS Anxiety and Items 6 and 7 as indicators of Attribution Bias. Thus, we will now proceed with a two-factor CFA where we assume uncorrelated (or orthogonal) factors. Having a two-item factor presents a special problem for identification. In order to identify a two-item factor there are two options:

  • Freely estimate the loadings of the two items on the same factor but equate them to be equal while setting the variance of the factor at 1
  • Freely estimate the variance of the factor, using the marker method for the first item, but covary (correlate) the two-item factor with another factor

Since we are doing an uncorrelated two-factor solution here, we are relegated to the first option.

How does this model compare to a one-factor model?

Show the code
library(lavaan)
# one factor model
m1 <- 'f1 =~ q01+ q03 + q04 + q05 + q08
        f2 =~ a*q06 + a*q07
        f1 ~~ 0*f2 ' 
onefac7items <- cfa(m1, data=dat, std.lv=TRUE) 
summary(onefac7items, fit.measures=TRUE, standardized=TRUE)
## lavaan 0.6.17 ended normally after 14 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        14
##   Number of equality constraints                     1
## 
##   Number of observations                          2571
## 
## Model Test User Model:
##                                                       
##   Test statistic                               841.205
##   Degrees of freedom                                15
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3876.345
##   Degrees of freedom                                21
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.786
##   Tucker-Lewis Index (TLI)                       0.700
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23684.164
##   Loglikelihood unrestricted model (H1)     -23263.562
##                                                       
##   Akaike (AIC)                               47394.328
##   Bayesian (BIC)                             47470.405
##   Sample-size adjusted Bayesian (SABIC)      47429.101
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.146
##   90 Percent confidence interval - lower         0.138
##   90 Percent confidence interval - upper         0.155
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    1.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.180
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     q01               0.539    0.017   31.135    0.000    0.539    0.651
##     q03              -0.573    0.023  -24.902    0.000   -0.573   -0.533
##     q04               0.652    0.020   33.032    0.000    0.652    0.687
##     q05               0.567    0.020   27.812    0.000    0.567    0.588
##     q08               0.431    0.019   22.862    0.000    0.431    0.494
##   f2 =~                                                                 
##     q06        (a)    0.797    0.017   46.329    0.000    0.797    0.710
##     q07        (a)    0.797    0.017   46.329    0.000    0.797    0.723
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 ~~                                                                 
##     f2                0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .q01               0.395    0.015   26.280    0.000    0.395    0.576
##    .q03               0.827    0.027   30.787    0.000    0.827    0.716
##    .q04               0.474    0.020   24.230    0.000    0.474    0.527
##    .q05               0.608    0.021   29.043    0.000    0.608    0.654
##    .q08               0.575    0.018   31.760    0.000    0.575    0.756
##    .q06               0.623    0.027   22.916    0.000    0.623    0.495
##    .q07               0.580    0.026   21.925    0.000    0.580    0.477
##     f1                1.000                               1.000    1.000
##     f2                1.000                               1.000    1.000
#uncorrelated two factor solution, var std method
m <- 'f1 =~ q01+ q03 + q04 + q05 + q08
        f2 =~ a*q06 + a*q07
        f1 ~~ 0*f2 ' 
twofac7items <- cfa(m, data=dat, std.lv=TRUE) 
summary(twofac7items, fit.measures=TRUE, standardized=TRUE)
## lavaan 0.6.17 ended normally after 14 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        14
##   Number of equality constraints                     1
## 
##   Number of observations                          2571
## 
## Model Test User Model:
##                                                       
##   Test statistic                               841.205
##   Degrees of freedom                                15
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3876.345
##   Degrees of freedom                                21
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.786
##   Tucker-Lewis Index (TLI)                       0.700
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23684.164
##   Loglikelihood unrestricted model (H1)     -23263.562
##                                                       
##   Akaike (AIC)                               47394.328
##   Bayesian (BIC)                             47470.405
##   Sample-size adjusted Bayesian (SABIC)      47429.101
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.146
##   90 Percent confidence interval - lower         0.138
##   90 Percent confidence interval - upper         0.155
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    1.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.180
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     q01               0.539    0.017   31.135    0.000    0.539    0.651
##     q03              -0.573    0.023  -24.902    0.000   -0.573   -0.533
##     q04               0.652    0.020   33.032    0.000    0.652    0.687
##     q05               0.567    0.020   27.812    0.000    0.567    0.588
##     q08               0.431    0.019   22.862    0.000    0.431    0.494
##   f2 =~                                                                 
##     q06        (a)    0.797    0.017   46.329    0.000    0.797    0.710
##     q07        (a)    0.797    0.017   46.329    0.000    0.797    0.723
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 ~~                                                                 
##     f2                0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .q01               0.395    0.015   26.280    0.000    0.395    0.576
##    .q03               0.827    0.027   30.787    0.000    0.827    0.716
##    .q04               0.474    0.020   24.230    0.000    0.474    0.527
##    .q05               0.608    0.021   29.043    0.000    0.608    0.654
##    .q08               0.575    0.018   31.760    0.000    0.575    0.756
##    .q06               0.623    0.027   22.916    0.000    0.623    0.495
##    .q07               0.580    0.026   21.925    0.000    0.580    0.477
##     f1                1.000                               1.000    1.000
##     f2                1.000                               1.000    1.000

Since we have 7 items, the total elements in our variance covariance matrix is 7(7+1)/2=28. The number of free parameters to be estimated include 7 residual variances , 7 loadings for a total of 14. Then we have 28-14=14 degrees of freedom. However for identification of the two indicator factor model, we constrained the loadings of Item 6 and Item 7 to be equal, which frees up a parameter and hence we end up with 14+1=15 degrees of freedom.

We can see that the uncorrelated two factor CFA solution gives us a higher chi-square (lower is better), higher RMSEA and lower CFI/TLI than the one-factor model, which means overall it is a poorer fitting model.

We decide to go with a correlated (oblique) two factor model:

Show the code
#correlated two factor solution, marker method
m <- 'f1 =~ q01+ q03 + q04 + q05 + q08
        f2 =~ q06 + q07' 
twofac7items_n <- cfa(m, data=dat, std.lv=TRUE) 
summary(twofac7items_n, fit.measures=TRUE, standardized=TRUE)
## lavaan 0.6.17 ended normally after 18 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        15
## 
##   Number of observations                          2571
## 
## Model Test User Model:
##                                                       
##   Test statistic                                66.768
##   Degrees of freedom                                13
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3876.345
##   Degrees of freedom                                21
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.986
##   Tucker-Lewis Index (TLI)                       0.977
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23296.945
##   Loglikelihood unrestricted model (H1)     -23263.562
##                                                       
##   Akaike (AIC)                               46623.891
##   Bayesian (BIC)                             46711.672
##   Sample-size adjusted Bayesian (SABIC)      46664.013
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.040
##   90 Percent confidence interval - lower         0.031
##   90 Percent confidence interval - upper         0.050
##   P-value H_0: RMSEA <= 0.050                    0.952
##   P-value H_0: RMSEA >= 0.080                    0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.021
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     q01               0.513    0.017   30.460    0.000    0.513    0.619
##     q03              -0.599    0.022  -26.941    0.000   -0.599   -0.557
##     q04               0.658    0.019   34.876    0.000    0.658    0.694
##     q05               0.567    0.020   28.676    0.000    0.567    0.588
##     q08               0.435    0.018   23.701    0.000    0.435    0.498
##   f2 =~                                                                 
##     q06               0.669    0.025   27.001    0.000    0.669    0.596
##     q07               0.949    0.027   35.310    0.000    0.949    0.861
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 ~~                                                                 
##     f2                0.676    0.020   33.023    0.000    0.676    0.676
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .q01               0.423    0.014   29.157    0.000    0.423    0.617
##    .q03               0.796    0.026   31.025    0.000    0.796    0.689
##    .q04               0.466    0.018   25.824    0.000    0.466    0.518
##    .q05               0.608    0.020   30.173    0.000    0.608    0.654
##    .q08               0.572    0.018   32.332    0.000    0.572    0.752
##    .q06               0.811    0.030   27.187    0.000    0.811    0.644
##    .q07               0.314    0.040    7.815    0.000    0.314    0.258
##     f1                1.000                               1.000    1.000
##     f2                1.000                               1.000    1.000

Compared to the uncorrelated two-factor solution, the chi-square and RMSEA are both lower. The test of RMSEA is not significant which means that we do not reject the null hypothesis that the RMSEA is less than or equal to 0.05. Additionally the CFI and TLI are both higher and pass the 0.95 threshold. This is even better fitting than the one-factor solution. We then choose the final two correlated factor CFA model as shown below: