Learn what CFA is and how to run the analysis in R
CFA is a model-data fit test based on multivariate regression. Outputs are coefficients of paths and fit indices.
If the paths are significant and indices indicate acceptable or high degree of fit, that means the structural model is confirmed by data.
Practical note: it is not good practice to use a CFA to confirm the findings of an EFA. It is better to use an EFA to help determine the number of factors, but then to collect more data to test one or more competing models based on that general factorization.
Concrete research applications in CFA
Latent constructs are not “directly” measurable (e.g. media usage motives, media or brand ratings, attitudes, emotions and empathy in the media reception, trust, etc.).
Problems:
Therefore, we should ensure several “indicator variables” that allow conclusions to be drawn about the latent variable.
Like the EFA, the CFA is also based on the fundamental theorem of factor analysis:
\[ x_{ij} = a_{j1}f_{j1} + a_{j2}f_{j2} + ... + a_{jn}f_{jn} \]
It is about a comparison (or an adjustment) between:
Mathematical requirements:
Theoretical assumptions:
Problem: question of whether a system of equations can be solved mathematically, which means that the model parameters (free parameters) must be estimated from the empirical variances and covariances of the manifest variables.
Logic: all parameters should be estimated in the model (factor loadings, error terms and, if applicable, permitted correlations between latent variables) must be calculable with the help of empirical parameters.
Identification: if all parameters are to be estimated in the model, then the model is identified. If this is not the case, there is (so to speak) an equation with too many unknowns. Such an equation is “unsolvable”, or such a model is “unidentified”.
Step 1: defining a metric for the latent constructs
Step 2: identifying the model structure
Note: the more complex a model is, the more parameters have to be estimated, and the greater the model’s degrees of freedom must be in order to make the estimation possible.
If \(n\) manifest variables are collected, then the empirical variances and covariances of these variables can be calculated:
\[ p = \frac{n(n+1)}{2} \]
\(p\) corresponds to the available “information” for calculating all free parameters of our model. The difference between the available empirical information (\(p\)) and the number of parameters to be estimated (\(q\)) gives the degrees of freedom of the model (\(df_M\)):
\[ df_M = p - q \]
Under- and over-identified models:
Reference Variable (or Marker Method):
Fixed Factor Loading (or Variance Standardization Method):
In the following example, the model is under-identified (df < 0): the information available from the empirical data is not sufficient to calculate the parameters.
In the model below:
Empirical variance-covariance matrix (in short: covariance matrix): calculated with the collected data (= empirical relationships in the data).
Model-theoretical covariance matrix: defined measurement model (= expected relationships in the data).
=> the model-theoretical covariance matrix should resemble the empirical covariance matrix as closely as possible.
Model quality: fit-indices describe how good the model is. Exactly one solution is possible for exactly identified models, several solutions are possible for over-identified models, the best solution must be found.
The parameters can actually be estimated using various functions. The most common method is the so-called Maximum Likelihood (ML) method. The estimated parameter is selected that is most likely to reproduce the observed data.
Once a solution has been found, it must be checked for quality.
Multi-stage process:
We must ensure that only “good” indicators are included in a model:
We should assess whether the constructs/factors are reliably and validly measured:
The examination at model level suggests to check whether the empirical variance-covariance matrix is reproduced as well as possible by the model-theoretical variance-covariance matrix (e.g. Fit-indices, Chi-Square test statistic, RMSEA and SRMR).
Table can be found here.
There are several things we can do if the model fit is too bad:
Comparing different models can be useful when we have competing theoretical models.
Comparison rules:
We can use CFA to check a theoretical model for measuring negative campaigning behaviors of politicians using the following variables:
# default: marker method by lavaan
syntx <- 'RC1 =~ B9a + B9b + B9c + B9d'
model_cfa <- lavaan::cfa(syntx, data = sel)
summary(model_cfa)
Length Class Mode
1 lavaan S4
lhs op rhs est pvalue
1 RC1 =~ B9a 1.0000000 NA
2 RC1 =~ B9b 1.0751968 0
3 RC1 =~ B9c 1.0159835 0
4 RC1 =~ B9d 0.7518455 0
5 B9a ~~ B9a 0.8377181 0
6 B9b ~~ B9b 0.8805613 0
7 B9c ~~ B9c 0.4378973 0
8 B9d ~~ B9d 0.5384798 0
9 RC1 ~~ RC1 0.7003171 0
fits <- lavaan::fitMeasures(model_cfa)
data.frame(fits = round(fits[c("ntotal","df",
"chisq","pvalue",
"rmsea","rmsea.pvalue","srmr")], 2))
fits
ntotal 2074.00
df 2.00
chisq 300.49
pvalue 0.00
rmsea 0.27
rmsea.pvalue 0.00
srmr 0.06
Multivariate statistics