Longitudinal data analysis presentation

Learn what longitudinal data analysis is and how to run the analysis in R

Longitudial data analysis with panel data

A panel is usually denoted by having multiple entries (rows) for the same entity (e.g. respondent, company, etc) in a dataset.

The multiple entries are due to different time periods at which the entity was observed.

The problem with OLS

OLS can be used to pool observations of the same entity recorded at different time points. However, observations of the same entity are then treated as if they originate from other entities.

Important influences like serial correlation of observations within the same entity cannot be considered, leading to biased estimates.

Fixed effect model

Fixed effect models (including time dummy variables) are sometimes applied to remove omitted variable bias. By estimating changes within a specific group (over time) all time-invariant differences between entities are controlled for.

The model removes characteristics that do not change over time, leading to unbiased estimates of the remaining independent variables on the dependent variable.

Hence, if unobserved characteristics do not change over time, each change in the dependent variable must be due to influences not related to the fixed effects, which are controlled for.

Note that the influence of time-invariant independent variables on the dependent variable cannot be examined with a fixed effect model.

Multilevel model

Random effect models (with random intercept and/or slope) assume that any variation between entities is random and not correlated with the independent variables used in the estimation model.

This also means that time-invariant variables (like a person’s gender) can be taken into account as independent variables.

The entity’s error term (unobserved heterogeneity) is hence not correlated with the independent variables.

Example

The plm package offers empirical data around Grunfeld’s Investment which is a panel of 10 observational units (firms) from 1935 to 1954. In the next examples, we are interested in regressing investment (inv) on capital.

We can start by exploring the panel data.

car::scatterplot(inv ~ year|firm, 
                 boxplots=F, smooth=T, reg.line=F, 
                 data=Grunfeld)

Heterogeneity across units and time

We can check the heterogeneity across firms (with a 95% confidence interval around the mean):

gplots::plotmeans(inv ~ firm, 
                  data=Grunfeld, 
                  text.n.label=NULL, 
                  n.label = F)

We can further check the heterogeneity across years:

gplots::plotmeans(inv ~ year, 
                  data=Grunfeld, 
                  text.n.label=NULL, 
                  n.label = F)

Simple linear regression

The regular linear regression does not consider heterogeneity across groups or time:

plot(Grunfeld$capital, Grunfeld$inv, xlab="capital", ylab="investments")
abline(lm(inv ~ capital, data=Grunfeld), col="red")

Fixed effect model

In the fixed effect model (including dummy variables for all firms), the coefficient of capital indicates how much investment changes over time, on average per firm, when capital increases by one unit:

fe_model <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", 
                model = "within")
stargazer::stargazer(fe_model, type="text", 
                     single.row = T)

========================================
                 Dependent variable:    
             ---------------------------
                         inv            
----------------------------------------
capital           0.371*** (0.019)      
----------------------------------------
Observations             200            
R2                      0.660           
Adjusted R2             0.642           
F Statistic   366.446*** (df = 1; 189)  
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

Each firm has its own intercept:

library(ggplot2)
fe_model_lm <- lm(inv ~ capital + factor(firm), 
                  data = Grunfeld)
ggplot(data = broom::augment(fe_model_lm),
       aes(x = capital, y = .fitted)) +
  geom_point(aes(color = `factor(firm)`)) +
  geom_line(aes(color = `factor(firm)`)) +
  labs(x = "Stock",
       y = "Fitted Values (inv ~ capital)",
       color = "Firm") 

pFtest: pooled vs fixed effects

The function pFtest() tests for fixed effects with the null hypothesis that pooled linear model is better than fixed effects. The results indicate that the null hypothesis is rejected in favor of the alternative that there are significant fixed effects:

pooled_ols <- plm(inv ~ capital, data = Grunfeld, 
                      index = c("firm", "year"), 
                      effect = "individual", model = "pooling")
pFtest(fe_model, pooled_ols)

    F test for individual effects

data:  inv ~ capital
F = 123.39, df1 = 9, df2 = 189, p-value < 2.2e-16
alternative hypothesis: significant effects

Random effects model

In random effects model, the coefficient of capital indicates the average effect of capital over investment when capital changes across time and between firm by one unit:

re_model <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", model = "random")
stargazer::stargazer(re_model, type="text", single.row = T)

========================================
                 Dependent variable:    
             ---------------------------
                         inv            
----------------------------------------
capital           0.372*** (0.019)      
Constant           43.247 (51.411)      
----------------------------------------
Observations             200            
R2                      0.652           
Adjusted R2             0.650           
F Statistic          371.149***         
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

phtest: fixed vs random effects

A decision between a fixed and random effects model can be made with the Hausman test. The null hypothesis states that there is no such correlation (thus, one should prefer the random effect model). The alternative hypothesis is that a correlation exists (thus, one should go for the fixed effect model). The null hypothesis cannot be rejected here, hence we should use a RE model:

phtest(fe_model, re_model)

    Hausman Test

data:  inv ~ capital
chisq = 0.93423, df = 1, p-value = 0.3338
alternative hypothesis: one model is inconsistent

plmtest: random effects vs OLS

The Breusch-Pagan Lagrange multiplier Test further helps to decide between a random effects model and a simple linear regression. The null hypothesis states that the variance across entities is zero (thus, this means that there is no panel effect). The test shows that there are significant differences across firms:

plmtest(pooled_ols, effect = "individual", type = c("bp"))

    Lagrange Multiplier Test - (Breusch-Pagan)

data:  inv ~ capital
chisq = 1285.1, df = 1, p-value < 2.2e-16
alternative hypothesis: significant effects

bptest: heteroskedasticity

We should further test for the presence of heteroskedasticity. Results show that there is strong evidence for the presence of heteroskedasticity, thereby advising for the use of robust standard errors:

lmtest::bptest(inv ~ capital + factor(firm), 
               studentize = F, data = Grunfeld)

    Breusch-Pagan test

data:  inv ~ capital + factor(firm)
BP = 386.81, df = 10, p-value < 2.2e-16

pbgtest: serial correlation

For long time-series, a test for serial correlation of the residuals should be performed because serial correlation can lead to an underestimation of standard errors (too small) and an overestimation of \(R^2\) (too large). The null hypothesis states that there is no serial correlation. Results show evidence that the residuals are serially correlated:

pbgtest(fe_model)

    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  inv ~ capital
chisq = 73.785, df = 20, p-value = 4.338e-08
alternative hypothesis: serial correlation in idiosyncratic errors

vcovHC: clustered standard errors

To solve the issue of serial correlation, clustered standard errors have to be used. Clustered standard errors estimate the variance of the coefficient when independent variables are correlated within the entity.

lmtest::coeftest(re_model, 
         vcov = vcovHC(re_model,
                       type = "sss",
                       cluster = "group"))

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 43.246697  37.815768  1.1436    0.2542    
capital      0.372120   0.065803  5.6551 5.389e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1