5 Repeated Measurements

5.1 Repeated measurements analyis

The repeated measures ANOVA is used for analyzing data where same subjects are measured more than once on the same outcome variable under different time points or conditions. This test is also referred to as a within-subjects ANOVA (or ANOVA with repeated measures).

Repeated measures ANOVA is the equivalent of the one-way ANOVA, but for related, not independent groups, and is the extension of the dependent t-test.

In repeated measures ANOVA, the independent variable (also referred as the within-subject factor) has categories (called levels or groups). We can analyse data using a repeated measures ANOVA for two types of study design:

investigating changes in mean scores over three or more time points (e.g., pre-, midway and post-intervention)
investigating differences in mean scores under three or more different conditions

5.1.1 Within- and between-subject variables

For one withing-subject factor, we can consider the example where time is used to evaluate whether there is any difference in social media reliance across the several times of data.

We can also consider adding age cohort as a between-subject factor in order to test effect of age cohort on social media reliance, as well as the interaction effect between time and age cohort.

5.1.2 One-way and two-way repeated measurements analyis

An ANOVA with repeated measures is used to compare three or more group means where the participants are the same in each group. This usually occurs in situations when participants are measured multiple times to see changes to an intervention or when participants are subjected to more than one condition/trial and the response to each of these conditions wants to be compared.

One-way repeated measures ANOVA is an extension of the paired-samples t-test for comparing the means of three or more levels of a within-subjects variable.

Two-way repeated measures ANOVA is used to evaluate simultaneously the effect of two within-subject factors on a continuous outcome variable.

5.2 Assumptions

The repeated measures ANOVA makes the following assumptions:

No significant outliers
Normality of the dependent variable (at each time point)
- We can use histograms and normality tests
Variance of the differences between groups should be equal (sphericity assumption)
- We can use Mauchly’s test of Sphericity (if p>0.05, sphericity can be assumed).

Note that, if the above assumptions are not met there are a non-parametric alternative (Friedman test) to the one-way repeated measures ANOVA. However, there are no non-parametric alternatives to the two-way (and the three-way) repeated measures ANOVA. Thus, in the situation where the assumptions are not met, you could consider running the two-way repeated measures ANOVA on the transformed and non-transformed data to see if there are any meaningful differences.

5.3 Mauchly’s test of sphericity

Mauchly’s test of sphericity tests the null hypothesis (H0) that all variances can be considered homogeneous. A significant result leads to the rejection of H0 since at least two variances are unequal. Violations of sphericity lead to a biased F-test and to biased post-hoc test results!

There are two different ways of calculating sphericity violation:

Greenhouser & Geisser (1959): Rather conservative (the injuries are overestimated, especially in the case of minor injuries to the sphericity)
Huynh & Feldt (1976): Rather liberal (thus rather underestimated)

Stephen (2002) therefore recommends taking the arithmetic mean of the two estimates. The F value remains the same, but if the sphericity is violated, the interpretation of the F value must be adjusted. The degrees of freedom are corrected downwards, which means that the F value becomes significant less quickly.

5.4 Logic of repeated measures ANOVA

The logic behind a repeated measures ANOVA is very similar to that of a between-subjects ANOVA. A between-subjects ANOVA partitions total variability into between-groups variability (\(SS_b\)) and within-groups variability (\(SS_w\)). Within-group variability (\(SS_w\)) is defined as the error variability (\(SS_{error}\)). Following division by the appropriate degrees of freedom, a mean sum of squares for between-groups (\(MS_b\)) and within-groups (\(MS_w\)) is determined and an F-statistic is calculated as the ratio of \(MS_b\) to \(MS_w\) (or \(MS_{error}\)).

\[ F=\frac{MS_b}{MS_w}=\frac{MS_b}{MS_{error}}\] A repeated measures ANOVA calculates an F-statistic in a similar way:

\[ F=\frac{MS_{conditions}}{MS_{error}}=\frac{MS_{time}}{MS_{error}}\] A repeated measures ANOVA can further partition the error term, thus reducing its size:

\[SS_{error}=SS_w-SS_{subjects}=SS_T-SS_{conditions}-SS_{subjects}\] This has the effect of increasing the value of the F-statistic due to the reduction of the denominator and leading to an increase in the power of the test to detect significant differences between means. With a repeated measures ANOVA, as we are using the same subjects in each group, we can remove the variability due to the individual differences between subjects, referred to as \(SS_{subjects}\), from the within-groups variability (SSw) by treating each subject as a block. That is, each subject becomes a level of a factor called subjects. The ability to subtract \(SS_{subjects}\) will leave us with a smaller \(SS_{error}\) term.

Source: https://statistics.laerd.com/statistical-guides/repeated-measures-anova-statistical-guide.php

Now that we have removed the between-subjects variability, our new \(SS_{error}\) only reflects individual variability to each condition. You might recognize this as the interaction effect of subject by conditions (how subjects react to the different conditions).

Computation example for one-way repeated measures ANOVA

The calculation of \(SS_{time}\) is the same as for \(SS_b\) in an independent ANOVA, and can be expressed as:

\[SS_{time}=SS_b={\sum^{}_{}n_i(\overline{x}_i-\overline{x})^2}\] where \(k\) = number of conditions, \(n_i\) = number of subjects under each (ith) condition, \(\overline{x}_i\) = mean score for each (ith) condition, and \(\overline{x}\) = grand mean.

Within-groups variation (\(SS_w\)) is also calculated in the same way as in an independent ANOVA, expressed as follows:

\[SS_w={\sum^{}_{}(x_{i1}-\overline{x}_1)^2+\sum^{}_{}(x_{i2}-\overline{x}_2)^2+...+\sum^{}_{}(x_{ik}-\overline{x}_k)^2}\] where \(k\) = number of conditions, and \(x_{i1}\) = score of the ith subject in group 1.

We treat each subject as a level of an independent factor called subjects. We can then calculate \(SS_{subjects}\) as follows:

\[SS_{subjects}={k\sum^{}_{}(\overline{x}_i-\overline{x})^2}\] where \(k\) = number of conditions, \(\overline{x}_i\) = mean subject i, and \(\overline{x}\) = grand mean.

To determine the mean sum of squares for time (\(MS_{time}\)) we divide \(SS_{time}\) by its associated degrees of freedom \((k-1)\):

\[MS_{time}=\frac{SS_{time}}{(k-1)}\] We do the same for the mean sum of squares for error (\(MS_{error}\)), this time dividing by \((n-1)(k-1)\) degrees of freedom:

\[MS_{time}=\frac{SS_{error}}{(n-1)(k-1)}\]

5.5 F-statistic

The sums of squares depend on the number of measurement times and cases, so the variance (standardized SS) is used to calculate the F-statistic.

We can calculate the F-statistic as:

\[F=\frac{MS_{time}}{MS_{error}}\] We can then ascertain the critical F-statistic for our F-distribution with our degrees of freedom for condition and error, and determine whether our F-statistic indicates a statistically significant result.

The figure below shows that the calculation of the F-statistic for repeated measures ANOVA offers very different results compared to the results that would be obtained using the calculation for independent ANOVA. You can click on the figure to download the excel file.

5.6 Two-way repeated measures ANOVA

A two-way repeated measures ANOVA compares the mean differences between groups that have been split on two within-subjects factors (also known as independent variables). A two-way repeated measures ANOVA is often used in studies where you have measured a dependent variable over two or more time points, or when subjects have undergone two or more conditions (e.g., “time” and “conditions”).

Remember that the two-way (repeated measures) ANOVA is an omnibus test statistic and cannot tell you which specific groups within each factor were significantly different from each other. It only tells you that at least two of the groups were different. To determine which groups differ from each other, you can use post hoc tests.

5.7 Effect size

The partial eta-squared is specific to the factor \(i\), but if there are several factors, you cannot add the individual partial eta-squares to form a total value because the denominator does not contain the total sum of squares (total variance)!

Partial eta-squared is where the the \(SS_{subjects}\) has been removed from the denominator:

\[\eta_{partial}^2=\frac{SS_{conditions}}{SS_{conditions}+SS_{error}}\]

Nota bene: critics about Eta-square

Eta-square overestimates the effect (i.e. the explained variance shares are assumed to be too large), because in samples there are very likely to be group differences even if there are no differences in the total population (especially with small n)!

Therefrore, a recommendation of many authors is not to use the measured variances (SS) as the basis for the calculation of effect sizes, but their estimates in the total population (then no more bias). This parameter is called omega-square.

5.8 Pairwise comparisons

There are several methods to conduct pairwise comparisons:

Contrasts: particularly useful for repeated measurement designs: each measurement time point is tested against the previous one: t2 vs. t1, t3 vs. t2, etc.
Post-hoc tests (Bonferroni correction is recommended for violations of sphericity)
Simple Effects also possible

5.9 Recap on the models’ assumptions

Below is a recap of the assumptions of the different models that we have seen so far (linear regression, ANOVA, repeated measures ANOVA). The assumptions are defined, methods and tests are highlighted, as well as solutions proposed.

5.10 In a nutshell

In the ANOVA with repeated measures, a distinction is made between the between-participant and the within-participant variance.

The within-participant variance can be further subdivided into model explained variance and unexplained (error) variance. The F-value is calculated from their ratio.

A prerequisite for the ANOVA with repeated measurements is sphericity (i.e. homogeneous variances between the measurement times). The Mauchly’s test checks this requirement, it should not be significant.

If there is no sphericity, the degrees of freedom for the critical F-value must be corrected. For small violations (epsilon > .75) with the Huynh & Feldt correction, for larger violations (epsilon < .75) with the Greenhouse & Geisser correction.

5.11 How it works in R?

Important note: For repeated measures ANOVA in R, it requires the long format of data. For the long format, we would need to stack the data from each individual into a vector (data at each time in a single column). If the database is in short format (data at each time in different columns), the function melt() from the R package reshape2 can be used. See the lecture slides on repeated measures ANOVA:

You can also download the PDF of the slides here:

5.12 Quiz

True	False	Statement
		When Mauchly’s test for equality of variances fails to show significance, you have evidence that the data are suitable for the application of the one-way repeated measures ANOVA.
		When conducting a one-way repeated measures ANOVA test a significance level of 0.506 indicates that the means are equal.
		Conducting pairwise comparisons helps to assess which groups a statistically different.
		In a two-way repeated measures ANOVA the distribution of the dependent variable in each combination of the related groups should be approximately normally distributed.

My results will appear here

5.13 Example from the literature

The following article relies on repeated measurements ANOVA as a method of analysis:

Hameleers, M., Brosius, A., & de Vreese, C. H. (2021). Where’s the fake news at? European news consumers’ perceptions of misinformation across information sources and topics. Harvard Kennedy School Misinformation Review. Available here.

Please reflect on the following questions:

What is the research question of the study?
What are the research hypotheses?
Is repeated measurements ANOVA an appropriate method of analysis to answer the research question?
What are the main findings of the repeated measurements ANOVA analysis?

5.14 Time to practice on your own

You can download the PDF of the exercises here:

5.14.1 Exercise 1: strengthening environmental protection over time

Use the data from the Selects 2019 Panel Survey and assess whether respondents’ stance towards strengthening environmental protection has increased over the first three waves (before, during and after the campaign).

Start by downloading the data and by selecting the variables.

Show the code

library(foreign)
db <- read.spss(file=paste0(getwd(),
                "/data/1184_Selects2019_Panel_Data_v4.0.sav"),
                use.value.labels = F, 
                to.data.frame = T)
sel <- db |>
  dplyr::select(id,
    # wave 1
    W1_f15340d,
    # wave 2
    W2_f15340d,
    # wave 3
    W3_f15340d) |>
  stats::na.omit()
# inverse the scale 
sel$W1_f15340d=(sel$W1_f15340d-6)*(-1)
sel$W2_f15340d=(sel$W2_f15340d-6)*(-1)
sel$W3_f15340d=(sel$W3_f15340d-6)*(-1)

Next, reshape the data so that there are in a long format.

Show the code

long <- reshape(as.data.frame(sel),
                direction="long",
                varying = c("W1_f15340d","W2_f15340d","W3_f15340d"),
                v.names = "pro_env",
                times =c("wave1","wave2","wave3"))

Then, we can check the normality of the dependent variable using the Shapiro-Wilk normality test:

Show the code

# Shapiro-Wilk test
long |>
  dplyr::group_by(time) |>
  rstatix::shapiro_test(pro_env)
## # A tibble: 3 × 4
##   time  variable statistic        p
##   <chr> <chr>        <dbl>    <dbl>
## 1 wave1 pro_env      0.773 8.31e-45
## 2 wave2 pro_env      0.743 8.74e-47
## 3 wave3 pro_env      0.786 6.03e-44

Interpretation

If the data is normally distributed, the p-value should be greater than 0.05., which is not the case here. Note that we need to test if the data was normally distributed at each time point. You can also visualize the distribution over time using boxplots.

Nota bene: If your sample size is greater than 50, the normal QQ plot is preferred because at larger sample sizes the Shapiro-Wilk test becomes very sensitive even to a minor deviation from normality.

Show the code

# QQ plot
ggpubr::ggqqplot(long, "pro_env", facet.by = "time")

The assumption of sphericity will be automatically checked during the computation of the ANOVA test using the R function anova_test(). By using the function get_anova_table() to extract the ANOVA table, the Greenhouse-Geisser sphericity correction is automatically applied to factors violating the sphericity assumption. Now, we can check whether there are group differences:

Show the code

# group differences
res.aov <- rstatix::anova_test(data = long, 
                      dv = pro_env, 
                      wid = id, 
                      within = time)
res.aov
## ANOVA Table (type III tests)
## 
## $ANOVA
##   Effect DFn  DFd      F        p p<.05   ges
## 1   time   2 3682 37.718 6.09e-17     * 0.004
## 
## $`Mauchly's Test for Sphericity`
##   Effect     W     p p<.05
## 1   time 0.993 0.001     *
## 
## $`Sphericity Corrections`
##   Effect   GGe        DF[GG]    p[GG] p[GG]<.05   HFe        DF[HF]    p[HF] p[HF]<.05
## 1   time 0.993 1.99, 3655.01 7.76e-17         * 0.994 1.99, 3658.94 7.49e-17         *

The sphericity test is violated (W=0.993 and p<0.05). Therefore, we need to look at the Greenhouser-Geisser (GG) Huynh-Feldt (HF) corrections. The p-values (p[GG] and p[HF]) are significant, thus indicating that the observed F values are significant and accepting the hypothesis that we have different means.

Interpretation

The pro-environment score was statistically significantly different at the different time points: F = 37.71, p < 0.05. Furthermore, the value for “ges” (generalized effect size) gives us the amount of variability due to the within-subjects factor.

Finally, we can assess which group (or time) differences are statistically significant:

Show the code

# Post-hoc test to assess differences
pwc <- long |>
  rstatix::pairwise_t_test(
    pro_env ~ time, 
    paired = TRUE,
    p.adjust.method = "bonferroni"
    )
pwc[,c(2,3,6,8,10)]
## # A tibble: 3 × 5
##   group1 group2 statistic        p p.adj.signif
##   <chr>  <chr>      <dbl>    <dbl> <chr>       
## 1 wave1  wave2      -3.15 2   e- 3 **          
## 2 wave1  wave3       5.26 1.58e- 7 ****        
## 3 wave2  wave3       8.80 3.19e-18 ****

5.14.2 Exercise 2: two-way repeated measure ANOVA

Let’s create a dataset containing a score measured at three points in time. In a second step, we will investigate if (frequently) working in group can induce a significant increase of the score over time.

Show the code

data <- data.frame(matrix(nrow = 200, ncol = 0)) 
set.seed(123)
data$score1 <- runif(nrow(data), min=2, max=4.5)
data$score2 <- runif(nrow(data), min=1.5, max=6)
data$score3 <- runif(nrow(data), min=3, max=5.5)
# assign id
data$id = rep(seq(1:100),2)
# assign group work variable
data$groupwork = c(rep(c("always"),100), rep(c("no"),100))
# copy of the data 
copy = data
# re-arrange the data
data <- data |>
  tidyr::gather(key = "time", value = "score", score1, score2, score3) |>
  rstatix::convert_as_factor(id, time)

Now, we will test whether there is significant interaction between working in group and time on the score. We can use boxplots of the score colored by working in group:

Show the code

ggpubr::ggboxplot(
  data, x = "time", 
  y = "score",
  color = "groupwork"
  )

We can check whether there are outliers:

Show the code

data |>
  dplyr::group_by(groupwork, time) |>
  rstatix::identify_outliers(score)
## [1] groupwork  time       id         score      is.outlier is.extreme
## <0 lignes> (ou 'row.names' de longueur nulle)

We next compute Shapiro-Wilk test to test for the normality assumption for each combinations of factor levels:

Show the code

# Shapiro
data |>
  dplyr::group_by(groupwork, time) |>
  rstatix::shapiro_test(score)
## # A tibble: 6 × 5
##   groupwork time   variable statistic        p
##   <chr>     <fct>  <chr>        <dbl>    <dbl>
## 1 always    score1 score        0.952 0.00119 
## 2 always    score2 score        0.945 0.000418
## 3 always    score3 score        0.948 0.000592
## 4 no        score1 score        0.964 0.00736 
## 5 no        score2 score        0.950 0.000789
## 6 no        score3 score        0.945 0.000379

Interpretation

The output shows that the score is generally not normally distributed (p < 0.05).

Note that for large sample size, we can also use the QQ plot:

Show the code

# QQ plot
ggpubr::ggqqplot(data, "score") +
  ggplot2::facet_grid(time ~ groupwork, labeller = "label_both")

We can assess whether there is a statistically significant two-way interactions between group work and time:

Show the code

# We also need to convert id and time into factor variables 
# data$groupwork <- as.factor(data$groupwork)
data$time <- as.factor(data$time)
data$id <- as.factor(data$id)
res.aov <- rstatix::anova_test(
  data = data, 
  dv = score, 
  wid = id,
  within = c(groupwork, time)
  )
# rstatix::get_anova_table(res.aov)
res.aov
## ANOVA Table (type III tests)
## 
## $ANOVA
##           Effect DFn DFd      F        p p<.05      ges
## 1      groupwork   1  99  0.454 5.02e-01       0.000696
## 2           time   2 198 46.979 2.01e-17     * 0.153000
## 3 groupwork:time   2 198  0.050 9.51e-01       0.000163
## 
## $`Mauchly's Test for Sphericity`
##           Effect     W         p p<.05
## 1           time 0.791 0.0000101     *
## 2 groupwork:time 0.809 0.0000317     *
## 
## $`Sphericity Corrections`
##           Effect   GGe       DF[GG]    p[GG] p[GG]<.05   HFe       DF[HF]    p[HF]
## 1           time 0.827 1.65, 163.74 7.74e-15         * 0.839 1.68, 166.17 5.07e-15
## 2 groupwork:time 0.840 1.68, 166.31 9.28e-01           0.853 1.71, 168.85 9.30e-01
##   p[HF]<.05
## 1         *
## 2

Interpretation

There is a statistically no significant two-way interactions between group work and time, F = 0.05, p > 0.05. Furthermore, the shericity test is violated, thus suggesting to look at the GG and HF corrections.

Procedure for post-hoc test:

A significant two-way interaction indicates that the impact that one factor (e.g., group work) has on the outcome variable (e.g., score) depends on the level of the other factor (e.g., time), and vice versa. So, you can decompose a significant two-way interaction into:

1. simple main effect (one-way model of the first variable at each level of the second variable: e.g., group work at each time point);
1. simple pairwise comparisons if the simple main effect is significant (pairwise comparisons to determine which groups are different: e.g., pairwise comparisons between categories of group work).

For a non-significant two-way interaction, you need to determine whether you have any statistically significant main effects from the ANOVA output (e.g., comparisons for group work and time variable).

Show the code

# Comparisons for group work
res1 = data |>
  rstatix::pairwise_t_test(
    score ~ groupwork, 
    paired = TRUE,
    p.adjust.method = "bonferroni"
)
res1[,c(2,3,6,8,10)]
## # A tibble: 1 × 5
##   group1 group2 statistic     p p.adj.signif
##   <chr>  <chr>      <dbl> <dbl> <chr>       
## 1 always no        -0.661 0.509 ns
# Comparisons for the time variable
res2 = data |>
  rstatix::pairwise_t_test(
    score ~ time, 
    paired = TRUE,
    p.adjust.method = "bonferroni"
)
res2[,c(2,3,6,8,10)]
## # A tibble: 3 × 5
##   group1 group2 statistic        p p.adj.signif
##   <chr>  <chr>      <dbl>    <dbl> <chr>       
## 1 score1 score2     -4.05 7.31e- 5 ***         
## 2 score1 score3    -13.5  5.28e-30 ****        
## 3 score2 score3     -5.04 1.02e- 6 ****