Recap: bivariate statistics

How to run bivariate analyses in R

1 Univariate statistics

1.1 Univariate recap

Univariate statistics

Univariate statistics

1.2 Fiability and validity

The Total Survey Error (TSE) framework (Biemer, 2010) accounts for the different sources of errors that occur at each stage of an investigation (e.g. coverage errors, selection errors, and measurement errors).

Two important concepts are: reliability and validity.

Reliability refers to the idea of replicability. It accounts for the degree of consistency of a measurement (by different observers, at different times).

Validity addresses the conclusions we can draw from of a measure. For instance, internal validity aims at measuring the fit between the concept and the measure.

1.3 Sample statistics

Sample statistics are estimators of population parameters. A particular point estimate cannot be expected to be exactly equal to the value of the parameter in the population. Therefore, the estimate always contains a margin of error.

Normal law distribution (or Gaussian curve): suggests that the distribution of a variable is around an average value and the other values increase and decrease in a homogeneous and symmetrical way around this average value.

1.4 Confidence intervals

The margin of error is also called the confidence interval. It corresponds to the area where we know for a given probability that the average or the percentage of a value will be found.

For a mean: \(\overline{x}\pm1.96(\frac{\sigma(X)}{\sqrt{n}})\)

For a proportion: \(Z_\alpha\sqrt{\frac{p(1-p)}{n}}\)

Confidence intervals for proportions

Confidence intervals for proportions

2 Bivariate statistics

2.1 Bivariate analyses and tests (recap)

Types of bivariate analyses

Types of bivariate analyses

Types of bivariate tests

Types of bivariate tests

2.2 Cross-tables: relationship between two categorical variables

  • H0 (null hypothesis) states that there is no relationship
    • The Chi-2 distribution table enables us to assesses this relationship with the critical value based on:
    • Degrees of freedom: df = (row-1)*(col-1)
    • p-value: typical threshold <0.05 (Chi-2 test should be greater that the critical value to reject H0)

2.3 Cross-tables: calculat Chi-2

For each cell of the table, we have to calculate the expected value under null hypothesis. For a given cell, the expected value is calculated as follow:

\[ e = \frac{row.sum ∗ col.sum}{grand.total} \] The Chi-square statistic is calculated as follow:

\[ Chi2 = \frac{\sum{(o−e)^2}}{e} \]

2.4 Cross-tables: Cramer’s V

  • Cramer’s V: assesses the strength of the relation: \[ V = \sqrt{\frac{(Chi2/n)}{min(col-1, row-1)}} \]
  • Chi2: Chi-square statistic
  • n: total sample size
  • r: number of rows
  • c: Number of columns

2.5 Chi-2 manual example

Chi-2 test and Cramer’s V

Chi-2 test and Cramer’s V

2.6 Chi-2 example in R

data = matrix(c(7,9,12,8), nrow = 2)
data
     [,1] [,2]
[1,]    7   12
[2,]    9    8
chisq.test(data)

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 0.40263, df = 1, p-value = 0.5257
rcompanion::cramerV(data)
Cramer V 
  0.1617 

2.7 Comparing the means of two groups

General logic: Compare the distribution of values for a quantitative variable across different groups (different categories of the categorical independent variable).

  • First step: Examine the relationship between the two variables to see if they are related or independent.
    • Variation between groups: Are the different groups distinct? (central tendency: means)
    • Variation within groups: Are the different groups homogeneous? (dispersion: standard deviation)
  • Second step: Determine the statistical significance of the relationship to see if it can be generalized to the population.
    • H0: The two variables are independent.
    • H1: The two variables are dependent.
    • If p < 0.05: Reject H0, indicating a statistically significant relationship.

2.8 t-test

  • Independent Samples: Used when comparing means from two different groups.
  • Paired Samples: Used when comparing means from the same group at different times or under different conditions.

Assumptions (same for ANOVA):

  • Normality: Data in each group should be approximately normally distributed.
  • Homogeneity of Variance: Variances in the two groups should be approximately equal.
  • Independence: Observations within and between groups should be independent.

\[ t = \frac{\hat{x_1}-\hat{x_2}}{\sqrt{(\frac{s_1^2}{n1})+(\frac{s_2^2}{n_2})}}\] where \(\hat{x_1}\) and \(\hat{x_2}\) are the sample means, \(s_1^2\) and \(s_2^2\) the sample variances, and \(n_1\) and \(n_2\) the sample sizes.

2.9 ANOVA: Comparing the means of more than two groups

  • Unlike t-tests ANOVA can handle multiple groups simultaneously.
  • It reduces the risk of Type I errors (false positives) that can occur when performing multiple t-tests.

Key Concepts:

  • variance within samples (\(S2_{within}\)) reflects individual differences and error variance.
  • variance between sample means (\(S2_{between}\)) measures the variation among the group means.
  • F-statistic: ratio of \(S2_{between}/S2_{within}\)
  • Eta: indicates the strength of the relation and is calculated as \(\eta^2 = SS_{between}/SS_{total}\). Higher F-values indicate greater differences between group means relative to the variability within groups.

2.10 ANOVA: FAQ

  • Question: Should I use the t-distribution table or the z-table to find the critical
  • Answer:
    • If the standard deviation of the population is unknown, use t-table
    • If known, then is the sample size > 30:

2.11 Correlation

  • Pearson correlation (r): measures a linear dependence between two numeric variables (x and y) and can be used only when x and y are from normal distribution
  • Kendall tau and Spearman rho: are rank-based correlation coefficients (non-parametric)

\[ r = \frac{\sum(x−m_x)(y−m_y)}{\sqrt{\sum(x−m_x)^2\sum(y−m_y)^2}} \]

2.12 Correlation test

The p-value can be determined by:

  • using the correlation coefficient table for the degrees of freedom (df = n−2)
  • calculating t (and determining the corresponding p-value using the t-table):

\[ t = \frac{r}{\sqrt{1−r^2}}\sqrt{n-2} \] If the p-value is < 5%, then the correlation between x and y is significant.

3 Quiz

3.1 Can I answer these questions?

True or false?

Score: