ANOVA: The analysis of variance

Motivation

During the first four weeks of the course, I introduced almost everything related to the sample mean of IID normal random variables \(N\left(\mu,\sigma^2\right)\). In particular, I showed how to use the techniques you already know from probability theory to evaluate claims about the population mean \(\mu\).

During the next four weeks of the course, we tried to move beyond the normal case to other parametric models in terms of parameter estimation and hypothesis testing. Now, we are moving on to more complicated data analysis techniques involving comparisons of population means but returning to a baseline assumption of normality.

Multi-sample problems

Let \(Y_{ij}\) where \(j=1,\ldots,k\) and \(i=1,\ldots,n_j\). You could put this in tabular form like LM Table 12.1. Now consider the following special cases:

  1. One-sample problem where \(k=1\), with sample size \(n=n_1\)
  2. Two-sample problem where \(k=2\), with sample sizes \(n_1\) and \(n_2\) such that \(n=n_1+n_2\).
  3. Multi-sample problem with sample sizes \(n_1, n_2, \ldots, n_k\) such that \(n=n_1+n_2+\cdots+n_k\).
  4. The case of equal sample sizes where \(n_1=n_2=\cdots=n_k\)

The total sample size is \[n=\sum_{j=1}^k n_j.\]

Algebraic aspects

Most of the statistical theory related to multi-sample problems involve decomposition of sums of squares. The decomposition is algebraic in nature and has nothing to do with statistics. But statistical theory is used to give more meaning to the components of the decomposition. Below you are going to be introduced to notation and to work out the steps of the algebra.

  1. Define the \(k\) batch or treatment means as \[\overline{Y}_{\bullet j}=\frac{1}{n_j}\sum_{i=1}^{n_j} Y_{ij},\ \ j=1,\ldots,k\]
  2. Define the grand or overall mean as \[\overline{Y}_{\bullet\bullet}=\frac{1}{n}\sum_{j=1}^k \sum_{i=1}^{n_j}Y_{ij}=\frac{1}{n}\sum_{j=1}^k n_j \overline{Y}_{\bullet j}\]
  3. Define the total sum of squares or total variation, denoted by SSTOT in LM, to be \[\mathrm{SSTOT}=\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet\bullet}\right)^2\]
  4. Define the treatment sum of squares or model sum of squares or batch sum of squares or between-samples variation, denoted by SSTR in LM, to be \[\mathrm{SSTR}=\sum_{j=1}^k \sum_{i=1}^{n_j} \left(\overline{Y}_{\bullet j}-\overline{Y}_{\bullet\bullet}\right)^2=\sum_{j=1}^k n_j\overline{Y}_{\bullet j}^2-n\overline{Y}_{\bullet\bullet}^2\]
  5. Define the error sum of squares or residual sum of squares or within-samples variation, denoted by SSE in LM, to be \[\mathrm{SSE}=\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet j}\right)^2\]

Observe that the within sum of squares could be written as \[\mathrm{SSE}=\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet j}\right)^2=\sum_{j=1}^k \left(n_j-1\right)S_j^2\] where \(S_j^2\) is the within-sample variance: \[S_j^2=\frac{1}{n_j-1} \sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet j}\right)^2\]

Decomposition of sums of squares is key to understanding the statistical procedures. In fact, you have been seeing decompositions already. One example is the following algebraic relationship you have already seen before: \[\sum_{i=1}^{n}Y_{i}^2=\sum_{i=1}^{n} \left(Y_{i}-\overline{Y}\right)^2+n\overline{Y}^2,\] which may be considered as a special case of \[\sum_{j=1}^k \sum_{i=1}^{n_j}Y_{ij}^2 =\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet\bullet}\right)^2+n\overline{Y}_{\bullet\bullet}^2\]

Another decomposition of interest is \[\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet\bullet}\right)^2=\sum_{j=1}^k\sum_{i=1}^{n_j} \left(\overline{Y}_{\bullet j}-\overline{Y}_{\bullet\bullet}\right)^2+\sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet j}\right)^2\]

A similar decomposition is related to the previous one: \[\sum_{j=1}^k \sum_{i=1}^{n_j}Y_{ij}^2 = \left[\sum_{j=1}^k n_j\overline{Y}_{\bullet j}^2-n\overline{Y}_{\bullet\bullet}^2 \right]+ \sum_{j=1}^k\sum_{i=1}^{n_j} \left(Y_{ij}-\overline{Y}_{\bullet j}\right)^2 +n\overline{Y}_{\bullet\bullet}^2\]

Statistical aspects

For the moment, assume that \(\left(Y_{1j}, Y_{2j}, \ldots, Y_{n_j,j}\right)\) is a random sample from some distribution. Assume also that each of these random samples are mutually independent. Assume further that for all \(j=1,\ldots,k\) and \(i=1,\ldots,n_j\), we have \(\mathbb{E}\left(Y_{ij}\right)=\mu_j\) and \(\mathsf{Var}\left(Y_{ij}\right)=\sigma^2\). Let the overall average be denoted by \(\mu\), which has a slightly different meaning in our context: \[\mu=\frac{1}{n}\sum_{j=1}^k n_j\mu_j\] We have the following results:

  1. The batch or treatment means have the following first two moments: \[\mathbb{E}\left(\overline{Y}_{\bullet j}\right)=\mu_j\ \ \mathsf{Var}\left(\overline{Y}_{\bullet j}\right)=\frac{\sigma^2}{n_j}\]
  2. The grand mean has the following first two moments: \[\mathbb{E}\left(\overline{Y}_{\bullet\bullet}\right)=\frac{1}{n}\sum_{j=1}^k n_j\mu_j=\mu, \ \ \mathsf{Var}\left(\overline{Y}_{\bullet\bullet}\right)=\frac{\sigma^2}{n}\]
  3. LM Theorem 12.2.1 calculates the first moment of SSTR: \[\mathbb{E}\left(\mathrm{SSTR}\right)=\left(k-1\right)\sigma^2+\sum_{j=1}^k n_j\left(\mu_j-\mu\right)^2\]
  4. SSE has the following first moment: \[\mathbb{E}\left(\mathrm{SSE}\right)=\left(n-k\right)\sigma^2\]
  5. The pair represented by the \(j\)th batch mean and the \(j\)th sample variance \(\left(\overline{Y}_{\bullet j}, S^2_j\right)\) are mutually independent across \(j\).

Parts of LM Chapters 7, 9, and 12 impose a normality assumption. Because of this distributional assumption, we have the following results:

  1. The \(j\)th batch mean and the \(j\)th sample variance are independent of each other. Therefore, \(\overline{Y}_{\bullet 1}, \ldots, \overline{Y}_{\bullet k}\) and SSE are mutually independent of each other.
  2. SSE and SSTR are independent of each other. The proof may be found in LM Theorem 12.2.3b.
  3. We know the full distribution of each of the batch means. For \(j=1,\ldots,k\), we have \[\overline{Y}_{\bullet j}=\frac{1}{n_j} \sum_{i=1}^{n_j} Y_{ij} \sim N\left(\mu_j, \frac{\sigma^2}{n_j}\right).\]
  4. We know the full distribution of the grand mean: \[\overline{Y}_{\bullet\bullet}=\frac{1}{n}\sum_{j=1}^k\sum_{i=1}^{n_j} Y_{ij} \sim N \left( \frac{1}{n}\sum_{j=1}^k n_j\mu_j, \frac{\sigma^2}{n}\right).\]
  5. We also know the full distribution of \(\mathrm{SSTR}/\sigma^2\), but this was relegated to the appendix of Chapter 12. Refer to the first part of the proof found in LM Theorem 12.A.2.2. \(\mathrm{SSTR}/\sigma^2\) has a non-central chi-square distribution with \(k-1\) degrees of freedom and non-centrality parameter \[\gamma=\frac{1}{\sigma^2}\sum_{j=1}^k n_j\left(\mu_j-\mu\right)^2.\]
  6. We know the full distribution of \(\mathrm{SSE}/\sigma^2\). The proof may be found in LM Theorem 12.2.3a. In particular, \(\mathrm{SSE}/\sigma^2\) has a chi-square distribution with \(n-k\) degrees of freedom.

Using the theory so far, you can already work on special cases for one-sample and two-sample problems.

Hypothesis testing involving normal population means

One null hypothesis that the LM focuses on is the equality of means, i.e. \(\mu_1=\cdots=\mu_k=\mu_0\) where \(\mu_0\) is a fixed value. This fixed value may is typically unknown, but may be pre-specified. Under this null, we have further results under normality:

  1. \(\mathrm{SSTOT}/\sigma^2\) has a chi-square distribution with \(n-1\) degrees of freedom.
  2. \(\mathrm{SSTR}/\sigma^2\) has a chi-square distribution with \(k-1\) degrees of freedom. A proof which applies moment generating functions may be found in Appendix 12.A.1.
  3. The statistic \[\frac{SSTR/\left(k-1\right)}{SSE/\left(n-k\right)}\] has an \(F\)-distribution with \(k-1\) numerator degrees of freedom and \(n-k\) denominator degrees of freedom. The proof may be found in LM Theorem 12.2.5a.

When you reject \(\mu_1=\cdots=\mu_k=\mu_0\), it does not really tell you which of the \(\mu_j\)’s are responsible for the lack of equality among means. This leads to “searching” for the identity of the \(\mu_j\), which is the basis of Section 12.3 about multiple comparisons using Tukey’s method. These multiple comparisons go through all possible comparisons of pairs of individual means. Because these are done after seeing a rejection of the null and not having a clear idea of what direction to search in, these are called post hoc comparisons.

On the other hand, if you do not plan on testing \(\mu_1=\cdots=\mu_k=\mu_0\), but instead focus on planned and pre-specified subhypotheses, then the desired contrast may be tested directly. The approach laid out in Section 12.4 discusses a procedure which is valid under normality and homoscedasticity.

Modern interpretations of the analysis of variance

The biggest practical issues with the analysis of variance stem from the equal variances assumption (called homoscedasticity) and the normality assumption. Without the equal variances assumption, everything you have seen falls apart. There is a lot of research and simulation studies on these aspects.

It is rare to see the analysis of variance taught to economics and finance audiences. You will see this technique applied more in other social sciences like psychology. For economics and finance audiences, analysis of variance typically “drops out” after computing a linear regression. There is growing recognition that the many tests of hypotheses (especially those discussed in LM Chapters 9 to 14) are special cases of a linear regression. So, what makes the analysis of variance special?

Statisticians have debated what the analysis of variance is all about. Consider Speed (1987) along with the discussions and Gelman (2005) also with discussions. The latter paper probably distills the essence of the key idea behind the analysis of variance: It is a tool to summarize groups of estimates when estimating complex models. In that sense, the focus is not on hypothesis testing at all. A compressed version of the latter paper was also written by Gelman for economists.

LM case studies implemented in R

Case Study 12.2.1

# setting up dataset
non <- c(69, 52, 71, 58, 59, 65)
light <- c(55, 60, 78, 58, 62, 66)
mod <- c(66, 81, 70, 77, 57, 79)
heavy <- c(91, 72, 81, 67, 95, 84)
# create data frame
smoking <- data.frame(rbind(cbind(non, 1), cbind(light, 2), cbind(mod, 3), cbind(heavy, 4)))
# names for the variables
names(smoking) <- c("heart.rate", "group")
# group-specific sample means
tapply(smoking$heart.rate, smoking$group, mean)
       1        2        3        4 
62.33333 63.16667 71.66667 81.66667 
# group-specific sample variances
tapply(smoking$heart.rate, smoking$group, var)
        1         2         3         4 
 52.66667  66.56667  83.86667 115.86667 
smoking$group
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
# convert group variable to a factor
smoking$group <- factor(smoking$group)
# compare with previous results
smoking$group
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
Levels: 1 2 3 4
# what the dataset could look like
smoking
   heart.rate group
1          69     1
2          52     1
3          71     1
4          58     1
5          59     1
6          65     1
7          55     2
8          60     2
9          78     2
10         58     2
11         62     2
12         66     2
13         66     3
14         81     3
15         70     3
16         77     3
17         57     3
18         79     3
19         91     4
20         72     4
21         81     4
22         67     4
23         95     4
24         84     4
# use least squares (approach for next semester)
model.lm <- lm(heart.rate ~ group, data = smoking)
# compute an anova based on lm()
anova(model.lm)
Analysis of Variance Table

Response: heart.rate
          Df Sum Sq Mean Sq F value   Pr(>F)   
group      3 1464.1  488.04  6.1203 0.003979 **
Residuals 20 1594.8   79.74                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# calculations for anova
model.aov <- aov(heart.rate ~ group, data = smoking)
# create anova table
summary(model.aov)
            Df Sum Sq Mean Sq F value  Pr(>F)   
group        3   1464   488.0    6.12 0.00398 **
Residuals   20   1595    79.7                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Case Study 12.3.1

# setting up dataset
g1 <- c(29.6, 24.3, 28.5, 32.0)
g2 <- c(27.3, 32.6, 30.8, 34.8)
g3 <- c(5.8, 6.2, 11.0, 8.3)
g4 <- c(21.6, 17.4, 18.3, 19.0)
g5 <- c(29.2, 32.8, 25.0, 24.2)
# create data frame
drug.data <- data.frame(rbind(cbind(g1, "Penicillin"), cbind(g2, "Tetracycline"), cbind(g3, "Streptomycin"), cbind(g4, "Erythromycin"), cbind(g5, "Chloramphenicol")))
# names for the variables
names(drug.data) <- c("bind.rate", "drug")
# group-specific sample means
tapply(drug.data$bind.rate, drug.data$drug, mean)
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Chloramphenicol    Erythromycin      Penicillin    Streptomycin    Tetracycline 
             NA              NA              NA              NA              NA 
# group-specific sample variances
tapply(drug.data$bind.rate, drug.data$drug, var)
Chloramphenicol    Erythromycin      Penicillin    Streptomycin    Tetracycline 
       15.92000         3.26250        10.35333         5.68250        10.05583 
# convert group variable to a factor
drug.data$drug <- factor(drug.data$drug)
# what the dataset could look like
drug.data
   bind.rate            drug
1       29.6      Penicillin
2       24.3      Penicillin
3       28.5      Penicillin
4         32      Penicillin
5       27.3    Tetracycline
6       32.6    Tetracycline
7       30.8    Tetracycline
8       34.8    Tetracycline
9        5.8    Streptomycin
10       6.2    Streptomycin
11        11    Streptomycin
12       8.3    Streptomycin
13      21.6    Erythromycin
14      17.4    Erythromycin
15      18.3    Erythromycin
16        19    Erythromycin
17      29.2 Chloramphenicol
18      32.8 Chloramphenicol
19        25 Chloramphenicol
20      24.2 Chloramphenicol
# partly explain why tapply fails
drug.data$bind.rate
 [1] "29.6" "24.3" "28.5" "32"   "27.3" "32.6" "30.8" "34.8" "5.8"  "6.2" 
[11] "11"   "8.3"  "21.6" "17.4" "18.3" "19"   "29.2" "32.8" "25"   "24.2"
# no error
tapply(as.numeric(drug.data$bind.rate), drug.data$drug, mean)
Chloramphenicol    Erythromycin      Penicillin    Streptomycin    Tetracycline 
         27.800          19.075          28.600           7.825          31.375 
# anova
model.aov <- aov(bind.rate ~ drug, data = drug.data)
summary(model.aov)
            Df Sum Sq Mean Sq F value   Pr(>F)    
drug         4 1480.8   370.2   40.88 6.74e-08 ***
Residuals   15  135.8     9.1                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = bind.rate ~ drug, data = drug.data)

$drug
                                diff        lwr        upr     p adj
Erythromycin-Chloramphenicol  -8.725 -15.295401  -2.154599 0.0071611
Penicillin-Chloramphenicol     0.800  -5.770401   7.370401 0.9952758
Streptomycin-Chloramphenicol -19.975 -26.545401 -13.404599 0.0000010
Tetracycline-Chloramphenicol   3.575  -2.995401  10.145401 0.4737713
Penicillin-Erythromycin        9.525   2.954599  16.095401 0.0034588
Streptomycin-Erythromycin    -11.250 -17.820401  -4.679599 0.0007429
Tetracycline-Erythromycin     12.300   5.729599  18.870401 0.0003007
Streptomycin-Penicillin      -20.775 -27.345401 -14.204599 0.0000006
Tetracycline-Penicillin        2.775  -3.795401   9.345401 0.6928357
Tetracycline-Streptomycin     23.550  16.979599  30.120401 0.0000001
# dishonest: all comparisons
pairwise.t.test(as.numeric(drug.data$bind.rate), drug.data$drug, p.adjust.method = "none")

    Pairwise comparisons using t tests with pooled SD 

data:  as.numeric(drug.data$bind.rate) and drug.data$drug 

             Chloramphenicol Erythromycin Penicillin Streptomycin
Erythromycin 0.00095         -            -          -           
Penicillin   0.71220         0.00044      -          -           
Streptomycin 1.1e-07         9.1e-05      6.8e-08    -           
Tetracycline 0.11363         3.6e-05      0.21183    1.3e-08     

P value adjustment method: none 
# using multcomp package
require(multcomp)
Loading required package: multcomp
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Loading required package: MASS

Attaching package: 'TH.data'
The following object is masked from 'package:MASS':

    geyser
summary(glht(model.aov, linfct = mcp(drug = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = bind.rate ~ drug, data = drug.data)

Linear Hypotheses:
                                    Estimate Std. Error t value Pr(>|t|)    
Erythromycin - Chloramphenicol == 0   -8.725      2.128  -4.101 0.007186 ** 
Penicillin - Chloramphenicol == 0      0.800      2.128   0.376 0.995275    
Streptomycin - Chloramphenicol == 0  -19.975      2.128  -9.388  < 1e-04 ***
Tetracycline - Chloramphenicol == 0    3.575      2.128   1.680 0.473714    
Penicillin - Erythromycin == 0         9.525      2.128   4.477 0.003458 ** 
Streptomycin - Erythromycin == 0     -11.250      2.128  -5.287 0.000800 ***
Tetracycline - Erythromycin == 0      12.300      2.128   5.781 0.000341 ***
Streptomycin - Penicillin == 0       -20.775      2.128  -9.764  < 1e-04 ***
Tetracycline - Penicillin == 0         2.775      2.128   1.304 0.692820    
Tetracycline - Streptomycin == 0      23.550      2.128  11.068  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Case Study 12.4.1

gA <- c(9, 9.5, 9.74, 10, 13, 9.5)
gB <- c(11, 10, 10, 11.75, 10.5, 15)
gC <- c(11.5, 12, 9, 11.5, 13.25, 13)
gD <- c(13.25, 11.5, 12, 13.5, 11.5)
infant <- data.frame(rbind(cbind(gA, "A"), cbind(gB, "B"), cbind(gC, "C"), cbind(gD, "D")))
names(infant) <- c("age", "training.grp")
infant$training.grp <- factor(infant$training.grp)
model.aov <- aov(age ~ training.grp, data = infant)
# using multcomp package
require(multcomp)
# contrast C1
summary(glht(model.aov, linfct = mcp(training.grp = c(1, -1, 0, 0))))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: User-defined Contrasts


Fit: aov(formula = age ~ training.grp, data = infant)

Linear Hypotheses:
       Estimate Std. Error t value Pr(>|t|)
1 == 0  -1.2517     0.8756   -1.43    0.169
(Adjusted p values reported -- single-step method)
# contrast C2
summary(glht(model.aov, linfct = mcp(training.grp = c(0, 0, 1, -1))))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: User-defined Contrasts


Fit: aov(formula = age ~ training.grp, data = infant)

Linear Hypotheses:
       Estimate Std. Error t value Pr(>|t|)
1 == 0  -0.6417     0.9183  -0.699    0.493
(Adjusted p values reported -- single-step method)