Simple, but real data analysis

(looking to the future)

Andrew Pua

Checking advertised weight

Recall our very first data collection exercise where I asked you to weigh two cookies separately.
The company claims that every pack of two cookies weighs 21 grams.
We are evaluating this claim based on some purchased boxes of 16 packs.
This is clearly a hypothesis testing application.

What is the estimand?

The estimand or the parameter of interest here is the total weight (or if you wish the average weight) in grams of the two cookies in the population of all packs of two cookies across all boxes of 16 packs each.
We can call that estimand \(\mu\) and we are evaluating the company’s claim that \(\mu=10.5\) grams.
You can apply what you learned in Chapters 6 and 7, if you want.
Let us set the significance level at 5%.

Actual analysis

cookie1 <- read.csv("data_cookie.csv")
head(cookie1)

  X id box scale chipnum weight numchoc
1 1  1   3    NA       1  10.23      24
2 2  1   3    NA       2  10.65      18
3 3  2   1     7       1  10.50      16
4 4  2   1     7       2  11.03      18
5 5  3   1     5       1  10.82      19
6 6  3   1     5       2  10.57      14

table(cookie1$scale)


 1  2  3  4  5  6  7 
14 14  2 32 18  8 18

table(cookie1$box)


 1  2  3  4 
32 30 32 32

Notice that there are missing data. These missing observations may or may not matter.
Let us focus on testing the claim. Do you think the assumptions are satisfied?

wt.avg <- tapply(cookie1$weight, cookie1$id, mean)
wt.avg

     1      2      3      4      5      6      7      8      9     10     11 
10.440 10.765 10.695 10.270 11.340 10.600 10.080 10.820 10.715 10.410 10.585 
    12     13     14     15     16     17     18     19     20     21     22 
10.980 10.705 11.165 10.565 10.985 11.025 10.125 10.915 10.290 10.815 10.205 
    23     24     25     26     27     28     29     30     31     32     33 
10.745 10.650 10.420  9.730 10.475 10.750 10.595 10.220 10.840 10.560 10.300 
    34     35     36     37     38     39     40     41     42     43     44 
10.445  9.900 11.120 10.345 10.630 10.690 10.880 10.250 10.250 10.515 10.455 
    45     46     47     48     49     50     51     52     53     54     55 
10.325 13.915 11.335 10.890 10.995 10.900 10.275  9.800 10.500 10.615 10.155 
    56     57     58     59     60     61     62     63     64 
10.235 10.225 10.820  9.760 11.440 10.905 10.200 10.555 10.230

hist(wt.avg, freq = FALSE)
curve(dnorm(x, mean=mean(wt.avg), sd=sd(wt.avg)), add = TRUE, col = "hotpink2")

qqnorm(wt.avg)
qqline(wt.avg)

Investigating the “outlier”

cookie1[cookie1$id == 46, ]

    X id box scale chipnum weight numchoc
91 91 46   1     5       1  13.85      15
92 92 46   1     5       2  13.98      15

Should we remove the outlier? Depends and this clearly depends on judgment too.
We can examine what happens if you remove the outlier.

par(mfrow = c(1, 2))
hist(wt.avg[-46], freq = FALSE)
curve(dnorm(x, mean=mean(wt.avg[-46]), sd=sd(wt.avg[-46])), add = TRUE, col = "hotpink2")
qqnorm(wt.avg[-46])
qqline(wt.avg[-46])

A nicer qq-plot is available in the package car. The command is qqPlot().
You may have to install this if you want to run the commands here for yourself.
This plot has confidence bands which may help in deciding whether the “goodness-of-fit” to the normal is acceptable to some extent.

par(mfrow = c(1, 2))
hist(wt.avg[-46], freq = FALSE)
curve(dnorm(x, mean=mean(wt.avg[-46]), sd=sd(wt.avg[-46])), add = TRUE, col = "hotpink2")
library(car)
qqPlot(wt.avg[-46], id = FALSE)

Dropping the “outlier”

Normality seems to be compatible with the data.
Let us try to proceed with the approach in Chapter 7 and evaluate the company’s claim.

t.test(wt.avg[-46], mu = 10.5)


    One Sample t-test

data:  wt.avg[-46]
t = 1.3106, df = 62, p-value = 0.1948
alternative hypothesis: true mean is not equal to 10.5
95 percent confidence interval:
 10.46732 10.65713
sample estimates:
mean of x 
 10.56222

Retaining the “outlier”

We can retain the “outlier” if this is an important data point.
But the justification here is using the central limit theorem and IID, of course.

t.test(wt.avg, mu = 10.5)


    One Sample t-test

data:  wt.avg
t = 1.6326, df = 63, p-value = 0.1075
alternative hypothesis: true mean is not equal to 10.5
95 percent confidence interval:
 10.47432 10.75489
sample estimates:
mean of x 
 10.61461

Many issues with the previous analysis

Can we act as if we did not do any pre-processing?
We have done multiple tests and carried out different analyses (under different assumptions).
What should we report?
Although the problem we are answering looks simple, every research process goes through what we have just seen.

The temptation to dig more

Although we set out to actually test only the company’s claim, you might be tempted to explore beyond what you set out to do.
There are sometimes issues related to this kind of exploration, that is why some have started to look into pre-registration.
There are many other things to explore but…

In fact, you may be inclined to explore whether there is a difference in the weights of the two cookies in one pack.
This should be relatively easy to implement but it is a bit different from a two-sample \(t\)-test discussed in Chapter 9.
But pay attention to the fact that the data collection protocol was not very clear in the sense that the data collection was extremely disorganized. For example, we did not agree which is to be the first and the second cookie.

M&M data analysis

The task here was to evaluate a claim of whether the company’s color distribution in the US for 2008 is the same color distribution as in China.
We looked at the aggregate data.
This is again a hypothesis testing problem.

mandm <- read.csv("mandm-01.csv")
observed <- apply(mandm[, 2:7], 2, sum)
null.prob <- c(0.13, 0.13, 0.24, 0.2, 0.16, 0.14)
chisq.test(observed, p = null.prob)


    Chi-squared test for given probabilities

data:  observed
X-squared = 92.832, df = 5, p-value < 2.2e-16

We could have conducted an analysis for every student’s M&M pack.

p.val <- function(color.counts)
{
  temp <- chisq.test(color.counts, p = c(0.13, 0.13, 0.24, 0.2, 0.16, 0.14), correct = TRUE)
  return(temp$p.value)
}
pvals <- apply(mandm[,2:7], 1, p.val)
round(pvals, 2)

 [1] 0.00 0.09 0.97 0.02 0.53 0.37 0.45 0.90 0.91 0.31 0.01 0.57 0.30 0.70 0.04
[16] 0.84 0.62 0.01 0.10 0.75 0.08 0.42 0.00 0.03 0.47 0.67 0.14 0.19 0.03 0.22
[31] 0.42 0.05 0.71 0.17 0.09 0.01 0.22 0.00 0.21 0.65 0.41 0.54 0.65 0.73 0.01
[46] 0.23 0.22 0.00 0.87 0.69 0.31 0.17 0.29 0.25 0.90 0.12 0.49 0.75

hist(pvals, freq=FALSE)

Think of each student as a “lab” studying the same phenomenon, using almost the same materials.
If each of these students “publish” their results only when they reject the null, what will happen?

hist(pvals[pvals<=0.05], freq=FALSE)

Thankfully, it seems that the null hypothesis is actually false.
What I showed you is a constant theme in the so-called replication crisis. I encourage you to explore these themes further.

Coffee and chocolate chips

This part is based on data collected as part of Quiz 02.
One possible research question was to determine if there are differences in the number of coffee (similarly, chocolate) chips across different boxes.
You can think of the box as the “treatment” in the ANOVA setting. There was random assignment of students to each box.
As you know, there was a data collection protocol implemented for this research question.

As you also know, this protocol was contaminated by two of your classmates who introduced data not belonging to the study.
Thankfully, this was discovered. But it could be much worse in real life. This is a lesson to be vigilant (trust, but verify).
There were also two missing observations from students who did not ask for leave and were unable to do the collection. But this is less serious.

This is what the cookie dataset looks like:

cookie2 <- read.csv("data_cookie2.csv")
cookie2$name <- NULL
head(cookie2)

  cof choc cookie box id
1   4   15      1   3  1
2   5   13      2   3  1
3   7   10      1   4  2
4   7    9      2   4  2
5   9   14      1   4  3
6   8   10      2   4  3

cof.grouped <- aggregate(cookie2$cof, by=list(id = cookie2$id, box = cookie2$box), sum)
cof.grouped

   id box  x
1  10   1 13
2  12   1 13
3  14   1 26
4  16   1 23
5  29   1 20
6  35   1 17
7  38   1 31
8  39   1 11
9  40   1 21
10 43   1 20
11 45   1 22
12 46   1 24
13 50   1 16
14 52   1 24
15 54   1 13
16  5   2 25
17  7   2 20
18  9   2 20
19 11   2 12
20 18   2 18
21 22   2 11
22 24   2 17
23 25   2 18
24 27   2 19
25 28   2 15
26 33   2 16
27 42   2  0
28 48   2 17
29 53   2 15
30  1   3  9
31  6   3 14
32  8   3 19
33 13   3 14
34 17   3 14
35 23   3 12
36 31   3 28
37 34   3 21
38 36   3 24
39 37   3 18
40 47   3 19
41 51   3 20
42 58   3 14
43  2   4 14
44  3   4 17
45  4   4 12
46 15   4 17
47 19   4 12
48 20   4 23
49 21   4 17
50 26   4 47
51 30   4 12
52 32   4 34
53 41   4 14
54 44   4 14
55 49   4 31
56 55   4 14
57 56   4 32
58 57   4  9

plot(x ~ box, data = cof.grouped)

plot(x ~ factor(box), data = cof.grouped)

par(mfrow=c(2,2))
hist(cof.grouped$x[cof.grouped$box == 1], freq = FALSE)
hist(cof.grouped$x[cof.grouped$box == 2], freq = FALSE)
hist(cof.grouped$x[cof.grouped$box == 3], freq = FALSE)
hist(cof.grouped$x[cof.grouped$box == 4], freq = FALSE)

summary(aov(x ~ factor(box), data = cof.grouped))

            Df Sum Sq Mean Sq F value Pr(>F)
factor(box)  3  157.9   52.62   0.982  0.408
Residuals   54 2894.5   53.60

Many issues: normality? equal variance across boxes?
Clearly, there is a big problem which we have to fix for an observation in Box 2! Zero coffee chips??
You can repeat the analysis for chocolate chips.
You might also be interested in looking into a different estimand such as the coffee chip to chocolate chip mix. Even more interesting is to find if this mix varies across boxes!

# Apply Example 12.5.1
# Number of coffee chips should be Poisson under some assumptions, see HW01
cof.grouped$sqrt.x <- sqrt(cof.grouped$x)
plot(sqrt.x ~ factor(box), data = cof.grouped)

summary(aov(sqrt.x ~ factor(box), data = cof.grouped))

            Df Sum Sq Mean Sq F value Pr(>F)
factor(box)  3   2.82  0.9387   1.093   0.36
Residuals   54  46.39  0.8591

What to look forward to in the future

A lot!
What we have covered is a very small part of the Analysis portion of the Statistical Method.
Recall that we have Problem, Plan, Data, Analysis, and Conclusion.
There are already a lot of issues in Problem, Plan, Data which cause difficulties in choosing what type of analysis to pursue.

Topics I wish I could have taught

If this class is not part of the 统开课 system, I would have taught:

Chapter 11 on regression: I will show that Chapters 7, 9, 12 can be thought of as special cases of Chapter 11.
- Check my analysis of variance notes as to what will be a better way to think of ANOVA now and in the future.
- Check my method of moments notes for how to think about regression in the modern way.

Section 5.9 on the bootstrap: I share some notes on these in case you are interested.
Exact hypothesis tests in controlled experiments: Check one of the papers in the project option.

Advice for the future

Focus on true understanding rather than just research.
Find good friends you can trust and can be good to work with.
Always think about what you had done, what you are doing, and what you will be doing.
Communicate and write more in a different language.
Be vigilant and pay attention to what is happening around you.
Good luck to you and goodbye!