What we mean when we say “mean”

1 Motivation

Based on the submitted homework solutions and the questions I received in class (last March 14), I decided to write about the different concepts involving the variance. To understand these concepts and how they differ from each other, we may need to go back to different concepts involving the mean.

When we use the words “mean” and “variance”, there is usually a context behind them. It is important to get used to asking about this context or be mature enough to know the context from the beginning. So far, you have seen the word mean used in many senses:

  • As the mean of a list of numbers
  • As the mean of a random variable
  • As the sample mean applied to a list of numbers or realizations of random variables
  • As the population mean applied to a list of fixed numbers
  • As an R command mean()

You can strike out the word “mean” and replace it with “variance”, but the R command will be var(). You could also replace it with “standard deviation”, but the R command will be sd(). The bottom line is that there are many ways of viewing the mean, variance, and standard deviation.

2 List of numbers

x <- c(1, 2, 2, 4, 6)
c(mean(x), var(x), sd(x))
[1] 3 4 2
sum(x)/length(x) 
[1] 3
xbar <- (1+2+2+4+6)/5
xbar 
[1] 3
sum((x-mean(x))^2)/(length(x)-1)
[1] 4
((1-xbar)^2+(2-xbar)^2+(2-xbar)^2+(4-xbar)^2+(6-xbar)^2)/4
[1] 4
sqrt(sum((x-mean(x))^2)/(length(x)-1))
[1] 2
sqrt(((1-xbar)^2+(2-xbar)^2+(2-xbar)^2+(4-xbar)^2+(6-xbar)^2)/4)
[1] 2

Here there is a fixed list of numbers \(\{1,2,2,4,6\}\). Let \(n\) be the number of elements in the list. The mean of those numbers was calculated based on \(\overline{x}=\displaystyle\frac{1}{n}\sum_{i=1}^n x_i\). The variance of those numbers was calculated based on \(s^2_x=\displaystyle\frac{1}{n-1}\sum_{i=1}^n \left(x_i-\overline{x}\right)^2\). The standard deviation of those numbers was calculated based on taking the square root of the calculated variance.

You would already notice one thing that seemingly contradicts what you have seen in Exercise C in Homework 01. You are looking at the same list of numbers, but in that exercise we treat the list of numbers as the population. In that situation, the number of elements in the list (or if you want, container) was called \(N\) to distinguish it from the sample size in the exercise. Furthermore, I called \(\mu=\displaystyle\frac{1}{N}\sum_{i=1}^N x_i\) and \(\sigma^2=\displaystyle\frac{1}{N}\sum_{i=1}^N \left(x_i-\mu\right)^2\).

What can we learn from this demonstration? Both share the same list of numbers \(\{1,2,2,4,6\}\). But the formulas based on \(\overline{x}=\displaystyle\frac{1}{n}\sum_{i=1}^n x_i\) and \(\displaystyle\frac{1}{n-1}\sum_{i=1}^n \left(x_i-\overline{x}\right)^2\) treat the list as if it were a sample. The formulas based on \(\mu=\displaystyle\frac{1}{N}\sum_{i=1}^N x_i\) and \(\sigma^2=\displaystyle\frac{1}{N}\sum_{i=1}^N \left(x_i-\mu\right)^2\) treat the list as if it were a finite population.

What many could be asking now is the idea behind the formula for the variance. Why is it that sometimes we divide by the number of elements in the list, and sometimes we divide by the number of elements in the list minus 1? Answering this question requires us to ask why we do the things we do or why we calculate the way we are asked to calculate things.

3 One realization among many hypothetical realizations

A fixed list of numbers \(\{x_1,\ldots,x_n\}\) could be thought of as just one of the many possible hypothetical realizations of a set of random variables. Recall the following paragraph from Section 2.1 of the Notes on normal models:

“Consider a setting where we observe a sample \(\left(y_1,\ldots,y_n\right)\) which is a realization from the joint distribution of IID normal random variables \(\left(Y_1,\ldots, Y_n\right)\). Let \(\widehat{\theta}=\overline{Y}\). We can easily calculate the observed sample mean \(\overline{y}=\dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i\) but what does probability theory tell us about \(\overline{Y}=\dfrac{1}{n}\displaystyle\sum_{i=1}^n Y_i\)?”

The formulas are still the same in spirit, but the context changes. We are adding a layer of make-believe. We think that the list of numbers we see is but one of many lists of numbers of length \(n\) drawn from the joint distribution of \(\left(Y_1,\ldots, Y_n\right)\). By the way, I know that I used the letter \(x\), but this is of no consequence. The notation is merely suggestive that the context has changed. \(\left(x_1,\ldots,x_n\right)\) could be a realization (or draw) from the joint distribution of \(\left(Y_1,\ldots, Y_n\right)\), just as \(\left(y_1,\ldots,y_n\right)\) could be another realization (or draw).

In this context, we have “new” formulas like \(\overline{Y}=\dfrac{1}{n}\displaystyle\sum_{i=1}^n Y_i\) for the sample mean and \(S^2=\dfrac{1}{n-1}\displaystyle\sum_{i=1}^n \left(Y_i-\overline{Y}\right)^2\) for the sample variance. These are not really new as the calculation aspect of the formulas is the same. Perhaps, it would be better to think of them as recipes:

  • Sample mean: Take your realization. Add all the numbers together. Divide by \(n\).
  • Sample variance: Take your realization. Compute the sample mean first. Subtract the sample mean from every number. Square every resulting number. Add them all together. Divide by \(n-1\).

In fact, these “new” formulas when applied to a realization (a list of numbers of the form \(\left(y_1,\ldots,y_n\right)\) ) will produce the corresponding \(\overline{y}\) and \(s^2_y\). Recall our code for \(n=5\) IID draws from \(N(1,4)\) but this time the sample variance of the five numbers is also calculated:

n <- 5 # sample size
mu <- 1 # common expected value 
sigma.sq <- 4 # common variance 
y <- rnorm(n, mu, sqrt(sigma.sq)) # Be careful of the R syntax here. 
y # a realization
[1] 1.4191078 3.1367193 0.2759058 0.8894976 1.6472649
ybar <- mean(y) # observed sample mean
ybar
[1] 1.473699
s2y <- var(y) # observed sample variance
s2y
[1] 1.143685
# Obtain another realization
y <- rnorm(n, mu, sqrt(sigma.sq)) # Be careful of the R syntax here. 
y # another realization
[1]  0.8178101  0.2484874 -3.7686550  1.2817904 -2.4121712
ybar <- mean(y) # observed sample mean
ybar
[1] -0.7665477
s2y <- var(y) # observed sample variance
s2y
[1] 4.864225

Because these “new” formulas have a new layer, there must be a good reason for this new layer. Otherwise, it is pointless to introduce it and to even make distinctions! Return to our code from before on \(N(1,4)\). I changed the code a bit so that I can focus on the sample mean and the sample variance separately. I also do not have to introduce new code that will allow us to look at the sample mean and sample variance together.

nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ybars <- replicate(nsim, mean(rnorm(n, mu, sqrt(sigma.sq))))
s2ys <- replicate(nsim, var(rnorm(n, mu, sqrt(sigma.sq))))
c(mean(ybars), var(ybars), sd(ybars))
[1] 0.9855684 0.8085306 0.8991833
c(mean(s2ys), var(s2ys), sd(s2ys))
[1] 3.981322 7.939487 2.817710

You have already seen or at least know what you should see when you run c(mean(ybars), var(ybars), sd(ybars)) but the new line c(mean(s2ys), var(s2ys), sd(s2ys)) is a bit different in the sense that you do not know yet what you should see here. Take some time and connect this to the finding in LM Example 5.4.4 (ignore what maximum likelihood estimator means for the time being). In that example, you have two estimators of \(\sigma^2\). This means that \(\sigma^2\) is now the estimand. We have two estimators for \(\sigma^2\). One is the formula for the sample variance \(S^2\). The other is what the book calls \(\widehat{\sigma}^2\), i.e. \[\widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n\left(Y_i-\overline{Y}\right)^2=\frac{n-1}{n}S^2\]

# alternative estimator 
s2ys.alt <- replicate(nsim, (n-1)/n*var(rnorm(n, mu, sqrt(sigma.sq))))
c(mean(s2ys.alt), var(s2ys.alt), sd(s2ys.alt))
[1] 3.193200 5.236598 2.288361

Focus on the results of mean(s2ys) and mean(s2ys.alt). I think the bottom line is that what I haveshown you here is a partial explanation of why we divide by \(n-1\) in the sample variance.

4 “Infinite” population

We now cover another version of a population mean (or variance, or standard deviation, adjust accordingly). The expected value of a random variable is sometimes called a population mean. The variance of a random variable is also sometimes called a population variance (In fact, the variance is an expected value of some transformed random variable). You might wonder why.

Consider the familiar example of rolling a fair die. Let \(X\) be the number on the face of the die after you roll the die. It is almost routine for you to answer questions like:

  • What is the distribution of \(X\)?
  • What are \(\mathbb{E}\left(X\right)\) and \(\mathsf{Var}\left(X\right)\)?
  • Find \(\mathbb{P}\left(X<2\right)\).

Now, we are going to dig a bit deeper as to what \(\mathbb{E}\left(X\right)\) and \(\mathsf{Var}\left(X\right)\) means beyond the definition you know. Here the exact or actual values are given by \(\mathbb{E}\left(X\right)=7/2=3.5\) and \(\mathsf{Var}\left(X\right)=91/6-49/4\approx 2.916667\). One interpretation of these quantities is that they represent a population mean and variance, respectively. In the notes, I also talk about an alternative interpretation of the expected value and variance.

Here is some R code to illustrate the idea. By now, you should be able to simulate die rolls.

1:6 # to show you what this means: same as c(1,2,3,4,5,6)
[1] 1 2 3 4 5 6
# Roll a fair die 10000 times and record every outcome
rolls <- sample(1:6, 10^4, replace = TRUE)
# Calculate the sample mean and the sample variance of the outcomes
c(mean(rolls), var(rolls))
[1] 3.514000 2.946299
# Increase the number of rolls, and calculate the sample mean and the sample variance of the outcomes
rolls <- sample(1:6, 10^6, replace = TRUE)
c(mean(rolls), var(rolls))
[1] 3.500429 2.917987
# Increase the number of rolls even further, and calculate the sample mean and the sample variance of the outcomes
# WARNING: Your computer might die. Be careful here.
rolls <- sample(1:6, 10^8, replace = TRUE) 
c(mean(rolls), var(rolls))
[1] 3.499832 2.916544

Therefore, you can think of the expected value of a random variable \(X\) as the mean of a very large number of realizations of \(X\). In some way, the very large number can be thought of as creating an “infinite” population because we are pushing the number of realizations to the limit.

In fact, the result from the R code should not be surprising given what you have been learning in the past two to three weeks. In a way, rolls really stands for realizations from the joint distribution of \(X_1,\ldots,X_n\) where \(n\) could be \(10^4\), \(10^6\), \(10^8\). The joint distribution of \(X_1,\ldots,X_n\) is really a set of IID random variables with mean \(\mathbb{E}\left(X\right)=7/2=3.5\), variance \(\mathsf{Var}\left(X\right)=91/6-49/4\approx 2.916667\), and sharing a common PMF equal to \(\mathbb{P}\left(X=k\right)=1/6\) for \(k=1,\ldots,6\). You can think of \(X\) as the parent and \(X_1,\ldots,X_n\) as IID “copies” of \(X\).

The bottom line is that \(\mathbb{E}\left(X\right)\) and \(\mathsf{Var}\left(X\right)\) can be thought of as the population mean and the population variance, respectively. But the population here is an imagined “infinite” population.