What is special about normal models?

1 Introduction

I am going to setup a review of some ideas and skills you should have picked up from your first course in probability theory. I connect some of the main ideas in the course to this review so that you can see how probability theory is applied to statistics. The resulting statistical theory becomes our basis for what to do in practice when you encounter real data, which is often messier.

Larsen and Marx, henceforth LM, already point out statistical applications of probability theory when studying random variables in Chapter 3. They state in page 175, “In statistics, the most important combination of a set of random variables is often their sum, so we continue this section with the problem of finding the pdf of \(X + Y\).” Another important example in that chapter is Example 3.9.11 in page 191, where the expected value and variance of an average computed from a random sample of \(n\) observations.

In the process of reviewing probability theory, we will be transitioning from probability models to what LM call data models. This phrase shows up in the Preface of LM and its first occurrence in the main text is in page 224. There is no explicit definition of what LM call data models, but they state the following:

Page viii: “The initial step in teaching someone how to analyze data is helping them learn how to recognize those eight”data models”: One-sample data, Two-sample data, \(k\)-sample data, Paired data, Randomized block data, Regression/Correlation data, Categorical data, and Factorial data.”
Page viii: “Identifying data models, of course, is not a difficult skill to acquire. Anyone involved in analyzing data learns it quickly. But for students taking their first course in statistics, ignoring the topic leaves them without a sense of where the subject is going and why.”
Page 224: “Today the function \(p_X(k) = e^{−\lambda}\lambda^k / k!\) is universally recognized as being among the three or four most important data models in all of statistics.”
Page 285: “In trying to model a set of data with a Poisson pdf, \(p_X(k) = e^{−\lambda}\lambda^k / k!\), \(k = 0, 1, 2, \ldots\), it will sometimes be the case that the value \(k = 0\) cannot occur, or there are too many zeroes for a good data model (called zero-inflated data). Accommodating that restriction leads to a definition of a truncated Poisson pdf – call it \(p_{X_T} (k)\) – where the values of \(k\) are limited to the integers \(1, 2, 3,\ldots\)”

Based on these passages, I have to introduce what I think LM meant when they say “data model”. I think they use it in two senses: one as a statistical model for the data and the other as the resulting data format representing the implications of the overall structure and experimental design which produced the data we observe.

Definition 1

A parameter \(\theta\) is an unknown fixed quantity or feature of some distribution. A parameter may be a number, a vector, a matrix, or a function.
A statistical model is a family or set \(\mathfrak{F}\) of distributions. Statistical models are sometimes classified as parametric and nonparametric.
A parametric statistical model or data model in LM’s terminology is a family or set \(\mathfrak{F}\) of distributions which is known up to (or indexed by) a finite number of parameters.
A nonparametric statistical model is a family or set \(\mathfrak{F}\) of distributions which cannot be indexed by a finite number of parameters.

In this course, we will mostly focus on parametric statistical models but there would be occasions where we will encounter a mix of a nonparametric and a parametric statistical model. Statistical models are really just probability models with a different focus. Probability models you have encountered in your first course in probability theory show up again in this course. These probability models depend on constants, which, once known, enables you to calculate probabilities. In contrast, we now do not know these constants and seek to know their value from the data.

LM actually introduces you to parametric statistical models (without explicitly saying so) in Chapter 4. Two examples of how probability models can be used as models for the data are illustrated in Case Study 4.2.2 and Case Study 4.2.3. In those case studies, \[\mathfrak{F}=\left\{p_X\left(k\right)=\frac{e^{−\lambda}\lambda^k}{k!}:\ \lambda>0\right\},\] where \(\lambda\) is a one-dimensional parameter of the model. The form of the distribution is known (Poisson). We do not know \(\lambda\), but we may observe some values of \(X\). In contrast to what you have encountered in your first course in probability theory, you find quantities like \(p_X\left(0\right)\) where \(\lambda\) is given to you.

In the two case studies, notice that what we actually observe is the data:

The numbers and proportions of times that \(\alpha\)-particles were detected in a given eighth-minute in Case Study 4.2.2
The number of wars starting in a given year in Case Study 4.2.3

The Poisson pdf depends on an unknown \(\lambda\) and a value for \(\lambda\) was plugged in to see if there is a “good match” between the relative frequencies (which we observe) and the probabilities provided by the Poisson pdf with a plug-in value for \(\lambda\) (which use an abstraction). The value plugged in for \(\lambda\) is the sample mean, which was calculated from the data. The two case studies suggest that there was a “good match”. These case studies should encourage you to ponder about the following questions:

Why choose the Poisson pdf as the data model?
Why plug-in the sample mean for \(\lambda\)? Are there other options?
How do we decide if there is a “good match”?
Why was it important to have a “good match”? What purpose does it serve?
If there is not a “good match”, what do we do?

Answering these questions hopefully motivates you to study statistical theory.

Exercise 1

The tables found in the two case studies are summaries of the data. What do you think the original data looked like?
If \(Y_1,\ldots,Y_n\) are IID random variables having a Poisson distribution with parameter \(\lambda\), what is \(\mathbb{E}\left(Y_i\right)\) and \(\mathsf{Var}\left(Y_i\right)\)? Can you speculate as to why the sample mean was used as a plug-in for \(\lambda\)?
In both case studies, try a different plug-in value for \(\lambda\). What do you notice? Can you speculate as to why the sample mean was used as a plug-in for \(\lambda\)?

Another early example involving the exponential density can be found in Case Study 4.2.4. In this case study, \[\mathfrak{F}=\left\{f_Y\left(y\right)=\lambda e^{−\lambda y}:\ \lambda>0\right\},\] where \(\lambda\) is a one-dimensional parameter of the model. You may find yourself asking similar questions as in the Poisson case studies.

Another example which has financial applications is discussed in the following business case study and Section 19.1 of Ruppert and Matteson (2015) (available only within XMU) about the concept of VaR or value-at-risk.

The last instances of the phrase “data model” show up in LM Chapter 8. LM state that, “… every set of measurements also has a certain overall structure, or design.” In this course, we pay attention to the following experimental designs: one-sample data, two-sample data, \(k\)-sample data, and paired data. For the current set of notes, I will focus on one-sample data.

I am now moving forward to three major topics – a study of the sample mean and its distribution, how this study of the sample mean specializes to normal models, and extending normal models to multivariate cases.

2 Statistics and sampling distributions

The word “statistics” can refer to the field of study or research or it can also refer to quantities we can calculate from the data.

Definition 2 A statistic or estimator \(\widehat{\theta}\) is a function of the data.

Compare this definition to the Comment in page 280. It is not crucial that a statistic be a function of a random sample, but the emphasis on approximating a parameter has been discussed in the Introduction. Why should we even consider statistics or estimators in the first place?

To answer this question, we will explore a very common statistic: the sample mean. When given data, it is easy to compute the sample mean. But what do we learn from computing this sample mean? We need to understand that this statistic is a procedure or an algorithm which could be applied to data we have and other data we might not have on hand. The latter is more hypothetical.

2.1 The sample mean \(\overline{Y}\)

Consider a setting where we observe a sample \(\left(y_1,\ldots,y_n\right)\) which is a realization from the joint distribution of IID normal random variables \(\left(Y_1,\ldots, Y_n\right)\). Let \(\widehat{\theta}=\overline{Y}\). We can easily calculate the observed sample mean \(\overline{y}=\dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i\) but what does probability theory tell us about \(\overline{Y}=\dfrac{1}{n}\displaystyle\sum_{i=1}^n Y_i\)?

Refer to Corollary 4.3.1. Let \(\mu\) be the common expected value of \(Y_i\) and \(\sigma^2\) be the common variance of \(Y_i\). Under the setting described, we can conclude that \[\overline{Y}\sim N\left(\mu,\frac{\sigma^2}{n}\right)\] The previous result describes the sampling distribution of the statistic \(\overline{Y}\). But what does sampling distribution tell us?

Consider an example where \(n=5\), \(\mu=1\), and \(\sigma^2=4\). We are going to use R to obtain realizations from IID normal random variables. What you are going to see is what is called a Monte Carlo simulation. When we design and conduct Monte Carlo simulations in statistics, we generate artificial data in order to evaluate the performance of statistical procedures.

n <- 5 # sample size
mu <- 1 # common expected value 
sigma.sq <- 4 # common variance 
y <- rnorm(n, mu, sqrt(sigma.sq)) # Be careful of the R syntax here. 
y # a realization

[1]  2.2669399  1.0548249 -0.2347173 -0.2792628  3.4235385

ybar <- mean(y) # observed sample mean
ybar

[1] 1.246265

# Obtain another realization
y <- rnorm(n, mu, sqrt(sigma.sq)) # Be careful of the R syntax here. 
y # another realization

[1]  2.215230 -3.549275 -2.405212  2.235449  2.759766

ybar <- mean(y) # observed sample mean
ybar

[1] 0.2511917

As you may have noticed, the sample mean is used in two senses: one as an observed sample mean \(\overline{y}\) (or simply, the sample mean of a list of numbers) and the other as a procedure \(\overline{Y}\). Conceptually, we can repeatedly apply the procedure of calculating the sample mean to different realizations of IID normal random variables.

nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
dim(ymat)

[1]     5 10000

ybars <- colMeans(ymat)
par(mfrow=c(2,2))
hist(ymat[, sample(1:10^4, 1)], freq = FALSE)
curve(dnorm(x, mu, sqrt(sigma.sq)), add = TRUE, col = "blue")
hist(ymat[, sample(1:10^4, 1)], freq = FALSE)
curve(dnorm(x, mu, sqrt(sigma.sq)), add = TRUE, col = "blue")
hist(ymat[, sample(1:10^4, 1)], freq = FALSE)
curve(dnorm(x, mu, sqrt(sigma.sq)), add = TRUE, col = "blue")
hist(ybars, freq = FALSE)
curve(dnorm(x, mu, sqrt(sigma.sq/n)), add = TRUE, col = "red")

The blue curves represent the density function of \(N(1,4)\). The red curve overlaid on the histogram of the ybars is the density function of \(N\left(1, 4/5\right)\). This is the sampling distribution of \(\overline{Y}\). This is what we theoretically expect to see when we look at the possible values \(\overline{Y}\) generates. But what we actually see in this simulation are the \(10^4\) observed sample means computed from realizations of IID \(N(1,4)\) random variables. You can think of this as a simulated sampling distribution or a computer-generated sampling distribution meant to verify what we expect to see theoretically.

c(mean(ybars), var(ybars), sd(ybars))

[1] 1.0027555 0.8043694 0.8968665

Finally, notice that we should see \(\mathbb{E}\left(\overline{Y}\right)=\mu\) and \(\mathsf{Var}\left(\overline{Y}\right)=\sigma^2/n\), which in the simulation are equal to 1 and 4/5, respectively. We do not get the values exactly, but we are close. A question to ponder about is why we do not get the values exactly and what kind of accuracy and precision we get.

Exercise 2

Holding everything else fixed, rerun the R code but after modifying the code to allow for n to be equal to \(100\).
Return n to be 5, but now change nsim to be equal to \(10^6\). What do you notice? Compare with the case where nsim is equal to \(10^4\).
Try running the code a number of times. You may notice that some of the digits are not changing, but others are. How can you use this finding to guide how you report results?

Exercise 3 From the result \[\overline{Y}\sim N\left(\mu,\frac{\sigma^2}{n}\right),\] you should be able to calculate probabilities involving \(\overline{Y}\) if you know \(\mu\), \(\sigma^2\), and \(n\). You can use the standard normal tables and you can also use R to calculate these probabilities.

To use standard normal tables, you need a standard normal random variable. How do you transform \(\overline{Y}\) so that it will have a \(N(0,1)\) distribution? The transformed random variable is called a standardized statistic. In our simulation, we have \(\mu=1\), \(\sigma^2=4\), and \(n=5\). Calculate \(\mathbb{P}\left(\overline{Y}< 2\right)\), \(\mathbb{P}\left(\left|\overline{Y}-1\right|\geq 4/\sqrt{5} \right)\).
Try modifying these commands so that you can compute the exact values of \(\mathbb{P}\left(\overline{Y}< 2\right)\), \(\mathbb{P}\left(\left|\overline{Y}-1\right|\geq 4/\sqrt{5} \right)\) using R. Verify if this matches what you found in the previous exercise. To use R, you can use the pnorm() commands. As a demonstration, consider \(X\sim N\left(2, 5\right)\). To find \(\mathbb{P}\left(X\leq 1\right)\) and \(\mathbb{P}\left(X> 1\right)\), you execute the following commands:

pnorm(1, mean = 2, sd = sqrt(5))

[1] 0.3273604

pnorm(1, mean = 2, sd = sqrt(5), lower.tail = FALSE)

[1] 0.6726396

Exercise 4 Let \(\left(Y_1,\ldots, Y_n\right)\) be IID Bernoulli random variables with a common probability of success \(p\). You are going to explore the sampling distribution of \(\overline{Y}\).

What is the distribution of \(\displaystyle\sum_{i=1}^n Y_i\) where \(n\) is fixed?
What would be the distribution of \(\overline{Y}\)?
Compute the expected value and variance of both \(\displaystyle\sum_{i=1}^n Y_i\) and \(\overline{Y}\).
Modify the code used to simulate IID normal random variables for the Bernoulli case and compute the mean, variance, and standard deviation of the sampling distribution of \(\overline{Y}\). Try different values of \(n\) and \(p\). To simulate a Bernoulli random variable, we use rbinom(). For example,

# Toss 1 fair coin 10 times
rbinom(10, 1, 0.5)

 [1] 0 0 0 0 0 1 0 1 0 1

# Toss 2 fair coins independently 10 times and record the total successes
rbinom(10, 2, 0.5)

 [1] 1 0 2 2 2 2 2 1 2 2

Plotting the simulated sampling distribution of \(\overline{Y}\) is straightforward. What is not straightforward is to superimpose the theoretical pmf of the sampling distribution of \(\overline{Y}\). Can you explain why?

2.2 Why should we find the sampling distribution of \(\overline{Y}\)?

From the previous discussion and the exercises, we can use the sampling distribution of \(\overline{Y}\) allows us to make probability statements. Are these actually useful?

At the minimum, the previous discussion allows us to say something about the moments of the sampling distribution of \(\overline{Y}\). As you recall, \(\mathbb{E}\left(\overline{Y}\right)=\mu\) and \(\mathsf{Var}\left(\overline{Y}\right)=\sigma^2/n\).

Exercise 5 We found the moments of the sampling distribution of \(\overline{Y}\) using Corollary 4.3.1 in LM. Determine the minimal set of assumptions needed to obtain the moments of the sampling distribution of \(\overline{Y}\). Do you need normality? Do you need IID?

We take a moment to understand the implications of \(\mathbb{E}\left(\overline{Y}\right)=\mu\). Note that \(\overline{Y}\) can be computed from the data you have (along with other hypothetical realizations). But it is possible that \(\mu\) is unknown to us, meaning that \(\mu\) may be an unknown parameter.

One task of statistics is parameter estimation. The idea is to use the data to estimate a parameter. Recall that \(\overline{Y}\) can be thought of as a procedure. What makes this procedure desirable is that we can evaluate its behavior under repeated sampling (meaning looking at hypothetical realizations) and ask how this procedure enables us to learn unknown parameters. We proceed on two fronts.

2.2.1 Consistency

We are going to apply Chebyshev’s inequality found in Theorem 5.7.1 of LM. Applying this inequality to \(\overline{Y}\), we obtain \[\mathbb{P}\left(|\overline{Y}-\mathbb{E}\left(\overline{Y}\right)|\geq \varepsilon\right) \leq \frac{\mathsf{Var}\left(\overline{Y}\right)}{\varepsilon^2} \Rightarrow \mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right) \leq \frac{\sigma^2}{n\varepsilon^2} \] for all \(\varepsilon>0\). So for any choice of \(\varepsilon>0\), the sum of the probability that \(\overline{Y}\geq \mu +\varepsilon\) and the probability that \(\overline{Y}\leq \mu -\varepsilon\) is bounded above and that this upper bound becomes smaller as we make \(\varepsilon\) large. This means that for fixed \(\mu\), \(\sigma\), and \(n\), the sum of these two probabilities is getting smaller and smaller if \(\varepsilon\) is large. In a sense, the event \(\left|\overline{Y}-\mu\right|\geq \varepsilon\) which is the event that \(\overline{Y}\) produces a realization outside of the region \(\left(\mu-\varepsilon, \mu+\varepsilon\right)\) is getting less and less likely.

Notice also that the upper bound is controlled by \(\mathsf{Var}\left(\overline{Y}\right)\). Furthermore, this upper bound goes to zero as \(n\to\infty\), holding everything else fixed. In other words, for any choice of \(\varepsilon>0\), \[\begin{eqnarray}\lim_{n\to\infty} \mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right) \leq \lim_{n\to\infty}\frac{\sigma^2}{n\varepsilon^2} &\Rightarrow & \lim_{n\to\infty} \mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right) \leq 0 \\ &\Rightarrow & \lim_{n\to\infty} \mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right) = 0\end{eqnarray}\] Thus, for any choice of \(\varepsilon>0\), the probability of the event that \(\overline{Y}\) produces a realization outside of the region \(\left(\mu-\varepsilon, \mu+\varepsilon\right)\) is negligible, as \(n\to\infty\). This means that “\(\overline{Y}\) practically gives you \(\mu\)”, the latter being an unknown parameter. In this sense, we learn \(\mu\) as sample size becomes large enough by applying \(\overline{Y}\).

What you have just encountered is sometimes phrased in different ways:

\(\overline{Y}\) is a consistent estimator of \(\mu\) or \(\overline{Y}\) converges in probability to \(\mu\) (denoted as \(\overline{Y}\overset{p}{\to} \mu\)). Refer to Definition 5.7.1 of LM. Observe that the definition uses the complement of the event \(|\overline{Y}-\mu|\geq \varepsilon\).
The weak law of large numbers: Let \(Y_1,\ldots, Y_n\) be IID random variables with finite mean \(\mu\). Then \(\overline{Y}\overset{p}{\to} \mu\).

Let us demonstrate the consistency property of \(\overline{Y}\) using R. Return to our example where \(\mu=1\), \(\sigma^2=4\), but this time we allow \(n\) to change from 5 to larger values. Although the histograms look the same, pay attention to the horizontal and vertical axes! If you push the sample size to even larger values, what do you think will happen?

par(mfrow=c(2,2))
hist(colMeans(replicate(nsim, rnorm(5, mu, sqrt(sigma.sq)))), freq = FALSE, xlab = "sample means", main = "Histogram of ybars")
curve(dnorm(x, mu, sqrt(sigma.sq/5)), add = TRUE, col = "red")
hist(colMeans(replicate(nsim, rnorm(20, mu, sqrt(sigma.sq)))), freq = FALSE, xlab = "sample means", main = "Histogram of ybars")
curve(dnorm(x, mu, sqrt(sigma.sq/20)), add = TRUE, col = "red")
hist(colMeans(replicate(nsim, rnorm(80, mu, sqrt(sigma.sq)))), freq = FALSE, xlab = "sample means", main = "Histogram of ybars")
curve(dnorm(x, mu, sqrt(sigma.sq/80)), add = TRUE, col = "red")
hist(colMeans(replicate(nsim, rnorm(320, mu, sqrt(sigma.sq)))), freq = FALSE, xlab = "sample means", main = "Histogram of ybars")
curve(dnorm(x, mu, sqrt(sigma.sq/320)), add = TRUE, col = "red")

Exercise 6 Under normality, we can have an exact value for \(\mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right)\) provided that \(\varepsilon\) is specified in advance. Suppose you set \(\varepsilon=\sigma/\sqrt{n}\), then \(\mathbb{P}\left(|\overline{Y}-\mu|\geq \sigma/\sqrt{n}\right)\) can be computed exactly.

Compute \(\mathbb{P}\left(|\overline{Y}-\mu|\geq \sigma/\sqrt{n}\right)\) under the assumption of normality of IID random variables \(Y_1,\ldots, Y_n\). Do you need to know the value of \(\mu\) and \(\sigma\) to compute the desired probability?
What if \(\varepsilon\) was not set equal to \(\sigma/\sqrt{n}\)? How would you calculate \(\mathbb{P}\left(|\overline{Y}-\mu|\geq \varepsilon\right)\) under the same normality assumption?

2.2.2 Unbiasedness

We now think of what \(\mathbb{E}\left(\overline{Y}\right)= \mu\) means. We are going to use a result found in LM Exercise 3.6.13. In that exercise, we have a random variable \(X\) with finite mean \(\mathbb{E}\left(X\right)\) and a function \[\begin{eqnarray}g\left(a\right) &=& \mathbb{E}\left[\left(X-a\right)^2\right]\\ &=& \mathbb{E}\left[\left(X-\mathbb{E}\left(X\right)\right)^2\right]+\left(\mathbb{E}\left(X\right)-a\right)^2\end{eqnarray}.\] This function is the mean squared error or MSE of \(X\) about \(a\).

Exercise 7 Show that the minimizer of \(g\left(a\right)\) is \(a^*=\mathbb{E}\left(X\right)\) and that the minimized value of \(g\left(a\right)\) is \(g\left(a^*\right)=\mathsf{Var}\left(X\right)\).

The previous exercise provides an alternative interpretation of the expected value. It is the best or optimal prediction of some random variable \(X\) under a mean squared error criterion. That means that if you are asked to “guess” the value of a random variable and you suffer “losses” from not making a perfect guess according to MSE, then the best guess is the expected value. More importantly, the smallest “loss” is actually the variance.

Applying Exercise 3.6.13 along with the alternative interpretation of the expected value, we can make sense of the expression \(\mathbb{E}\left(\overline{Y}\right)= \mu\). There are two parts to this expression. One is that \(\overline{Y}\) is a random variable and our best guess of its value is its expected value \(\mathbb{E}\left(\overline{Y}\right)\). Two, it so happens that it is equal to the common mean \(\mu\).

Had our best prediction \(\mathbb{E}\left(\overline{Y}\right)\) been larger than \(\mu\), then we “overshot” or systematically guessed something larger than what we are interested in, which is \(\mu\). Had our best prediction \(\mathbb{E}\left(\overline{Y}\right)\) been smaller than \(\mu\), then we “undershot” or systematically guessed something smaller than what we are interested in, which is \(\mu\). Therefore, it should be preferable to have \(\mathbb{E}\left(\overline{Y}\right)= \mu\).

There is also a connection between the finding \(\mathbb{E}\left(\overline{Y}\right)= \mu\) and the idea of unbiased estimation. Here, \(\overline{Y}\) is called an unbiased estimator of \(\mu\). Refer to Definition 5.4.1 in LM page 310. Note that the random sample requirement is not really needed in the definition, as long as the expected value of some estimator is equal to the unknown parameter of interest. \(\overline{Y}\) has a sampling distribution and its “center” \(\mathbb{E}\left(\overline{Y}\right)\) should be at \(\mu\).

In general, unbiased estimators are not necessarily consistent and consistent estimators are not necessarily unbiased. But, there is a slight connection between unbiasedness and consistency (based on LM Exercise 5.7.4):

Definition 3 An estimator \(\widehat{\theta}\) is squared-error consistent for the parameter \(\theta\) if \[\lim_{n\to\infty} \mathbb{E}\left[\left(\widehat{\theta}-\theta\right)^2\right]=0\]

Recall the quantity \(g(a)\), which is the MSE of \(X\) about \(a\). This quantity is similar to \(\mathbb{E}\left[\left(\widehat{\theta}-\theta\right)^2\right]\), which is the MSE of \(\widehat{\theta}\) about \(\theta\). The context is slightly different because \(\theta\) is something we want to learn, but the underlying idea is the same. We can also write \[\mathbb{E}\left[\left(\widehat{\theta}-\theta\right)^2\right]=\mathbb{E}\left[\left(\widehat{\theta}-\mathbb{E}\left(\widehat{\theta}\right)\right)^2\right]+\left(\mathbb{E}\left(\widehat{\theta}\right)-\theta\right)^2=\mathsf{Var}\left(\widehat{\theta}\right)+\left(\mathbb{E}\left(\widehat{\theta}\right)-\theta\right)^2.\] The second term \(\mathbb{E}\left(\widehat{\theta}\right)-\theta\) is the bias of the estimator \(\widehat{\theta}\).

To show that an estimator is squared-error consistent, we just have to show that \[\lim_{n\to\infty}\mathsf{Var}\left(\widehat{\theta}\right)=0,\ \ \ \lim_{n\to\infty} \left(\mathbb{E}\left(\widehat{\theta}\right)-\theta\right)=0\] The latter expression is another way of saying that \(\widehat{\theta}\) is asymptotically unbiased for \(\theta\).

Exercise 8

Show that whenever \(Y_1,\ldots, Y_n\) are IID random variables with mean \(\mu\) and finite variances, then \(\overline{Y}\) is squared-error consistent for \(\mu\).
Using Chebyshev’s inequality, show that if an estimator \(\widehat{\theta}\) is squared-error consistent for the parameter \(\theta\), then \(\widehat{\theta}\overset{p}{\to} \theta\).

2.3 Measuring uncertainty

From a prediction angle, the smallest “loss” incurred from not knowing \(\overline{Y}\) for sure and relying on its property that the best guess satisfies \(\mathbb{E}\left(\overline{Y}\right)=\mu\) is \(\mathsf{Var}\left(\overline{Y}\right)\), which happens to be \(\sigma^2/n\). Even though we get to see the observed sample mean \(\overline{y}\), we only know the statistical properties of \(\overline{Y}\).

From a sampling distribution angle, \(\mathsf{Var}\left(\overline{Y}\right)\) is a measure of spread. By definition, the variance of \(\overline{Y}\) is the “average” spread of \(\overline{Y}\) about its mean: \[\mathsf{Var}\left(\overline{Y}\right)=\mathbb{E}\left[\left(\overline{Y}-\mathbb{E}\left(\overline{Y}\right)\right)^2\right]\] Therefore, it measures a “typical” distance of \(\overline{Y}\) relative to \(\mathbb{E}\left(\overline{Y}\right)=\mu\). But this distance is in squared units, so it is not in the same scale as the original scale of the \(Y\)’s. Thus, taking a square root would put things on the same scale.

As a result, the quantity \(\sqrt{\mathsf{Var}\left(\overline{Y}\right)}=\sigma/\sqrt{n}\) is so important that it is given a specific name: the standard error of the sample mean. This is nothing but the standard deviation of the sampling distribution of \(\overline{Y}\). So why give a new name? The reason is that there is also a standard deviation for a list of numbers, just like there is a mean for a list of numbers. Suppose you have a list of numbers \(\{y_1,\ldots,y_n\}\). The standard deviation of this list is computed as \[\sqrt{\frac{1}{n-1}\sum_{i=1}^n \left(y_i-\overline{y}\right)^2}.\] The difference lies in the fact that the preceding formula is for numbers that you have while \(\sigma/\sqrt{n}\) is for hypothetical sample means we could have gotten. Be very aware of this distinction.

Chebyshev’s inequality is also tied to \(\mathsf{Var}\left(\overline{Y}\right)\) as you have seen earlier. If we set \(\varepsilon=c\sigma/\sqrt{n}\) where \(c\) is a positive constant (a factor of a standard error), then \[\begin{eqnarray}\mathbb{P}\left(\left|\overline{Y}-\mathbb{E}\left(\overline{Y}\right)\right|\geq \frac{c\sigma}{\sqrt{n}}\right) \leq \frac{\mathsf{Var}\left(\overline{Y}\right)}{c^2\sigma^2/n} &\Rightarrow & \mathbb{P}\left(\left|\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right|\geq c\right) \leq \frac{1}{c^2} \\ &\Rightarrow & \mathbb{P}\left(\left|\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right|\leq c\right) \geq 1-\frac{1}{c^2}.\end{eqnarray}\] Notice that we have probabilities involving a standardized quantity \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\).

If we take \(c=2\), the probability that \(\overline{Y}\) will be within 2 standard errors of \(\mu\) is greater than or equal to 75%. If we take \(c=3\), the probability that \(\overline{Y}\) will be within 3 standard errors of \(\mu\) is greater than or equal to 88%. Therefore, \(\sigma/\sqrt{n}\) can be thought of as an appropriate scale to use to measure how far \(\overline{Y}\) is from \(\mu\). It becomes a way to measure our uncertainty about \(\overline{Y}\).

Exercise 9 Revisit Exercise 6. We can actually calculate the exact probability, rather than just an upper bound, in the case of IID normal random variables, because the standardized quantity \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) has a standard normal distribution. Because this distribution does not depend on unknown parameters, we call \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) a pivotal quantity or a pivot.

Exercise 10 Recall Exercise 3. You are going to add lines of code to the simulation and use the simulated sample means found in ybars to approximate \(\mathbb{P}\left(\overline{Y}< 2\right)\), \(\mathbb{P}\left(\left|\overline{Y}-1\right|\geq 4/\sqrt{5} \right)\). You computed the exact probabilities in Exercise 3, but, for the moment, pretend you do not know these exact probabilities. Report an estimate and a standard error.

The prediction angle and the sampling distribution angle actually coincide but come from different starting points. More importantly, both quantify how much we “lose” because of the randomness of \(\overline{Y}\). For sure, we can learn about the unknown parameter \(\mu\) under certain conditions, but we pay a price for using the data to learn it. More importantly, the price you pay really depends on the conditions you impose.

The standard error formula \(\sigma/\sqrt{n}\) only works in the IID case. Once you depart from the IID case, things become more complicated. Let \(Z_1,Z_2,\ldots,Z_n\) be random variables where \(Z_t=Y_t+0.5Y_{t-1}\) and \(Y_0,Y_1,\ldots Y_n\) are IID normal random variables \(N\left(\mu,\sigma^2\right)\). Will it be the case that \[\overline{Z} \sim N\left(\mu,\frac{\sigma^2}{n}\right)?\] Let us look at a simulation where \(\mu=0\), \(\sigma^2=4\), and \(n=20\).

# Create function to generate realizations of Z
generate.z <- function(n, mu, sigma.sq)
{
  y0 <- rnorm(1, mu, sqrt(sigma.sq))
  y <- rnorm(n, mu, sqrt(sigma.sq))
  z <- numeric(n)
  for(j in 2:n)
  {
    z[1] <- y[1]+0.5*y0
    z[j] <- y[j]+0.5*y[(j-1)]
  }
  return(z)
}
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
zmat <- replicate(nsim, generate.z(20, 0, 4))
zbars <- colMeans(zmat)
c(mean(zbars), var(zbars), sd(zbars))

[1] -0.001614658  0.440703143  0.663854760

hist(zbars, freq = FALSE)
curve(dnorm(x, 0, sqrt(4/20)), add = TRUE, col = "red")

Notice that the mean of the sampling distribution of \(\overline{Z}\) matches the mean of each of the \(Z_i\)’s. If you blindly apply the “formulas”, the standard error based on IID conditions \(\sigma/\sqrt{n}\approx 0.45\) is far off compared to what you see in the simulations as the standard deviation of the sampling distribution: 0.66. The formula is off by 48%! Therefore, we have to think of how to adjust the formula for the standard error of the sample mean to accommodate non-IID situations. We do not cover how to make the adjustment in this course, but you will learn more about this in future courses.

Exercise 11

Adjust the code so that you simulate IID normal random variables with \(\mu=1\), holding everything else constant.
Explore how things change with a larger sample size. Will a larger \(n\) make the issue with the incorrect standard error formula magically disappear?
Let \(Y_0,Y_1,\ldots Y_n\) are IID normal random variables \(N\left(\mu,\sigma^2\right)\). You are now going to derive some theory.
1. Are \(Z_1,\ldots,Z_n\) IID random variables? Explain.
2. Find \(\mathbb{E}\left(Z_t\right)\) and \(\mathsf{Var}\left(Z_t\right)\).
3. Find \(\mathsf{Cov}\left(Z_t,Z_{t-1}\right)\) and show that \(\mathsf{Cov}\left(Z_t,Z_{t-j}\right)=0\) for all \(|j|> 1\).
4. Find \(\mathbb{E}\left(\overline{Z}\right)\) and \(\mathsf{Var}\left(\overline{Z}\right)\).

2.4 Looking forward to generalizing our main example

The bottom line from the previous discussions is that the sampling distribution of some statistic is the distribution of the statistic under repeated sampling. Repeated sampling does not necessarily mean random sampling. What this means is that we have multiple realizations of a set of random variables and we apply the statistic to these multiple realizations. In a sense, these multiple realizations are really hypothetical. We only get to see one realization and that is the observed data. The core idea of understanding a statistic as a procedure which has properties under repeated sampling is the essence of frequentist inference.

We now contemplate what will happen in a setting slightly more general than our main example. We now move on to the following: Consider a setting where we observe a sample \(\left(y_1,\ldots,y_n\right)\) which is a realization from the ~~IID normal~~ random variables \(\left(Y_1,\ldots, Y_n\right)\). These random variables have some joint distribution \(f\left(y_1,\ldots,y_n;\theta\right)\) where \(\theta\in\Theta\) and \(\Theta\) is the parameter space.

What makes this generalization harder is that even if we specify what \(f\) is, we still have to figure out what parameter to estimate. After that, we have to obtain the either the sampling distribution of the estimator \(\widehat{\theta}\) or establish whether the estimator \(\widehat{\theta}\) is unbiased for \(\theta\) and/or consistent for \(\theta\). We also have to derive a standard error so that we can measure our uncertainty about \(\widehat{\theta}\).

In this course, I will discuss general-purpose methods of estimation which will allow us to estimate \(\theta\) for any \(f\) you specify in advance. These general-purpose methods allow for a unified theory, so that we need not demonstrate good statistical properties on a case-by-case basis. But in this set of notes, we will dig a bit deeper into the IID normal case.

3 Digging deeper into normal models

I am now going to dig deeper into normal random variables and its related extensions in greater detail for the following reasons:

To introduce you to classical results about the behavior of statistical procedures when the sample size \(n\) is fixed or finite: Normal models form the basis of what are called “exact results”.
To prepare you for studying the asymptotic behavior (sample size \(n\to\infty\)) of statistical procedures
To show you the advantages and disadvantages of using the normal distribution as a modeling device
Normal random variables are convenient to work with but you need to know at what cost.
How about the distributions that arise from the normal?

3.1 Exact results: A preview

3.1.1 An example of a confidence interval

Another task of statistics is interval estimation. Interval estimation differs from parameter estimation in that we want a sense of the precision of our estimators. A fair question to ask why we need interval estimation when we already have a standard error. Interval estimation has more content about plausible values for the unknown parameter of interest. Not surprisingly, standard errors will play a crucial role in interval estimation.

We return to our context where observe a sample \(\left(y_1,\ldots,y_n\right)\) which is a realization from the IID normal random variables \(\left(Y_1,\ldots, Y_n\right)\). What does probability theory tell us about \(\overline{Y}\)? Recall that \[\overline{Y}\sim N\left(\mu,\frac{\sigma^2}{n}\right) \Rightarrow \frac{\overline{Y}-\mu}{\sigma/\sqrt{n}} \sim N\left(0,1\right).\]

From the normality of the sampling distribution of \(\overline{Y}\), Exercise 3, Exercise 6, you have seen that you can make exact probability statements involving \(\overline{Y}\), rather than just bounds from Chebyshev’s inequality. For example, \[\mathbb{P}\left(\left|\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right| \geq 2 \right)\approx 0.05\] But this probability statement gives us another way to quantify our uncertainty about not knowing \(\mu\) for sure. In particular, we can rewrite the previous probability statement as \[\begin{eqnarray}\mathbb{P}\left(\left|\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right| \leq 2 \right) \approx 0.95 &\Leftrightarrow & \mathbb{P}\left(-2\cdot\frac{\sigma}{\sqrt{n}} \leq \overline{Y}-\mu\leq 2\cdot\frac{\sigma}{\sqrt{n}} \right) \approx 0.95 \\ &\Leftrightarrow & \mathbb{P}\left(\overline{Y}-2\cdot\frac{\sigma}{\sqrt{n}} \leq \mu\leq \overline{Y}+2\cdot\frac{\sigma}{\sqrt{n}} \right) \approx 0.95. \end{eqnarray}\] We have to be careful in expressing in words the last probability statement. The rewritten probability statement is saying that the interval \(\left(\overline{Y}-2\cdot\dfrac{\sigma}{\sqrt{n}}, \overline{Y}+2\cdot\dfrac{\sigma}{\sqrt{n}}\right)\), whose endpoints are random variables, is able to “capture” or “cover” the unknown parameter \(\mu\) with 95% probability.

We could have rewritten the probability statement as \[\mathbb{P}\left(\mu \in \left(\overline{Y}-2\cdot\frac{\sigma}{\sqrt{n}}, \overline{Y}+2\cdot\frac{\sigma}{\sqrt{n}}\right) \right) \approx 0.95\] but writing the statement this way introduces potential misinterpretation. You might read the statement as “there is a 95% probability that \(\mu\) is inside the interval \(\left(\overline{Y}-2\cdot\dfrac{\sigma}{\sqrt{n}}, \overline{Y}+2\cdot\dfrac{\sigma}{\sqrt{n}}\right)\). This is incorrect and we will explore why in the following simulation.

We will be constructing these intervals repeatedly for different hypothetical realizations. Set \(n=5\), \(\mu=1\), and \(\sigma^2=4\). We are going to pretend to know that value of \(\sigma^2\) when constructing the interval. It may feel weird, but it is a simplification so that we can directly illustrate the source of misinterpretation.

n <- 5
mu <- 1
sigma.sq <- 4
# A realization of IID normal random variables
y <- rnorm(n, mu, sqrt(sigma.sq))
# Construct the interval
c(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))

[1] -1.604906  1.972803

Is \(\mu=1\) in the interval? So where does the probability 95% come in? Simulate another realization and construct another interval of the same form.

# A realization of IID normal random variables
y <- rnorm(n, mu, sqrt(sigma.sq))
# Construct the interval
c(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))

[1] -0.2860722  3.2916366

Is \(\mu=1\) in the interval? Let us repeat the construction of the interval 10000 times.

# Create a function depending on sample size
cons.ci <- function(n)
{
  y <- rnorm(n, mu, sqrt(sigma.sq))
  c(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))
}
# Construct interval 10000 times
results <- replicate(10^4, cons.ci(5))
# Calculate how many 
mean(results[1,] < 1 & 1 < results[2,] )

[1] 0.9545

Let us visualize the first 100 intervals.

require(plotrix)

Loading required package: plotrix

center <- (results[1,]+results[2,])/2
# Plot of the first 50 confidence intervals
plotCI(1:100, center[1:100], li = results[1,1:100], ui = results[2,1:100], xlab = "", ylab = "", sfrac = 0.001, pch = 20, cex = 0.5)
abline(h = mu, col = "red")

Notice that sometimes the interval computed for a particular realization is able to “capture” \(\mu\), but there are times when the interval is not able to “capture” \(\mu\). We found from the simulation that roughly 95% of the computed intervals contains \(\mu\). This is where the 95% gets its meaning. Furthermore, either the computed interval contains \(\mu\) or not. More importantly, for any specific interval you calculate using the data on hand, you will never know whether or not it has “captured” \(\mu\).

3.1.2 Constructing confidence intervals for \(\mu\) when \(\sigma^2\) is known

What you have seen so far is a statistical procedure called a confidence interval.

Definition 4 Let \(\theta\) be an unknown parameter of interest and \(\alpha\in (0,1)\). Let \(L\) and \(U\) be statistics. \((L, U)\) is a \(100\left(1-\alpha\right)\%\) confidence interval for \(\theta\) if \[\mathbb{P}\left(L\leq \theta \leq U\right)= 1-\alpha\] for every \(\theta\in\Theta\).

When \(Y_1,\ldots, Y_n\) are IID normal random variables, we can construct a \(100\left(1-\alpha\right)\%\) confidence interval for \(\mu\) in a more general way. So we need to find constants (not depending on parameters or the data) \(c_l\) and \(c_u\) such that \[ \mathbb{P}\left(c_l \leq \frac{\overline{Y}-\mu }{\sigma/\sqrt{n}}\leq c_u\right) = 1-\alpha.\] These two constants exist because \(\dfrac{\overline{Y}-\mu }{\sigma/\sqrt{n}}\) is a pivotal quantity. Thus, a \(100\left(1-\alpha\right)\%\) confidence interval for \(\mu\) is \[ \left(\overline{Y}-c_u\frac{\sigma}{\sqrt{n}} , \overline{Y} - c_l\frac{\sigma}{\sqrt{n}} \right).\]

A popular choice is to fix \(c_l\) and \(c_u\) so that you have a symmetric \(100\left(1-\alpha\right)\)% confidence interval. What this means is that \[\mathbb{P}\left(c_l \geq \frac{\overline{Y}-\mu }{\sigma/\sqrt{n}}\right)=\alpha/2, \ \ \ \mathbb{P}\left( \frac{\overline{Y}-\mu }{\sigma/\sqrt{n}}\geq c_u\right)=\alpha/2.\] So, we can set \(c_u=z_{\alpha/2}\) and \(c_l=-z_{\alpha/2}\) where \(z_{\alpha/2}\) is the value for which \(\mathrm{P}\left(Z\geq z_{\alpha/2}\right)=\alpha/2\) with \(Z\sim N(0,1)\). In other words, \(z_{\alpha/2}\) is the \((1- \alpha/2)\)-quantile of the standard normal. Choosing \(c_l\) and \(c_u\) to have a symmetric \(100\left(1-\alpha\right)\)% confidence interval ensures that \(c_u-c_l\) is at its shortest.

The exact confidence statement can be made under the assumption of normality and a known variance \(\sigma^2\). This exact confidence statement is likely to be extremely limited because it asks us to pretend that \(\sigma^2\) is known, yet \(\mu\) is unknown.

If you lose normality (whether or not variance is known), the interval is no longer valid but it may be justified when sample sizes are large. You may have to find a different pivotal quantity.
If you lose known variance (provided that normality still holds), the interval is also no longer valid but \(c_l\) and \(c_u\) may have to be adjusted to compensate.

Exercise 12 (Based on Stine and Foster (2004)) Hoping to lure more shoppers downtown, a city builds a new public parking garage in the central business district. The city plans to pay for the structure through parking fees. In their study of past performance of other parking garages in other shopping areas, the city records indicate that the standard deviation of daily fees is 150 dollars. During the past 44 weekdays, daily fees collected averaged 1264 dollars.

What assumptions and conditions would you need to make in order to use these statistics for inference?
Describe the parameter of interest to the city. Be specific.
Compute a 95% confidence interval for the parameter in item 2.
Do you think the true value of the parameter in item 2 is in the confidence interval you computed in item 3?

3.1.3 Testing claims about \(\mu\) if \(\sigma^2\) is known

Another task of statistics to test claims or test hypotheses we may have about the parameters of a model. There are situations where we may have an idea about the value of some parameters in a model. In quality control, firms have a record of past production and they may want check whether a measurable aspect of a recent batch of products are in line with previously established quality guidelines. Another situation is where we want to know whether the observed data are consistent with the model whose parameters are set to some known values. This model might be based on subject-matter knowledge.

Suppose we want to test the claim that \(\mu=\mu_0\) where \(\mu_0\) is known to us in advance (meaning before we actually see the data). We get to observe \(\overline{y}\) and ask whether what we observed is compatible with a model with IID normal random variables \(N\left(\mu_0,\sigma^2\right)\), where \(\sigma^2\) is known. One way to answer this question is to calculate the probability of observing values of \(\overline{Y}\) as extreme or more extreme than what we have observed under the assumption that \(\mu=\mu_0\) is true. This probability is called the observed level of significance or \(p\)-value. A small \(p\)-value means that what we have observed is very unlikely under the assumption that \(\mu=\mu_0\) is true. A large \(p\)-value means that what we have observed is compatible with the assumption that \(\mu=\mu_0\) is true.

We will return to hypothesis testing later, as we need to understand why this procedure works and when it could be applied. For example, you should be wondering how small is small and how large is large.

We end this short introduction to the idea of hypothesis testing with an illustration and simulation. Consider LM Example 6.2.2. In that example, we are supposed to evaluate the claim that \(\mu=494\), meaning that the results of the new curriculum (which produced exam scores with an observed sample mean of 502) is compatible with past performance without the new curriculum. The example assumes normality of exam scores and that \(\sigma=124\) is known. The \(p\)-value was calculated to be 0.55. In the calculation, the meaning of “as extreme or more extreme” allows for two directions.

We can also obtain a similar \(p\)-value using simulation. What you are seeing is a simple form of what some call simulation-based inference.

nsim <- 10^4
# Take note that we created artificial data assuming mu = 494
scoremat <- replicate(nsim, rnorm(86, mean = 494, sd = 124))
# Calculate the simulated sample means
sim.scores <- colMeans(scoremat)
# Count how many of the simulated sample means exceed 502
mean(sim.scores >= 502)

[1] 0.2718

# The simulated p-value which should be close to the calculation
mean(abs((sim.scores-494)/(124/sqrt(86)))>(502-494)/(124/sqrt(86)))

[1] 0.5506

The bottom line from this illustration is that seeing a sample mean of 502 is not statistically different from 494. So, 502 is something that is “typical” if we believe that \(\mu=494\) is true.

3.2 Exact results: The more realistic case

3.2.1 What happens if \(\sigma^2\) is unknown?

Under IID normal random variables with \(\sigma^2\) known, a \(100\left(1-\alpha\right)\%\) confidence interval for \(\mu\) is given by \[\left(\overline{Y}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}} , \overline{Y} +z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \right).\]

If we do not know \(\sigma\), we need to plug-in a value for it. \(\sigma\) is just like \(\mu\), an unknown parameter. So, just like before when we were trying to estimate \(\mu\), we have to search for a statistic which could help us learn \(\sigma\).

A possible candidate is to understand first what \(\sigma\) is. It is the square root of the common variance \(\mathsf{Var}\left(Y_i\right)=\sigma^2\). By the definition of the variance, \(\mathsf{Var}\left(Y_i\right)=\mathbb{E}\left[\left(Y_i-\mathbb{E}\left(Y_i\right)\right)^2\right]\). The weak law of large numbers suggests a possible way to estimate \(\sigma^2\). Recall that to estimate \(\mathbb{E}\left(Y_i\right)\), we can use the sample mean of the \(Y_i\)’s. Extending this idea to \(\sigma^2\), we can use the sample mean of the \(\left(Y_i-\mathbb{E}\left(Y_i\right)\right)^2\)’s. The only problem is that \(\mathbb{E}\left(Y_i\right)\) is unknown, but this can be estimated quite well. Taking all of these together, a possible candidate to estimate \(\sigma^2\) is \[\frac{1}{n}\sum_{i=1}^n \left(Y_i-\overline{Y}\right)^2.\] For the development of the theory for normal models, we are going to use a related candidate which is \[S^2=\frac{1}{n-1}\sum_{i=1}^n \left(Y_i-\overline{Y}\right)^2.\] The question is whether \(S^2\) and \(S\) help us to learn \(\sigma^2\) and \(\sigma\).

We can use simulation to compare the sampling distributions of \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) and \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). We already know that under the assumption of IID normal random variables, \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)\). But we do not know anything about the distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). The difference is that \(\sigma/\sqrt{n}\) is not random, but \(S/\sqrt{n}\) is a random variable.

n <- 5
mu <- 1
sigma.sq <- 4
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
ybars <- colMeans(ymat)
sds <- apply(ymat, 2, sd)
par(mfrow=c(1,2))
hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq = FALSE, main = "Known sigma.sq")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")
hist((ybars-mu)/(sds/sqrt(n)), freq = FALSE, main = "Plug-in")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")

The red curve is the density function for \(N(0,1)\). If you compare the two histograms, you would notice that the red curve “fits” better for \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) than \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). There is less concentration around 0 for the histogram on the right and more mass on the extremes. This means that extreme positive and negative realizations of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\) are more likely for the histogram on the right compared to the histogram on the left.

We can explore what happens when we increase sample size \(n=5\) to \(n=20\).

n <- 20
mu <- 1
sigma.sq <- 4
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
ybars <- colMeans(ymat)
sds <- apply(ymat, 2, sd)
par(mfrow=c(1,2))
hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq = FALSE, main = "Known sigma.sq")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")
hist((ybars-mu)/(sds/sqrt(n)), freq = FALSE, main = "Plug-in")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")

Exercise 13 Modify the previous code and explore what happens when \(n=80\) and \(n=320\).

You will notice from the previous exercise that when \(n\to\infty\), the approximation to the histogram of standardized sample means by the red curve is getting better. But we still have to address what happenes with finite \(n\). There are two approaches to pursue:

Exploit the normality assumption and obtain the exact finite-sample distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\).
Derive a central limit theorem for \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\).

3.2.2 Deriving the distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\)

3.2.2.1 A detour to some linear algebra

We will discuss the first approach but it requires a brief detour to introduce a special type of linear transformation called an orthogonal transformation. Notice that \(\overline{Y}\) and \(S\) algebraically depend on each other. Applying an orthogonal transformation to IID normally distributed random variables allow us to show that these two quantities do not statistically depend on each other.

Consider the case where we are linearly transforming a vector \(\left(Y_1,Y_2\right)\) to another vector \(\left(V_1,V_2\right)\). We can write the transformation as a system of equations: \[V_1=a_{11}Y_1+a_{12}Y_2, \ \ V_2=a_{21}Y_1+a_{22} Y_2\] or in matrix form: \[\begin{bmatrix}V_1\\ V_2\end{bmatrix}=\underbrace{\begin{bmatrix}a_{11} & a_{12} \\ a_{21} & a_{22}\end{bmatrix}}_{\mathbf{A}}\begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix}.\] What makes a linear transformation an orthogonal transformation? In other words, what restrictions should be placed on \(\mathbf{A}\) so that it represents an orthogonal transformation? I now introduce some linear algebra concepts in two-dimensional Euclidean space \(\mathbb{R}^2\), but these extend to \(n\)-dimensional Euclidean space \(\mathbb{R}^n\).

Definition 5

The length or the Euclidean norm of a vector \(\left(Y_1,Y_2\right)\) is given by \(\sqrt{Y_1^2+Y_2^2}\).
The dot product of two vectors \(\left(Y_1,Y_2\right)\) and \(\left(X_1,X_2\right)\) is given by \(X_1Y_1+X_2Y_2\).
\(\left(Y_1,Y_2\right)\) and \(\left(X_1,X_2\right)\) are orthogonal to each other if and only if their dot product is zero.
A \(2\times 2\) matrix \(\mathbf{A}\) is orthogonal if and only if the length of \(\left(Y_1,Y_2\right)\) is the same as the length of \(\left(V_1,V_2\right)\), i.e., \[Y_1^2+Y_2^2=V_1^2+V_2^2.\]
A \(2\times 2\) matrix \(\mathbf{A}\) is the identity matrix if \(a_{ii}=1\) for all \(i\) and \(a_{ij}=0\) for all \(i\neq j\). We denote the identity matrix by \(\mathbf{I}\).
A transpose \(\mathbf{A}^\prime\) of a \(2\times 2\) matrix \(\mathbf{A}\) is formed by changing columns into rows and rows into columns so that the \((i,j)\)th entry of \(\mathbf{A}^\prime\) is the \((j,i)\)th entry of \(\mathbf{A}\).
The inverse of a \(2\times 2\) matrix \(\mathbf{A}\) is a \(2\times 2\) matrix \(\mathbf{A}^{-1}\) which satisfies \(\mathbf{A}\mathbf{A}^{-1}=\mathbf{A}^{-1}\mathbf{A}=\mathbf{I}\).

Based on the previous definition, we have the following restrictions on an orthogonal matrix \(\mathbf{A}\): \[\begin{eqnarray}a_{11}^2+a_{21}^2 &=& 1\\ a_{12}^2+a_{22}^2 &=& 1 \\ a_{11}a_{12}+a_{21}a_{22} &=& 0\end{eqnarray}\] We can rewrite these restrictions as restrictions involving the columns of the matrix \(\mathbf{A}\). The first and second restrictions imply that the columns of \(\mathbf{A}\) have length equal to 1. The third restriction implies that the columns of \(\mathbf{A}\) are orthogonal to each other.

Exercise 14 Let \(\mathbf{A}\) be an orthogonal matrix.

Show that the entries of \(\mathbf{A}\) will satisfy the restrictions mentioned earlier.
Show that \(\mathbf{A}^\prime \mathbf{A}=\mathbf{I}\) . What is the inverse of an orthogonal matrix \(\mathbf{A}\)?

3.2.2.2 \(\overline{Y}\) and \(S\) are independent under normality

How do orthogonal transformations interact with IID normal random variables \(N\left(\mu,\sigma^2\right)\)? To answer this, we have to find the expected value, variance, and joint density \(f_{V_1,V_2}\left(v_1,v_2\right)\). One thing to note is that the variance stays the same after the orthogonal transformation.

Exercise 15 Show that \[\mathbb{E}\left(V_1\right)=\left(a_{11}+a_{12}\right)\mu, \ \ \ \mathbb{E}\left(V_2\right)=\left(a_{21}+a_{22}\right)\mu, \ \ \ \mathsf{Var}\left(V_1\right)=\sigma^2 , \ \ \ \mathsf{Var}\left(V_2\right)=\sigma^2\]

We start from the joint density of IID normal random variables \(\left(Y_1, Y_2\right)\). In this case, we have \[f_{Y_1,Y_2}\left(y_1, y_2\right)=\frac{1}{2\pi\sigma^2}\exp\left(-\frac{1}{2\sigma^2} \left((y_1-\mu)^2+(y_2-\mu)^2\right)\right).\] We have to express \(Y\)’s into \(V\)’s. Since we are working with orthogonal transformations, we assume that \(\mathbf{A}\) is an orthogonal matrix whose entries satisfy certain restrictions. Based on Exercise 14, \[\begin{bmatrix}V_1\\ V_2\end{bmatrix}=\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22}\end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix} \Rightarrow \begin{bmatrix}a_{11} & a_{21} \\ a_{12} & a_{22}\end{bmatrix}\begin{bmatrix}V_1\\ V_2\end{bmatrix}=\begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix}.\] From a geometric perspective, orthogonal transformations preserve lengths of vectors. Therefore, the transformation from \(Y\) to \(V\) does not change volume. The joint density of \(\left(V_1, V_2\right)\) for this case becomes \[\begin{eqnarray}f_{V_1,V_2}\left(v_1, v_2\right) &=&\frac{1}{2\pi\sigma^2}\exp\left(-\frac{1}{2\sigma^2} \left((a_{11}v_1+a_{21}v_2-\mu)^2+(a_{12}v_1+a_{22}v_2-\mu)^2\right)\right)\\ &=& \frac{1}{2\pi\sigma^2}\exp\left(-\frac{1}{2\sigma^2} \left((v_1-(a_{11}+a_{12})\mu)^2+(v_2-(a_{21}+a_{22})\mu)^2\right)\right) \\ &=& \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2} (v_1-(a_{11}+a_{12})\mu)^2\right)\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2} (v_2-(a_{21}+a_{22})\mu)^2\right)\end{eqnarray}\] Therefore, the joint density of \(\left(V_1, V_2\right)\) is a product of the density of a \(N\left((a_{11}+a_{12})\mu,\sigma^2\right)\) and the density of a \(N\left((a_{21}+a_{22})\mu,\sigma^2\right)\).

From a statistical perspective, applying an orthogonal transformation to IID normal random variables preserves normality, independence, and variance. Note that the means are not identical. The derivation presented extends to \(n\) dimensions.

Now that we know the statistical effect of applying an orthogonal transformation. The next question is to construct an orthogonal transformation which will help us show that \(\overline{Y}\) and \(S\) do not statistically depend on each other. Let \(Y_1,\ldots, Y_n\) be IID normal random variables \(N(\mu,\sigma^2)\). Since \(S\) depends on deviations from the mean \(Y_i-\overline{Y}\), we consider the following linear transformation: \[\begin{bmatrix}\overline{Y} \\ Y_2-\overline{Y} \\ Y_3-\overline{Y} \\ \vdots \\ Y_n-\overline{Y}\end{bmatrix} = \underbrace{\begin{bmatrix}\dfrac{1}{n} & \dfrac{1}{n} & \dfrac{1}{n} & \cdots & \dfrac{1}{n} \\ -\dfrac{1}{n} & 1-\dfrac{1}{n} & -\dfrac{1}{n} & \cdots & -\dfrac{1}{n} \\ -\dfrac{1}{n} & -\dfrac{1}{n} & 1-\dfrac{1}{n} & \cdots & -\dfrac{1}{n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -\dfrac{1}{n} & -\dfrac{1}{n} & -\dfrac{1}{n} & \cdots & 1-\dfrac{1}{n} \\ \end{bmatrix}}_{\mathbf{B}}\begin{bmatrix}Y_1\\Y_2\\Y_3 \\ \vdots \\ Y_n\end{bmatrix}\] Unfortunately, the matrix \(\mathbf{B}\) is not an orthogonal matrix. It is the case that the first row is orthogonal to the remaining rows of \(\mathbf{B}\). But the length is not 1. We can rescale \(\overline{Y}\) to \(V_1=\sqrt{n}\overline{Y}\) so that \[\begin{bmatrix}\sqrt{n}\overline{Y} \\ V_2 \\ V_3 \\ \vdots \\ V_n\end{bmatrix} = \underbrace{\begin{bmatrix}\dfrac{1}{\sqrt{n}} & \dfrac{1}{\sqrt{n}} & \dfrac{1}{\sqrt{n}} & \cdots & \dfrac{1}{\sqrt{n}} \\ ? & ? & ? & \cdots & ? \\ ? & ? & ? & \cdots & ? \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ ? & ? & ? & \cdots & ? \\ \end{bmatrix}}_{\mathbf{A}}\begin{bmatrix}Y_1\\Y_2\\Y_3 \\ \vdots \\ Y_n\end{bmatrix}\] It is possible to complete this matrix to ensure that it is an orthogonal matrix, but it is not necessary to see its complete form.

Why? Because we can already obtain a set of results, provided that \(\mathbf{A}\) is an orthogonal matrix:

\(\mathbb{E}\left(V_1\right)=\sqrt{n}\mu\) which implies that a previous result that \(\mathbb{E}\left(\overline{Y}\right)=\mu\).
\(V_1,\ldots,V_n\) are still independent random variables after an orthogonal transformation, even if we do not know explicitly what the transformation actually looks like.
The sum of squares of the random variables \(Y_1,\ldots, Y_n\) is equal to the sum of squares of the random variables \(V_1,\ldots, V_n\), i.e. \[\sum_{i=1}^n V_i^2 = \sum_{i=1}^n Y_i^2 \Rightarrow n\overline{Y}^2+\sum_{i=2}^n V_i^2=\sum_{i=1}^n Y_i^2 \Rightarrow \sum_{i=2}^n V_i^2 = \sum_{i=1}^n Y_i^2-n\overline{Y}^2 \Rightarrow \sum_{i=2}^n V_i^2 = \sum_{i=1}^n \left(Y_i-\overline{Y}\right)^2.\] This algebraic result allows us to separate the sum of squares \(\displaystyle\sum_{i=1}^n Y_i^2\) into two parts: a function of \(\overline{Y}\) and a function of \(S\). More importantly, these two parts are statistically independent of each other. Therefore \(\overline{Y}\) is independent of \(S^2\).

In case you want to complete the entries of the orthogonal matrix \(\mathbf{A}\), we can say much more but there are alternative ways to prove these other statements. The following matrix is an orthogonal matrix: \[\mathbf{A}=\begin{bmatrix}\dfrac{1}{\sqrt{n}} & \dfrac{1}{\sqrt{n}} & \dfrac{1}{\sqrt{n}} & \cdots & \dfrac{1}{\sqrt{n}} \\ -\dfrac{1}{\sqrt{2}} & \dfrac{1}{\sqrt{2}} & 0 & \cdots & 0 \\ -\dfrac{1}{\sqrt{6}} & -\dfrac{1}{\sqrt{6}} & \dfrac{2}{\sqrt{6}} & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -\dfrac{1}{\sqrt{n(n-1)}} & -\dfrac{1}{\sqrt{n(n-1)}} & -\dfrac{1}{\sqrt{n(n-1)}} & \cdots & \dfrac{n-1}{\sqrt{n(n-1)}} \\ \end{bmatrix}\] As a result, we obtain:

\(\mathbb{E}\left(V_i\right)=0\), for \(i>1\).
\(\mathbb{E}\left(S^2\right)=\sigma^2\) which means that \(S^2\) is an unbiased estimator of \(\sigma^2\).

As you may have noticed, sums of squares play a crucial role. Applying an orthogonal transformation to IID normal random variables enabled us to partition the sums of squares of those IID normal random variables into a square specific to the sample mean and into a sum of squares specific to the sample variance. This breakdown is the essence of an analysis of variance.

3.2.2.3 The distribution of \(S^2\)

Theorem 7.3.1 in LM page 384 states that the sum of independent standard normal random variables has a gamma distribution with \(r=m/2\) and \(\lambda=1/2\). We will demonstrate the same result by using a moment generating function approach.

First, I show that if \(Z_j\sim N(0,1)\), then \(M_{Z_j^2}\left(t\right)=\left(1-2t\right)^{-1/2}\) for \(|t|<1/2\). To see this, start with the definition of a moment generating function and do some integration: \[\begin{eqnarray}M_{Z_j^2}\left(t\right)&=&\mathbb{E}\left(\exp\left(z^2 t\right)\right)\\ &=&\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}}\exp\left(z^2 t\right)\exp\left(-\frac{1}{2}z^2\right)\,\, dz \\ &=&\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}}\exp\left(z^2\left(t-\frac{1}{2}\right)\right)\,\, dz\\ &=&\left(1-2t\right)^{-1/2}\int_{-\infty}^\infty \underbrace{\frac{1}{\sqrt{2\pi\left(1-2t\right)^{-1}}}\exp\left(-\frac{z^2}{2\left(1-2t\right)^{-1}}\right)}_{\mathrm{pdf\ of\ }N\left(0,\left(1-2t\right)^{-1}\right)}\,\, dz \\ &=& \left(1-2t\right)^{-1/2}\end{eqnarray}\]

Next, we have to find the moment generating function of \(U=\displaystyle\sum_{j=1}^m Z_j\) where \(Z_1,\ldots,Z_m\) are independent standard normal random variables and \(m\) is fixed. By Theorem 3.12.3b in LM page 213, \[M_U\left(t\right)=M_{Z_1^2}\left(t\right)M_{Z_2^2}\left(t\right)\cdot \ldots \cdot M_{Z_m^2}\left(t\right)=\left(1-2t\right)^{-m/2}.\] By Theorem 4.6.5 in LM page 270, along with Theorem 3.12.2 in LM page 213, this is the moment generating function of a gamma pdf with \(r=m/2\) and \(\lambda=1/2\).

Because of the central role played by sums of squares, a gamma distribution with \(r=m/2\) and \(\lambda=1/2\) is also called a chi-squared distribution with \(m\) degrees of freedom (denoted as \(\chi^2_m\)) .

The problem with applying the preceding result directly to \(S^2\) is that although \(S^2\) has the form of a sum of squares, it is not a sum of squares of independent standard normal random variables yet.

One approach is to use the consequence of applying orthogonal transformations to IID normal random variables. In particular, we have \[S^2=\frac{1}{n-1}\sum_{i=2}^n V_i^2 =\frac{1}{n-1}\sum_{i=2}^n \left(\frac{V_i-\mathbb{E}\left(V_i\right)+\mathbb{E}\left(V_i\right)}{\sqrt{\mathsf{Var}\left(V_i\right)}}\right)^2 \mathsf{Var}\left(V_i\right)=\frac{1}{n-1}\sum_{i=2}^n \left(\frac{V_i}{\sigma}\right)^2 \sigma^2\Rightarrow \frac{\left(n-1\right)S^2}{\sigma^2}=\sum_{i=2}^n \left(\frac{V_i}{\sigma}\right)^2\] Therefore, we have to rescale \(S^2\) to \(\left(n-1\right)S^2/\sigma^2\) and the latter is now a sum of squares of \(n-1\) independent standard normal random variables. So, we have an exact finite-sample distributional result: \[\frac{\left(n-1\right)S^2}{\sigma^2}=\sum_{i=2}^n \left(\frac{V_i}{\sigma}\right)^2\sim \chi^2_{n-1}.\]

Exercise 16

(Exercise 7.3.2) Let \(U\sim \chi^2_m\). Use the moment generating function for a chi-squared distributed random variable to show that \(\mathbb{E}\left(U\right)=m\) and \(\mathsf{Var}\left(U\right)=2m\)
Use the distributional result to show that

\(S^2\) is an unbiased estimator of \(\sigma^2\).
(Exercise 7.3.4) \(\mathsf{Var}\left(S^2\right)=2\sigma^4/(n-1)\)
(Exercise 7.3.5) \(S^2\) is a consistent estimator of \(\sigma^2\).

3.2.2.4 Putting all of these results together

So far, we know that \[\overline{Y}\sim N\left(\mu, \frac{\sigma^2}{n}\right),\ \ \ \frac{\left(n-1\right)S^2}{\sigma^2}\sim \chi^2_{n-1}\] and that \(\overline{Y}\) and \(S\) are independent. The last ingredient to complete the derivation is to determine the distribution of \[\frac{\overline{Y}-\mu}{S/\sqrt{n}}=\frac{\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}}{\sqrt{\dfrac{\left(n-1\right)S^2}{\sigma^2}\cdot \dfrac{1}{\left(n-1\right)}}}.\] Theorem 7.3.5 in LM page 387 show that this random variable has a Student \(t\) distribution with \(n-1\) degrees of freedom (denoted by \(T_{n-1}\)).

Alternatively, we can put the results together using an analysis of variance. The total sum of squares \(\displaystyle\sum_{i=1}^n Y_i^2\) could be split up into two parts \(V_1^2\) and \(\displaystyle\sum_{i=2}^n V_i^2\). These two parts are statistically independent of each other, provided that \(Y_1,\ldots, Y_n\) are IID normal random variables. Recall our analysis of variance:

Source	Sum of squares (SS)	Degrees of freedom (df)	Mean sum of squares (MS)	Expected MS
Mean	\(V_1^2=n\overline{Y}^2\)	\(1\)	\(n\overline{Y}^2\)	\(\sigma^2+n\mu^2\)
Deviations	\(\displaystyle\sum_{i=2}^n V_i^2=\displaystyle\sum_{i=1}^n\left(Y_i-\overline{Y}\right)^2\)	\(n-1\)	\(S^2\)	\(\sigma^2\)
Total	\(\displaystyle\sum_{i=1}^n V_i^2=\displaystyle\sum_{i=1}^n Y_i^2\)	\(n\)

Consider the ratio \[\frac{\left(\dfrac{\left(V_1-\sqrt{n}\mu\right)^2}{\sigma^2}\right)/1}{\left(\dfrac{\displaystyle\sum_{i=2}^n V_i^2}{\sigma^2}\right)/(n-1)}=\frac{\left(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right)^2/1}{\left(\dfrac{(n-1)S^2}{\sigma^2}\right)/(n-1)}.\] Theorem 7.3.3 in LM page 385 states that a ratio of independent mean squares (sum of squares divided by the degrees of freedom) has an \(F\) distribution. In this particular situation, the resulting random variable has an \(F\) distribution with 1 numerator degree of freedom and \(n-1\) degrees of freedom (denoted as \(F_{1,n-1}\)).

3.3 Slight detour: asymptotic distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\)

I now provide an explanation of the second approach to obtaining the distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). We already saw that under the normality of IID random variables \(Y_1,Y_2, \ldots, Y_n\), \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\sim T_{n-1}\). What happens if we remove the normality assumption? In Exercise 13 along with its preceding Monte Carlo simulation, we saw that the histogram of simulated sample means is closely approximated by the red curve as \(n\) becomes larger. Although normal random variables were simulated in that exercise, we can also show that we can remove the normality assumption.

n <- 80
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, runif(n, min = 0, max = 1))
ybars <- colMeans(ymat)
sds <- apply(ymat, 2, sd)
mu <- 1/2 # the expected value of a uniformly distributed random variable
sigma.sq <- 1/12 # the variance of a uniformly distributed random variable
par(mfrow=c(1,2))
hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq = FALSE, main = "Known sigma.sq")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")
hist((ybars-mu)/(sds/sqrt(n)), freq = FALSE, main = "Plug-in")
curve(dnorm(x, 0, 1), add = TRUE, col = "red")

I am going to introduce another convergence concept to allow us to capture the ability of the density function of the \(N(0,1)\) (shown as the red curve) to “approximate” the histogram of standardized sample means.

Definition 6 Let \(X_1,X_2,\ldots\) be a sequence of random variables and let \(X\) be another random variable. Let \(F_n\) be the cdf of \(X_n\) and \(F\) be the cdf of \(X\). \(X_n\) converges in distribution to \(X\), denoted as \(X_n\overset{d}{\to}X\), if \[\lim_{n\to\infty} F_n\left(t\right)=F\left(t\right)\ \ \mathrm{or} \ \ \lim_{n\to\infty} \mathbb{P}\left(X_n\leq t\right) = \mathbb{P}\left(X\leq t\right) \] at all \(t\) for which \(F\) is continuous.

This convergence concept is useful to approximate probabilities involving random variables \(X_1, X_2, \ldots,\) whose distribution is unknown to us. Of course, you should be able to compute probabilities involving the limiting random variable \(X\), otherwise, the concept becomes less useful. The key idea is that when \(n\) is large enough, the hard-to-compute \(\mathbb{P}\left(X_n\leq t\right)\) can be made as close as we please to the easier-to-compute \(\mathbb{P}\left(X\leq t\right)\)

This convergence concept is most prominent in the central limit theorem found in Theorem 4.3.2 in LM page 244. What it says is that if we have an IID sequence \(Y_1,Y_2,\ldots\) of random variables with finite variance, then \[\lim_{n\to\infty} \mathbb{P}\left(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\leq z\right)=\mathbb{P}\left(Z\leq z\right)\] where \(Z\sim N(0,1)\). Another way to write the result is \[\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\overset{d}{\to} N(0,1).\]

The distribution of each of the random variables \(Y_1,Y_2,\ldots\) is unknown to us apart from each of them having finite variance. Therefore, it is very unlikely that you can actually compute probabilities involving the sampling distribution of \(\overline{Y}\). Even if we did know the distribution of each of the random variables \(Y_1,Y_2,\ldots\), it might still be extremely difficult to obtain an expression for the distribution of \(\overline{Y}\). If \(Y_1,Y_2,\ldots\) are IID normal random variables, then we have an exact and known sampling distribution of \(\overline{Y}\) and no urgent need for a central limit theorem. But this is more of an exception.

The next thing to hope for is that if we plug-in a good value for \(S\), then we also obtain another central limit theorem that says: \[\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\overset{d}{\to} N(0,1).\] We will plug-in a consistent estimator of \(\sigma\), which happens to be \(S\).

3.4 Extending univariate normal distributions to higher dimensions

3.4.1 The Standard Bivariate Normal Distribution

We are now going to extend the standard normal distribution to two dimensions. We have to think about the more interesting case where two standard normal random variables are not independent of each other. We will be introducing a parameter \(\rho\) to reflect this dependence.

Let \[Z_1 = U_1,\quad Z_2=\rho U_1+\sqrt{1-\rho^2} U_2\] where \(U_1\sim N\left(0,1\right)\), \(U_2\sim N\left(0,1\right)\) and \(U_1\) is independent of \(U_2\). As a result, we have \[\mathbb{E}\left(Z_1\right)=\mathbb{E}\left(Z_2\right)=0, \quad \mathsf{Var}\left(Z_1\right)=\mathsf{Var}\left(Z_2\right)=1, \quad \mathsf{Cov}\left(Z_1,Z_2\right)=\rho.\]

Definition 7 Let \(Z_1\) and \(Z_2\) be random variables with joint pdf \[f\left(z_1,z_2\right)=\frac{1}{2\pi\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2}\frac{z_1^2+z_2^2-2\rho z_1z_2}{1-\rho^2}\right).\] \(Z_1\) and \(Z_2\) are said to have a standard bivariate normal distribution with parameter \(\rho\), i.e., \(\left(Z_1,Z_2\right) \sim \mathsf{SBVN}\left(\rho\right)\).

Consider the special cases where \(\rho=1\) and \(\rho=-1\). These cases will produce a standard normal distribution in 3D. Here \(\rho=0\) implies \(Z_1\) and \(Z_2\) are independent. The bivariate normal distribution with \(\rho=0\) is the case where uncorrelatedness implies independence.

In finance, some practitioners “paste” two normal marginals together with a copula. This allows one to produce distributions that have normal marginals but the joint distribution is not necessarily bivariate normal.

Next, we will visualize the joint pdf of the standard bivariate normal distribution for \(\rho \in \{-0.9,-0.6,0,0.6,0.9,0.99\}\).

par(mfrow=c(2,3))
sbvn <- function(z1, z2, rho) { 1/(2*pi*sqrt(1-rho^2))* exp(-0.5*(z1^2+z2^2-2*rho* z1*z2)/(1-rho^2)) }
z1 <- seq(-3, 3, length= 40)
z2 <- seq(-3, 3, length= 40)
f <- outer(z1, z2, sbvn, rho = -0.9)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = -0.9")
f <- outer(z1, z2, sbvn, rho = -0.6)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = -0.6")
f <- outer(z1, z2, sbvn, rho = 0)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = 0")
f <- outer(z1, z2, sbvn, rho = 0.6)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = 0.6")
f <- outer(z1, z2, sbvn, rho = 0.9)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = 0.9")
f <- outer(z1, z2, sbvn, rho = 0.99)
persp(z1, z2, f,
      zlab = "f(z1,z2)",
      theta = 30, phi = 15,
      col = "springgreen", shade = 0.5, main = "rho = 0.99")

Next, we will also see pictures of contour plots. These are plots where you find all \(\left(z_1,z_2\right)\) so that \(f\left(z_1,z_2\right) = c\) , where \(c\) is some constant.

par(mfrow=c(2,3))
f <- outer(z1, z2, sbvn, rho = -0.9)
contour(z1, z2, f)
f <- outer(z1, z2, sbvn, rho = -0.6)
contour(z1, z2, f)
f <- outer(z1, z2, sbvn, rho = 0)
contour(z1, z2, f)
f <- outer(z1, z2, sbvn, rho = 0.6)
contour(z1, z2, f)
f <- outer(z1, z2, sbvn, rho = 0.9)
contour(z1, z2, f)
f <- outer(z1, z2, sbvn, rho = 0.99)
contour(z1, z2, f)

3.4.2 Bivariate normal distributions

We can now extend to bivariate normal distributions with general means and variances. The idea is to use a similar approach when we extend the standard normal distribution to \(N\left(\mu,\sigma^2\right)\). The idea is to apply linear transformations.

Let \(W_1=\mu_1+\sigma_1 Z_1\) and \(W_2=\mu_2+\sigma_2 Z_2\). As a result, we have \[\mathbb{E}\left(W_1\right)=\mu_1, \quad \mathbb{E}\left(W_2\right)=\mu_2, \quad \mathsf{Var}\left(W_1\right)=\sigma^2_1, \quad \mathsf{Var}\left(W_2\right)=\sigma^2_2,\] and \[\mathsf{Cov}\left(W_1,W_2\right)=\mathsf{Cov}\left(Z_1,Z_2\right)\sigma_1\sigma_2=\rho\sigma_1\sigma_2.\]

Definition 8 Let \(W_1\) and \(W_2\) be random variables with joint pdf \[f_{W_1, W_2}\left(w_1,w_2\right)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2}\frac{\left(\frac{w_1-\mu_1}{\sigma_1}\right)^2+\left(\frac{w_2-\mu_2}{\sigma_2}\right)^2-2\rho \left(\frac{w_1-\mu_1}{\sigma_1}\right)\left(\frac{w_2-\mu_2}{\sigma_2}\right)}{1-\rho^2}\right)\] for all \(w_1\) and \(w_2\). \(W_1\) and \(W_2\) are said to have a bivariate normal distribution with five parameters, i.e., \(\left(W_1,W_2\right) \sim \mathsf{BVN}\left(\mu_1,\mu_2, \sigma_1^2, \sigma_2^2, \rho\right)\).

This definition matches Definition 11.5.1 in LM page 572.

You can show that the marginal pdf of \(W_1\) is \(N\left(\mu_1,\sigma^2_1\right)\) and the marginal pdf of \(W_2\) is \(N\left(\mu_2,\sigma^2_2\right)\).

3.4.3 Multivariate normal distributions

Just like how we extended the standard normal to two dimensions, it is possible to extend to the multivariate case by noticing one aspect of the joint pdf in the bivariate case.

Observe that \[z_1^2+z_2^2-2\rho z_1z_2=\begin{bmatrix} z_1 & z_2 \end{bmatrix} \begin{bmatrix} 1 & -\rho \\ -\rho & 1 \end{bmatrix}\begin{bmatrix} z_1 \\ z_2\end{bmatrix}\] and that \[\frac{1}{1-\rho^2}\begin{bmatrix} 1 & -\rho \\ -\rho & 1 \end{bmatrix}=\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1}\] and \[\mathsf{det}\left(\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\right)=1-\rho^2\] Thus, the joint pdf for the standard bivariate normal distribution can be rewritten as \[f\left(z_1,z_2\right)=\frac{1}{2\pi\sqrt{\mathsf{det}\left(\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\right)}} \exp\left(-\frac{1}{2}\begin{bmatrix} z_1 & z_2 \end{bmatrix} \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1}\begin{bmatrix} z_1 \\ z_2\end{bmatrix}\right).\]

Definition 9 The \(n\times 1\) random vector \(\mathbf{Y}=\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n\end{bmatrix}\) has a multivariate normal distribution \(\mathbf{Y}\sim \mathsf{MVN}\left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right)\) if it has joint pdf of the form \[f_{\mathbf{Y}}(\mathbf{y})=\left(2\pi\right)^{-n/2}\left(\mathsf{det}\ \boldsymbol{\Sigma}\right)^{-1/2}\exp\left[ -\frac{1}{2}\left(\mathbf{y}-\boldsymbol{\mu}\right)^{\prime}\boldsymbol{\Sigma}^{-1}\left(\mathbf{y}-\boldsymbol{\mu}\right)\right].\]

The mean vector is given by \[\boldsymbol{\mu}=\mathbb{E}\left(\mathbf{Y}\right)=\begin{bmatrix} \mathbb{E}\left(Y_{1}\right)\\ \mathbb{E}\left(Y_{2}\right)\\ \vdots\\ \mathbb{E}\left(Y_{n}\right) \end{bmatrix}=\begin{bmatrix} \boldsymbol{\mu}_1\\ \boldsymbol{\mu}_2\\ \vdots\\ \boldsymbol{\mu}_n \end{bmatrix}\] and the covariance matrix is given by \[\boldsymbol{\Sigma}=\mathsf{Var}\left(\mathbf{Y}\right)=\begin{bmatrix}\mathsf{Var}(Y_{1}) & \mathsf{Cov}(Y_{1},Y_{2}) & \ldots & \mathsf{Cov}(Y_{1},Y_{n})\\ \mathsf{Cov}(Y_{2},Y_{1}) & \mathsf{Var}(Y_{2}) & \ldots & \mathsf{Cov}(Y_{2},Y_{n})\\ \vdots & \vdots & \ddots & \vdots\\ \mathsf{Cov}(Y_{n},Y_{1}) & \mathsf{Cov}(Y_{n},Y_{2}) & \ldots & \mathsf{Var}(Y_{n}) \end{bmatrix}.\]