I am going to setup a review of some ideas and skills you should have picked up from your first course in probability theory. I connect some of the main ideas in the course to this review so that you can see how probability theory is applied to statistics. The resulting statistical theory becomes our basis for what to do in practice when you encounter real data, which is often messier.
Larsen and Marx, henceforth LM, already point out statistical applications of probability theory when studying random variables in Chapter 3. They state in page 175, “In statistics, the most important combination of a set of random variables is often their sum, so we continue this section with the problem of finding the pdf of .” Another important example in that chapter is Example 3.9.11 in page 191, where the expected value and variance of an average computed from a random sample of observations.
In the process of reviewing probability theory, we will be transitioning from probability models to what LM call data models. This phrase shows up in the Preface of LM and its first occurrence in the main text is in page 224. There is no explicit definition of what LM call data models, but they state the following:
Page viii: “The initial step in teaching someone how to analyze data is helping them learn how to recognize those eight”data models”: One-sample data, Two-sample data, -sample data, Paired data, Randomized block data, Regression/Correlation data, Categorical data, and Factorial data.”
Page viii: “Identifying data models, of course, is not a difficult skill to acquire. Anyone involved in analyzing data learns it quickly. But for students taking their first course in statistics, ignoring the topic leaves them without a sense of where the subject is going and why.”
Page 224: “Today the function is universally recognized as being among the three or four most important data models in all of statistics.”
Page 285: “In trying to model a set of data with a Poisson pdf, , , it will sometimes be the case that the value cannot occur, or there are too many zeroes for a good data model (called zero-inflated data). Accommodating that restriction leads to a definition of a truncated Poisson pdf – call it – where the values of are limited to the integers ”
Based on these passages, I have to introduce what I think LM meant when they say “data model”. I think they use it in two senses: one as a statistical model for the data and the other as the resulting data format representing the implications of the overall structure and experimental design which produced the data we observe.
Definition 1
A parameter is an unknown fixed quantity or feature of some distribution. A parameter may be a number, a vector, a matrix, or a function.
A statistical model is a family or set of distributions. Statistical models are sometimes classified as parametric and nonparametric.
A parametric statistical model or data model in LM’s terminology is a family or set of distributions which is known up to (or indexed by) a finite number of parameters.
A nonparametric statistical model is a family or set of distributions which cannot be indexed by a finite number of parameters.
In this course, we will mostly focus on parametric statistical models but there would be occasions where we will encounter a mix of a nonparametric and a parametric statistical model. Statistical models are really just probability models with a different focus. Probability models you have encountered in your first course in probability theory show up again in this course. These probability models depend on constants, which, once known, enables you to calculate probabilities. In contrast, we now do not know these constants and seek to know their value from the data.
LM actually introduces you to parametric statistical models (without explicitly saying so) in Chapter 4. Two examples of how probability models can be used as models for the data are illustrated in Case Study 4.2.2 and Case Study 4.2.3. In those case studies, where is a one-dimensional parameter of the model. The form of the distribution is known (Poisson). We do not know , but we may observe some values of . In contrast to what you have encountered in your first course in probability theory, you find quantities like where is given to you.
In the two case studies, notice that what we actually observe is the data:
The numbers and proportions of times that -particles were detected in a given eighth-minute in Case Study 4.2.2
The number of wars starting in a given year in Case Study 4.2.3
The Poisson pdf depends on an unknown and a value for was plugged in to see if there is a “good match” between the relative frequencies (which we observe) and the probabilities provided by the Poisson pdf with a plug-in value for (which use an abstraction). The value plugged in for is the sample mean, which was calculated from the data. The two case studies suggest that there was a “good match”. These case studies should encourage you to ponder about the following questions:
Why choose the Poisson pdf as the data model?
Why plug-in the sample mean for ? Are there other options?
How do we decide if there is a “good match”?
Why was it important to have a “good match”? What purpose does it serve?
If there is not a “good match”, what do we do?
Answering these questions hopefully motivates you to study statistical theory.
Exercise 1
The tables found in the two case studies are summaries of the data. What do you think the original data looked like?
If are IID random variables having a Poisson distribution with parameter , what is and ? Can you speculate as to why the sample mean was used as a plug-in for ?
In both case studies, try a different plug-in value for . What do you notice? Can you speculate as to why the sample mean was used as a plug-in for ?
Another early example involving the exponential density can be found in Case Study 4.2.4. In this case study, where is a one-dimensional parameter of the model. You may find yourself asking similar questions as in the Poisson case studies.
Another example which has financial applications is discussed in the following business case study and Section 19.1 of Ruppert and Matteson (2015) (available only within XMU) about the concept of VaR or value-at-risk.
The last instances of the phrase “data model” show up in LM Chapter 8. LM state that, “… every set of measurements also has a certain overall structure, or design.” In this course, we pay attention to the following experimental designs: one-sample data, two-sample data, -sample data, and paired data. For the current set of notes, I will focus on one-sample data.
I am now moving forward to three major topics – a study of the sample mean and its distribution, how this study of the sample mean specializes to normal models, and extending normal models to multivariate cases.
2 Statistics and sampling distributions
The word “statistics” can refer to the field of study or research or it can also refer to quantities we can calculate from the data.
Definition 2 A statistic or estimator is a function of the data.
Compare this definition to the Comment in page 280. It is not crucial that a statistic be a function of a random sample, but the emphasis on approximating a parameter has been discussed in the Introduction. Why should we even consider statistics or estimators in the first place?
To answer this question, we will explore a very common statistic: the sample mean. When given data, it is easy to compute the sample mean. But what do we learn from computing this sample mean? We need to understand that this statistic is a procedure or an algorithm which could be applied to data we have and other data we might not have on hand. The latter is more hypothetical.
2.1 The sample mean
Consider a setting where we observe a sample which is a realization from the joint distribution of IID normal random variables . Let . We can easily calculate the observed sample mean but what does probability theory tell us about ?
Refer to Corollary 4.3.1. Let be the common expected value of and be the common variance of . Under the setting described, we can conclude that The previous result describes the sampling distribution of the statistic . But what does sampling distribution tell us?
Consider an example where , , and . We are going to use R to obtain realizations from IID normal random variables. What you are going to see is what is called a Monte Carlo simulation. When we design and conduct Monte Carlo simulations in statistics, we generate artificial data in order to evaluate the performance of statistical procedures.
n <-5# sample sizemu <-1# common expected value sigma.sq <-4# common variance y <-rnorm(n, mu, sqrt(sigma.sq)) # Be careful of the R syntax here. y # a realization
As you may have noticed, the sample mean is used in two senses: one as an observed sample mean (or simply, the sample mean of a list of numbers) and the other as a procedure . Conceptually, we can repeatedly apply the procedure of calculating the sample mean to different realizations of IID normal random variables.
nsim <-10^4# number of realizations to be obtained# repeatedly obtain realizationsymat <-replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))dim(ymat)
[1] 5 10000
ybars <-colMeans(ymat)par(mfrow=c(2,2))hist(ymat[, sample(1:10^4, 1)], freq =FALSE)curve(dnorm(x, mu, sqrt(sigma.sq)), add =TRUE, col ="blue")hist(ymat[, sample(1:10^4, 1)], freq =FALSE)curve(dnorm(x, mu, sqrt(sigma.sq)), add =TRUE, col ="blue")hist(ymat[, sample(1:10^4, 1)], freq =FALSE)curve(dnorm(x, mu, sqrt(sigma.sq)), add =TRUE, col ="blue")hist(ybars, freq =FALSE)curve(dnorm(x, mu, sqrt(sigma.sq/n)), add =TRUE, col ="red")
The blue curves represent the density function of . The red curve overlaid on the histogram of the ybars is the density function of . This is the sampling distribution of . This is what we theoretically expect to see when we look at the possible values generates. But what we actually see in this simulation are the observed sample means computed from realizations of IID random variables. You can think of this as a simulated sampling distribution or a computer-generated sampling distribution meant to verify what we expect to see theoretically.
c(mean(ybars), var(ybars), sd(ybars))
[1] 1.0027555 0.8043694 0.8968665
Finally, notice that we should see and , which in the simulation are equal to 1 and 4/5, respectively. We do not get the values exactly, but we are close. A question to ponder about is why we do not get the values exactly and what kind of accuracy and precision we get.
Exercise 2
Holding everything else fixed, rerun the R code but after modifying the code to allow for n to be equal to .
Return n to be 5, but now change nsim to be equal to . What do you notice? Compare with the case where nsim is equal to .
Try running the code a number of times. You may notice that some of the digits are not changing, but others are. How can you use this finding to guide how you report results?
Exercise 3 From the result you should be able to calculate probabilities involving if you know , , and . You can use the standard normal tables and you can also use R to calculate these probabilities.
To use standard normal tables, you need a standard normal random variable. How do you transform so that it will have a distribution? The transformed random variable is called a standardized statistic. In our simulation, we have , , and . Calculate , .
Try modifying these commands so that you can compute the exact values of , using R. Verify if this matches what you found in the previous exercise. To use R, you can use the pnorm() commands. As a demonstration, consider . To find and , you execute the following commands:
pnorm(1, mean =2, sd =sqrt(5))
[1] 0.3273604
pnorm(1, mean =2, sd =sqrt(5), lower.tail =FALSE)
[1] 0.6726396
Exercise 4 Let be IID Bernoulli random variables with a common probability of success . You are going to explore the sampling distribution of .
What is the distribution of where is fixed?
What would be the distribution of ?
Compute the expected value and variance of both and .
Modify the code used to simulate IID normal random variables for the Bernoulli case and compute the mean, variance, and standard deviation of the sampling distribution of . Try different values of and . To simulate a Bernoulli random variable, we use rbinom(). For example,
# Toss 1 fair coin 10 timesrbinom(10, 1, 0.5)
[1] 0 0 0 0 0 1 0 1 0 1
# Toss 2 fair coins independently 10 times and record the total successesrbinom(10, 2, 0.5)
[1] 1 0 2 2 2 2 2 1 2 2
Plotting the simulated sampling distribution of is straightforward. What is not straightforward is to superimpose the theoretical pmf of the sampling distribution of . Can you explain why?
2.2 Why should we find the sampling distribution of ?
From the previous discussion and the exercises, we can use the sampling distribution of allows us to make probability statements. Are these actually useful?
At the minimum, the previous discussion allows us to say something about the moments of the sampling distribution of . As you recall, and .
Exercise 5 We found the moments of the sampling distribution of using Corollary 4.3.1 in LM. Determine the minimal set of assumptions needed to obtain the moments of the sampling distribution of . Do you need normality? Do you need IID?
We take a moment to understand the implications of . Note that can be computed from the data you have (along with other hypothetical realizations). But it is possible that is unknown to us, meaning that may be an unknown parameter.
One task of statistics is parameter estimation. The idea is to use the data to estimate a parameter. Recall that can be thought of as a procedure. What makes this procedure desirable is that we can evaluate its behavior under repeated sampling (meaning looking at hypothetical realizations) and ask how this procedure enables us to learn unknown parameters. We proceed on two fronts.
2.2.1 Consistency
We are going to apply Chebyshev’s inequality found in Theorem 5.7.1 of LM. Applying this inequality to , we obtain for all . So for any choice of , the sum of the probability that and the probability that is bounded above and that this upper bound becomes smaller as we make large. This means that for fixed , , and , the sum of these two probabilities is getting smaller and smaller if is large. In a sense, the event which is the event that produces a realization outside of the region is getting less and less likely.
Notice also that the upper bound is controlled by . Furthermore, this upper bound goes to zero as , holding everything else fixed. In other words, for any choice of , Thus, for any choice of , the probability of the event that produces a realization outside of the region is negligible, as . This means that “ practically gives you ”, the latter being an unknown parameter. In this sense, we learn as sample size becomes large enough by applying .
What you have just encountered is sometimes phrased in different ways:
is a consistent estimator of or converges in probability to (denoted as ). Refer to Definition 5.7.1 of LM. Observe that the definition uses the complement of the event .
The weak law of large numbers: Let be IID random variables with finite mean . Then .
Let us demonstrate the consistency property of using R. Return to our example where , , but this time we allow to change from 5 to larger values. Although the histograms look the same, pay attention to the horizontal and vertical axes! If you push the sample size to even larger values, what do you think will happen?
par(mfrow=c(2,2))hist(colMeans(replicate(nsim, rnorm(5, mu, sqrt(sigma.sq)))), freq =FALSE, xlab ="sample means", main ="Histogram of ybars")curve(dnorm(x, mu, sqrt(sigma.sq/5)), add =TRUE, col ="red")hist(colMeans(replicate(nsim, rnorm(20, mu, sqrt(sigma.sq)))), freq =FALSE, xlab ="sample means", main ="Histogram of ybars")curve(dnorm(x, mu, sqrt(sigma.sq/20)), add =TRUE, col ="red")hist(colMeans(replicate(nsim, rnorm(80, mu, sqrt(sigma.sq)))), freq =FALSE, xlab ="sample means", main ="Histogram of ybars")curve(dnorm(x, mu, sqrt(sigma.sq/80)), add =TRUE, col ="red")hist(colMeans(replicate(nsim, rnorm(320, mu, sqrt(sigma.sq)))), freq =FALSE, xlab ="sample means", main ="Histogram of ybars")curve(dnorm(x, mu, sqrt(sigma.sq/320)), add =TRUE, col ="red")
Exercise 6 Under normality, we can have an exact value for provided that is specified in advance. Suppose you set , then can be computed exactly.
Compute under the assumption of normality of IID random variables . Do you need to know the value of and to compute the desired probability?
What if was not set equal to ? How would you calculate under the same normality assumption?
2.2.2 Unbiasedness
We now think of what means. We are going to use a result found in LM Exercise 3.6.13. In that exercise, we have a random variable with finite mean and a function This function is the mean squared error or MSE of about .
Exercise 7 Show that the minimizer of is and that the minimized value of is .
The previous exercise provides an alternative interpretation of the expected value. It is the best or optimal prediction of some random variable under a mean squared error criterion. That means that if you are asked to “guess” the value of a random variable and you suffer “losses” from not making a perfect guess according to MSE, then the best guess is the expected value. More importantly, the smallest “loss” is actually the variance.
Applying Exercise 3.6.13 along with the alternative interpretation of the expected value, we can make sense of the expression . There are two parts to this expression. One is that is a random variable and our best guess of its value is its expected value . Two, it so happens that it is equal to the common mean .
Had our best prediction been larger than , then we “overshot” or systematically guessed something larger than what we are interested in, which is . Had our best prediction been smaller than , then we “undershot” or systematically guessed something smaller than what we are interested in, which is . Therefore, it should be preferable to have .
There is also a connection between the finding and the idea of unbiased estimation. Here, is called an unbiased estimator of . Refer to Definition 5.4.1 in LM page 310. Note that the random sample requirement is not really needed in the definition, as long as the expected value of some estimator is equal to the unknown parameter of interest. has a sampling distribution and its “center” should be at .
In general, unbiased estimators are not necessarily consistent and consistent estimators are not necessarily unbiased. But, there is a slight connection between unbiasedness and consistency (based on LM Exercise 5.7.4):
Definition 3 An estimator is squared-error consistent for the parameter if
Recall the quantity , which is the MSE of about . This quantity is similar to , which is the MSE of about . The context is slightly different because is something we want to learn, but the underlying idea is the same. We can also write The second term is the bias of the estimator .
To show that an estimator is squared-error consistent, we just have to show that The latter expression is another way of saying that is asymptotically unbiased for .
Exercise 8
Show that whenever are IID random variables with mean and finite variances, then is squared-error consistent for .
Using Chebyshev’s inequality, show that if an estimator is squared-error consistent for the parameter , then .
2.3 Measuring uncertainty
From a prediction angle, the smallest “loss” incurred from not knowing for sure and relying on its property that the best guess satisfies is , which happens to be . Even though we get to see the observed sample mean , we only know the statistical properties of .
From a sampling distribution angle, is a measure of spread. By definition, the variance of is the “average” spread of about its mean: Therefore, it measures a “typical” distance of relative to . But this distance is in squared units, so it is not in the same scale as the original scale of the ’s. Thus, taking a square root would put things on the same scale.
As a result, the quantity is so important that it is given a specific name: the standard error of the sample mean. This is nothing but the standard deviation of the sampling distribution of . So why give a new name? The reason is that there is also a standard deviation for a list of numbers, just like there is a mean for a list of numbers. Suppose you have a list of numbers . The standard deviation of this list is computed as The difference lies in the fact that the preceding formula is for numbers that you have while is for hypothetical sample means we could have gotten. Be very aware of this distinction.
Chebyshev’s inequality is also tied to as you have seen earlier. If we set where is a positive constant (a factor of a standard error), then Notice that we have probabilities involving a standardized quantity .
If we take , the probability that will be within 2 standard errors of is greater than or equal to 75%. If we take , the probability that will be within 3 standard errors of is greater than or equal to 88%. Therefore, can be thought of as an appropriate scale to use to measure how far is from . It becomes a way to measure our uncertainty about .
Exercise 9 Revisit Exercise 6. We can actually calculate the exact probability, rather than just an upper bound, in the case of IID normal random variables, because the standardized quantity has a standard normal distribution. Because this distribution does not depend on unknown parameters, we call a pivotal quantity or a pivot.
Exercise 10 Recall Exercise 3. You are going to add lines of code to the simulation and use the simulated sample means found in ybars to approximate , . You computed the exact probabilities in Exercise 3, but, for the moment, pretend you do not know these exact probabilities. Report an estimate and a standard error.
The prediction angle and the sampling distribution angle actually coincide but come from different starting points. More importantly, both quantify how much we “lose” because of the randomness of . For sure, we can learn about the unknown parameter under certain conditions, but we pay a price for using the data to learn it. More importantly, the price you pay really depends on the conditions you impose.
The standard error formula only works in the IID case. Once you depart from the IID case, things become more complicated. Let be random variables where and are IID normal random variables . Will it be the case that Let us look at a simulation where , , and .
# Create function to generate realizations of Zgenerate.z <-function(n, mu, sigma.sq){ y0 <-rnorm(1, mu, sqrt(sigma.sq)) y <-rnorm(n, mu, sqrt(sigma.sq)) z <-numeric(n)for(j in2:n) { z[1] <- y[1]+0.5*y0 z[j] <- y[j]+0.5*y[(j-1)] }return(z)}nsim <-10^4# number of realizations to be obtained# repeatedly obtain realizationszmat <-replicate(nsim, generate.z(20, 0, 4))zbars <-colMeans(zmat)c(mean(zbars), var(zbars), sd(zbars))
[1] -0.001614658 0.440703143 0.663854760
hist(zbars, freq =FALSE)curve(dnorm(x, 0, sqrt(4/20)), add =TRUE, col ="red")
Notice that the mean of the sampling distribution of matches the mean of each of the ’s. If you blindly apply the “formulas”, the standard error based on IID conditions is far off compared to what you see in the simulations as the standard deviation of the sampling distribution: 0.66. The formula is off by 48%! Therefore, we have to think of how to adjust the formula for the standard error of the sample mean to accommodate non-IID situations. We do not cover how to make the adjustment in this course, but you will learn more about this in future courses.
Exercise 11
Adjust the code so that you simulate IID normal random variables with , holding everything else constant.
Explore how things change with a larger sample size. Will a larger make the issue with the incorrect standard error formula magically disappear?
Let are IID normal random variables . You are now going to derive some theory.
Are IID random variables? Explain.
Find and .
Find and show that for all .
Find and .
2.4 Looking forward to generalizing our main example
The bottom line from the previous discussions is that the sampling distribution of some statistic is the distribution of the statistic under repeated sampling. Repeated sampling does not necessarily mean random sampling. What this means is that we have multiple realizations of a set of random variables and we apply the statistic to these multiple realizations. In a sense, these multiple realizations are really hypothetical. We only get to see one realization and that is the observed data. The core idea of understanding a statistic as a procedure which has properties under repeated sampling is the essence of frequentist inference.
We now contemplate what will happen in a setting slightly more general than our main example. We now move on to the following: Consider a setting where we observe a sample which is a realization from the IID normal random variables . These random variables have some joint distribution where and is the parameter space.
What makes this generalization harder is that even if we specify what is, we still have to figure out what parameter to estimate. After that, we have to obtain the either the sampling distribution of the estimator or establish whether the estimator is unbiased for and/or consistent for . We also have to derive a standard error so that we can measure our uncertainty about .
In this course, I will discuss general-purpose methods of estimation which will allow us to estimate for any you specify in advance. These general-purpose methods allow for a unified theory, so that we need not demonstrate good statistical properties on a case-by-case basis. But in this set of notes, we will dig a bit deeper into the IID normal case.
3 Digging deeper into normal models
I am now going to dig deeper into normal random variables and its related extensions in greater detail for the following reasons:
To introduce you to classical results about the behavior of statistical procedures when the sample size is fixed or finite: Normal models form the basis of what are called “exact results”.
To prepare you for studying the asymptotic behavior (sample size ) of statistical procedures
To show you the advantages and disadvantages of using the normal distribution as a modeling device
Normal random variables are convenient to work with but you need to know at what cost.
How about the distributions that arise from the normal?
3.1 Exact results: A preview
3.1.1 An example of a confidence interval
Another task of statistics is interval estimation. Interval estimation differs from parameter estimation in that we want a sense of the precision of our estimators. A fair question to ask why we need interval estimation when we already have a standard error. Interval estimation has more content about plausible values for the unknown parameter of interest. Not surprisingly, standard errors will play a crucial role in interval estimation.
We return to our context where observe a sample which is a realization from the IID normal random variables . What does probability theory tell us about ? Recall that
From the normality of the sampling distribution of , Exercise 3, Exercise 6, you have seen that you can make exact probability statements involving , rather than just bounds from Chebyshev’s inequality. For example, But this probability statement gives us another way to quantify our uncertainty about not knowing for sure. In particular, we can rewrite the previous probability statement as We have to be careful in expressing in words the last probability statement. The rewritten probability statement is saying that the interval , whose endpoints are random variables, is able to “capture” or “cover” the unknown parameter with 95% probability.
We could have rewritten the probability statement as but writing the statement this way introduces potential misinterpretation. You might read the statement as “there is a 95% probability that is inside the interval . This is incorrect and we will explore why in the following simulation.
We will be constructing these intervals repeatedly for different hypothetical realizations. Set , , and . We are going to pretend to know that value of when constructing the interval. It may feel weird, but it is a simplification so that we can directly illustrate the source of misinterpretation.
n <-5mu <-1sigma.sq <-4# A realization of IID normal random variablesy <-rnorm(n, mu, sqrt(sigma.sq))# Construct the intervalc(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))
[1] -1.604906 1.972803
Is in the interval? So where does the probability 95% come in? Simulate another realization and construct another interval of the same form.
# A realization of IID normal random variablesy <-rnorm(n, mu, sqrt(sigma.sq))# Construct the intervalc(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))
[1] -0.2860722 3.2916366
Is in the interval? Let us repeat the construction of the interval 10000 times.
# Create a function depending on sample sizecons.ci <-function(n){ y <-rnorm(n, mu, sqrt(sigma.sq))c(mean(y)-2*2/sqrt(n), mean(y)+2*2/sqrt(n))}# Construct interval 10000 timesresults <-replicate(10^4, cons.ci(5))# Calculate how many mean(results[1,] <1&1< results[2,] )
[1] 0.9545
Let us visualize the first 100 intervals.
require(plotrix)
Loading required package: plotrix
center <- (results[1,]+results[2,])/2# Plot of the first 50 confidence intervalsplotCI(1:100, center[1:100], li = results[1,1:100], ui = results[2,1:100], xlab ="", ylab ="", sfrac =0.001, pch =20, cex =0.5)abline(h = mu, col ="red")
Notice that sometimes the interval computed for a particular realization is able to “capture” , but there are times when the interval is not able to “capture” . We found from the simulation that roughly 95% of the computed intervals contains . This is where the 95% gets its meaning. Furthermore, either the computed interval contains or not. More importantly, for any specific interval you calculate using the data on hand, you will never know whether or not it has “captured” .
3.1.2 Constructing confidence intervals for when is known
What you have seen so far is a statistical procedure called a confidence interval.
Definition 4 Let be an unknown parameter of interest and . Let and be statistics. is a confidence interval for if for every .
When are IID normal random variables, we can construct a confidence interval for in a more general way. So we need to find constants (not depending on parameters or the data) and such that These two constants exist because is a pivotal quantity. Thus, a confidence interval for is
A popular choice is to fix and so that you have a symmetric % confidence interval. What this means is that So, we can set and where is the value for which with . In other words, is the -quantile of the standard normal. Choosing and to have a symmetric % confidence interval ensures that is at its shortest.
The exact confidence statement can be made under the assumption of normality and a known variance . This exact confidence statement is likely to be extremely limited because it asks us to pretend that is known, yet is unknown.
If you lose normality (whether or not variance is known), the interval is no longer valid but it may be justified when sample sizes are large. You may have to find a different pivotal quantity.
If you lose known variance (provided that normality still holds), the interval is also no longer valid but and may have to be adjusted to compensate.
Exercise 12 (Based on Stine and Foster (2004)) Hoping to lure more shoppers downtown, a city builds a new public parking garage in the central business district. The city plans to pay for the structure through parking fees. In their study of past performance of other parking garages in other shopping areas, the city records indicate that the standard deviation of daily fees is 150 dollars. During the past 44 weekdays, daily fees collected averaged 1264 dollars.
What assumptions and conditions would you need to make in order to use these statistics for inference?
Describe the parameter of interest to the city. Be specific.
Compute a 95% confidence interval for the parameter in item 2.
Do you think the true value of the parameter in item 2 is in the confidence interval you computed in item 3?
3.1.3 Testing claims about if is known
Another task of statistics to test claims or test hypotheses we may have about the parameters of a model. There are situations where we may have an idea about the value of some parameters in a model. In quality control, firms have a record of past production and they may want check whether a measurable aspect of a recent batch of products are in line with previously established quality guidelines. Another situation is where we want to know whether the observed data are consistent with the model whose parameters are set to some known values. This model might be based on subject-matter knowledge.
Suppose we want to test the claim that where is known to us in advance (meaning before we actually see the data). We get to observe and ask whether what we observed is compatible with a model with IID normal random variables , where is known. One way to answer this question is to calculate the probability of observing values of as extreme or more extreme than what we have observed under the assumption that is true. This probability is called the observed level of significance or -value. A small -value means that what we have observed is very unlikely under the assumption that is true. A large -value means that what we have observed is compatible with the assumption that is true.
We will return to hypothesis testing later, as we need to understand why this procedure works and when it could be applied. For example, you should be wondering how small is small and how large is large.
We end this short introduction to the idea of hypothesis testing with an illustration and simulation. Consider LM Example 6.2.2. In that example, we are supposed to evaluate the claim that , meaning that the results of the new curriculum (which produced exam scores with an observed sample mean of 502) is compatible with past performance without the new curriculum. The example assumes normality of exam scores and that is known. The -value was calculated to be 0.55. In the calculation, the meaning of “as extreme or more extreme” allows for two directions.
We can also obtain a similar -value using simulation. What you are seeing is a simple form of what some call simulation-based inference.
nsim <-10^4# Take note that we created artificial data assuming mu = 494scoremat <-replicate(nsim, rnorm(86, mean =494, sd =124))# Calculate the simulated sample meanssim.scores <-colMeans(scoremat)# Count how many of the simulated sample means exceed 502mean(sim.scores >=502)
[1] 0.2718
# The simulated p-value which should be close to the calculationmean(abs((sim.scores-494)/(124/sqrt(86)))>(502-494)/(124/sqrt(86)))
[1] 0.5506
The bottom line from this illustration is that seeing a sample mean of 502 is not statistically different from 494. So, 502 is something that is “typical” if we believe that is true.
3.2 Exact results: The more realistic case
3.2.1 What happens if is unknown?
Under IID normal random variables with known, a confidence interval for is given by
If we do not know , we need to plug-in a value for it. is just like , an unknown parameter. So, just like before when we were trying to estimate , we have to search for a statistic which could help us learn .
A possible candidate is to understand first what is. It is the square root of the common variance . By the definition of the variance, . The weak law of large numbers suggests a possible way to estimate . Recall that to estimate , we can use the sample mean of the ’s. Extending this idea to , we can use the sample mean of the ’s. The only problem is that is unknown, but this can be estimated quite well. Taking all of these together, a possible candidate to estimate is For the development of the theory for normal models, we are going to use a related candidate which is The question is whether and help us to learn and .
We can use simulation to compare the sampling distributions of and . We already know that under the assumption of IID normal random variables, . But we do not know anything about the distribution of . The difference is that is not random, but is a random variable.
n <-5mu <-1sigma.sq <-4nsim <-10^4# number of realizations to be obtained# repeatedly obtain realizationsymat <-replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))ybars <-colMeans(ymat)sds <-apply(ymat, 2, sd)par(mfrow=c(1,2))hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq =FALSE, main ="Known sigma.sq")curve(dnorm(x, 0, 1), add =TRUE, col ="red")hist((ybars-mu)/(sds/sqrt(n)), freq =FALSE, main ="Plug-in")curve(dnorm(x, 0, 1), add =TRUE, col ="red")
The red curve is the density function for . If you compare the two histograms, you would notice that the red curve “fits” better for than . There is less concentration around 0 for the histogram on the right and more mass on the extremes. This means that extreme positive and negative realizations of are more likely for the histogram on the right compared to the histogram on the left.
We can explore what happens when we increase sample size to .
n <-20mu <-1sigma.sq <-4nsim <-10^4# number of realizations to be obtained# repeatedly obtain realizationsymat <-replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))ybars <-colMeans(ymat)sds <-apply(ymat, 2, sd)par(mfrow=c(1,2))hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq =FALSE, main ="Known sigma.sq")curve(dnorm(x, 0, 1), add =TRUE, col ="red")hist((ybars-mu)/(sds/sqrt(n)), freq =FALSE, main ="Plug-in")curve(dnorm(x, 0, 1), add =TRUE, col ="red")
Exercise 13 Modify the previous code and explore what happens when and .
You will notice from the previous exercise that when , the approximation to the histogram of standardized sample means by the red curve is getting better. But we still have to address what happenes with finite . There are two approaches to pursue:
Exploit the normality assumption and obtain the exact finite-sample distribution of .
Derive a central limit theorem for .
3.2.2 Deriving the distribution of
3.2.2.1 A detour to some linear algebra
We will discuss the first approach but it requires a brief detour to introduce a special type of linear transformation called an orthogonal transformation. Notice that and algebraically depend on each other. Applying an orthogonal transformation to IID normally distributed random variables allow us to show that these two quantities do not statistically depend on each other.
Consider the case where we are linearly transforming a vector to another vector . We can write the transformation as a system of equations: or in matrix form: What makes a linear transformation an orthogonal transformation? In other words, what restrictions should be placed on so that it represents an orthogonal transformation? I now introduce some linear algebra concepts in two-dimensional Euclidean space , but these extend to -dimensional Euclidean space .
Definition 5
The length or the Euclidean norm of a vector is given by .
The dot product of two vectors and is given by .
and are orthogonal to each other if and only if their dot product is zero.
A matrix is orthogonal if and only if the length of is the same as the length of , i.e.,
A matrix is the identity matrix if for all and for all . We denote the identity matrix by .
A transpose of a matrix is formed by changing columns into rows and rows into columns so that the th entry of is the th entry of .
The inverse of a matrix is a matrix which satisfies .
Based on the previous definition, we have the following restrictions on an orthogonal matrix : We can rewrite these restrictions as restrictions involving the columns of the matrix . The first and second restrictions imply that the columns of have length equal to 1. The third restriction implies that the columns of are orthogonal to each other.
Exercise 14 Let be an orthogonal matrix.
Show that the entries of will satisfy the restrictions mentioned earlier.
Show that . What is the inverse of an orthogonal matrix ?
3.2.2.2 and are independent under normality
How do orthogonal transformations interact with IID normal random variables ? To answer this, we have to find the expected value, variance, and joint density . One thing to note is that the variance stays the same after the orthogonal transformation.
Exercise 15 Show that
We start from the joint density of IID normal random variables . In this case, we have We have to express ’s into ’s. Since we are working with orthogonal transformations, we assume that is an orthogonal matrix whose entries satisfy certain restrictions. Based on Exercise 14, From a geometric perspective, orthogonal transformations preserve lengths of vectors. Therefore, the transformation from to does not change volume. The joint density of for this case becomes Therefore, the joint density of is a product of the density of a and the density of a .
From a statistical perspective, applying an orthogonal transformation to IID normal random variables preserves normality, independence, and variance. Note that the means are not identical. The derivation presented extends to dimensions.
Now that we know the statistical effect of applying an orthogonal transformation. The next question is to construct an orthogonal transformation which will help us show that and do not statistically depend on each other. Let be IID normal random variables . Since depends on deviations from the mean , we consider the following linear transformation: Unfortunately, the matrix is not an orthogonal matrix. It is the case that the first row is orthogonal to the remaining rows of . But the length is not 1. We can rescale to so that It is possible to complete this matrix to ensure that it is an orthogonal matrix, but it is not necessary to see its complete form.
Why? Because we can already obtain a set of results, provided that is an orthogonal matrix:
which implies that a previous result that .
are still independent random variables after an orthogonal transformation, even if we do not know explicitly what the transformation actually looks like.
The sum of squares of the random variables is equal to the sum of squares of the random variables , i.e. This algebraic result allows us to separate the sum of squares into two parts: a function of and a function of . More importantly, these two parts are statistically independent of each other. Therefore is independent of .
In case you want to complete the entries of the orthogonal matrix , we can say much more but there are alternative ways to prove these other statements. The following matrix is an orthogonal matrix: As a result, we obtain:
, for .
which means that is an unbiased estimator of .
As you may have noticed, sums of squares play a crucial role. Applying an orthogonal transformation to IID normal random variables enabled us to partition the sums of squares of those IID normal random variables into a square specific to the sample mean and into a sum of squares specific to the sample variance. This breakdown is the essence of an analysis of variance.
3.2.2.3 The distribution of
Theorem 7.3.1 in LM page 384 states that the sum of independent standard normal random variables has a gamma distribution with and . We will demonstrate the same result by using a moment generating function approach.
First, I show that if , then for . To see this, start with the definition of a moment generating function and do some integration:
Next, we have to find the moment generating function of where are independent standard normal random variables and is fixed. By Theorem 3.12.3b in LM page 213, By Theorem 4.6.5 in LM page 270, along with Theorem 3.12.2 in LM page 213, this is the moment generating function of a gamma pdf with and .
Because of the central role played by sums of squares, a gamma distribution with and is also called a chi-squared distribution with degrees of freedom (denoted as ) .
The problem with applying the preceding result directly to is that although has the form of a sum of squares, it is not a sum of squares of independent standard normal random variables yet.
One approach is to use the consequence of applying orthogonal transformations to IID normal random variables. In particular, we have Therefore, we have to rescale to and the latter is now a sum of squares of independent standard normal random variables. So, we have an exact finite-sample distributional result:
Exercise 16
(Exercise 7.3.2) Let . Use the moment generating function for a chi-squared distributed random variable to show that and
Use the distributional result to show that
is an unbiased estimator of .
(Exercise 7.3.4)
(Exercise 7.3.5) is a consistent estimator of .
3.2.2.4 Putting all of these results together
So far, we know that and that and are independent. The last ingredient to complete the derivation is to determine the distribution of Theorem 7.3.5 in LM page 387 show that this random variable has a Student distribution with degrees of freedom (denoted by ).
Alternatively, we can put the results together using an analysis of variance. The total sum of squares could be split up into two parts and . These two parts are statistically independent of each other, provided that are IID normal random variables. Recall our analysis of variance:
Source
Sum of squares (SS)
Degrees of freedom (df)
Mean sum of squares (MS)
Expected MS
Mean
Deviations
Total
Consider the ratio Theorem 7.3.3 in LM page 385 states that a ratio of independent mean squares (sum of squares divided by the degrees of freedom) has an distribution. In this particular situation, the resulting random variable has an distribution with 1 numerator degree of freedom and degrees of freedom (denoted as ).
3.3 Slight detour: asymptotic distribution of
I now provide an explanation of the second approach to obtaining the distribution of . We already saw that under the normality of IID random variables , . What happens if we remove the normality assumption? In Exercise 13 along with its preceding Monte Carlo simulation, we saw that the histogram of simulated sample means is closely approximated by the red curve as becomes larger. Although normal random variables were simulated in that exercise, we can also show that we can remove the normality assumption.
n <-80nsim <-10^4# number of realizations to be obtained# repeatedly obtain realizationsymat <-replicate(nsim, runif(n, min =0, max =1))ybars <-colMeans(ymat)sds <-apply(ymat, 2, sd)mu <-1/2# the expected value of a uniformly distributed random variablesigma.sq <-1/12# the variance of a uniformly distributed random variablepar(mfrow=c(1,2))hist((ybars-mu)/(sqrt(sigma.sq)/sqrt(n)), freq =FALSE, main ="Known sigma.sq")curve(dnorm(x, 0, 1), add =TRUE, col ="red")hist((ybars-mu)/(sds/sqrt(n)), freq =FALSE, main ="Plug-in")curve(dnorm(x, 0, 1), add =TRUE, col ="red")
I am going to introduce another convergence concept to allow us to capture the ability of the density function of the (shown as the red curve) to “approximate” the histogram of standardized sample means.
Definition 6 Let be a sequence of random variables and let be another random variable. Let be the cdf of and be the cdf of . converges in distribution to , denoted as , if at all for which is continuous.
This convergence concept is useful to approximate probabilities involving random variables whose distribution is unknown to us. Of course, you should be able to compute probabilities involving the limiting random variable , otherwise, the concept becomes less useful. The key idea is that when is large enough, the hard-to-compute can be made as close as we please to the easier-to-compute
This convergence concept is most prominent in the central limit theorem found in Theorem 4.3.2 in LM page 244. What it says is that if we have an IID sequence of random variables with finite variance, then where . Another way to write the result is
The distribution of each of the random variables is unknown to us apart from each of them having finite variance. Therefore, it is very unlikely that you can actually compute probabilities involving the sampling distribution of . Even if we did know the distribution of each of the random variables , it might still be extremely difficult to obtain an expression for the distribution of . If are IID normal random variables, then we have an exact and known sampling distribution of and no urgent need for a central limit theorem. But this is more of an exception.
The next thing to hope for is that if we plug-in a good value for , then we also obtain another central limit theorem that says: We will plug-in a consistent estimator of , which happens to be .
3.4 Extending univariate normal distributions to higher dimensions
3.4.1 The Standard Bivariate Normal Distribution
We are now going to extend the standard normal distribution to two dimensions. We have to think about the more interesting case where two standard normal random variables are not independent of each other. We will be introducing a parameter to reflect this dependence.
Let where , and is independent of . As a result, we have
Definition 7 Let and be random variables with joint pdf and are said to have a standard bivariate normal distribution with parameter , i.e., .
Consider the special cases where and . These cases will produce a standard normal distribution in 3D. Here implies and are independent. The bivariate normal distribution with is the case where uncorrelatedness implies independence.
In finance, some practitioners “paste” two normal marginals together with a copula. This allows one to produce distributions that have normal marginals but the joint distribution is not necessarily bivariate normal.
Next, we will visualize the joint pdf of the standard bivariate normal distribution for .
We can now extend to bivariate normal distributions with general means and variances. The idea is to use a similar approach when we extend the standard normal distribution to . The idea is to apply linear transformations.
Let and . As a result, we have and
Definition 8 Let and be random variables with joint pdf for all and . and are said to have a bivariate normal distribution with five parameters, i.e., .
This definition matches Definition 11.5.1 in LM page 572.
You can show that the marginal pdf of is and the marginal pdf of is .
3.4.3 Multivariate normal distributions
Just like how we extended the standard normal to two dimensions, it is possible to extend to the multivariate case by noticing one aspect of the joint pdf in the bivariate case.
Observe that and that and Thus, the joint pdf for the standard bivariate normal distribution can be rewritten as
Definition 9 The random vector has a multivariate normal distribution if it has joint pdf of the form
The mean vector is given by and the covariance matrix is given by