Calibrating decision rules for testing claims

Motivation

During the first four weeks of the course, I introduced almost everything related to the sample mean of IID normal random variables \(N\left(\mu,\sigma^2\right)\). In particular, I showed how to use the techniques you already know from probability theory to evaluate claims about the population mean \(\mu\).

We looked into LM Examples 6.2.1 and 6.2.2 and introduced the concept of a \(p\)-value, which is the only new statistical idea (so far) behind testing claims about \(\mu\). We calculated it directly from the knowledge of the sampling distribution of the sample mean and also used a simulation-based approach.

There were a few things I intentionally left behind because they are details which may detract from the key idea behind testing claims: use a statistic which can help you evaluate a claim about a population parameter whose distribution is pivotal and known and then calculate a measure of “support” or “discomfort” (however way you may look at it) called a \(p\)-value. So what did I leave behind?

How do we know if a calculated \(p\)-value is small enough? large enough?
How do we set things up when the claim is NOT about the normal population mean?
Given a context, how do we decide what claim is to be tested?

Some of these questions were already asked by you. Now, we are going to be more specific about the details. It was more important for you to understand what hypothesis testing is all about first without being bogged down by all the other details. I hope that, at the very least, you understood that we have NEVER PROVED that the claim is either true or false.

The setup of a hypothesis testing problem

Whenever we evaluate a claim about a parameter \(\theta\), we actually partition the parameter space \(\Theta\) (or the possible values the parameter space can take) into two disjoint sets \(\Theta_0\) and \(\Theta_1\). It may not be obvious but the parameter \(\theta\) is always part of a statistical model. The task is to use the data to figure out whether the data \(Y_1,\ldots,Y_n\) coming from some joint distribution \(f\left(y_1,\ldots,y_n;\theta\right)\) supports \(\theta\in\Theta_0\) or \(\theta\in\Theta_1\). We call \(\theta\in\Theta_0\) the null hypothesis \(H_0\) and \(\theta\in\Theta_1\) the alternative hypothesis \(H_1\).

The way hypothesis testing is practiced is that the null is typically the status quo or the “chance” explanation for what we observe, while the alternative is the “not due to chance” explanation for what we observe. How do we actually use the data to determine whether there is support for the null or the alternative? Because we have to decide on the basis of the data, we would have to find a test statistic \(T\left(Y_1,\ldots,Y_n\right)\) whose behavior we know a lot about under the null.

How to set up a problem, how to choose the appropriate standard of evidence, and whether testing is actually needed are both crucial in the correct application of hypothesis testing.

LM Examples 6.2.1 and 6.2.2

If you revisit the example we discussed before, we could set the context up as follows:

The parameter \(\theta\) is \(\mu\), which is the true average math SAT score that the new curriculum is expected to produce.
\(\Theta_0 =\{494\}\), \(\Theta_1=(494,\infty)\), \(f\left(y_1,\ldots,y_n;\theta\right)\) is IID \(N(\mu,\sigma^2)\), where \(\sigma^2\) is known.
A test statistic \(T\left(Y_1,\ldots,Y_n\right)\) could be \(\dfrac{\overline{Y}-494}{124/\sqrt{n}}\).

We could also have set the context up in a different way:

The parameter \(\theta\) is \(\mu\), which is the true average math SAT score that the new curriculum is expected to produce.
\(\Theta_0 =\{494\}\), \(\Theta_1=(-\infty,494) \cup (494,\infty)\), \(f\left(y_1,\ldots,y_n;\theta\right)\) is IID \(N(\mu,\sigma^2)\), where \(\sigma^2\) is known.
A test statistic \(T\left(Y_1,\ldots,Y_n\right)\) could be \(\left|\dfrac{\overline{Y}-494}{124/\sqrt{n}}\right|\).

Which should you choose? We can only answer based on the context, as statistical theory alone is not enough.

LM mostly concentrates on the situation where \(\Theta_0\) is a singleton, meaning there is only one point in \(\Theta_0\). This is usually called a simple hypothesis or a point null. The alternative hypothesis takes on a range of values and is usually called a composite hypothesis.

It is definitely possible to have a null that is composite. For example, had we removed the assumption that \(\sigma\) is known to be 124 in Examples 6.2.1 and 6.2.2, the setup will have to change into something like:

The parameter \(\theta\) is actually \((\mu,\sigma^2)\), where \(\mu\) is the true average math SAT scores that the new curriculum is expected to produce and \(\sigma^2\) is the true variance of the math SAT scores the new curriculum is expected to produce. The former is of “interest”, while the latter is a “nuisance”.
\(\Theta_0 =\{\left(494,\sigma^2\right): \sigma^2>0\}\), \(\Theta_1=\{\left(\mu,\sigma^2\right): \mu >494, \ \sigma^2 >0\}\), \(f\left(y_1,\ldots,y_n;\theta\right)\) is IID \(N(\mu,\sigma^2)\), where \(\sigma^2\) is known.
A statistic \(T\left(Y_1,\ldots,Y_n\right)\) could be \(\dfrac{\overline{Y}-494}{S/\sqrt{n}}\), where \(S\) is the sample standard deviation.

What is missing? We now have to construct the appropriate threshold to make a decision as to whether or not the data supports the null or the alternative.

Decision rules

When I introduced you to the \(p\)-value for the first time in Examples 6.2.1 and 6.2.2, I used the setup:

The parameter \(\theta\) is \(\mu\), which is the true average math SAT score that the new curriculum is expected to produce.
\(\Theta_0 =\{494\}\), \(\Theta_1=(494,\infty)\), \(f\left(y_1,\ldots,y_n;\theta\right)\) is IID \(N(\mu,\sigma^2)\), where \(\sigma^2\) is known.
A test statistic \(T\left(Y_1,\ldots,Y_n\right)\) could be \(\dfrac{\overline{Y}-494}{124/\sqrt{n}}\). Therefore, the null is given by \(\mu=494\) and the alternative is given by \(\mu > 494\). Intuitively, we would reject the null if the observed value of \(\overline{Y}\) is “large” enough. The context and the form of the alternative hypothesis should also help you in deciding in what direction you should look in order to reject the null. The question that remains is how “large”.

Recall our calculation of the \(p\)-value for the example: \[\begin{eqnarray}\mathbb{P}\left(\overline{Y} \geq 502|\mu=494\right) &= & \mathbb{P}\left(\frac{\overline{Y}-494}{124/\sqrt{86}} \geq \frac{502-494}{124/\sqrt{86}} \bigg| \mu=494\right) \\ &=&\mathbb{P}\left(\frac{\overline{Y}-494}{124/\sqrt{86}} \geq 0.6 \bigg| \mu=494\right)=0.2743\end{eqnarray}.\]

Controlling the probability of a Type I error

The calculation you have just seen is much more general than you think. For example, you can form a decision rule of the form \(\overline{Y}\geq c\) such that \(\mathbb{P}\left(\overline{Y} \geq c|\mu=494\right)\) is equal to some \(\alpha\in (0,1)\), i.e.,

\[\mathbb{P}\left(\overline{Y} \geq c|\mu=494\right) = \mathbb{P}\left(\frac{\overline{Y}-494}{124/\sqrt{86}} \geq \frac{c-494}{124/\sqrt{86}} \bigg| \mu=494\right) =\alpha.\]

It is critical that you know the distribution of \(\dfrac{\overline{Y}-494}{124/\sqrt{86}}\) under the null. Otherwise, you cannot even do the calculations! In this situation, it has a standard normal distribution. Therefore, we have a pivotal quantity or a pivot under the null. Below you will find a threshold \(c\) for every \(\alpha\). For example, we have \(\mathbb{P}\left(\overline{Y} \geq 516|\mu=494\right)\approx 0.05\) and there are others reported below.

# Target alpha
alpha <- c(0.9, 0.6, 0.3, 0.1, 0.05, 0.01)
# Upper quantiles of the standard normal corresponding to alphas
qnorm(alpha, lower.tail = FALSE)

[1] -1.2815516 -0.2533471  0.5244005  1.2815516  1.6448536  2.3263479

# Threshold values
124/sqrt(86)*qnorm(alpha, lower.tail = FALSE)+494

[1] 476.8640 490.6124 501.0119 511.1360 515.9938 525.1062

Notice that we can reject the null at different thresholds and every threshold has a probability attached. What is the meaning of that probability? Well, it is the probability of rejecting the null when the null is true. Ideally, we would want to keep this as small as possible, but we cannot make it equal to zero. The error we make in using a rule \(\overline{Y}\geq c\) when the null is true is called a Type I error and the probability of it happening is called the significance level \(\alpha\). The \(p\)-value is sometimes called the observed significance level.

Controlling the probability of a Type II error

Even if we adopt a rule \(\overline{Y}\geq c\) so that \(\alpha\) is set to a low value, we still make another type of error. We could actually fail to reject the null when the null is false. Why? We simply do not know whether the null is true or false. Therefore, we have account for the possibility that we could commit what is called a Type II error.

What makes the control of Type II error much more difficult is that the null being false opens up a range of alternatives. In particular, we now have to think about probabilities of the form \(\mathbb{P}\left(\overline{Y} \leq c|\mu > 494 \right)\). These are not necessarily hard to compute, but controlling all these probabilities at once may be more difficult.

Suppose we tried to control Type I error so that \(\alpha=0.05\). Then we reject the null whenever \(\overline{Y}\geq 516\). What would be the probability of a Type II error for this rule?

curve(pnorm((516-mu)/(124/sqrt(86))), from = 494, to = 800, xname = "mu")

Observe that the probability of a Type II error cannot be controlled at a low value simultaneously across different values for \(\mu\).

We can adjust the rule from \(\overline{Y} \geq 516\) to \(\overline{Y} \geq 525\). The probability of Type I error is now roughly 0.01, but what about the probability of a Type II error?

curve(pnorm((516-mu)/(124/sqrt(86))), from = 494, to = 800, xname = "mu")
curve(pnorm((525-mu)/(124/sqrt(86))), from = 494, to = 800, xname = "mu", col = "#0072B2", add = TRUE)

Observe that because the rule became “stricter” in the sense of making it harder to reject the null whether it is true or false, the probability of a Type II error increased uniformly across all alternatives. This example shows that it may be difficult to simultaneously reduce both the probabilities of Type I and Type II errors. It can also be difficult (perhaps even more difficult) to first construct a rule to control Type II error, especially if the range of alternative is very large, and afterwards control Type I error.

As a result, the practice has become to fix the probability of a Type I error and then make sure that the test has a low probability of a Type II error.

Exercise 1 This modified exercise is based on Abramovich and Ritov (2013).

The advertisement of the fast food chain of restaurants “FastBurger” claims that the average waiting time for food in its branches is 30 seconds unlike 50 seconds of their competitors. Mr. Skeptic does not believe much in advertising and decided to test its truth by the following test: he will go to one of “FastBurger’s” branches, measure the waiting time and if it is less than 40 seconds he would believe in its advertisement. Otherwise, he would conclude that the service in “FastBurger” is no faster than in other fast food companies. Mr. Skeptic also assumes that waiting time is exponentially distributed.

What are the hypotheses Mr. Skeptic tests? Calculate the probabilities of errors of both types for his test.
Mrs. Skeptic, Mr. Skeptic’s wife, agrees in general with the test her husband has used but decided instead to fix a threshold for the waiting time which will minimize the sum of the probabilities of both error types. What will be her decision rule?
Calculate the probabilities of the errors for Mrs. Skeptic’s test and compare them with those of her husband.
Mr. Student, having learned mathematical statistics, decides to choose a threshold so that the probability of a Type I error to 0.05 according to convention. What would have been Mr. Student’s decision rule? What would be the probability of a Type II error? What do you notice?

What you should notice

The setup of the problem, the decision rule of the form \(\overline{Y}\geq c\), and fixing the probability of a Type I error \(\alpha\) did not involve seeing the data. Only the context of Examples 6.2.1 and 6.2.2 matter. In fact, the fact that \(\overline{y}=502\) did not influence the calculations.

The only way this approach to testing claims will work is when we DO NOT peek at the data first. If you did and made adjustments to any step of the approach, then you would have violated the spirit of the approach and all its theoretical guarantees no longer hold. Of course, you can PRETEND that you did not see the data while making adjustments and still calculate what needs to be calculated. But these calculations become essentially meaningless, unless you explicitly account for the modifications to the approach.

Other cases, possibly more general

LM Examples 6.2.1 and 6.2.2 focus on the IID \(N\left(\mu,\sigma^2\right)\) case where \(\sigma^2\) is known. LM treats other cases of testing claims about a population mean, such as:

IID \(\mathsf{Ber}\left(p\right)\) (Section 6.3, which has its own complications because of the discrete nature of the data)
IID \(N\left(\mu\,\sigma^2\right)\) when \(\sigma^2\) is not known (Section 7.4)

We can also test claims about other parameters beyond the population mean. LM treats the following cases:

IID random variables outside of the normal and binomial (Section 6.4 on decision rules for non-normal data)
Testing claims about \(\sigma^2\) for the case of IID \(N\left(\mu\,\sigma^2\right)\) (Section 7.5)
Testing claims about the difference of two population means \(\mu_X-\mu_Y\) (Sections 9.2 and 9.3)
Testing claims about the difference of two population variances \(\sigma^2_X-\sigma^2_Y\) (Section 9.3)
Testing claims about the form of the parametric statistical model, for example, testing whether the data are compatible with IID \(N\left(\mu,\sigma^2\right)\) (Sections 10.3 and 10.4)
Testing claims about the independence of random variables (Section 10.5)
testing claims about the differences across multiple population means (Section 12.2)

All of these cases share the same framework. But there are technical details which may be unique to each of them. Furthermore, the null hypothesis may not necessarily be just a singleton.

Does likelihood play a role?

Since the likelihood function is formed from the joint distribution of the data, you should be feeling that, at some intuitive level, the likelihood may be used for testing claims. Suppose we want to test the claim that \(\theta=\theta_0\), where \(\theta_0\) is not necessarily the true value, but a claimed value.

When we studied maximum likelihood estimation, we found that we have an asymptotic normality result of the form \[\frac{\widehat{\theta}-\theta_0}{\widehat{\mathsf{SE}}} \overset{d}{\to} N\left(0,1\right).\] This means that as \(n\to\infty\), we have a pivotal quantity upon which we can test claims about \(\theta\), provided that we can form the likelihood function. What you have just seen is usually called a Wald test. It could be extended to multi-dimensional parameters with suitable changes to the asymptotic distribution (but it would still be pivotal, at least in large samples).

We actually have other objects related to the likelihood. For example, the score function has a zero mean with variance equal to the Fisher information. If a standardization of this score function can be shown to be pivotal with a known distribution, then it can also be used as a basis for constructing tests to evaluate claims about \(\theta\). Unsurpisingly, what I described forms the basis for a score test. It is sometimes called a Lagrange multiplier (LM) test.

Yet another object we can look at is the likelihood function itself. Recall that the quadraticity of log-likelihood is key to establishing the properties of the MLE. Consider a second-order Taylor series expansion of the log-likelihood evaluated at the MLE \(\widehat{\theta}\) about the point \(\theta_0\), i.e. \[\begin{eqnarray}\mathcal{l}\left(\widehat{\theta}\right) & \approx & \mathcal{l}\left(\theta_0\right)+\mathcal{l}^\prime\left(\theta_0\right)\left(\widehat{\theta}-\theta_0\right)+\frac{1}{2}\mathcal{l}^{\prime\prime}\left(\theta_0\right)\left(\widehat{\theta}-\theta_0\right)^2 \\ \Rightarrow \mathcal{l}\left(\widehat{\theta}\right) -\mathcal{l}\left(\theta_0\right)&\approx & \mathcal{l}^\prime\left(\theta_0\right)\left(\widehat{\theta}-\theta_0\right)-\frac{1}{2}\left[-\mathcal{l}^{\prime\prime}\left(\theta_0\right)\right]\left(\widehat{\theta}-\theta_0\right)^2 \end{eqnarray}\]

If you can establish that, under the null, you have

\(\mathcal{l}^\prime\left(\theta_0\right)\) behaves roughly like a constant \(\mathbb{E}\left(\mathcal{l}^\prime\left(\theta_0\right)\right)\) which is equal to zero
\(\widehat{\theta}\) behaves like \(N(\theta_0, \left[I\left(\theta_0\right)\right]^{-1})\)
\(-\mathcal{l}^{\prime\prime}\left(\theta_0\right)\) behaves like a constant \(I\left(\theta_0\right)\)

then, we will roughly have

\[-2\left[\mathcal{l}\left(\widehat{\theta}\right) -\mathcal{l}\left(\theta_0\right)\right] \approx \left[I\left(\theta_0\right)^{1/2} N(\theta_0-\theta_0, \left[I\left(\theta_0\right)\right]^{-1})\right]^2 \sim \left(N\left(0,1\right)\right)^2.\] Later, in Chapter 7, we will learn more about the square of a standard normal.

Generalized likelihood ratio test

The difference of log-likelihoods could be used to construct a pivotal quantity under the null. Algebraically, the difference of log-likelihoods may be rewritten as a monotonic function of a likelihood ratio, i.e., \[-2\left[\mathcal{l}\left(\widehat{\theta}\right) -\mathcal{l}\left(\theta_0\right)\right]=2\log\left[\frac{L\left(\theta_0\right)}{L\left(\widehat{\theta}\right)}\right].\] Note that the generalized likelihood ratio satisfies \[0\leq \dfrac{L\left(\theta_0\right)}{L\left(\widehat{\theta}\right)} \leq 1.\]

When this likelihood ratio is close to 1, then the data would signal support for the null. When this likelihood ratio is close to 0, then the data would signal support for the alternative. The problem now is to find an appropriate threshold to make a decision.

Actually, this likelihood ratio may be thought of more generally in terms of a generalized likelihood ratio: \[\lambda = \frac{\displaystyle\sup_{\theta\in \Theta_0} L\left(\theta\right)}{\displaystyle\sup_{\theta\in \Theta_1} L\left(\theta\right)}.\] If it happens that \(\Theta_0=\{\theta_0\}\) is a singleton as in LM, then the numerator will just be \(L\left(\theta_0\right)\). The numerator is the value of the likelihood function for the restricted model. The denominator is just the likelihood function evaluated at the MLE \(\widehat{\theta}\). The denominator is the value of the likelihood function for the unrestricted model.

LM Section 6.5 introduces the idea of a generalized likelihood ratio test (GLRT) and starts from this definition of a generalized likelihood ratio. LM does not elaborate into the details as to why the likelihood ratio will be a good starting point, except for stating that, “… some of the particular hypothesis tests that statisticians most often use in dealing with real-world problems. All of these have the same conceptual heritage … the generalized likelihood ratio is a working criterion for actually suggesting test procedures.” Earlier, I provided a different angle to motivate why you could be interested in a likelihood ratio. But if you follow the discussion here and in LM, you will realize that the key difficulty would be finding the appropriate critical value so that the probability of a Type I error is controlled at \(\alpha\).

Take note that the generalized likelihood ratio here is quite different from the ratio of likelihoods found in the discussion on sufficiency. For the former, the data used are the same. For the latter, you have two different datasets.

I am going to illustrate the calculation of the generalized likelihood ratio test using the setting of LM Examples 6.2.1 and 6.2.2. Recall that we are testing \(\mu=\mu_0\), where \(\mu_0\) is a fixed, pre-specified value. Let \(Y=\left(Y_1,\ldots,Y_n\right)\). The likelihood function for \(\mu\) in the case where \(Y_1,\ldots,Y_n\) are IID \(N\left(\mu,\sigma^2\right)\) where \(\sigma^2\) is known is given by

\[L\left(\mu; Y\right)=\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)S^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{Y}-\mu\right)^2\right)\]

The MLE for \(\mu\) is \(\widehat{\mu}=\overline{Y}\). Therefore, the generalized likelihood ratio, as defined in LM Definition 6.5.1, is \[\begin{eqnarray}\frac{L\left(\mu_0; Y\right)}{L\left(\widehat{\mu}; Y\right)} &=& \frac{\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)S^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{Y}-\mu_0\right)^2\right)}{\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)S^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{Y}-\widehat{\mu}\right)^2\right)} \\ &=& \exp\left(-\frac{1}{2\sigma^2}n\left(\overline{Y}-\mu_0\right)^2\right) \\ &=& \exp\left(-\frac{1}{2}\left(\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\right)^2\right)\end{eqnarray}\]

We now construct a decision rule based on the generalized likelihood ratio. We reject the null when this ratio is “small” enough, i.e., under the null (which we suppress in the notation, but is there in the background) we have \[\begin{eqnarray}\mathbb{P}\left(\frac{L\left(\mu_0; Y\right)}{L\left(\widehat{\mu}; Y\right)} \leq c\right)=\alpha &\Leftrightarrow & \mathbb{P}\left(\exp\left(-\frac{1}{2}\left(\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\right)^2\right) \leq c\right)=\alpha \\ &\Leftrightarrow & \mathbb{P}\left(\left|\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\right| \geq \sqrt{-2\log c}\right)=\alpha \\ &\Leftrightarrow & \mathbb{P}\left(\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\geq \sqrt{-2\log c}\right)+\mathbb{P}\left(\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\leq -\sqrt{-2\log c}\right)=\alpha \\ &\Leftrightarrow & 1-\Phi\left(\sqrt{-2\log c}\right)+\Phi\left(-\sqrt{-2\log c}\right)=\alpha \\ &\Leftrightarrow & 2\Phi\left(-\sqrt{-2\log c}\right)=\alpha\end{eqnarray}\] By choosing \(\displaystyle c=\exp\left(-\frac{1}{2}\left(-\Phi^{-1}\left(\frac{\alpha}{2}\right)\right)^2\right)\), we now have a threshold to decide whether the data signal support for the null or the alternative and control the probability of a Type I error to a pre-specified level \(\alpha\). Therefore, we can rationalize the test developed in Theorem 6.2.1(c) as a GLRT.

Observe that this threshold is a function of \(\Phi^{-1}\left(\alpha/2\right)\). This is called \(-z_{\alpha/2}\) in LM Theorem 6.2.1.

As you have seen, the GLRT can be used as a basis for constructing tests. More importantly, sometimes the generalized likelihood ratio depends on a statistic whose distribution is known under the null. In the previous case, the statistic \(\displaystyle\frac{\overline{Y}-\mu_0}{\sigma/\sqrt{n}}\) has a standard normal distribution under the null.

We can also construct tests even outside cases like LM Examples 6.2.1 and 6.2.2. In fact, LM illustrate how to construct a GLRT for the uniform case. This is a case where the “regularity” conditions do not apply! Here we consider a very special case of LM Exercise 6.5.2:

Exercise 2 This exercise is a continuation of Exercise 1. Can you suggest a better test to Mr. Skeptic with the same significance level as he has?