Goodness-of-fit testing

Motivation

During the first week of the course, you have encountered case studies where a parametric statistical model is used to “fit” the data. For example, you encountered Case Study 4.2.2 involving the Poisson distribution and Case Study 4.2.4 involving the exponential distribution. In both cases, a parametric statistical model was specified and then “checked” against the data. The specification of a parametric statistical model has to come from somewhere – usually the problem and/or research question or the scientific context and/or the subject-matter theories or conjectures play a big role.

The “checking” is really a form of diagnostic testing or specification testing. LM Chapter 10 focuses on a particular form of diagnostic testing which relies on comparing observed counts against “expected” counts. There are other approaches but these are beyond the scope of the textbook.

LM Sections 10.3 and 10.4 focus on the case where the data are from one categorical variable. LM Section 10.5 focus on the case where the data are from two categorical variables. The notation for Chapter 10 is slightly different compared to usual, so pay attention to the indices and what the indices mean.

Where does the goodness-of-fit test statistic come from?

LM Theorem 10.3.1 proposes a test statistic which can be used to test whether a proposed statistical model is compatible with the observed data. The idea is to compare what we observe against what we expect to see if the proposed statistical model is appropriate. This test statistic has the following general form: \[\sum_{\mathsf{all}\ \mathsf{categories}} \frac{\left(\mathsf{observed}\ \mathsf{counts}-\mathsf{expected}\ \mathsf{counts}\right)^2}{\mathsf{expected}\ \mathsf{counts}}\]

Let \(r_1\) and \(r_2\) be the set of possible outcomes (or better yet mutually exclusive categories) associated with each of \(n\) independent trials. Let \(p_i=\mathbb{P}\left(r_i\right)\) be the probability associated with each outcome. Let \(X_i\) be the number of times \(r_i\) occurs. Let \(p_{10}\) and \(p_{20}\) be pre-specified values determined by the proposed statistical model. The task is to test the \(H_0:\ p_1=p_{10}, p_2=p_{20}\).

An example which matches the described situation is to test whether a coin is fair. Here we have:

  1. \(r_1\) and \(r_2\) represent heads and tails, respectively
  2. \(n\) represents the total number of tosses
  3. \(p_1\) and \(p_2\) represent the probability of obtaining heads and tails, respectively
  4. \(X_1\sim\mathsf{Bin}\left(n,p_1\right)\) and \(X_2\sim\mathsf{Bin}\left(n,p_2\right)\)
  5. A fair coin is the situation where \(p_{10}=1/2\) and \(p_{20}=1/2\)

Let us return to the task, which covers the previous example, and form the GLRT. The parameter space under the null is \(\Theta_0=\{\left(p_{10},p_{20}\right): p_{10}+p_{20}=1\}\). The parameter space under the alternative is \(\Theta_1=\{\left(p_{1},p_{2}\right): p_{1}+p_{2}=1\}\). Let \(\widehat{p}_1=X_1/n\) and \(\widehat{p}_2=X_2/n\) be the MLE of \(p_1\) and \(p_2\) under the alternative. The form of the generalized likelihood ratio is given by \[\begin{eqnarray*}\dfrac{\displaystyle\sup_{\left(p_1,p_2\right)\in\Theta_0} L\left(p_1,p_2\right)}{\displaystyle\sup_{\left(p_1,p_2\right)\in\Theta_1} L\left(p_1,p_2\right)} &=& \dfrac{\displaystyle\sup_{\left(p_1,p_2\right)\in\Theta_0}\dfrac{n!}{X_1!X_2!} p_{10}^{X_1} p_{20}^{X_2}}{\displaystyle\sup_{\left(p_1,p_2\right)\in\Theta_1}\dfrac{n!}{X_1!X_2!} p_{1}^{X_1} p_{2}^{X_2}} \\ &=& \dfrac{p_{10}^{X_1} p_{20}^{X_2}}{\widehat{p}_{1}^{X_1} \widehat{p}_{2}^{X_2}}\end{eqnarray*}\]

Thus, we reject the null whenever \[\Lambda=\left(\frac{p_{10}}{\widehat{p}_{1}}\right)^{X_1}\left(\frac{p_{20}}{\widehat{p}_{2}}\right)^{X_2} \leq \lambda^*.\] Taking logarithms and doing some algebra, we have \[\log\Lambda =X_1\log \left(1+\frac{p_{10}-\widehat{p}_1}{\widehat{p}_1}\right) + X_2\log \left(1+\frac{p_{20}-\widehat{p}_2}{\widehat{p}_2}\right).\]

Using a second-order Taylor expansion, we can approximate \(\log\left(1+x\right) \approx x-x^2/2\). This approximation works for \(x\) small. In our case, this will happen when \(n\) is large. Therefore, we can roughly write \[\log\Lambda \approx X_1\left[\frac{p_{10}-\widehat{p}_1}{\widehat{p}_1}-\frac{1}{2}\left(\frac{p_{10}-\widehat{p}_1}{\widehat{p}_1}\right)^2\right]+X_2\left[\frac{p_{20}-\widehat{p}_2}{\widehat{p}_2}-\frac{1}{2}\left(\frac{p_{20}-\widehat{p}_2}{\widehat{p}_2}\right)^2\right]\]

Observe that \[n\left(p_{10}-\widehat{p}_1\right)+n\left(p_{20}-\widehat{p}_2\right)=n\left(p_{10}+p_{20}\right)-\left(X_1+X_2\right)=0.\] As a result, we have \[\begin{eqnarray*}-2\log\Lambda^* &\approx & n\widehat{p}_1\left(\frac{p_{10}-\widehat{p}_1}{\widehat{p}_1}\right)^2+n\widehat{p}_2\left(\frac{p_{20}-\widehat{p}_2}{\widehat{p}_2}\right)^2 \\ &=& \frac{\left(X_1-np_{10}\right)^2}{n\widehat{p}_1} + \frac{\left(X_2-np_{20}\right)^2}{n\widehat{p}_2}\end{eqnarray*}\]

There is a connection between \(D\) and tests involving a population proportion, at least for the special case described here. Noting that \(p_{10}+p_{20}=1\) and \(\widehat{p}_1+\widehat{p}_2=1\), we have \[\begin{eqnarray*}-2\log\Lambda &\approx & \frac{n\left(p_{10}-\widehat{p}_1\right)^2}{\widehat{p}_1\left(1-\widehat{p}_1\right)} \\ &=&\frac{n\left(p_{10}-\widehat{p}_1\right)^2}{p_{10}\left(1-p_{10}\right)}\times \frac{p_{10}\left(1-p_{10}\right)}{\widehat{p}_1\left(1-\widehat{p}_1\right)} \\ &=& \underbrace{\left(\frac{\widehat{p}_1-p_{10}}{\sqrt{\dfrac{p_{10}\left(1-p_{10}\right)}{n}}}\right)^2}_{\overset{d}{\to} \left(N(0,1)\right)^2}\underbrace{\frac{p_{10}\left(1-p_{10}\right)}{\widehat{p}_1\left(1-\widehat{p}_1\right)}}_{\overset{p}{\to} 1} \end{eqnarray*}\] We have now established that in large samples, \(-2\log\Lambda\) has an approximate \(\chi^2_1\) distribution. Notice that although there are two parameters \(p_1\) and \(p_2\), effectively, once we know \(p_{1}\), we automatically know \(p_{2}\). As a result, the degrees of freedom is reduced from 2 to 1.

The expression for \(-2\log\Lambda\) does not exactly match \(D\) in LM Theorem 10.3.1. But the difference between these two quantities disappears when \(n\to\infty\). From the sketch of the proof of LM Theorem 10.3.1, \(D\) could be written as \[D=\frac{\left(X_1-np_{10}\right)^2}{np_{10}\left(1-p_{10}\right)}=\left(\frac{\widehat{p}_1-p_{10}}{\sqrt{\dfrac{p_{10}\left(1-p_{10}\right)}{n}}}\right)^2.\] You can show that the difference between \(-2\log\Lambda\) and \(D\) converges in probability to zero as \(n\to\infty\).

What happens if there are multiple categories?

Previously, we looked into testing the null \(H_0:\ p_1=p_{10}, p_2=p_{20}\). In that context, there were two categories and each of the counts of the occurrences \(X_1\) and \(X_2\) is binomial. In addition, \(X_1+X_2=n\).

Now, we are extending the test to \(t\) categories, so the null becomes \(H_0:\ p_1=p_{10}, p_2=p_{20}, \ldots, p_t=p_{t0}\) and each of the counts of the occurrences \(X_1,\ldots,X_t\) is binomial. In addition, \(X_1+\cdots+X_t=n\).

The vector of random variables \(\left(X_1,\ldots,X_t\right)\) has a joint distribution called the multinomial distribution. This distribution directly extends the binomial distribution which is a multinomial with two categories. Refer to LM Section 10.2 for more details. As you get to encounter a new distribution, everything related to manipulating distributions, calculating probabilities and moments, and so on are on the table. This is similar to the new distributions connected to the normal.

How do we test the null \(H_0:\ p_1=p_{10}, p_2=p_{20}, \ldots, p_t=p_{t0}\)? We proceed in pretty much the same way. The test statistic \(D\) you encountered in the case of two categories will be extended to \(t\) categories and will have a \(\chi^2_{t-1}\) distribution under the null.

Where would the null probabilities come from?

In the discrete case, the proposed statistical model will be in terms of a probability mass function. In the continuous case, the proposed statistical model will be in terms of a probability density function. Things are simpler in both cases if there are no other unknown parameters in the proposed statistical model (see LM Section 10.3). If there are unknown parameters, then you have to estimate these parameters first and the null distribution of the goodness-of-fit statistic will change (see LM Section 10.4).

There is a slight problem with the continuous case, even if the parameters of the proposed statistical model are known. The probability density function does not directly have information about probabilities. You need to do some integration over some interval. Therefore, there is some arbitrary element in the setup of the null hypothesis if you are checking the goodness-of-fit in the continuous case.

The practice is to divide the support of the continuous random variable into intervals (or categories or cells) so that the \(\chi^2\) approximation is a good approximation. This means that you have to choose the intervals in such a way that there are no “empty” intervals. A rule of thumb used in the textbook is to ensure that \(np_{i0} > 5\) for all \(i=1,\ldots, t\). But sometimes this may be hard to achieve in the tails of the distribution, so having a few intervals with \(np_{i0} < 5\) is acceptable provided that the number of intervals is not too small. Therefore, you do not want to pool too much in the sense that the number of intervals is small. You also do not want to pool too little in the sense that the number of intervals is large.

Another approach is to construct the intervals in such a way that \(p_{10}=\cdots=p_{t0}\). This kind of partitioning is sometimes called equiprobable partitioning. But this partitioning still requires you to specify the number of intervals. In addition, there are more computations required to guarantee equiprobable partitioning.

What happens if the parametric model depends on unknown parameters?

An extra layer of computation is necessary if the parametric statistical model depends on other unknown parameters, on top of those computations mentioned earlier. Consider Case Study 4.2.2. The Poisson distribution was the proposed statistical model, but the Poisson probabilities depend on an unknown parameter \(\lambda\). In particular, we have 12 Poisson probabilities \[p_{10}\left(\lambda\right), p_{20}\left(\lambda\right), \ldots, p_{100}\left(\lambda\right), p_{120}\left(\lambda\right)\] and all these probabilities sum up to 1.

Let \(r_1\) be the outcome where no particles are detected, \(r_2\) be the outcome where 1 particle is detected, and so on. Let \(X_i\) be the number of times the outcome \(r_i\) occurs. Pay attention to the definition of \(X_i\), as this is very different from our usual setup where \(X_i\) represents the \(i\)th observation. The test statistic will now have the form \[D=\sum_{i=1}^{12} \frac{\left(X_i-np_{i0}\left(\lambda\right)\right)^2}{np_{i0}\left(\lambda\right)}\] So we need to find the value of \(\lambda\) which minimizes \(D\).

We can write \[\begin{eqnarray*}D &=& \sum_{i=1}^{12} \frac{X_i^2}{np_{i0}\left(\lambda\right)} -2\sum_{i=1}^{12} X_i +\sum_{i=1}^{12} np_{i0}\left(\lambda\right) \\ &=& \sum_{i=1}^{12} \frac{X_i^2}{np_{i0}\left(\lambda\right)} -2\times n +n \\ &=& \sum_{i=1}^{12} \frac{X_i^2}{np_{i0}\left(\lambda\right)} -n \end{eqnarray*}\] The derivation uses the facts that \(X_1+X_2+\cdots+X_{12}=n\) and \(p_{10}+\cdots+p_{120}=1\). The derivation also extends to a finite number of categories and also applies to the case where the parameters are known.

The first derivative for \(\lambda\) could be written as \[\begin{eqnarray*}\frac{\partial D}{\partial \lambda} &=& -\sum_{i=1}^{12} \frac{X_i^2}{n\left(p_{i0}\left(\lambda\right)\right)^2} \frac{\partial p_{i0}\left(\lambda\right)}{\partial \lambda} \\ &=& -\sum_{i=1}^{12} \frac{X_i^2}{np_{i0}\left(\lambda\right)} \frac{\partial \log p_{i0}\left(\lambda\right)}{\partial \lambda} \\ &=& -\sum_{i=1}^{12} \frac{X_i}{n}\frac{1}{p_{i0}\left(\lambda\right)} X_i\frac{\partial \log p_{i0}\left(\lambda\right)}{\partial \lambda} \end{eqnarray*}\]

If the null that the Poisson distribution is a good fit is true, then \[ \frac{X_i}{n}\frac{1}{p_{i0}\left(\lambda\right)} \approx 1\] Therefore, the first derivative for \(\lambda\) could be approximated by \[\frac{\partial D}{\partial \lambda}\approx -\sum_{i=1}^{12} X_i\frac{\partial \log p_{i0}\left(\lambda\right)}{\partial \lambda}\] Interestingly, the expression on the right-hand side should remind you of the score of the log-likelihood for the parameter of a binomial distribution if there are two categories. What you see here is the score of the log-likelihood for the parameters of a multinomial distribution if there are 12 categories. The idea can be generalized to \(t\) categories.

What the procedure above tells you is that estimators for the unknown parameters (in our goodness-of-fit testing context) have to be estimated by maximum likelihood using “grouped” data rather than “ungrouped” data. LM do not use this approach and I think in practice, most would use easily computed consistent estimators of the unknown parameters. In particular, they would estimate the parameters of the pre-specified distribution using maximum likelihood.

Estimating the one-dimensional parameter \(\lambda\) and plugging this into \(D\) will change the asymptotic distribution. We will lose 1 degree of freedom because we estimated one parameter. In general, you will lose as many degrees of freedom as there are parameters estimated.

Applications

Benford’s law and too good to be true

LM have two case studies which feature tools for detecting potential fabrication of the data. Pay attention to Case Study 10.3.2, where the distribution of the first digits of a class of datasets has been found not to be uniformly distributed. It has been applied to detect tax fraud in international trade. A nontechnical summary can be found here. A technical but readable paper on detecting fraudulent trade transactions can be found here. Check the citations of the previous paper for applications in other areas like accounting.

# Reproduce calculations in Case Study 10.3.2
observed <- c(111, 60, 46, 29, 26, 22, 21, 20, 20)
# Pay attention that the log in the book is log base 10!
null.prob <- log10(1:9 + 1) - log10(1:9)
# Chi-squared test with null probabilities known
test.case <- chisq.test(observed, p = null.prob, correct = FALSE)
test.case$expected
[1] 106.86565  62.51240  44.35325  34.40305  28.10934  23.76611  20.58714
[8]  18.15915  16.24391
test.case$statistic
X-squared 
 2.523567 
qchisq(0.95, 8)
[1] 15.50731
test.case

    Chi-squared test for given probabilities

data:  observed
X-squared = 2.5236, df = 8, p-value = 0.9606

The other case study is Case Study 10.3.3, where the data provided by a study seem too good to be true in the sense that the data “fit” the theory a bit too well. Typically, when we do a goodness-of-fit test, we reject the null for large values of the test statistic. That means we are only looking at the extreme right direction. But if there are suspicions that the data are “fitting” the null a bit too well, then the test statistic might be too small. Therefore, it makes sense to look into the extreme left direction.

Revisiting Case Study 4.2.2

Recall that the unknown \(\lambda\) was estimated using maximum likelihood on the grouped data (but it was already grouped for you in the case study).

observed <- c(57, 203, 383, 525, 532, 408, 273, 139, 45, 27, 10, 6)
lambda.mle <- as.numeric(observed %*% (0:11)/sum(observed))
null.prob <- c(dpois(0:10, lambda.mle), 1-ppois(10, lambda.mle))
test.case <- chisq.test(observed, p = null.prob, correct = FALSE)
test.case$expected
 [1]  54.418655 210.580164 407.433861 525.539688 508.411286 393.472906
 [7] 253.765885 140.282938  67.855416  29.175054  11.289672   5.774475
test.case$statistic
X-squared 
 12.97809 
qchisq(0.95, 10)
[1] 18.30704
# p-value here is incorrect
test.case

    Chi-squared test for given probabilities

data:  observed
X-squared = 12.978, df = 11, p-value = 0.2948
# correct p-value 
1-pchisq(test.case$statistic, 10)
X-squared 
0.2248992 

Testing independence

We are going to consider two categorical variables instead of just one categorical variable. Suppose one of the categorical variables has two mutually exclusive outcomes \(A_1\) and \(A_2\). In addition, the other categorical variables also has two mutually exclusive outcomes \(B_1\) and \(B_2\). Our task is to design a test to test the null that these two categorical variables are independent. Under the null of independence, we should have \[\begin{eqnarray*} \mathbb{P}\left(A_1\cap B_1\right) &=& \mathbb{P}\left(A_1\right)\mathbb{P}\left(B_1\right) \\ \mathbb{P}\left(A_1\cap B_2\right) &=& \mathbb{P}\left(A_1\right)\mathbb{P}\left(B_2\right) \\ \mathbb{P}\left(A_2\cap B_1\right) &=& \mathbb{P}\left(A_2\right)\mathbb{P}\left(B_1\right) \\ \mathbb{P}\left(A_2\cap B_2\right) &=& \mathbb{P}\left(A_2\right)\mathbb{P}\left(B_2\right) \end{eqnarray*}\] Of course, once any of the 3 equalities hold, the remaining equality will also hold.

To apply the idea of goodness-of-fit testing for one categorical variable, it may be instructive to think of \(A_1\cap B_1\), \(A_1\cap B_2\), \(A_2\cap B_1\), \(A_2\cap B_2\) as four categories of one categorical variable. Let \(X_{ij}\) be the number of times \(A_i \cap B_j\) occurs. Observe that

  1. \(X_{11}+X_{12}+X_{21}+X_{22}=n\), where \(n\) is the total number of observations or the sample size
  2. Each of the categories \(A_i \cap B_j\) has a probability of occurring. Let \(p_{ij}=\mathbb{P}\left(A_i\cap B_j\right)\). We should also have \(p_{11}+p_{12}+p_{21}+p_{22}=1\).
  3. Let \(p_i=\mathbb{P}\left(A_i\right)\) and \(q_j=\mathbb{P}\left(B_j\right)\). We should have \(p_1+p_2=1\) and \(q_1+q_2=1\). Under the null of independence, \(p_{ij}=p_i\cdot q_j\).

Typically the information above is summarized into two tables – a table we get to observe and a table we expect to see under the null. The first table is sometimes called a contingency table or a two-way table. To be more specific, the table we get to observe is

\(B_1\) \(B_2\) Row totals
\(A_1\) \(X_{11}\) \(X_{12}\) \(R_1\)
\(A_2\) \(X_{21}\) \(X_{22}\) \(R_2\)
Column totals \(C_1\) \(C_2\) \(n\)

The row totals and column totals are random, but the total number of observations \(n\) is not. Although the row and column totals are random, when we think about what the table should roughly look like under the null, we “fix” the values of \(R_1, R_2, C_1, C_2\).

The table we expect to see under the null is as follows:

\(B_1\) \(B_2\) Row totals
\(A_1\) \(np_1q_1\) \(np_1q_2\) \(R_1\)
\(A_2\) \(np_2q_1\) \(np_2q_2\) \(R_2\)
Column totals \(C_1\) \(C_2\) \(n\)

If we knew the values of \(p_1, p_2, q_1,q_2\) , then we are in the case of a goodness-of-fit test involving four categories with probabilities of each category being known. Then the goodness-of-fit statistic will have a \(\chi^2_3\) distribution under the null. The 3 degrees of freedom come from four categories minus one of the categories which is actually redundant. But knowing the probabilities \(p_1, p_2, q_1, q_2\) is definitely unrealistic.

If we do not know the values of \(p_1,p_2, q_1,q_2\) , then we have to estimate them from the data. In this case, we are in the case of a goodness-of-fit test involving four categories with probabilities of each category being unknown but estimated. The resulting goodness-of-fit statistic will have a \(\chi^2_1\) distribution under the null. The 1 degree of freedom come 4 categories minus one of the categories which is actually redundant minus two estimated parameters (either one of \(p_1\) and \(p_2\) along with either one of \(q_1\) and \(q_2\)).

The preceding setup extend to the case where there are \(r\) row outcomes \(A_1,A_2,\ldots, A_r\) and there are \(c\) column outcomes \(B_1,B_2,\ldots,B_c\). The degrees of freedom are equal to \(rc-1-(r-1)-(c-1)=(r-1)(c-1)\).

Connection to testing the equality of two proportions

Sections 9.4 and 9.5 (specifically LM Theorem 9.5.3) are about tests of equality of two proportions and confidence intervals for the difference of two proportions. The development of these tests and confidence intervals is relatively straightforward, especially if you are searching for an asymptoticially pivotal quantity. Furthermore, the calculation of the GLR in Section 9.4 should remind you of the GLR form testing the equality of three normal means in our in-class exercise.

More importantly, there is a mathematical connection between a chi-squared test of independence of two categorical variables (each with two mutually exclusive categories) and a test of equality of two proportions.

Under the null of independence, we should have \[\begin{eqnarray*} \mathbb{P}\left(A_1| B_1\right) &=& \mathbb{P}\left(A_1| B_2\right) \\ \mathbb{P}\left(A_2 | B_1\right) &=& \mathbb{P}\left(A_2|B_2\right) \end{eqnarray*}\] But one of these equalities is redundant. We could also have looked into \[\begin{eqnarray*} \mathbb{P}\left(B_1| A_1\right) &=& \mathbb{P}\left(B_1| A_2\right) \\ \mathbb{P}\left(B_2 | A_1\right) &=& \mathbb{P}\left(B_2|A_2\right) \end{eqnarray*}\] Again, one of these equalities is redundant. Therefore, we can think of testing the null of independence as a test of equality of two proportions, where the proportions are conditional probabilities.

Despite the mathematical similarity, the objectives of the test of equality of two proportions and the test of independence are very different from each other. They only share a similarity with respect to the goodness-of-fit statistic to be computed.

Sex bias in graduate admissions

I provide this famous application which features a chi-squared test of independence. This application shows that a chi-squared test is just the beginning of an exploration and does not necessarily prove sex bias necessarily. This application is also a very nice illustration of what some call Simpson’s paradox. The idea behind this paradox is that two variables may have a positive relationship but once you look at subgroups, the relationship between these two variables may get reversed.

Here I only illustrate the calculation of the chi-squared test they presented. But rejection of the null of independence does not mean that there is sex bias, as you already know about what hypothesis testing actually does.

# Load admissions dataset
admissions <- read.csv("https://waf.cs.illinois.edu/discovery/berkeley.csv")
# Create a two-way table similar to Table 1 of the paper
table(admissions$Gender, admissions$Admission)
   
    Accepted Rejected
  F     1494     2827
  M     3738     4704
# Chi-squared test of independence in Table 1, matches the paper
chisq.test(table(admissions$Gender, admissions$Admission))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(admissions$Gender, admissions$Admission)
X-squared = 110.85, df = 1, p-value < 2.2e-16
# Chi-squared test of independence without the correction, matches calculation based on LM
chisq.test(table(admissions$Gender, admissions$Admission), correct = FALSE)

    Pearson's Chi-squared test

data:  table(admissions$Gender, admissions$Admission)
X-squared = 111.25, df = 1, p-value < 2.2e-16