Properties of MLE

Motivation

LM does not really show that maximum likelihood estimators have desirable properties. Although the “algorithmic” aspect of maximum likelihood is illustrated in the examples and the exercises, there is no discussion about its theoretical properties and the computational issues.

In this set of notes, I focus on the theoretical properties of maximum likelihood estimators. I list them here for convenience:

  1. The MLE \(\widehat{\theta}\) is consistent for \(\theta_0\), where \(\theta_0\) is the true value of the parameter \(\theta\).
  2. If \(\widehat{\theta}\) is the MLE, then \(g\left(\widehat{\theta}\right)\) is the MLE of \(g\left(\theta\right)\).
  3. The MLE \(\widehat{\theta}\) is asymptotically normal, meaning that \[\frac{\widehat{\theta}-\theta_0}{\widehat{\mathsf{SE}}} \overset{d}{\to} N\left(0,1\right).\]
  4. The MLE \(\widehat{\theta}\) is asymptotically efficient among the class of “regular” estimators. This roughly means that among all estimators of \(\theta\) which obey certain “regularity” conditions, the MLE \(\widehat{\theta}\) is the one with the smallest asymptotic variance.

I will be more specific about what asymptotic variance means later on. But, the ideas of consistency and asymptotic normality have been referenced to some extent when discussing the sample mean.

I also will restrict attention to the case of a one-dimensional parameter. There are extensions to the multi-dimensional case but the extensions involve matrix expressions. I have written the notes in such a way that the “feel” of matrix expressions will be slightly obvious.

Some asymptotic tools

You have already seen that maximum likelihood estimators could be biased in finite samples. Therefore, we will spend more time using asymptotic theory (or large-sample theory) to justify the use of maximum likelihood estimators.

You have already encountered the idea of squared-error consistency when we discussed the notions of unbiasedness and consistency and Homework 02. Although the idea shows up during our discussion of estimators, the concept applies to sequences of random variables in general and there are different names for the same idea. We now give the following more generic definition:

Definition 1 Let \(X_1,X_2,\ldots\) be a sequence of random variables. The sequence \(\{X_n\}\) converges in mean-square or converges in \(L^2\) or converges in quadratic mean to a random variable \(X\), denoted by \(X_n \overset{ms}{\to} X\), if \[\lim_{n\to\infty}\mathbb{E}\left[\left(X_n-X\right)^2\right] = 0.\]

Typically, the random variable \(X\) we would usually consider is a constant \(c\).

You have also already encountered the idea of convergence in probability when we discussed the notion of consistency of the sample mean. We now give the following more generic definition:

Definition 2 Let \(X_1,X_2,\ldots\) be a sequence of random variables. The sequence \(\{X_n\}\) converges in probability to a random variable \(X\), denoted by \(X_n \overset{p}{\to} X\), if, for any \(\varepsilon>0\), we have \[\lim_{n\to\infty} \mathbb{P}\left(|X_n-X|\geq \varepsilon\right) = 0.\]

Typically, the random variable \(X\) we would usually consider is a constant \(c\).

Finally, you have also already encountered the idea of convergence in distribution when we looked into the Poisson approximation to a Binomial random variable when \(n \to\infty\), \(p\to 0\) such that \(\lambda=np\) remains constant (Refer to LM Theorem 4.2.1 and Homework 01). You have also encountered this idea when you studied the central limit theorem in your first course in probability theory (Refer to LM Theorem 4.3.1 for the normal approximation to the binomial and LM Theorem 4.3.2). All these examples have a common form as seen in the following definition:

Definition 3 Let \(X_1,X_2,\ldots\) be a sequence of random variables and let \(X\) be another random variable. Let \(F_n\) be the cdf of \(X_n\) and \(F\) be the cdf of \(X\). The sequence \(\{X_n\}\) converges in distribution to \(X\), which we denote by \(X_n\overset{d}{\to}X\), if \[\lim_{n\to\infty} F_n\left(t\right)=F\left(t\right)\ \ \mathrm{or} \ \ \lim_{n\to\infty} \mathbb{P}\left(X_n\leq t\right) = \mathbb{P}\left(X\leq t\right) \] at all \(t\) for which \(F\) is continuous.

Typically, the random variable \(X\) we would usually consider is the standard normal distribution.

I state without proof that these three convergence concepts are actually somewhat related. In particular, convergence in mean-square implies convergence in probability and convergence in probability implies convergence in distribution. The reverse implications do not necessarily hold. But if you have convergence in distribution to a constant, then you have also shown convergence in probability to that same constant.

Just like limits you have encountered in calculus, there is some “algebra” you may apply for some of these convergence concepts. I collect most of them below as a theorem:

Theorem 1 Let \(X_1,X_2,\ldots\) be a sequence of random variables. Let \(Y_1,Y_2,\ldots\) be another sequence of random variables. Let \(X\) be a random variable. Let \(a\), \(b\), and \(c\) be constants. Let \(g(\cdot)\) be a continuous function.

  1. If \(X_n\overset{p}{\to} a\), then \(cX_n\overset{p}{\to} ca\).
  2. If \(X_n\overset{p}{\to} a\), then \(X_n + c\overset{p}{\to} a + c\).
  3. If \(X_n\overset{p}{\to} a\) and \(Y_n\overset{p}{\to} b\), then \(X_n + Y_n\overset{p}{\to} a + b\).
  4. If \(X_n\overset{p}{\to} a\) and \(Y_n\overset{p}{\to} b\), then \(X_nY_n\overset{p}{\to} ab\).
  5. If \(Y_n\overset{p}{\to} b\) and \(b\neq 0\), then \(\dfrac{1}{Y_n} \overset{p}{\to} \dfrac{1}{b}\).
  6. If \(X_n\overset{p}{\to} a\), then \(g\left(X_n\right)\overset{p}{\to} g\left(a\right)\).
  7. If \(X_n\overset{d}{\to} X\) and \(Y_n\overset{p}{\to} b\), then \(X_n + Y_n\overset{d}{\to} X + b\).
  8. If \(X_n\overset{d}{\to} X\) and \(Y_n\overset{p}{\to} b\), then \(X_nY_n\overset{d}{\to} bX\).
  9. If \(X_n\overset{d}{\to} X\), \(Y_n\overset{p}{\to} b\), and \(b\neq 0\), then \(\dfrac{X_n}{Y_n} \overset{d}{\to} \dfrac{X}{b}\).
  10. If \(X_n\overset{d}{\to} X\), then \(g\left(X_n\right)\overset{d}{\to} g\left(X\right)\).

From these asymptotic tools, along with the law of large numbers and the central limit theorem, we can already derive the asymptotic distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). The steps are as follows. First, prove that \(S^2 \overset{p}{\to} \sigma^2\) in the IID case with common mean \(\mu\) and variance \(\sigma^2\). Next, prove that \(S \overset{p}{\to} \sigma\). Finally, show that \[\frac{\overline{Y}-\mu}{S/\sqrt{n}} = \frac{\sigma}{S}\cdot\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\] and show that \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}} \overset{d}{\to} N\left(0,1\right)\).

Why does maximum likelihood estimation work?

Let \(\theta\) be the parameter and \(\theta_0\) be the true value of \(\theta\). Let the likelihood function be either the joint pmf of joint pdf of the observed data. We treat the observed data as random variables \(Y=\left(Y_1,\ldots,Y_n\right)\) with a joint distribution denoted by \(f\). So, the likelihood function in this case is given by \(L\left(\theta\right)=f\left(Y;\theta\right)\).

The key ideas: the score function and the Fisher information

The first two derivatives of the log-likelihood are some of the key ingredients of the theory. The first derivative of the log-likelihood \(\mathcal{l}^\prime\left(\theta\right)\) is called the score function.

Theorem 2 Under “regularity” conditions, \(\mathbb{E}\left(\mathcal{l}^\prime\left(\theta\right)\right)=0\) for all \(\theta\).

To prove this, observe that \[\int f\left(y;\theta\right) dy=1.\] Take the first derivative of both sides of the equation with respect to \(\theta\) and interchange the differentiation and integration operations in order to obtain \[\int \frac{d}{d\theta}f\left(y;\theta\right)dy=0 \Rightarrow \int \left[\frac{1}{f\left(y;\theta\right)}\frac{d}{d\theta}f\left(y;\theta\right)\right]f\left(y;\theta\right)dy=0.\] We can rewrite the previous expression in terms of the likelihood function and produce the score function, i.e., \[\int \left[\frac{1}{L\left(\theta\right)}\frac{d}{d\theta}L\left(\theta\right)\right]f\left(y;\theta\right)dy=0 \Rightarrow \int \mathcal{l}^\prime\left(\theta\right) f\left(y;\theta\right)dy=0\Rightarrow \mathbb{E}\left(\mathcal{l}^\prime\left(\theta\right)\right)=0.\]

Definition 4 The Fisher information \(I\left(\theta\right)\) is the variance of the score function, i.e. \[I\left(\theta\right)=\mathsf{Var}\left(\mathcal{l}^\prime\left(\theta\right)\right)\]

Theorem 3 Under “regularity” conditions, \(I\left(\theta\right)=\mathbb{E}\left(-\mathcal{l}^{\prime\prime}\left(\theta\right)\right)\) for all \(\theta\).

To prove this, we start from the conclusion of Theorem 2. Recall that \[\int \mathcal{l}^\prime\left(\theta\right) f\left(y;\theta\right)dy=0.\] Take derivatives of both sides of the equation once more with respect to \(\theta\), interchange the differentiation and integration operations, and apply the product rule in order to obtain \[\frac{d}{d\theta}\int \mathcal{l}^\prime\left(\theta\right) f\left(y;\theta\right)dy=0 \Rightarrow \int \mathcal{l}^{\prime\prime}\left(\theta\right)f\left(y;\theta\right)dy+\int \mathcal{l}^\prime\left(\theta\right)\frac{d}{d\theta}f\left(y;\theta\right)dy=0.\] Applying the same idea from the proof of Theorem 2, we can express the second expression as \[\int \mathcal{l}^\prime\left(\theta\right)\frac{d}{d\theta}f\left(y;\theta\right)dy=\int\mathcal{l}^\prime\left(\theta\right)\left[\frac{1}{f\left(y;\theta\right)}\frac{d}{d\theta}f\left(y;\theta\right)\right]f\left(y;\theta\right)dy=\int \left[\mathcal{l}^\prime\left(\theta\right)\right]^2f\left(y;\theta\right)dy.\] Therefore, we now have \[\int \mathcal{l}^{\prime\prime}\left(\theta\right)f\left(y;\theta\right)dy+\int \left[\mathcal{l}^\prime\left(\theta\right)\right]^2f\left(y;\theta\right)dy=0 \Rightarrow \mathbb{E}\left(-\mathcal{l}^{\prime\prime}\left(\theta\right)\right)=\mathbb{E}\left(\left[\mathcal{l}^\prime\left(\theta\right)\right]^2\right).\] By Theorem 2, \(\mathbb{E}\left(\mathcal{l}^\prime\left(\theta\right)\right)=0\). By Definition 4, we obtain the desired conclusion: \[I\left(\theta\right)=\mathbb{E}\left(\left[\mathcal{l}^\prime\left(\theta\right)\right]^2\right)=\mathbb{E}\left(-\mathcal{l}^{\prime\prime}\left(\theta\right)\right)\]

The derivations you have seen can actually be extended to higher-order derivatives and are actually the basis of the so-called Bartlett identities. These identities are typically used to check whether the assumed parametric statistical model is compatible with the data, but they are also used in theoretical work in improving maximum likelihood estimators.

The key ideas: quadratic shape of the log-likelihood

Consider the situation where the log-likelihood \(\mathcal{l}\left(\theta\right)=\ln L\left(\theta\right)\) is quadratic in \(\theta\), i.e., \[\mathcal{l}\left(\theta\right)=a_1 +a_2 \theta + \frac{1}{2}a_3 \theta^2.\] Note that \(a_1\), \(a_2\), \(a_3\) are all functions of \(Y\) and I suppress this detail in the notation.

The MLE of \(\theta\) is given by \(\widehat{\theta}= -a_3^{-1} a_2\). Observe that taking the first and second derivatives of the log-likelihood with respect to \(\theta\) gives \(\mathcal{l}^\prime\left(\theta\right)=a_2+a_3\theta\) and \(\mathcal{l}^{\prime\prime}\left(\theta\right)=a_3\).

By a first-order Taylor series approximation of \(\mathcal{l}^\prime\left(\widehat{\theta}\right)\) around \(\theta_0\), there exists \(\overline{\theta}\) such that \[\mathcal{l}^\prime\left(\widehat{\theta}\right) = \mathcal{l}^\prime\left(\theta_0\right)+\mathcal{l}^{\prime\prime}\left(\overline{\theta}\right)\left(\widehat{\theta}-\theta_0\right).\] Since \(\mathcal{l}^{\prime\prime}\left(\theta\right)=a_3\) for any \(\theta\) and \(\mathcal{l}^\prime\left(\widehat{\theta}\right)=0\), we have \[\widehat{\theta}-\theta_0=\left[-\mathcal{l}^{\prime\prime}\left(\overline{\theta}\right)\right]^{-1}\mathcal{l}^\prime\left(\theta_0\right)=\left[-\mathcal{l}^{\prime\prime}\left(\theta_0\right)\right]^{-1}\mathcal{l}^\prime\left(\theta_0\right).\]

If you can establish that \(\mathcal{l}^\prime\left(\theta_0\right)\) behaves like \(N\left(0, I\left(\theta_0\right)\right)\) and \(-\mathcal{l}^{\prime\prime}\left(\theta_0\right)\) behaves like a constant \(I\left(\theta_0\right)\), then \(\widehat{\theta}\) would behave like \(N(\theta_0, \left[I\left(\theta_0\right)\right]^{-1})\).

The previous result is intuitively the fundamental result which justifies maximum likelihood estimation. In more general situations, the log-likelihood is only approximately quadratic. But the essence of how MLE works is preserved. In a sense, what makes maximum likelihood work is not simply \(n\) treated as sample size being large. The treatment you have just seen is based on Geyer (2013), where he makes clearer what makes maximum likelihood really work.

The IID case

In the IID case, things become more familiar in the sense that we can see summations. It also allows you to invoke limit theorems, such as the law of large numbers and the central limit theorem. The treatment you are going to see next is found in typical mathematical statistics textbooks.

In particular, the score function is really a sum of IID terms, i.e. \[\mathcal{l}^\prime\left(\theta\right)=\sum_{i=1}^n \frac{d}{d\theta}\ln f\left(Y_i;\theta\right)\] and the Fisher information can also be written as a sum of IID terms, i.e., \[\mathcal{l}^{\prime\prime}\left(\theta\right) =\sum_{i=1}^n \frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\Rightarrow I\left(\theta\right)=-n\mathbb{E}\left[\frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right].\] The latter may be obtained from the fact that if \(Y_1,\ldots, Y_n\) are IID, then \(\dfrac{d^2}{d\theta^2}\ln f\left(Y_1;\theta\right), \dfrac{d^2}{d\theta^2}\ln f\left(Y_2;\theta\right), \ldots, \dfrac{d^2}{d\theta^2}\ln f\left(Y_n;\theta\right)\) are also IID. Therefore, all of them share a common mean which is \(\displaystyle\mathbb{E}\left[\frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right]\).

By applying the asymptotic tools mentioned earlier, we can establish the following result for the IID case:

Theorem 4 Let \(Y_1,\ldots, Y_n\) be IID draws from a common distribution \(f\left(y;\theta\right)\). Let \(\widehat{\theta}\) be the MLE of \(\theta\). Under “regularity” conditions, \[\sqrt{n}\left(\widehat{\theta}-\theta_0\right)\overset{d}{\to} N\left(0,\left(\mathbb{E}\left[-\frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right]\bigg|_{\theta=\theta_0}\right)^{-1}\right)\]

In practice, the result is usually interpreted as the MLE \(\widehat{\theta}\) being “approximately normal” with mean \(\theta_0\) and variance \[\dfrac{1}{n\mathbb{E}\left[-\dfrac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right]\bigg|_{\theta=\theta_0}}=I\left(\theta_0\right)^{-1}.\]

We will return to the consequences of the previous theorem on computation. At this moment, the computational illustration in the first set of notes on likelihood functions already hints at the use of the second derivative of the log-likelihood.

The expression for the Fisher information in the IID case features prominently in the Cramér-Rao inequality found in LM Theorem 5.5.1. Theorem 5.5.1 provides another angle to motivate the use of maximum likelihood with respect to its efficiency.