Cramér-Rao, sufficiency, and exponential families

A more general version of Theorem 5.5.1

Let \(Y=\left(Y_1,\ldots, Y_n\right)\) have a joint distribution \(f\left(y;\theta\right)\). Let \(T\left(Y\right)\) be an estimator of \(g\left(\theta\right)\) and \(\theta\) is a one-dimensional parameter. Assume that

  • \(f\left(y;\theta\right)\) is twice differentiable in \(\theta\)
  • the support of the random variables do not depend on \(\theta\)
  • differentiation with respect to \(\theta\) and integration with respect to \(y\) can be interchanged
  • \(T\left(Y\right)\) has finite variance
  • \(g\left(\theta\right)\) is differentiable with respect to \(\theta\)
  • the bias \(b\left(\theta\right)=\mathbb{E}\left(T\left(Y\right)\right)-g\left(\theta\right)\) is also differentiable with respect to \(\theta\)

Then, \[\mathsf{Var}\left(T\left(Y\right)\right) \geq \frac{\left(g^\prime\left(\theta\right)+b^\prime\left(\theta\right)\right)^2}{I\left(\theta\right)}.\]

It is possible to extend the idea to parameters of more than one dimension (as long as it is finite). The expressions will be in matrix form instead.

If you are in the IID case, the estimand is \(g\left(\theta\right)=\theta\), and \(T\left(Y\right)\) is unbiased for \(\theta\) so that \(b\left(\theta\right)=0\), and IID case, we obtain LM Theorem 5.5.1.

Sketch of a proof of the Cramér-Rao lower bound

The idea starts from \[\mathbb{E}\left(T\left(Y\right)\right)=g\left(\theta\right)+b\left(\theta\right).\] Take the derivative of both sides with respect to \(\theta\) in order to obtain \[\frac{d}{d\theta}\int T\left(y\right)f\left(y;\theta\right)\, dy=g^\prime\left(\theta\right)+b^\prime\left(\theta\right).\] Using the same approach in obtaining the properties of the score function, we can simplify the left hand side as \[\mathbb{E}\left(T\left(Y\right)\mathcal{l}^\prime\left(\theta\right)\right)=g^\prime\left(\theta\right)+b^\prime\left(\theta\right).\] Because the score function has zero mean, the left hand side could be written as a covariance: \[\mathsf{Cov}\left(T\left(Y\right),\mathcal{l}^\prime\left(\theta\right)\right)=g^\prime\left(\theta\right)+b^\prime\left(\theta\right).\] We can express the covariance in terms of correlation: \[\mathsf{Corr}^2\left(T\left(Y\right),\mathcal{l}^\prime\left(\theta\right)\right) \cdot \mathsf{Var}\left(T\left(Y\right)\right) \cdot \mathsf{Var}\left(\mathcal{l}^\prime\left(\theta\right)\right)= \left[g^\prime\left(\theta\right)+b^\prime\left(\theta\right)\right]^2.\] Since squared correlations are always greater than or equal to one, we have \[\mathsf{Var}\left(T\left(Y\right)\right) \cdot\mathsf{Var}\left(\mathcal{l}^\prime\left(\theta\right)\right) \geq \left[g^\prime\left(\theta\right)+b^\prime\left(\theta\right)\right]^2.\] Divide both sides by the Fisher information and then the result follows.

The different notions of sufficiency

An early example

Informally, a statistic is sufficient for \(\theta\) if it contains all the information about \(\theta\) in the data. Another way to frame the idea is to think of a statistic as data compression or data reduction. If you have a sufficient statistic, then the data \(y_,\ldots,y_n\) can be reduced a lower-dimensional function of the data without losing any information about \(\theta\).

We actually encountered this idea before when we formed the likelihood function for \(\left(\mu,\sigma^2\right)\) in the case of IID normal random variables. Recall that \[L\left(\mu, \sigma^2; y_1,\ldots,y_n\right)=\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2\right).\] Thus, the likelihood function can be formed any time you have access to \(\overline{y}\) and \(s^2\), regardless of what values the data \(y_,\ldots,y_n\) take.

Intuitively, \(\left(\overline{Y},S^2\right)\) contain all the information about \(\left(\mu,\sigma^2\right)\) present in the data. Notice that \(\displaystyle\left(\sum_{i=1}^n Y_i,\sum_{i=1}^n Y_i^2\right)\) can also play the same role. You can actually think of others, but the bottom line is that if I give you a sufficient statistic, I should be able to use that to form the likelihood function as if I had access to the full data.

Based on the preceding discussion, sufficient statistics are not unique. Furthermore, LM Exercise 5.6.3 asks you to show that if \(\widehat{\theta}\) is sufficient for \(\theta\), then any one-to-one function of \(\widehat{\theta}\) is also sufficient for \(\theta\).

LM Definition 5.6.1 provides one characterization of what makes an estimator \(\widehat{\theta}\) is sufficient for \(\theta\). It is actually difficult to use it to prove that an estimator is sufficient, especially for the continuous case. The Binomial example used to motivate the key idea behind sufficiency can be difficult to apply in the continuous case. You just have to remember every student’s nightmare involving Jacobians of transformations in a first course in probability theory.

Characterizing sufficiency

Let \(Y=\left(Y_1,\ldots,Y_n\right)\) be the data treated as random variables and \(y=\left(y_1,\ldots, y_n\right)\) be the actual data observed. I am going to write \(\widehat{\theta}\) as an explicit function of the data \(\widehat{\theta}\left(Y\right)\) to make the ideas clearer. Let \(\theta_e\) be a fixed value, so that \(\widehat{\theta}\left(y\right)=\theta_e\). We have four equivalent versions of what it means for an estimator \(\widehat{\theta}\left(Y\right)\) to be sufficient for \(\theta\):

  1. (Demonstrated through the Binomial example in LM, but works more generally) The conditional distribution of \(Y\) given \(\widehat{\theta}\left(Y\right)\) does not depend on \(\theta\).
  2. (LM Definition 5.6.1) The likelihood function for \(\theta\) could be written as a product of two components: a function of the data alone and the distribution of \(\widehat{\theta}\left(Y\right)\) which depends on \(\theta\), i.e., \[L\left(\theta; y\right)=f_{\widehat{\theta}}\left(\theta_e;\theta\right)b\left(y\right).\]
  3. (LM Theorem 5.6.1 or the Fisher-Neyman factorization theorem) The likelihood function for \(\theta\) could be written as a product of two components: a function of the data alone and a function of \(\widehat{\theta}\left(y\right)\) and \(\theta\), i.e., \[L\left(\theta; y\right)=g\left(\widehat{\theta}\left(y\right), \theta\right)h\left(y\right).\]
  4. (Not in LM explicitly) For any two datasets \(y^{(1)}=\left(y^{(1)}_1,\ldots,y_n^{(1)}\right)\) and \(y^{(2)}=\left(y^{(2)}_1,\ldots,y_n^{(2)}\right)\) which produce the same estimate \(\theta_e\), i.e., \(\widehat{\theta}\left(y^{(1)}\right)=\widehat{\theta}\left(y^{(2)}\right)=\theta_e\), the ratio of the likelihood function evaluated at dataset \(y^{(1)}\) and the likelihood function evaluated at another dataset \(y^{(2)}\) \[\frac{L\left(\theta; y^{(1)}\right)}{L\left(\theta; y^{(2)}\right)}=\frac{f_{\widehat{\theta}}\left(\theta_e;\theta\right)b\left(y^{(1)}\right)}{f_{\widehat{\theta}}\left(\theta_e;\theta\right)b\left(y^{(2)}\right)}=\frac{b\left(y^{(1)}\right)}{b\left(y^{(2)}\right)}\] does not depend on \(\theta\).

It may appear that the second and third characterizations are the same, but pay attention! The third characterization implies that it is unnecessary to actually find the distribution of \(\widehat{\theta}\left(Y\right)\) to construct the decomposition of the likelihood function.

It may seem too much but you actually have four ways to show whether or not an estimator is sufficient. Choose the one that is easiest to use. Of course, this will depend on the setting.

Exponential families

We spent a lot of time talking about normally distributed random variables. Although we did not dig deep yet into normal models (in the sense of exploring Chapter 7), you already have a sense of the neatness in the analytical results provided by this parametric family.

There is actually another parametric family discovered around the 1950s, which includes the normal, that has been the subject of extensive research around the late 1970s until now. This family has a structure which allows for “almost” exact results which somehow parallel the exact results provided by the normal family. This family is called the exponential family or Koopman-Darmois-Pitman family. Bradley Efron actually states in his recent book on exponential families that “My own experience has been that when I can put a problem, applied or theoretical, into an exponential family framework, a solution is often imminent.”

LM Exercise 5.6.9 has a definition of an exponential family. It is special enough that it has links to sufficiency and the Cramér-Rao lower bound. The exercise also asks you to show that a sufficient statistic exists for an exponential family. Try checking whether some of the special distributions you know is actually a member of the exponential family. Finally, it can be shown that the Cramér-Rao lower bound is attained by exponential families.