The method of moments

The method of moments before 1982

One-dimensional parameter

Consider an example where \(X_{1},X_{2},\ldots,X_{n}\) are IID \(\mathsf{Exp}\left(\lambda\right)\) where the estimand is the parameter \(\lambda>0\). The method of moments proceeds from the following observation: \[\mathbb{E}\left[X_{i}^{k}\right]=\dfrac{k}{\lambda^{k}},\qquad i=1,\ldots,n\;k=1,2,\ldots\]

If we focus on the first moment\[\mathbb{E}\bigg[\underbrace{X_{i}-\dfrac{1}{\lambda}}_{m_{1i}\left(\lambda\right)}\bigg]=0.\] We can construct a method of moments estimator by replacing \(\mathbb{E}\left(\cdot\right)\), which is an operation indicating a population mean, with an operation indicating a sample mean \(\displaystyle\frac{1}{n}\sum_{i=1}^n \left(\cdot\right)\). The resulting estimator of \(\lambda\) is going to be the solution to the equation \[\frac{1}{n}\sum_{i=1}^{n}\left[X_{i}-\dfrac{1}{\lambda}\right]=0\quad\Rightarrow\quad\widehat{\lambda}=\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)^{-1}=\dfrac{1}{\overline{X}}.\]

You could have chosen to use other moments, for example, the second moment:\[\mathbb{E}\bigg[\underbrace{X_{i}^{2}-\dfrac{2}{\lambda^{2}}}_{m_{2i}\left(\lambda\right)}\bigg]=0.\] The resulting estimator of \(\lambda\) is going to be the solution to the equation \[\frac{1}{n}\sum_{i=1}^{n}\left[X_{i}^{2}-\dfrac{2}{\lambda^{2}}\right]=0\quad\Rightarrow\quad\widehat{\lambda}=\sqrt{2\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}^{2}\right)^{-1}}.\]

Observe that one moment was needed to form an estimator for a one-dimensional parameter.

Two-dimensional parameters

Normal distribution with unknown mean and variance

Consider an example where \(X_{1},X_{2},\ldots,X_{n}\) are IID \(N\left(\mu,\sigma^2\right)\), where the estimand is the parameter \(\left(\mu,\sigma^2\right)\). Observe that we have the first two moments: \[\mathbb{E}\left(X_i\right)=\mu,\ \ \ \mathbb{E}\left(X_i^2\right)=\sigma^2 + \mu^2.\] We can rewrite these as: \[\mathbb{E}\bigg[\underbrace{X_{i}-\mu}_{m_{1i}\left(\mu,\sigma^2\right)}\bigg]=0, \ \ \ \mathbb{E}\bigg[\underbrace{X^2_{i}-\mu^2-\sigma^2}_{m_{2i}\left(\mu,\sigma^2\right)}\bigg]=0\]

Applying the idea behind the method of moments, we have \[\begin{eqnarray}\frac{1}{n}\sum_{i=1}^n \left(X_i- \widehat{\mu}\right) &=& 0 \\ \frac{1}{n}\sum_{i=1}^n \left(X_i^2- \widehat{\mu}-\widehat{\sigma}^2\right) &=& 0 \end{eqnarray}\] The solutions, which incidentally match the corresponding maximum likelihood estimators, will be \[\widehat{\mu}=\overline{X},\ \ \ \widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n X_i^2 - \left(\overline{X}\right)^2\]

Best linear predictor

I will first introduce the idea of a best linear predictor, which may be thought of as an extension of best prediction I discussed when we were looking into unbiasedness.

Suppose we have two random variables \(X_{1}\) and \(Y\), which follow a joint distribution \(f_{X_{1},Y}\left(x_{1},y\right)\). Suppose you draw a unit at random from the subpopulation of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\). How do we accomplish this task optimally?

Consider a prediction rule of the form \(\beta_{0}+\beta_{1}X_{1}\). The task is now to find the unique solution to the following optimization problem: \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_{1}\right)^{2}\right].\] Provided that \(\mathsf{Var}\left(X_{1}\right)>0\), the optimal solution \(\left(\beta_{0}^{*},\beta_{1}^{*}\right)\) solve the following first-order conditions: \[\begin{eqnarray} \mathbb{E}\left(Y\right)-\beta_{0}^{*}-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right) &=& 0,\\ \mathbb{E}\left(X_{1}Y\right)-\beta_{0}^{*}\mathbb{E}\left(X_{1}\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}^{2}\right) &=& 0. \end{eqnarray}.\]

As a result, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\]

Therefore, the optimal prediction rule under a minimum MSE criterion is \(\beta_0^*+\beta_1^* X_1\). Let us revisit the question at the beginning. Suppose you draw a unit at random from the subpopulation of units with \(X_{1}=x_{1}\). Your task is to predict this unit’s \(Y\). The optimal prediction rule will then be \(\beta^*_0+\beta_1^* x_1\).

To use this optimal prediction rule in practice, we need to estimate \(\left(\beta_0^*, \beta_1^*\right)\). How do we do this? Use the method of moments! After some rewriting, we can express the first-order conditions as \[\begin{eqnarray} \mathbb{E}\bigg(\underbrace{Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}}_{m_{1i}\left(\beta_0^*, \beta_1^*\right)}\bigg) &=& 0,\\ \mathbb{E}\bigg(\underbrace{X_{1}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_{1}\right)}_{m_{2i}\left(\beta_0^*, \beta_1^*\right)}\bigg) &=& 0. \end{eqnarray}.\]

Replace the population mean with the sample mean in order to obtain \[\begin{eqnarray}\frac{1}{n}\sum_{i=1}^n \left(Y_i-\widehat{\beta}_0-\widehat{\beta}_1 X_{1i}\right) &=& 0 \\ \frac{1}{n}\sum_{i=1}^n X_{1i}\left(Y_i-\widehat{\beta}_0-\widehat{\beta}_1 X_{1i}\right) &=& 0 \end{eqnarray}\] Solving these two equations for \(\left(\widehat{\beta}_0,\widehat{\beta}_1\right)\), we will have \[\widehat{\beta}_0=\overline{Y}-\widehat{\beta}_1 \overline{X}_1,\ \ \ \widehat{\beta}_1=\frac{\displaystyle\frac{1}{n}\sum_{i=1}^n X_{1i}Y_i-\overline{X}_1\overline{Y}}{\displaystyle\frac{1}{n}\sum_{i=1}^n X_{1i}^2-\left(\overline{X}_1\right)^2}=\dfrac{\dfrac{1}{n}\displaystyle\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)\left(Y_{i}-\overline{Y}\right)}{\dfrac{1}{n}\displaystyle\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}}.\]

These quantities may be familiar to students who have studied finding the “line of best fit”.

Observe again that you need at least two

The setup for the method of moments

All of our examples featured equations of the form \(\mathbb{E}\left(m_{i}\left(\theta\right)\right)=0\). These equations effectively restrict the function \(m_t\) to have zero mean. Therefore, these equations are sometimes called moment conditions or moment restrictions or orthogonality conditions. The latter terminology is probably not very obvious, but I don’t think you should be worried about it now. What is important is that the number of equations should match the dimension of \(\theta\).

The function \(m_{i}\left(\theta\right)\), which depends on the data, is called a moment function. \(\theta_0\) is usually assumed to be the only value of \(\theta\) which satisfies the moment conditions. The expectation operator \(\mathbb{E}\) is with respect to the distribution of the data, which is indexed by the parameter \(\theta_0\). But, in contrast to the likelihood, we do not specify the complete distribution of the data!

The sample counterparts of the moment conditions are given by \[\widehat{m}\left(\theta\right)=\frac{1}{n}\sum_{i=1}^{n}m_{i}\left(\theta\right).\] To obtain the method of moments estimator \(\widehat{\theta}\), we have to solve a system of equations of the form \(\widehat{m}\left(\widehat{\theta}\right)=0\).

Why use the method of moments if maximum likelihood is available?

Sometimes it is easier to compute method of moments estimators than maximum likelihood estimators. They also seem to be natural estimators because sometimes there is a tangible link between the moments of a random variable and a parameter is available.

More importantly, the method of moments does not exploit all the information stemming from the full distribution of the data. This may feel strange, especially in the example where we have IID random variables from \(\mathsf{Exp}\left(\lambda\right)\). The moments were calculated exploiting the information from the exponential distribution! Perhaps the method of moments is most interesting in the example about the best linear predictor.

Finally, the method of moments estimator can be used as an initial starting point when maximizing the log-likelihood. Recall that the approximate quadraticity of the log-likelihood allows us to write \[\widehat{\theta}-\theta_0 \approx \left[-\mathcal{l}^{\prime\prime}\left(\theta_0\right)\right]^{-1}\mathcal{l}^\prime\left(\theta_0\right).\] Suppose we have a starting point \(\theta^{(1)}\). We can use the expression as a way to create an “update”, i.e., \[\theta^{(2)} \approx \theta^{(1)} + \left[-\mathcal{l}^{\prime\prime}\left(\theta^{(1)}\right)\right]^{-1}\mathcal{l}^\prime\left(\theta^{(1)}\right).\] We can iterate this continuously and hope for the best. But if you choose a consistent and asymptotically normal estimator of \(\theta_0\) as a starting point, then this approach converges to the MLE in one step.

The method of moments: key idea as to why it works

We focus on the case where \(\theta\) and \(\widehat{m}\left(\cdot\right)\) are both one-dimensional. The multi-dimensional case is slightly complicated in notation but roughly follows a similar argument.

In a similar spirit as what we did with the score function, we take a first-order Taylor series approximation of \(\widehat{m}\left(\widehat{\theta}\right)\) about \(\theta_0\). In particular, \[\widehat{m}\left(\widehat{\theta}\right)\approx \widehat{m}\left(\theta_0\right)+\frac{\partial \widehat{m}\left(\theta\right)}{\partial \theta}\bigg|_{\theta=\theta_0} \left(\widehat{\theta}-\theta_0\right).\] So, \[\widehat{\theta}-\theta_0\approx \left[-\frac{\partial \widehat{m}\left(\theta\right)}{\partial \theta}\bigg|_{\theta=\theta_0}\right]^{-1}\widehat{m}\left(\theta_0\right).\]

If you can establish that \(\displaystyle -\frac{\partial \widehat{m}\left(\theta\right)}{\partial \theta}\bigg|_{\theta=\theta_0}\) behaves like a constant \(G\left(\theta_0\right)\) and \(\widehat{m}\left(\theta_0\right)\) behaves like \(N\left(0,V\left(\theta_0\right)\right)\), then \(\widehat{\theta}\) would behave like \(N\left(\theta_0, \left(G\left(\theta_0\right)\right)^{-1}V\left(\theta_0\right)\left(G\left(\theta_0\right)\right)^{-1}\right)\).

Computation

Some programming is also needed to use the method of moments, especially if you have to use it from scratch. Just like what we did in maximum likelihood, you have to set up the moment equations and then simultaneously solve for the roots of those equations. For one-dimensional problems, the function uniroot() could be used. But the construction of standard errors is not automatic! For multi-dimensional problems, the package nleqslv may be useful. I have used it to calculate rough estimates by method of moments, but I have not used it to construct standard errors. Finally, there is a whole framework proposed by the R package momentfit meant to have something similar to an automatic procedure to generate estiimates, standard errors, construct confidence intervals, and test claims.

The method of moments after 1982

Economic models typically have parameters of interest and can be tied to moment restrictions, especially when one accounts for uncertainty. Just think back to simpler economic models without uncertainty. For example, the first-order condition for cost minimization ties parameters involving the production technology to observable quantities such as labor, capital, wages, and the interest rate. In more complicated models, some version of marginal cost being equal to marginal benefit holds on average. Economic models also generate a lot of moment restrictions which far exceed the dimension of the parameters.

Hansen (1982) shows how to actually exploit extra moment restrictions available. The idea is to put weights on the moments and allow for correlation between moment restrictions! The underlying idea can be understood in terms of our IID \(\mathsf{Exp}\left(\lambda\right)\). The moment function is now given by \[m_{i}\left(\lambda\right)=\left(\begin{array}{c} m_{1i}\left(\lambda\right)\\ m_{2i}\left(\lambda\right) \end{array}\right),\] where the components were already defined earlier. We cannot solve \(\widehat{m}\left(\lambda\right)=0\) simultaneously for a one-dimensional \(\lambda\). Instead, we want both to be as close to zero as possible.

What I described leads to the generalized method of moments (GMM) estimator. At least in the IID \(\mathsf{Exp}\left(\lambda\right)\) context, this estimator is defined as the solution to the following minimization problem: \[\widehat{\lambda}_{GMM}=\arg\min_{\lambda}\widehat{m}\left(\lambda\right)^{\prime}\widehat{W}^{-1}\widehat{m}\left(\lambda\right)\] where \(\widehat{W}\) is some weighting matrix that has to satisfy some conditions and could be chosen optimally (in the sense of having an estimator with the lowest variance, assuming that the same set of moment restrictions were used).

If you want to learn more about the discovery and how it relates to economics and finance, read the background information provided by the Nobel Prize website.