Recall that we spent a lot of time looking into IID normal random variables \(N\left(\mu,\sigma^2\right)\). We specifically looked at the sample mean \(\overline{Y}\) and asked whether what we can learn by applying the sample mean. We found that
the sample mean is an unbiased estimator of \(\mu\)
the sample mean is a consistent estimator of \(\mu\)
we can attach a measure of uncertainty to this estimator
we can make probability statements involving \(\mu\)
We also found that sometimes the assumption of normality may not be very crucial to learn \(\mu\) under a variety of situations. All of what we did focused on the sample mean to find out its performance and provide some guidelines as to its usage.
We are going to venture into similar developments, but you may find that it may be too repetitive to go through the same process if we have to derive an arbitrary estimator’s performance and develop guidelines for its usage. So, we have to dig into the commonalities and hopefully develop a more unified theory.
Joint density of IID normal random variables
Let \(Y_1,\ldots,Y_n\) be IID normal random variables \(N\left(\mu,\sigma^2\right)\). Let us construct the joint density of \(Y_1,\ldots,Y_n\). It is given by \[\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n\left(y_i-\mu\right)^2\right)\] Now, observe that \[\sum_{i=1}^n\left(y_i-\overline{y}\right)^2+n\left(\overline{y}-\mu\right)^2\] So, the joint density of \(Y_1,\ldots,Y_n\) may be rewritten as \[\begin{eqnarray}&&\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n\left(y_i-\overline{y}\right)^2\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2\right)\\ &=& \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2\right)\end{eqnarray}\]
In your probability theory course, we can already make probability statements involving the \(Y_1,\ldots,Y_n\) provided we know \(\mu\) and \(\sigma^2\). Furthermore, rewriting the joint density is not necessary to make probability statements. But in our course, we are interested in estimating \(\mu\) (and sometimes \(\sigma^2\), or \(\sigma\)) using the data.
Setting up a likelihood function
One approach to estimating \(\mu\) and \(\sigma^2\) using the observed data \(\left(y_1,\ldots, y_n\right)\) is to find estimates for \(\mu\) and \(\sigma^2\) so that the probability of observing the data \(\left(y_1,\ldots, y_n\right)\) is maximized. So what is the probability of observing the data? In the discrete case, this is easy to answer as the joint density contains this information. In the continuous case, the probability of observing the data is not just the joint density. We have to look at a small neighborhood around the observed data \(y_1,\ldots, y_n\), i.e., \[\begin{eqnarray}&&\mathbb{P}\left(y_1-\varepsilon_1\leq Y_1 \leq y_1+\varepsilon_1, y_2-\varepsilon_2\leq Y_2 \leq y_2+\varepsilon_2, \ldots, y_n-\varepsilon_n\leq Y_n \leq y_n+\varepsilon_n\right) \\ &=& \mathbb{P}\left(y_1-\varepsilon_1\leq Y_1 \leq y_1+\varepsilon_1\right)\cdot \ldots \cdot \mathbb{P}\left(y_n-\varepsilon_n\leq Y_n \leq y_n+\varepsilon_n\right)\\ &=& \left[\int_{y_1-\varepsilon_1}^{y_1+\varepsilon_1} f_{Y_1}\left(s\right) ds \right] \cdot \left[\int_{y_2-\varepsilon_2}^{y_2+\varepsilon_2} f_{Y_2}\left(s\right) ds \right] \cdot \ldots \cdot \left[\int_{y_n-\varepsilon_n}^{y_n+\varepsilon_n} f_{Y_n}\left(s\right) ds \right] \\ &\approx & f_{Y_1}\left(y_1\right)f_{Y_2}\left(y_2\right)\cdot \ldots \cdot f_{Y_n}\left(y_n\right)\left(2\varepsilon_1\right)\left(2\varepsilon_2\right)\cdot \ldots \cdot \left(2\varepsilon_n\right) \\
&=& \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2\right)\left(2\varepsilon_1\right)\left(2\varepsilon_2\right)\cdot \ldots \cdot \left(2\varepsilon_n\right)\end{eqnarray}\]
Provided that the \(\varepsilon_i\)’s (think of these as the neighborhood dimensions) are not affected by \(\mu\) and \(\sigma^2\), we can maximize the probability of observing the data \(y_1,\ldots, y_n\) by looking solely at the quantity \[\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right)\exp\left(-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2\right)\] as a function of \(\mu\) and \(\sigma^2\) rather than just as a joint density of IID normal random variables. We will call it a likelihood function, denoted as \(L\left(\mu,\sigma^2\right)\). We then maximize this likelihood function where the choice variables are \(\mu\) and \(\sigma^2\). This approach is called the method of maximum likelihood.
Maximum likelihood: The algorithm
I am now going to illustrate the method of maximum likelihood. It can be difficult to maximize the likelihood function with respect to both \(\mu\) and \(\sigma^2\). So, we can take the logarithm of the likelihood function, to form the log-likelihood : \[\ln\ L\left(\mu,\sigma^2\right)=-\frac{n}{2}\ln 2\pi-\frac{n}{2}\ln\sigma^2-\frac{(n-1)s^2}{2\sigma^2}-\frac{1}{2\sigma^2}n\left(\overline{y}-\mu\right)^2.\] We have the following first derivatives: \[\begin{eqnarray} \frac{\partial \ln\ L\left(\mu,\sigma^2\right)}{\partial \mu} &=& \frac{1}{\sigma^2}n\left(\overline{y}-\mu\right)\\ \frac{\partial \ln\ L\left(\mu,\sigma^2\right)}{\partial \sigma^2} &=& -\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}(n-1)s^2+\frac{1}{2\sigma^4}n\left(\overline{y}-\mu\right)^2\end{eqnarray}\] Now, set these derivatives to zero and set up a system of equations to solve for \(\widehat{\mu}\) and \(\widehat{\sigma^2}\): \[\begin{eqnarray} -\frac{1}{\widehat{\sigma^2}}n\left(\overline{y}-\widehat{\mu}\right) &=& 0 \\ -\frac{n}{2\widehat{\sigma^2}}+\frac{1}{2\left(\widehat{\sigma^2}\right)^2}(n-1)s^2 +\frac{1}{2\widehat{\sigma^2}^2}n\left(\overline{y}-\widehat{\mu}\right)^2&=& 0 \end{eqnarray}\]
Therefore, the estimates based on maximum likelihood is given by \[\widehat{\mu}=\overline{y}\ \ \ , \ \ \ \widehat{\sigma^2}= \frac{n-1}{n}s^2=\frac{1}{n}\sum_{i=1}^n \left(y_i-\overline{y}\right)^2\] These estimates should be familiar – they are the observed sample mean and the observed sample variance with \(n\) rather than \(n-1\) for the denominator.
We can also compute the complete set of second derivatives:
To check whether you have maximized the log-likelihood, you have to check whether the matrix of second derivatives \[\begin{bmatrix}\frac{\partial^2 \ln\ L\left(\mu,\sigma^2\right)}{\partial \mu^2} & \frac{\partial^2 \ln\ L\left(\mu,\sigma^2\right)}{\partial \mu\partial\sigma^2} \\ \frac{\partial^2 \ln\ L\left(\mu,\sigma^2\right)}{\partial \mu\partial\sigma^2} & \frac{\partial^2 \ln\ L\left(\mu,\sigma^2\right)}{\partial \left(\sigma^2\right)^2} \end{bmatrix}\] evaluated at the maximum likelihood estimates is negative definite.
In the one-dimensional case, you just have to check whether the second derivative is negative. In the case of more than one dimension, then you either check whether all the eigenvalues are negative or you can use the principal minors approach. As a starting point, I refer you to page 2 of this note about checking negative definiteness of \(2\times 2\) symmetric matrices. For the more general case, you have to check whether all odd principal minors (determinants of certain submatrices) are negative and all even principal minors are positive.
Let us also illustrate the algorithm using R. There is no choice but to create an R function specific to the log-likelihood that you intend to maximize. Typically, optimization routines in software are minimization routines, so we create a negative log-likelihood instead.
# Generate simulated data from N(1, 4)n <-5mu <-1sigma.sq <-4y <-rnorm(n, mu, sqrt(sigma.sq))# Set up MINUS the log-likelihoodmlnl <-function(par){sum(-dnorm(y, mean = par[1], sd =sqrt(par[2]), log =TRUE))}# nlm() is one of the numerical optimization routines available in R# Print out things completelyoutput <-nlm(mlnl, c(1,4), hessian =TRUE, print.level =2)
iteration = 0
Step:
[1] 0 0
Parameter:
[1] 1 4
Function Value
[1] 10.20053
Gradient:
[1] -1.69631110 0.08997376
iteration = 1
Step:
[1] 1.69631110 -0.08997376
Parameter:
[1] 2.696311 3.910026
Function Value
[1] 9.089022
Gradient:
[1] 0.4338373 0.3617702
iteration = 2
Step:
[1] -0.3004950 -0.2328004
Parameter:
[1] 2.395816 3.677226
Function Value
[1] 8.927049
Gradient:
[1] 0.05271355 0.38698750
iteration = 3
Step:
[1] -0.3370561 -0.8933966
Parameter:
[1] 2.058760 2.783829
Function Value
[1] 8.655397
Gradient:
[1] -0.5357518 0.3588093
iteration = 4
Step:
[1] -0.04318385 -0.51074264
Parameter:
[1] 2.015576 2.273087
Function Value
[1] 8.516353
Gradient:
[1] -0.7511201 0.2776783
iteration = 5
Step:
[1] 0.02666247 -0.20560357
Parameter:
[1] 2.042239 2.067483
Function Value
[1] 8.444024
Gradient:
[1] -0.7613357 0.2256362
iteration = 6
Step:
[1] 0.09089516 -0.27169687
Parameter:
[1] 2.133134 1.795786
Function Value
[1] 8.331294
Gradient:
[1] -0.6234442 0.1264109
iteration = 7
Step:
[1] 0.3101654 -0.4862101
Parameter:
[1] 2.443299 1.309576
Function Value
[1] 8.304333
Gradient:
[1] 0.3293090 -0.4088195
iteration = 8
Step:
[1] -0.08299555 0.41447228
Parameter:
[1] 2.360304 1.724048
Function Value
[1] 8.251265
Gradient:
[1] 0.009441289 0.118973571
iteration = 9
Step:
[1] -0.01430545 -0.10061427
Parameter:
[1] 2.345998 1.623434
Function Value
[1] 8.243337
Gradient:
[1] -0.03403284 0.03863188
iteration = 10
Step:
[1] 0.01320633 -0.04946619
Parameter:
[1] 2.359205 1.573968
Function Value
[1] 8.242388
Gradient:
[1] 0.006849961 -0.008701111
iteration = 11
Step:
[1] -0.002191843 0.009111298
Parameter:
[1] 2.357013 1.583079
Function Value
[1] 8.242344
Gradient:
[1] -0.0001121887 0.0004923618
iteration = 12
Step:
[1] 2.466291e-05 -4.960310e-04
Parameter:
[1] 2.357037 1.582583
Function Value
[1] 8.242344
Gradient:
[1] -3.430417e-05 -2.454779e-06
iteration = 13
Step:
[1] 1.230243e-05 2.687410e-06
Parameter:
[1] 2.357050 1.582586
Function Value
[1] 8.242344
Gradient:
[1] 4.564018e-06 2.278552e-07
iteration = 14
Parameter:
[1] 2.357049 1.582586
Function Value
[1] 8.242344
Gradient:
[1] -1.130454e-10 4.826488e-10
Relative gradient close to zero.
Current iterate is probably solution.
# Comparing with what we think would be natural estimators of mu and sigma.sqc(mean(y), var(y), (n-1)/n*var(y))
[1] 2.357049 1.978233 1.582586
Suggestive points to note and to look forward to
The maximum likelihood estimates exist and are available in closed form. What the latter means is that we actually have an expression or a formula where we express the unknowns in terms of the knowns.
The maximum likelihood estimates locally maximize \(L\left(\mu,\sigma^2\right)\). Is it a global maximum?
We did not really address why maximizing the likelihood function or the log-likelihood is a good idea. Why would it work at a basic level? One clue is given by given by Markov’s inequality. Let \(\left(\mu_0,\sigma^2_0\right)\) be the “true” values of \(\left(\mu,\sigma^2\right)\). You can think of \(\left(\mu_0,\sigma^2_0\right)\) as the estimand. By Markov’s inequality, for any \(\left(\mu,\sigma^2\right) \neq \left(\mu_0,\sigma^2_0\right)\) and any \(k>0\), we have \[\mathbb{P}\left(\frac{L\left(\mu,\sigma^2\right)}{L\left(\mu_0,\sigma^2_0\right)}\geq c\right) \leq \frac{1}{c}\times \mathbb{E}\left(\frac{L\left(\mu,\sigma^2\right)}{L\left(\mu_0,\sigma^2_0\right)}\right)=\frac{1}{c}\] The latter follows from noticing that \[ \mathbb{E}\left(\frac{L\left(\mu,\sigma^2\right)}{L\left(\mu_0,\sigma^2_0\right)}\right)= \int_{\mathbb{R}} \ldots \int_{\mathbb{R}} \frac{L\left(\mu,\sigma^2\right)}{L\left(\mu_0,\sigma^2_0\right)} L\left(\mu_0,\sigma^2_0\right) dy_1\ \cdot dy_2\ \cdot \ldots \cdot dy_n=1\] Therefore, the probability that the likelihood function at values different from the “ground truth” being greater than \(c\) times the likelihood function at the “ground truth” is upper bounded by \(1/c\).
We can think of maximum likelihood estimates as realizations of random variables. So we can think of \(\widehat{\mu}\) and \(\widehat{\sigma^2}\) as random variables or as maximum likelihood estimators: \[\widehat{\mu}=\overline{Y}\ \ \ , \ \ \ \widehat{\sigma^2}= \frac{n-1}{n}S^2=\frac{1}{n}\sum_{i=1}^n \left(Y_i-\overline{Y}\right)^2\] It so happens that \(\widehat{\mu}\) is unbiased for \(\mu\), but \(\widehat{\sigma^2}\) is biased for \(\sigma^2\). For the latter, refer to LM Example 5.4.4. This example already tells you that maximum likelihood estimators are likely to be biased for the parameters.
Note, however, that \(\widehat{\mu}\) is consistent for \(\mu\). It can also be shown that \(\widehat{\sigma}^2\) is also consistent for \(\sigma^2\). The proof for the latter may proceed using Chebyshev’s inequality but you need \(\mathsf{Var}\left(\widehat{\sigma}^2\right)\). This is a lot of algebraic manipulations involved in this calculation, but it can be done. In fact, LM Exercise 5.7.2 looks at an extremely special case which avoids these algebraic calculations. Although LM Exercise 5.7.2 focuses on a special case, it pretty much gives you an idea of the problems you will encounter in more general cases (normality gone and \(\mu\neq0\)). Do the simpler exercise first and pay attention to what happens in more general cases.
What about standard errors? Pay attention to one of the second derivatives of the log-likelihood function. Do you notice that there is \(n/\sigma^2\)? Some clue is found in LM Section 5.5. Perhaps, we can also make probability statements?
Interestingly, the rewritten form of the joint density depends only on \(\overline{y}\) and \(s^2\). Therefore, realizations of IID normal random variables with the same \(\overline{y}\) and \(s^2\) will all have the same likelihood function. Furthermore, knowing \(\overline{y}\) and \(s^2\) alone helps you compute the maximum likelihood estimates without actually having access to all the observed values in the data.
We could also have chosen to convert the problem of maximizing \(L\left(\mu, \sigma^2\right)\) by choosing \(\mu\) and \(\sigma^2\) into a sequential maximization problem. This approach can help both computationally and visually. What this means is that we could fix \(\sigma^2\) first and maximize \(L\left(\mu, \sigma^2\right)\) with respect to \(\mu\) alone. Let us call this maximizer \(\widehat{\mu}\left(\sigma^2\right)\). The second step is to maximize \(L\left(\widehat{\mu}\left(\sigma^2\right), \sigma^2\right)\) with respect to \(\sigma\) alone. You can show that \(\widehat{\mu}\left(\sigma^2\right)=\overline{y}\) and \[\ln L\left(\widehat{\mu}\left(\sigma^2\right), \sigma^2\right)=-\frac{n}{2}\log 2\pi-\frac{n}{2}\log\sigma^2-\frac{(n-1)s^2}{2\sigma^4}.\] Maximizing the latter with respect to \(\sigma^2\), gives an optimal \(\widehat{\sigma}^2=\dfrac{n-1}{n}s^2\). Try to work out fixing \(\mu\) first, then maximizing with respect to \(\sigma^2\).
What happens when you maximize with respect to \(\left(\mu,\sigma\right)\) instead?
What happens if the data did not come from \(N\left(\mu,\sigma^2\right)\) and you still used the likelihood function \(L\left(\mu,\sigma^2\right)\)?