This case study shows that sometimes the computations using R may not be straightforward. A closed form solution for the MLE is available, but nlm() is not giving the right answer. The problem can be traced to the definition of the geometric distribution in R, not because nlm() is malfunctioning.
# Create the dataset based on Table 5.2.1# This is the actual dataset up to a reshufflingx <-c(rep(1, 72), rep(2, 35), rep(3, 11), rep(4, 6), rep(5, 2), rep(6, 2))# Closed form solution1/mean(x)
There should be an R package which has a function for zero-truncated Poisson distribution, but I avoid doing this. Instead, I directly coded the log-likelihood. It matches the reported result in LM.
# Report estimate and standard errorc(output$estimate, sqrt(1/output$hessian))
[1] 1.3202176 0.1983568
It may also be a good idea to include an argument for the data to be used into the R function where you code the negative likelihood. Doing this may be useful in Monte Carlo simulations. For example,
# Table 5.2.2 datasetx <-c(rep(1, 27), rep(2, 10), rep(3, 8), rep(4, 1), rep(5, 2), rep(6, 1))mlnl <-function(par, x){-sum(dpois(x, par, log =TRUE) -log(1-exp(-par)))}# nlm() only cares about minimizing but if you need to use the data to form your log-likelihood, an extra argument has to be passed to nlm(). # The mlnl has an argument x which points to the data# nlm() should only care about the argument par.# x = x means that whatever x happens to be in memory should be passed to the x argument in mlnl(). output <-nlm(mlnl, 1, hessian =TRUE, x = x)output
# Report estimate and standard errorc(output$estimate, sqrt(1/output$hessian))
[1] 1.3984803 0.1992687
Estimating a correlation coefficient
Consider IID observations \((X_i,Y_i)\), \(i=1,\ldots,n\) from a standard bivariate normal distribution, i.e. \(\left(X,Y\right)\sim\mathsf{SBVN}\left(\rho\right)\). The estimand is the correlation coefficient \(\rho\). Recall that by definition \[\mathsf{Corr}\left(X,Y\right)=\dfrac{\mathsf{Cov}\left(X,Y\right)}{\sqrt{\mathsf{Var}\left(X\right)\mathsf{Var}\left(Y\right)}}\] In this case, \(\mathsf{Corr}\left(X,Y\right)=\rho\).
# Load mvtnorm package to simulate from bivariate normal# To install this package run install.packages("mvtnorm")library(mvtnorm)# IID draws from SBVNn <-10rho <-0.5mean.vec <-c(0, 0)cov.mat <-matrix(c(1, rho, rho, 1), ncol=2)draws <-rmvnorm(n, mean = mean.vec, sigma = cov.mat)plot(draws)
# Set up MINUS the log-likelihoodmlnl <-function(par){sum(-dmvnorm(draws, mean = mean.vec, sigma =matrix(c(1, par, par, 1), ncol=2), log =TRUE))}# nlm() is one of the numerical optimization routines available in R# Print out things completelyoutput <-nlm(mlnl, 0.5, hessian =TRUE)
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
If a random variable has a Cauchy distribution, then its first moment does not exist. The pdf of a Cauchy random variable has the form \[f\left(x;\mu,\gamma\right)=\frac{1}{\pi\sigma \left[1+\left(\dfrac{x-\mu}{\sigma}\right)^2\right]}.\] The two parameters in the pdf are the location parameter \(\mu\) and the scale parameter \(\sigma\). Although these symbols were used to indicate the population mean and population standard deviation, these two concepts do not exist for this particular distribution. The symbols may be the same, but the symbols mean something else in this context.
Notice from the simulation below that the sampling distribution of the sample mean of IID Cauchy distributed random variables does not have a well-defined center and is affected by extreme values at the tails. Although you can calculate a sample mean, it does not mean that you learn something about the population mean of a Cauchy. Why? Because the population mean does not even exist!
n <-20loc <-0scale <-1# Simulate the sampling distribution of the sample meansample.means <-replicate(10^4, mean(rcauchy(n, location = loc, scale = scale)))hist(sample.means, freq =FALSE)
c(mean(sample.means), var(sample.means))
[1] -3.889575 261854.693336
# Repeat and see what happenssample.means <-replicate(10^4, mean(rcauchy(n, location = loc, scale = scale)))hist(sample.means, freq =FALSE)
c(mean(sample.means), var(sample.means))
[1] 4.380531 48108.662001
Suppose you are interested in estimating the location parameter \(\mu\) of a Cauchy distributed random variable and you have IID draws from that distribution. Assume that the scale parameter \(\sigma\) is known. You should set up the log-likelihood by hand and obtain the first-order condition. You will find that you cannot obtain a closed form for the MLE. In addition, the first-order condition is actually a polynomial in \(\mu\) with degree growing with sample size. Thus, it is definitely possible to get multiple roots in this situation. If you want to learn more about this, refer to Section 2.2 of Small, Wang, and Yang (2000).
What you are going to see is a very old version of what is called a linear regression model. Modern versions of the linear regression model as applied in economics do not require strong parametric assumptions. We will encounter this more modern version when we discuss the method of moments.
Consider the following setting where \(Y_1,\ldots,Y_n\) are independently distributed with \(Y_i \sim N\left(\alpha+\beta x_i,\sigma^2\right)\). The \(x_i\)’s are treated as fixed in repeated samples. This may be appropriate for experimental settings or as a bridge towards conditioning on \(x_i\)’s. The estimand is \(\left(\alpha,\beta,\sigma^2\right)\). This is very similar to the IID \(N\left(\mu,\sigma^2\right)\) example, but the population mean now is different across observations.
Given the parametric statistical model, the joint pdf of \(Y_1,\ldots,Y_n\) is given by \[
\begin{aligned}
f\left(y_1,\ldots,y_n\right) &= f_{Y_1}\left(y_1;\alpha,\beta,\sigma^2\right)f_{Y_2}\left(y_2;\alpha,\beta,\sigma^2\right)\cdots f_{Y_n}\left(y_n;\alpha,\beta,\sigma^2\right) \\ & =\prod_{i=1}^n f_{Y_i}\left(y_i;\alpha,\beta,\sigma^2\right) \\
&= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\alpha-\beta x_i\right)^2}{2\sigma^2}\right) \\ & = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{\sum\left(y_i-\alpha-\beta x_i\right)^2}{2\sigma^2}\right)
\end{aligned}
\]
From what you have seen, you can now form the log-likelihood by hand or the negative log-likelihood in R. We will be using a dataset from Section 17.4 of Dekking et al (2005). The problem was to make a prediction about the hardness of wood from the density of wood. The data were obtained from measurements of the hardness (\(y\)) and density (\(x\)) of 36 Australian eucalypt hardwoods. The model used was a classical linear regression model but without the normality assumption. I impose normality here to illustrate the use of maximum likelihood methods for the multi-dimensional case and the possible computational issues you could encounter.
# Loading a text file instead of a CSV, open the text file for yourself to see what it looks likejhdat <-read.table("jankahardness.txt", header=FALSE)names(jhdat) <-c("x", "y")n <-nrow(jhdat)y <- jhdat$y x <- jhdat$x# Scatterplotplot(x, y)
# MINUS log-likelihood under normalitymlnl <-function(par){sum(-dnorm(y, mean = (par[1] + par[2]*x), sd =sqrt(par[3]), log =TRUE))}output <-nlm(mlnl, c(0, 1, 1), hessian =TRUE)# WARNING!!! Code is 5! Something may be wrong. output
# You will learn about this in the next semester# But lm() is used for estimating linear regressionssummary(lm(y ~ x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.3840 -0.9698 -0.1571 0.9271 6.2506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.60500 1.08580 -10.69 2.07e-12 ***
x 0.57507 0.02279 25.24 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.831 on 34 degrees of freedom
Multiple R-squared: 0.9493, Adjusted R-squared: 0.9478
F-statistic: 637 on 1 and 34 DF, p-value: < 2.2e-16
Logistic regression
What you are going to see is a typical tool in machine learning used for classification. This tool is called logistic regression.
Consider the following simplified setting. Let \(Y_1,\ldots,Y_n\) are independently distributed with \(Y_i \sim \mathsf{Ber}\left(p_i\right)\), where \[p_i=\frac{1}{1+\exp\left(-\left(\beta_0+\beta_1 x_i\right)\right)}.\] Assume that the \(x_i\)’s are fixed in repeated samples. You should try setting up the log-likelihood for yourself (by hand) in this case.
I will be using four artificial datasets based on Zorn (2005) to illustrate some issues with applying maximum likelihood to estimate \(\beta_0\) and \(\beta_1\). Gelman, in one of his blog posts in 2011 (!!), considers an even simpler setup where \(\beta_1=0\).
# Setup MINUS log-likelihoodmlnl <-function(par, x, y){-sum(dbinom(y, 1, 1/(1+exp(-(par[1]+par[2]*x))), log =TRUE))}# Number of observationsn <-100# x drawn from standard normal, treat as fixed across simulationsx <-rnorm(n)# Create a function which depends on alpha defined in Zorn (2005) in order to# generate data according to Zorn (2005),# collect results from glm() -> this is a built-in function to do logistic regression,# use nlm to compute results with your own log-likelihoodestimates <-function(alpha){# Generate data ystar <- x + alpha*rnorm(n) y <- (ystar >=0)# Apply built-in logistic regression routine temp <-glm(y ~ x, family =binomial(link ="logit"))# Manually search for optima using nlm() output <-nlm(mlnl, c(0, 0), x = x, y = y, hessian =TRUE)# Produce a scatterplotplot(x, y)# Superimpose the estimated pcurve(1/(1+exp(-(temp$coefficients[[1]]+temp$coefficients[[2]]*x))), lty ="dotted", add =TRUE)# Make a list of what to report# What the built-in command produces, what nlm() produceslist(summary(temp), output, output$estimate, sqrt(diag(solve(output$hessian))))}# Try different values of alpha# As alpha goes to zero, what is happening?estimates(1)
[[1]]
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9325 -0.7972 -0.2736 0.8168 2.2131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2645 0.2512 -1.053 0.292
x 1.6377 0.3422 4.786 1.7e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.469 on 99 degrees of freedom
Residual deviance: 98.988 on 98 degrees of freedom
AIC: 102.99
Number of Fisher Scoring iterations: 5
[[2]]
[[2]]$minimum
[1] 49.494
[[2]]$estimate
[1] -0.2645412 1.6377060
[[2]]$gradient
[1] 1.278977e-06 -5.171666e-06
[[2]]$hessian
[,1] [,2]
[1,] 16.397259 2.193747
[2,] 2.193747 8.832247
[[2]]$code
[1] 1
[[2]]$iterations
[1] 10
[[3]]
[1] -0.2645412 1.6377060
[[4]]
[1] 0.2511613 0.3422180
estimates(0.5)
[[1]]
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4341 -0.4379 0.0263 0.4096 2.1451
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2281 0.3291 -0.693 0.488
x 3.6642 0.7651 4.789 1.68e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.589 on 99 degrees of freedom
Residual deviance: 60.054 on 98 degrees of freedom
AIC: 64.054
Number of Fisher Scoring iterations: 6
[[2]]
[[2]]$minimum
[1] 30.02702
[[2]]$estimate
[1] -0.2281068 3.6642138
[[2]]$gradient
[1] 4.561684e-06 -2.736128e-06
[[2]]$hessian
[,1] [,2]
[1,] 9.4588689 0.6318131
[2,] 0.6318131 1.7498058
[[2]]$code
[1] 1
[[2]]$iterations
[1] 11
[[3]]
[1] -0.2281068 3.6642138
[[4]]
[1] 0.3291409 0.7652556
estimates(0.1)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
[[1]]
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.36516 -0.00502 0.00000 0.00176 1.97006
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2513 0.7457 -0.337 0.73613
x 18.2272 7.0400 2.589 0.00962 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 12.16 on 98 degrees of freedom
AIC: 16.16
Number of Fisher Scoring iterations: 10
[[2]]
[[2]]$minimum
[1] 6.079906
[[2]]$estimate
[1] -0.2512855 18.2271961
[[2]]$gradient
[1] 1.776357e-09 -4.872820e-11
[[2]]$hessian
[,1] [,2]
[1,] 1.80537620 -0.01193291
[2,] -0.01193291 0.02024778
[[2]]$code
[1] 1
[[2]]$iterations
[1] 16
[[3]]
[1] -0.2512855 18.2271961
[[4]]
[1] 0.7456992 7.0413963
# Problems starting to show up hereestimates(0.01)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in sqrt(diag(solve(output$hessian))): NaNs produced
[[1]]
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-5.415e-04 -2.000e-08 2.000e-08 2.000e-08 5.205e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.71 1154.21 0.001 0.999
x 1594.41 115513.42 0.014 0.989
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3847e+02 on 99 degrees of freedom
Residual deviance: 5.6418e-07 on 98 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
[[2]]
[[2]]$minimum
[1] 3.858529e-08
[[2]]$estimate
[1] 4.896653 2008.059639
[[2]]$gradient
[1] 3.830533e-08 -4.218093e-10
[[2]]$hessian
[,1] [,2]
[1,] 3.753404e-08 -4.202387e-10
[2,] -4.202387e-10 4.602800e-12
[[2]]$code
[1] 1
[[2]]$iterations
[1] 35
[[3]]
[1] 4.896653 2008.059639
[[4]]
[1] NaN NaN
More examples
If you want to learn more about other examples which may be relevant in finance, I suggest that you consult the relevant sections of Chapters 5 and 19 of Ruppert and Matteson (2015) and Section 2.2.2 of Carmona (2014).
An economic example which may interest you applies maximum likelihood methods to estimate bid functions in auction settings (see Paarsch (1992)). You will also see the Weibull family (LM Exercise 5.2.15) showing up.
If you want to learn more about non-existence, non-uniqueness, and multiple root issues in MLE but not within the context of a toy model, you may want to check out this paper on factor analysis and this other paper on mixture modelling (which can be thought of as a clustering technique).