Computational examples

LM Case Study 5.2.1

This case study shows that sometimes the computations using R may not be straightforward. A closed form solution for the MLE is available, but nlm() is not giving the right answer. The problem can be traced to the definition of the geometric distribution in R, not because nlm() is malfunctioning.

# Create the dataset based on Table 5.2.1
# This is the actual dataset up to a reshuffling
x <- c(rep(1, 72), rep(2, 35), rep(3, 11), rep(4, 6), rep(5, 2), rep(6, 2))
# Closed form solution
1/mean(x)
[1] 0.5791855
# WARNING!!!
mlnl <- function(par)
{
  sum(-dgeom(x, prob = par, log = TRUE))
}  
output <- nlm(mlnl, 0.5, hessian = TRUE)
Warning in dgeom(x, prob = par, log = TRUE): NaNs produced
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
Warning in dgeom(x, prob = par, log = TRUE): NaNs produced
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
Warning in dgeom(x, prob = par, log = TRUE): NaNs produced
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
# Something is wrong!
output
$minimum
[1] 229.3663

$estimate
[1] 0.3667617

$gradient
[1] 1.847411e-06

$hessian
         [,1]
[1,] 1502.363

$code
[1] 1

$iterations
[1] 5
# Code from scratch!
mlnl <- function(par)
{
  -sum((x-1)*log(1-par)+log(par))
}
output <- nlm(mlnl, 0.5, hessian = TRUE)
Warning in log(1 - par): NaNs produced

Warning in log(1 - par): NA/Inf replaced by maximum positive value
Warning in log(1 - par): NaNs produced
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
Warning in log(1 - par): NaNs produced
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
output
$minimum
[1] 150.4023

$estimate
[1] 0.579185

$gradient
[1] -1.421085e-07

$hessian
         [,1]
[1,] 906.8597

$code
[1] 1

$iterations
[1] 4
# MLE and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 0.57918502 0.03320702
# Obtain expected frequency column
exp.freq <- function(k, est)
{
  (1-est)^(k-1)*est
}
sapply(1:6, exp.freq, est = output$estimate)*128
[1] 74.1356826 31.1974058 13.1283357  5.5246003  2.3248346  0.9783252

LM Example 5.2.3

There should be an R package which has a function for zero-truncated Poisson distribution, but I avoid doing this. Instead, I directly coded the log-likelihood. It matches the reported result in LM.

# Table 5.2.2 dataset
x <- c(rep(1, 27), rep(2, 10), rep(3, 8), rep(4, 1), rep(5, 2), rep(6, 1))
mlnl <- function(par)
{
  -sum(x*log(par)-log(exp(par)-1)-lfactorial(x))
}
output <- nlm(mlnl, 1, hessian = TRUE)
output
$minimum
[1] 64.70424

$estimate
[1] 1.39848

$gradient
[1] 5.88359e-06

$hessian
         [,1]
[1,] 25.18383

$code
[1] 1

$iterations
[1] 5
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 1.3984803 0.1992687

An alternative is to use dpois() and then make an adjustment for the truncation.

# Table 5.2.2 dataset
x <- c(rep(1, 27), rep(2, 10), rep(3, 8), rep(4, 1), rep(5, 2), rep(6, 1))
mlnl <- function(par)
{
  -sum(dpois(x, par, log = TRUE) - log(1-exp(-par)))
}
output <- nlm(mlnl, 1, hessian = TRUE)
output
$minimum
[1] 64.70424

$estimate
[1] 1.39848

$gradient
[1] 5.903913e-06

$hessian
         [,1]
[1,] 25.18383

$code
[1] 1

$iterations
[1] 5
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 1.3984803 0.1992687

If you want to fuse the outcomes 4, 5, and 6 together, then you may have to code the log-likelihood from scratch.

# Table 5.2.2 dataset
x <- c(rep(1, 27), rep(2, 10), rep(3, 8), rep(4, 1), rep(5, 2), rep(6, 1))
mlnl <- function(par)
{
  -sum((x == 1)*(dpois(1, par, log = TRUE) - log(1-exp(-par))) 
       + (x == 2)*(dpois(2, par, log = TRUE) - log(1-exp(-par))) 
       + (x == 3)*(dpois(3, par, log = TRUE) - log(1-exp(-par))) 
       + (x >= 4)*(ppois(3, par, lower.tail = FALSE, log.p = TRUE) - log(1-exp(-par))))
}
output <- nlm(mlnl, 1, hessian = TRUE)
output
$minimum
[1] 58.11873

$estimate
[1] 1.320218

$gradient
[1] 2.583366e-07

$hessian
         [,1]
[1,] 25.41591

$code
[1] 1

$iterations
[1] 5
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 1.3202176 0.1983568

It may also be a good idea to include an argument for the data to be used into the R function where you code the negative likelihood. Doing this may be useful in Monte Carlo simulations. For example,

# Table 5.2.2 dataset
x <- c(rep(1, 27), rep(2, 10), rep(3, 8), rep(4, 1), rep(5, 2), rep(6, 1))
mlnl <- function(par, x)
{
  -sum(dpois(x, par, log = TRUE) - log(1-exp(-par)))
}
# nlm() only cares about minimizing but if you need to use the data to form your log-likelihood, an extra argument has to be passed to nlm(). 
# The mlnl has an argument x which points to the data
# nlm() should only care about the argument par.
# x = x means that whatever x happens to be in memory should be passed to the x argument in mlnl(). 
output <- nlm(mlnl, 1, hessian = TRUE, x = x)
output
$minimum
[1] 64.70424

$estimate
[1] 1.39848

$gradient
[1] 5.903913e-06

$hessian
         [,1]
[1,] 25.18383

$code
[1] 1

$iterations
[1] 5
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 1.3984803 0.1992687

Estimating a correlation coefficient

Consider IID observations \((X_i,Y_i)\), \(i=1,\ldots,n\) from a standard bivariate normal distribution, i.e. \(\left(X,Y\right)\sim\mathsf{SBVN}\left(\rho\right)\). The estimand is the correlation coefficient \(\rho\). Recall that by definition \[\mathsf{Corr}\left(X,Y\right)=\dfrac{\mathsf{Cov}\left(X,Y\right)}{\sqrt{\mathsf{Var}\left(X\right)\mathsf{Var}\left(Y\right)}}\] In this case, \(\mathsf{Corr}\left(X,Y\right)=\rho\).

# Load mvtnorm package to simulate from bivariate normal
# To install this package run install.packages("mvtnorm")
library(mvtnorm)
# IID draws from SBVN
n <- 10
rho <- 0.5
mean.vec <- c(0, 0)
cov.mat <- matrix(c(1, rho, rho, 1), ncol=2)
draws <- rmvnorm(n, mean = mean.vec, sigma = cov.mat)
plot(draws)

# Set up MINUS the log-likelihood
mlnl <- function(par)
{
  sum(-dmvnorm(draws, mean = mean.vec, sigma = matrix(c(1, par, par, 1), ncol=2), log = TRUE))
}
# nlm() is one of the numerical optimization routines available in R
# Print out things completely
output <- nlm(mlnl, 0.5, hessian = TRUE)
Warning in nlm(mlnl, 0.5, hessian = TRUE): NA/Inf replaced by maximum positive
value
output
$minimum
[1] 28.38906

$estimate
[1] 0.6161043

$gradient
[1] -6.757261e-06

$hessian
        [,1]
[1,] 43.6081

$code
[1] 1

$iterations
[1] 5
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 0.6161043 0.1514316
# Comparing with what we think would be natural estimators of mu and sigma.sq
cor(draws)
          [,1]      [,2]
[1,] 1.0000000 0.7269308
[2,] 0.7269308 1.0000000

Estimating the location of a Cauchy distribution

If a random variable has a Cauchy distribution, then its first moment does not exist. The pdf of a Cauchy random variable has the form \[f\left(x;\mu,\gamma\right)=\frac{1}{\pi\sigma \left[1+\left(\dfrac{x-\mu}{\sigma}\right)^2\right]}.\] The two parameters in the pdf are the location parameter \(\mu\) and the scale parameter \(\sigma\). Although these symbols were used to indicate the population mean and population standard deviation, these two concepts do not exist for this particular distribution. The symbols may be the same, but the symbols mean something else in this context.

Notice from the simulation below that the sampling distribution of the sample mean of IID Cauchy distributed random variables does not have a well-defined center and is affected by extreme values at the tails. Although you can calculate a sample mean, it does not mean that you learn something about the population mean of a Cauchy. Why? Because the population mean does not even exist!

n <- 20
loc <- 0
scale <- 1
# Simulate the sampling distribution of the sample mean
sample.means <- replicate(10^4, mean(rcauchy(n, location = loc, scale = scale)))
hist(sample.means, freq = FALSE)

c(mean(sample.means), var(sample.means))
[1]     -3.889575 261854.693336
# Repeat and see what happens
sample.means <- replicate(10^4, mean(rcauchy(n, location = loc, scale = scale)))
hist(sample.means, freq = FALSE)

c(mean(sample.means), var(sample.means))
[1]     4.380531 48108.662001

Suppose you are interested in estimating the location parameter \(\mu\) of a Cauchy distributed random variable and you have IID draws from that distribution. Assume that the scale parameter \(\sigma\) is known. You should set up the log-likelihood by hand and obtain the first-order condition. You will find that you cannot obtain a closed form for the MLE. In addition, the first-order condition is actually a polynomial in \(\mu\) with degree growing with sample size. Thus, it is definitely possible to get multiple roots in this situation. If you want to learn more about this, refer to Section 2.2 of Small, Wang, and Yang (2000).

n <- 20
loc <- 0
scale <- 1
x <- rcauchy(n, location = loc, scale = scale)
mlnl <- function(par)
{
  sum(-dcauchy(x, location = par, scale = scale, log = TRUE))
}  
output <- nlm(mlnl, 0, hessian = TRUE)
output
$minimum
[1] 45.60868

$estimate
[1] 0.8574567

$gradient
[1] 1.895728e-05

$hessian
         [,1]
[1,] 9.971876

$code
[1] 1

$iterations
[1] 3
# Report estimate and standard error
c(output$estimate, sqrt(1/output$hessian))
[1] 0.8574567 0.3166734

Estimating the location and scale of a Cauchy distribution

n <- 20
loc <- 0
scale <- 1
x <- rcauchy(n, location = loc, scale = scale)
mlnl <- function(par)
{
  sum(-dcauchy(x, location = par[1], scale = par[2], log = TRUE))
}  
output <- nlm(mlnl, c(0, 1), hessian = TRUE)
output
$minimum
[1] 57.97463

$estimate
[1] 0.7216396 1.1176053

$gradient
[1] 7.815970e-08 1.653008e-07

$hessian
          [,1]      [,2]
[1,]  8.799451 -0.712296
[2,] -0.712296  7.210299

$code
[1] 1

$iterations
[1] 7
# Check positive definiteness of Hessian of negative log-likelihood
eigen(output$hessian)
eigen() decomposition
$values
[1] 9.071982 6.937769

$vectors
           [,1]       [,2]
[1,] -0.9339721 -0.3573458
[2,]  0.3573458 -0.9339721
# Obtain covariance matrix
solve(output$hessian)
           [,1]       [,2]
[1,] 0.11455955 0.01131719
[2,] 0.01131719 0.13980851
# Report estimates and standard errors
rbind(output$estimate, sqrt(diag(solve(output$hessian))))
          [,1]      [,2]
[1,] 0.7216396 1.1176053
[2,] 0.3384665 0.3739098

Classical normal linear regression

What you are going to see is a very old version of what is called a linear regression model. Modern versions of the linear regression model as applied in economics do not require strong parametric assumptions. We will encounter this more modern version when we discuss the method of moments.

Consider the following setting where \(Y_1,\ldots,Y_n\) are independently distributed with \(Y_i \sim N\left(\alpha+\beta x_i,\sigma^2\right)\). The \(x_i\)’s are treated as fixed in repeated samples. This may be appropriate for experimental settings or as a bridge towards conditioning on \(x_i\)’s. The estimand is \(\left(\alpha,\beta,\sigma^2\right)\). This is very similar to the IID \(N\left(\mu,\sigma^2\right)\) example, but the population mean now is different across observations.

Given the parametric statistical model, the joint pdf of \(Y_1,\ldots,Y_n\) is given by \[ \begin{aligned} f\left(y_1,\ldots,y_n\right) &= f_{Y_1}\left(y_1;\alpha,\beta,\sigma^2\right)f_{Y_2}\left(y_2;\alpha,\beta,\sigma^2\right)\cdots f_{Y_n}\left(y_n;\alpha,\beta,\sigma^2\right) \\ & =\prod_{i=1}^n f_{Y_i}\left(y_i;\alpha,\beta,\sigma^2\right) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\alpha-\beta x_i\right)^2}{2\sigma^2}\right) \\ & = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{\sum\left(y_i-\alpha-\beta x_i\right)^2}{2\sigma^2}\right) \end{aligned} \]

From what you have seen, you can now form the log-likelihood by hand or the negative log-likelihood in R. We will be using a dataset from Section 17.4 of Dekking et al (2005). The problem was to make a prediction about the hardness of wood from the density of wood. The data were obtained from measurements of the hardness (\(y\)) and density (\(x\)) of 36 Australian eucalypt hardwoods. The model used was a classical linear regression model but without the normality assumption. I impose normality here to illustrate the use of maximum likelihood methods for the multi-dimensional case and the possible computational issues you could encounter.

# Loading a text file instead of a CSV, open the text file for yourself to see what it looks like
jhdat <- read.table("jankahardness.txt", header=FALSE)
names(jhdat) <- c("x", "y")
n <- nrow(jhdat)
y <- jhdat$y 
x <- jhdat$x
# Scatterplot
plot(x, y)

# MINUS log-likelihood under normality
mlnl <- function(par)
{
  sum(-dnorm(y, mean = (par[1] + par[2]*x), sd = sqrt(par[3]), log = TRUE))
}
output <- nlm(mlnl, c(0, 1, 1), hessian = TRUE)
# WARNING!!! Code is 5! Something may be wrong. 
output
$minimum
[1] 248.5967

$estimate
[1] -1131.40516    57.04168 12502.93050

$gradient
[1]  0.022542989  0.791022694 -0.002216119

$hessian
              [,1]          [,2]          [,3]
[1,]  2.879119e-03  0.1316811207 -1.802651e-06
[2,]  1.316811e-01  6.5384687194 -6.473710e-05
[3,] -1.802651e-06 -0.0000647371  4.694901e-07

$code
[1] 5

$iterations
[1] 50
# Adjust the "scale" of y
y <- jhdat$y/100
output <- nlm(mlnl, c(0, 1, 1), hessian = TRUE)
Warning in sqrt(par[3]): NaNs produced
Warning in nlm(mlnl, c(0, 1, 1), hessian = TRUE): NA/Inf replaced by maximum
positive value
Warning in sqrt(par[3]): NaNs produced
Warning in nlm(mlnl, c(0, 1, 1), hessian = TRUE): NA/Inf replaced by maximum
positive value
output
$minimum
[1] 71.82001

$estimate
[1] -11.6049970   0.5750667   3.1649069

$gradient
[1] -5.350042e-08 -2.359357e-06  6.196384e-09

$hessian
              [,1]          [,2]          [,3]
[1,]  1.137474e+01   520.2048754 -0.0001796053
[2,]  5.202049e+02 25830.1499841 -0.4080296624
[3,] -1.796053e-04    -0.4080297  1.7962919685

$code
[1] 1

$iterations
[1] 49
# Check positive definiteness of Hessian of negative log-likelihood
eigen(output$hessian)
eigen() decomposition
$values
[1] 25840.626992     1.796357     0.897670

$vectors
              [,1]          [,2]        [,3]
[1,] -2.013606e-02  0.0089413947  0.99975727
[2,] -9.997972e-01 -0.0001642901 -0.02013540
[3,]  1.578827e-05  0.9999600114 -0.00894289
# Obtain covariance matrix
solve(output$hessian)
             [,1]          [,2]          [,3]
[1,]  1.113498914 -0.0224253284 -0.0049826025
[2,] -0.022425328  0.0004903498  0.0001091413
[3,] -0.004982602  0.0001091413  0.5567266655
# Report estimates and standard errors
rbind(output$estimate, sqrt(diag(solve(output$hessian))))
           [,1]       [,2]      [,3]
[1,] -11.604997 0.57506675 3.1649069
[2,]   1.055225 0.02214384 0.7461412
# You will learn about this in the next semester
# But lm() is used for estimating linear regressions
summary(lm(y ~ x))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3840 -0.9698 -0.1571  0.9271  6.2506 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.60500    1.08580  -10.69 2.07e-12 ***
x             0.57507    0.02279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.831 on 34 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9478 
F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16

Logistic regression

What you are going to see is a typical tool in machine learning used for classification. This tool is called logistic regression.

Consider the following simplified setting. Let \(Y_1,\ldots,Y_n\) are independently distributed with \(Y_i \sim \mathsf{Ber}\left(p_i\right)\), where \[p_i=\frac{1}{1+\exp\left(-\left(\beta_0+\beta_1 x_i\right)\right)}.\] Assume that the \(x_i\)’s are fixed in repeated samples. You should try setting up the log-likelihood for yourself (by hand) in this case.

I will be using four artificial datasets based on Zorn (2005) to illustrate some issues with applying maximum likelihood to estimate \(\beta_0\) and \(\beta_1\). Gelman, in one of his blog posts in 2011 (!!), considers an even simpler setup where \(\beta_1=0\).

# Setup MINUS log-likelihood
mlnl <- function(par, x, y)
{
  -sum(dbinom(y, 1, 1/(1+exp(-(par[1]+par[2]*x))), log = TRUE))
}
# Number of observations
n <- 100
# x drawn from standard normal, treat as fixed across simulations
x <- rnorm(n)
# Create a function which depends on alpha defined in Zorn (2005) in order to
# generate data according to Zorn (2005),
# collect results from glm() -> this is a built-in function to do logistic regression,
# use nlm to compute results with your own log-likelihood
estimates <- function(alpha)
{
  # Generate data
  ystar <- x + alpha*rnorm(n)
  y <- (ystar >= 0)
  # Apply built-in logistic regression routine
  temp <- glm(y ~ x, family = binomial(link = "logit"))
  # Manually search for optima using nlm()
  output <- nlm(mlnl, c(0, 0), x = x, y = y, hessian = TRUE)
  # Produce a scatterplot
  plot(x, y)
  # Superimpose the estimated p
  curve(1/(1+exp(-(temp$coefficients[[1]]+temp$coefficients[[2]]*x))), lty = "dotted", add = TRUE)
  # Make a list of what to report
  # What the built-in command produces, what nlm() produces
  list(summary(temp), output, output$estimate, sqrt(diag(solve(output$hessian))))
}
# Try different values of alpha
# As alpha goes to zero, what is happening?
estimates(1)

[[1]]

Call:
glm(formula = y ~ x, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9325  -0.7972  -0.2736   0.8168   2.2131  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2645     0.2512  -1.053    0.292    
x             1.6377     0.3422   4.786  1.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.469  on 99  degrees of freedom
Residual deviance:  98.988  on 98  degrees of freedom
AIC: 102.99

Number of Fisher Scoring iterations: 5


[[2]]
[[2]]$minimum
[1] 49.494

[[2]]$estimate
[1] -0.2645412  1.6377060

[[2]]$gradient
[1]  1.278977e-06 -5.171666e-06

[[2]]$hessian
          [,1]     [,2]
[1,] 16.397259 2.193747
[2,]  2.193747 8.832247

[[2]]$code
[1] 1

[[2]]$iterations
[1] 10


[[3]]
[1] -0.2645412  1.6377060

[[4]]
[1] 0.2511613 0.3422180
estimates(0.5)

[[1]]

Call:
glm(formula = y ~ x, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4341  -0.4379   0.0263   0.4096   2.1451  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2281     0.3291  -0.693    0.488    
x             3.6642     0.7651   4.789 1.68e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.589  on 99  degrees of freedom
Residual deviance:  60.054  on 98  degrees of freedom
AIC: 64.054

Number of Fisher Scoring iterations: 6


[[2]]
[[2]]$minimum
[1] 30.02702

[[2]]$estimate
[1] -0.2281068  3.6642138

[[2]]$gradient
[1]  4.561684e-06 -2.736128e-06

[[2]]$hessian
          [,1]      [,2]
[1,] 9.4588689 0.6318131
[2,] 0.6318131 1.7498058

[[2]]$code
[1] 1

[[2]]$iterations
[1] 11


[[3]]
[1] -0.2281068  3.6642138

[[4]]
[1] 0.3291409 0.7652556
estimates(0.1)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

[[1]]

Call:
glm(formula = y ~ x, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.36516  -0.00502   0.00000   0.00176   1.97006  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -0.2513     0.7457  -0.337  0.73613   
x            18.2272     7.0400   2.589  0.00962 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance:  12.16  on 98  degrees of freedom
AIC: 16.16

Number of Fisher Scoring iterations: 10


[[2]]
[[2]]$minimum
[1] 6.079906

[[2]]$estimate
[1] -0.2512855 18.2271961

[[2]]$gradient
[1]  1.776357e-09 -4.872820e-11

[[2]]$hessian
            [,1]        [,2]
[1,]  1.80537620 -0.01193291
[2,] -0.01193291  0.02024778

[[2]]$code
[1] 1

[[2]]$iterations
[1] 16


[[3]]
[1] -0.2512855 18.2271961

[[4]]
[1] 0.7456992 7.0413963
# Problems starting to show up here
estimates(0.01)
Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in sqrt(diag(solve(output$hessian))): NaNs produced

[[1]]

Call:
glm(formula = y ~ x, family = binomial(link = "logit"))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-5.415e-04  -2.000e-08   2.000e-08   2.000e-08   5.205e-04  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)      1.71    1154.21   0.001    0.999
x             1594.41  115513.42   0.014    0.989

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3847e+02  on 99  degrees of freedom
Residual deviance: 5.6418e-07  on 98  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25


[[2]]
[[2]]$minimum
[1] 3.858529e-08

[[2]]$estimate
[1]    4.896653 2008.059639

[[2]]$gradient
[1]  3.830533e-08 -4.218093e-10

[[2]]$hessian
              [,1]          [,2]
[1,]  3.753404e-08 -4.202387e-10
[2,] -4.202387e-10  4.602800e-12

[[2]]$code
[1] 1

[[2]]$iterations
[1] 35


[[3]]
[1]    4.896653 2008.059639

[[4]]
[1] NaN NaN

More examples

If you want to learn more about other examples which may be relevant in finance, I suggest that you consult the relevant sections of Chapters 5 and 19 of Ruppert and Matteson (2015) and Section 2.2.2 of Carmona (2014).

An economic example which may interest you applies maximum likelihood methods to estimate bid functions in auction settings (see Paarsch (1992)). You will also see the Weibull family (LM Exercise 5.2.15) showing up.

If you want to learn more about non-existence, non-uniqueness, and multiple root issues in MLE but not within the context of a toy model, you may want to check out this paper on factor analysis and this other paper on mixture modelling (which can be thought of as a clustering technique).