Homework 02 Solution

Exercise A: ID numbers and contributions

1 bonus point earned for doing this exercise properly.

Exercise B: Chips Ahoy, Part 2

You have personally observed that the number of chocolate chips on a Chips Ahoy cookie is not the same across all cookies. In other words, there is variability in the number of chocolate chips even though these cookies are mass-produced and have quality control standards. You will now go through an argument meant to motivate the choice of a parametric statistical model to describe the variability of \(X\).

Chips Ahoy, Part 1 gave you a chance to work through what could be a suitable model for the number of chocolate chip cookies. You have reached the conclusion that \(X\sim \mathrm{Poi}\left(\lambda\right)\), where \(\lambda=m/n\). Both \(m\) and \(n\) are unobservable from the point of view of the consumer. In this exercise, you will be constructing a rough confidence interval for \(\lambda\) from the point of view of the consumer. Let \(Y_1,\ldots, Y_J\) be IID draws from \(\mathrm{Poi}\left(\lambda\right)\), where \(J\) is the sample size1.

  1. (3 points) Write down a proof, along with your reasoning, of the following statement: \[\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\frac{1}{c^2}\]

ANSWER: Observe that \(Y_1,\ldots, Y_J\) are IID draws from \(\mathrm{Poi}\left(\lambda\right)\). Therefore, \(\mathbb{E}\left(\overline{Y}\right)=\lambda\) and \(\mathsf{Var}\left(\overline{Y}\right)=\lambda/n\). By Chebyshev’s inequality, we have for any \(\varepsilon > 0\), \[\mathbb{P}\left(\left|\overline{Y}-\mathbb{E}\left(\overline{Y}\right)\right| \leq \varepsilon \right)\ \geq 1-\frac{\mathsf{Var}\left(\overline{Y}\right)}{\varepsilon^2} \Rightarrow \mathbb{P}\left(\left|\overline{Y}-\lambda\right| \leq \varepsilon \right)\ \geq 1-\frac{\lambda}{J\varepsilon^2}. \] If we choose \(\varepsilon=c\lambda/J\), we obtain \[\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\frac{1}{c^2}.\]

  1. (1 point) Note that \[\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \Leftrightarrow \left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2.\] Do some algebra and show that \[\left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2 \Leftrightarrow \lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2\leq 0\]

I do not show the solution for this part, as this is straightforward algebra.

  1. (4 points) Consider a modified version of the definition of a confidence interval found here. In particular, let \(\theta\) be an unknown parameter of interest or estimand and \(c>0\). Let \(L\) and \(U\) be statistics. \((L, U)\) is a \(100\left(1-1/c^2\right)\%\) finite-sample conservative \(100\left(1-1/c^2\right)\%\) confidence interval for \(\theta\) if \[\mathbb{P}\left(L\leq \theta \leq U\right) \geq 1-\frac{1}{c^2}\] for every \(\theta\). Use the result in Item 2 to construct a finite-sample conservative \(100\left(1-1/c^2\right)\%\) confidence interval for \(\lambda\). You should give me expressions for \(L\) and \(U\), which will achieve the lower bound \(1-1/c^2\).

First, solve the quadratic equation \[\lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2= 0.\] The solutions are given by \[L=\frac{2\overline{Y}+\dfrac{c^2}{J}-\sqrt{\dfrac{4\overline{Y}c^2}{J}+\dfrac{c^4}{J^2}}}{2}, \ \ \ U=\frac{2\overline{Y}+\dfrac{c^2}{J}+\sqrt{\dfrac{4\overline{Y}c^2}{J}+\dfrac{c^4}{J^2}}}{2}.\] Therefore, the solutions to the inequality \[\lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2\leq 0\] are given by \(L\leq \lambda \leq U\). Next, show that the desired guarantee is attained: \[\mathbb{P}\left(L\leq \lambda \leq U\right)=\mathbb{P}\left(\left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2\right)=\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\dfrac{1}{c^2}.\]

  1. (1 point) Recall we constructed a confidence interval for \(\mu\) in the IID normal case. We used an approach where we pretended to know \(\sigma^2\). Do we have to pretend to know \(\sigma^2\) for the Poisson case? Explain.

We cannot even pretend to know \(\sigma^2\) for the Poisson case because \(\sigma^2=\lambda\) and \(\lambda\) is unknown.

  1. (2 points) You will now apply the interval found in Item 3 to obtain a finite-sample conservative 95% confidence interval for \(\lambda\) using our cookies dataset. You may have to use read.csv(). Report the R code you have used and your finding.

We must set \(c^2=20\) because we want a finite-sample conservative 95% confidence interval for \(\lambda\). Below, we produce the calculations which apply the interval obtained in Item 3.

cookie <- read.csv("data_cookie.csv")
ybar <- mean(cookie$numchoc)
J <- length(cookie$numchoc)
c <- sqrt(20)
c(2*ybar + c^2/J - sqrt(4*ybar*c^2/J + c^4/J^2), 2*ybar + c^2/J + sqrt(4*ybar*c^2/J + c^4/J^2))/2
[1] 13.64160 16.71778
  1. (4 points) You will be using a Monte Carlo simulation to evaluate the performance of the finite-sample conservative \(100\left(1-1/c^2\right)\%\) confidence interval for \(\lambda\). Consider the R code we used to demonstrate what a confidence interval means in the IID normal case here. I created a modified version below. You have to fill up the missing parts accordingly for the case of a finite-sample conservative \(100\left(1-1/c^2\right)\%\) confidence interval for \(\lambda\). Report the code you have used and your finding for mean(results[1,] < lambda & lambda < results[2,]). Discuss whether your finding matches the theoretical guarantee developed in Item 3.

ANSWER: Below you will find the R code used. The theoretical guarantee matches the one found in Item 3. In fact, the lower bound 3/4 (when \(c=2\)) is quite conservative.

# Set a value for lambda to generate artificial datasets following a Poisson distribution
lambda <- 14
# Set c for the desired 1-1/c^2  
c <- 2
# Create a function depending on sample size J
cons.ci <- function(J)
{
  y <- rpois(J, lambda)
  ybar <- mean(y)
  c(2*ybar + c^2/J - sqrt(4*ybar*c^2/J + c^4/J^2), 2*ybar + c^2/J + sqrt(4*ybar*c^2/J + c^4/J^2))/2
}
# Construct interval 10000 times, you can change 5 into something else
results <- replicate(10^4, cons.ci(J))
# Calculate how many 
mean(results[1,] < lambda & lambda < results[2,])
[1] 0.9553

NOTE: It is not necessary to match the simulation design to the Chips Ahoy example, but you may do so. For one thing, we really do not know the real value of \(\lambda\).

R NOTE: The object J in cons.ci(J) points to the \(J\) available in memory which should match J <- length(cookie$numchoc). The J in the function definition for cons.ci is only a placeholder.

Exercise C: The IID Uniform case

This exercise should force you to start working on the examples and exercises in the book. This exercise is a version of LM Examples 5.4.2, 5.4.6, 5.7.1, and LM Exercises 5.4.18, 5.7.4, 5.7.5. You may also need to use integration by parts at some point for the mathematical derivations.

Let \(Y_1,\ldots,Y_n\) be IID \(\mathsf{U}\left(0,\theta\right)\), where \(\theta>0\). The common pdf of these random variables is given by \[f\left(y\right)=\begin{cases}\dfrac{1}{\theta} & \ \ \ 0\leq y\leq \theta\\ 0 & \ \ \ \mathrm{otherwise} \end{cases},\] the common mean is \(\theta/2\), and the common variance is \(\theta^2/12\). The estimand for this exercise is \(\theta\).

You will be considering three estimators for \(\theta\): \[\widehat{\theta}_1=\left(n+1\right)Y_{\mathsf{min}}, \ \ \ \widehat{\theta}_2=2\overline{Y}, \ \ \ \widehat{\theta}_3=\dfrac{n+1}{n}Y_{\mathsf{max}}.\] You will learn later that \(\widehat{\theta}_1\), which depends on smallest order statistic2, \(Y_{\mathsf{min}}\) is a bias-adjusted estimator that applies the plug-in principle. You will also learn later that \(\widehat{\theta}_2=2\overline{Y}\) is a method of moments estimator. You will also learn later that \(\widehat{\theta}_3\), which depends on the largest order statistic \(Y_{\mathsf{max}}\), is a bias-adjusted maximum likelihood estimator.

ANSWER: First, let me write down all the calculus-based results I am going to need. The cdf for \(\mathsf{U}\left(0,\theta\right)\) is given by \[F\left(y\right)=\int_{-\infty}^y f\left(s\right)\,\,ds=\begin{cases} 0 & \mathrm{if}\ y<0\\ \displaystyle\int_{-\infty}^0 f\left(s\right)\,\,ds+\int_0^y f\left(s\right)\,\,ds = \int_0^y \frac{1}{\theta}\,\,ds=\frac{y}{\theta} & \mathrm{if}\ 0\leq y \leq \theta\\ \displaystyle\int_{-\infty}^\theta f\left(s\right)\,\,ds+\int_\theta^y f\left(s\right)\,\,ds = 1 & \mathrm{if}\ y > \theta \end{cases}\] Next, by Theorem 3.10.1, the pdfs of \(Y_{\mathsf{max}}\) and \(Y_{\mathsf{min}}\) are given by \[f_{Y_{\mathsf{max}}}\left(y\right)=\frac{ny^{n-1}}{\theta^n}, \ \ f_{Y_{\mathsf{min}}}\left(y\right)=\dfrac{n}{\theta}\left(1-\frac{y}{\theta}\right)^{n-1}\] for \(0\leq y\leq \theta\). Finally, I collect the following integration results below to tidy up the calculations later.

  • For positive integers \(k\), we have \[\int \left(1-\frac{s}{\theta}\right)^k \,\, ds = -\frac{\theta}{k+1}\left(1-\frac{s}{\theta}\right)^{k+1}\] This result is obtained using integration by substitution.
  • For positive integers \(m\) and \(k\), we have \[\int s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = -\frac{\theta s^m}{k+1}\left(1-\frac{s}{\theta}\right)^{k+1}+\frac{m\theta}{k+1} \int s^{m-1}\left(1-\frac{s}{\theta}\right)^{k+1}\,\, ds\] So, if \(m=1\) and \(k=n-1\), \(\displaystyle\int_0^\theta s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = \frac{\theta^2}{n\left(n+1\right)}\). If \(m=2\) and \(k=n-1\), \[\int_0^\theta s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = \frac{2\theta^3}{n\left(n+1\right)\left(n+2\right)}.\]
  1. (6 points) Find \(\mathbb{E}\left(\widehat{\theta}_1\right)\), \(\mathbb{E}\left(\widehat{\theta}_2\right)\), and \(\mathbb{E}\left(\widehat{\theta}_3\right)\). One of the calculations is much easier. Can you explain why?

ANSWER:

We need to derive the pdf of \(Y_{\mathsf{min}}\) and to directly calculate some integrals. These results were collected earlier. So, we have

\[\mathbb{E}\left(\widehat{\theta}_1\right)=\left(n+1\right)\mathbb{E}\left(Y_{\mathsf{min}}\right)=\displaystyle\frac{\left(n+1\right)n}{\theta}\int_0^\theta s\left(1-\frac{s}{\theta}\right)^{n-1}\,\, ds= \frac{\left(n+1\right)n}{\theta} \cdot \frac{\theta^2}{n\left(n+1\right)} =\theta\] Because \(Y_1,\ldots, Y_n\) are IID \(\mathsf{U}\left(0,\theta\right)\), \(\mathbb{E}\left(\overline{Y}\right)=\theta/2\). Therefore,

\[\mathbb{E}\left(\widehat{\theta}_2\right)=2\mathbb{E}\left(\overline{Y}\right)=2\theta/2=\theta\] We also need to derive the pdf of \(Y_{\mathsf{max}}\) and to directly calculate some integrals. So, we have \[\displaystyle \mathbb{E}\left(\widehat{\theta}_3\right)=\frac{n+1}{n}\mathbb{E}\left(Y_{\mathsf{max}}\right)=\frac{n+1}{\theta^n}\int_0^\theta s^n\,\, ds = \frac{n+1}{\theta^n}\cdot\frac{\theta^{n+1}}{n+1}=\theta\]

The easiest to work out was \(\mathbb{E}\left(\widehat{\theta}_2\right)\) because you do not need the distribution of \(\overline{Y}\). The linearity of \(\overline{Y}\) substantially simplifies the calculations compared to the other two estimators.

NOTE: \(Y_{\mathsf{min}}\) took a lot more time to work out compared to \(Y_{\mathsf{max}}\), as the latter does not require integration by parts.

  1. (6 points) Find \(\mathsf{Var}\left(\widehat{\theta}_1\right)\), \(\mathsf{Var}\left(\widehat{\theta}_2\right)\), and \(\mathsf{Var}\left(\widehat{\theta}_3\right)\). Which estimator has a variance that goes to zero more rapidly?

ANSWER: Using the previously collected results, we have

\[\mathbb{E}\left(\widehat{\theta}_1^2\right)= \left(n+1\right)^2\mathbb{E}\left(Y_{\mathsf{min}}^2\right)=\frac{\left(n+1\right)^2 n}{\theta} \cdot \frac{2\theta^3}{n\left(n+1\right)\left(n+2\right)}=\frac{2\left(n+1\right)\theta^2}{n+2}.\] As a result, \[\mathsf{Var}\left(\widehat{\theta}_1\right)=\frac{2\left(n+1\right)\theta^2}{n+2}-\theta^2=\frac{n}{n+2}\theta^2.\] Next, \(Y_1,\ldots, Y_n\) are IID \(\mathsf{U}\left(0,\theta\right)\), so \(\mathsf{Var}\left(\overline{Y}\right)=\theta^2/\left(12n\right)\). Therefore, \[\mathsf{Var}\left(\widehat{\theta}_2\right)=4\mathsf{Var}\left(\overline{Y}\right)=\frac{1}{3n}\theta^2.\] Finally, \[\mathbb{E}\left(\widehat{\theta}_3^2\right)=\left(\frac{n+1}{n}\right)^2\mathbb{E}\left(Y_{\mathsf{max}}^2\right)=\left(\frac{n+1}{n}\right)^2 \cdot \frac{n}{\theta^n}\int_0^\theta s^{n+1}\,\, ds=\frac{\left(n+1\right)^2}{n\left(n+2\right)}\theta^2.\] Therefore, \[\mathsf{Var}\left(\widehat{\theta}_3\right)=\frac{\left(n+1\right)^2}{n\left(n+2\right)}\theta^2-\theta^2=\frac{1}{n\left(n+2\right)}\theta^2\]

\(\widehat{\theta}_3\) has a variance that goes to zero more rapidly compared to \(\widehat{\theta}_1\) and \(\widehat{\theta}_2\).

  1. (3 points) Read the portion of the notes related to squared-error consistency here. There you will find the definition of squared-error consistency, bias, asymptotically unbiased, and MSE. Using what you already have in Item 1, what would be the bias of \(Y_{\mathsf{min}}\), \(2\overline{Y}\), and \(Y_{\mathsf{max}}\). Are these estimators unbiased for \(\theta\)? asymptotically unbiased for \(\theta\)? Why do you think \(\widehat{\theta}_1\) and \(\widehat{\theta}_3\) are labeled bias-adjusted?

ANSWER: The bias of \(Y_{\mathsf{min}}\) for \(\theta\) is \(\displaystyle\mathbb{E}\left(Y_{\mathsf{min}}\right)-\theta=-\frac{n}{n+1}\theta\). The bias of \(2\overline{Y}\) for for \(\theta\) is 0. The bias of \(Y_{\mathsf{max}}\) for \(\theta\) is \(\displaystyle\mathbb{E}\left(Y_{\mathsf{max}}\right)-\theta=-\frac{1}{n+1}\theta\). Therefore, only \(2\overline{Y}\) is unbiased for \(\theta\).

The limit of the bias of \(2\overline{Y}\) and \(Y_{\mathsf{max}}\) as \(n\to\infty\) are both equal to zero. Therefore, these two estimators are asymptotically unbiased for \(\theta\).

Since \(\displaystyle\mathbb{E}\left(Y_{max}\right)=\frac{n}{n+1}\theta\), we must have \(\displaystyle\mathbb{E}\left(\frac{n+1}{n}Y_{max}\right)=\theta\). Observe that \(\displaystyle\frac{n+1}{n}Y_{max}\) is exactly \(\widehat{\theta}_3\).

In a similar fashion, \(\displaystyle\mathbb{E}\left(Y_{min}\right)=\frac{1}{n+1}\theta\), we must have \(\displaystyle\mathbb{E}\left(\left(n+1\right)Y_{min}\right)=\theta\). Observe that \(\displaystyle\left(n+1\right)Y_{min}\) is exactly \(\widehat{\theta}_1\). Therefore, in both cases, we can justify the label bias-adjusted estimators.

  1. (3 points) Determine whether or not \(\widehat{\theta}_1\), \(\widehat{\theta}_2\), and \(\widehat{\theta}_3\) are squared-error consistent for \(\theta\). Show your work and reasoning.

Since \(\widehat{\theta}_1\), \(\widehat{\theta}_2\), and \(\widehat{\theta}_3\) are all unbiased for \(\theta\), they are also asymptotically unbiased for \(\theta\).

From Item 2, \[\lim_{n\to\infty}\mathsf{Var}\left(\widehat{\theta}_1\right)=\lim_{n\to\infty}\frac{n}{n+2}\theta^2=\theta^2 \neq 0.\] Therefore, \(\widehat{\theta}_1\) is not squared-error consistent for \(\theta\).

From Item 2, observe that both \(\widehat{\theta}_2\) and \(\widehat{\theta}_3\) have variances that go to zero as \(n\to\infty\). Because \(\widehat{\theta}_2\) and \(\widehat{\theta}_3\) are asymptotically unbiased and have variances that go to zero asymptotically, both these estimators are squared-error consistent for \(\theta\).

Below you see sample code to demonstrate the apply() function. To learn more about it, type ?apply. But the idea it is to “apply” a function (either built-in or user-defined) to either the rows or columns of an array. This is a convenient way to repeatedly do something to the columns or rows of an array without really specifying a for loop. For example, you encountered colMeans() which may be formulated in terms of apply().

n <- 5
mu <- 1
sigma.sq <- 4
nsim <- 8 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
ymat
           [,1]        [,2]      [,3]     [,4]       [,5]      [,6]      [,7]
[1,]  1.3904051 -0.03111589  2.369097 2.213403 2.22615523 -2.934541 2.3109174
[2,]  0.3284039  2.29255777  1.778515 3.661879 1.00847223 -2.319194 3.6014871
[3,] -2.1726322  0.03044256 -3.045596 1.200483 3.73690601 -1.535897 3.0837312
[4,]  0.6038961  0.26164353  5.022951 2.845416 0.01033531  2.566453 0.4153617
[5,]  0.7250735  1.18166540  2.054126 2.044209 2.23500607  3.795071 1.8485274
           [,8]
[1,]  3.9297191
[2,]  2.2290166
[3,]  1.7287033
[4,] -0.6197381
[5,] -1.8608037
apply(ymat, 2, mean)
[1]  0.17502928  0.74703867  1.63581857  2.39307784  1.84337497 -0.08562156
[7]  2.25200495  1.08137944
apply(ymat, 1, mean)
[1] 1.4342550 1.5726422 0.3782676 1.3882898 1.5028593
  1. (6 points) You are going to repeatedly generate \(10^4\) artificial datasets from \(\mathsf{U}\left(0,\theta\right)\). Set \(n=5\) and \(n=500\), but you can choose any valid value of \(\theta\). You may have to use the apply(), min(), max(), and runif() commands for this item. Your tasks are (a) to apply the estimators \(\widehat{\theta}_1\), \(\widehat{\theta}_2\), and \(\widehat{\theta}_3\) to the artificial datasets (b) to produce histograms of the sampling distributions of \(\widehat{\theta}_1\), \(\widehat{\theta}_2\), and \(\widehat{\theta}_3\) (c) to report your own R code by modifying the code below. What do you notice about the sampling distributions of each estimator?
theta <- 1
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, runif(5, min = 0, max = theta))
# calculate estimates
theta1hat <- 6*apply(ymat, 2, min)
theta2hat <- 2*colMeans(ymat)
theta3hat <- 6/5*apply(ymat, 2, max)
# put a 1x3 canvas for the case of n=5
par(mfrow=c(1,3))
hist(theta1hat, freq = FALSE)
hist(theta2hat, freq = FALSE)
hist(theta3hat, freq = FALSE)

c(mean(theta1hat), mean(theta2hat), mean(theta3hat))
[1] 0.9935800 0.9930326 0.9979351
c(var(theta1hat), var(theta2hat), var(theta3hat))
[1] 0.70735197 0.06603466 0.02905609
# n=500
ymat <- replicate(nsim, runif(500, min = 0, max = theta))
# calculate estimates
theta1hat <- 501*apply(ymat, 2, min)
theta2hat <- 2*colMeans(ymat)
theta3hat <- 501/500*apply(ymat, 2, max)
# put a 1x3 canvas for the case of n=500
par(mfrow=c(1,3))
hist(theta1hat, freq = FALSE)
hist(theta2hat, freq = FALSE)
hist(theta3hat, freq = FALSE)

c(mean(theta1hat), mean(theta2hat), mean(theta3hat))
[1] 1.0046896 1.0000454 0.9999998
c(var(theta1hat), var(theta2hat), var(theta3hat))
[1] 9.782793e-01 6.656409e-04 3.967086e-06
  1. (1 point) Report your findings for c(mean(theta1hat), mean(theta2hat), mean(theta3hat)) and c(var(theta1hat), var(theta2hat), var(theta3hat)) when \(n=5\) and \(n=500\). Compare with the theoretical results obtained in Items 1 and 2.

Findings from the simulation could be found in Item 5, where we fixed \(\theta=1\) by design. We should see a value close to 1 for the c(mean(theta1hat), mean(theta2hat), mean(theta3hat)) regardless of whether \(n=5\) or \(n=500\).

For c(var(theta1hat), var(theta2hat), var(theta3hat)), we should obtain values roughly around 0.714, 0.067, 0.029, respectively, when \(n=5\). When \(n=500\), we should obtain values roughly around 1, \(6.7\times 10^{-4}\), and \(4\times 10^{-6}\).

The theoretical and simulation results match very well for the case where \(\theta=1\).

Footnotes

  1. I used \(J\) for sample size instead of \(n\), which is what we usually use in class. The main reason is that I have already used up \(n\) for the exercise.↩︎

  2. If you forgot this concept, consult LM Section 3.10.↩︎