Homework 02 Solution

Exercise A: ID numbers and contributions

1 bonus point earned for doing this exercise properly.

Exercise B: Chips Ahoy, Part 2

You have personally observed that the number of chocolate chips on a Chips Ahoy cookie is not the same across all cookies. In other words, there is variability in the number of chocolate chips even though these cookies are mass-produced and have quality control standards. You will now go through an argument meant to motivate the choice of a parametric statistical model to describe the variability of $X$.

Chips Ahoy, Part 1 gave you a chance to work through what could be a suitable model for the number of chocolate chip cookies. You have reached the conclusion that $X\sim \mathrm{Poi}\left(\lambda\right)$, where $\lambda=m/n$. Both $m$ and $n$ are unobservable from the point of view of the consumer. In this exercise, you will be constructing a rough confidence interval for $\lambda$ from the point of view of the consumer. Let $Y_1,\ldots, Y_J$ be IID draws from $\mathrm{Poi}\left(\lambda\right)$, where $J$ is the sample size¹.

(3 points) Write down a proof, along with your reasoning, of the following statement: \[\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\frac{1}{c^2}\]

ANSWER: Observe that $Y_1,\ldots, Y_J$ are IID draws from $\mathrm{Poi}\left(\lambda\right)$. Therefore, $\mathbb{E}\left(\overline{Y}\right)=\lambda$ and $\mathsf{Var}\left(\overline{Y}\right)=\lambda/n$. By Chebyshev’s inequality, we have for any $\varepsilon > 0$, \[\mathbb{P}\left(\left|\overline{Y}-\mathbb{E}\left(\overline{Y}\right)\right| \leq \varepsilon \right)\ \geq 1-\frac{\mathsf{Var}\left(\overline{Y}\right)}{\varepsilon^2} \Rightarrow \mathbb{P}\left(\left|\overline{Y}-\lambda\right| \leq \varepsilon \right)\ \geq 1-\frac{\lambda}{J\varepsilon^2}. \] If we choose $\varepsilon=c\lambda/J$, we obtain \[\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\frac{1}{c^2}.\]

(1 point) Note that \[\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \Leftrightarrow \left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2.\] Do some algebra and show that \[\left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2 \Leftrightarrow \lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2\leq 0\]

I do not show the solution for this part, as this is straightforward algebra.

(4 points) Consider a modified version of the definition of a confidence interval found here. In particular, let $\theta$ be an unknown parameter of interest or estimand and $c>0$. Let $L$ and $U$ be statistics. $(L, U)$ is a $100\left(1-1/c^2\right)\%$ finite-sample conservative $100\left(1-1/c^2\right)\%$ confidence interval for $\theta$ if \[\mathbb{P}\left(L\leq \theta \leq U\right) \geq 1-\frac{1}{c^2}\] for every $\theta$. Use the result in Item 2 to construct a finite-sample conservative $100\left(1-1/c^2\right)\%$ confidence interval for $\lambda$. You should give me expressions for $L$ and $U$, which will achieve the lower bound $1-1/c^2$.

First, solve the quadratic equation \[\lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2= 0.\] The solutions are given by \[L=\frac{2\overline{Y}+\dfrac{c^2}{J}-\sqrt{\dfrac{4\overline{Y}c^2}{J}+\dfrac{c^4}{J^2}}}{2}, \ \ \ U=\frac{2\overline{Y}+\dfrac{c^2}{J}+\sqrt{\dfrac{4\overline{Y}c^2}{J}+\dfrac{c^4}{J^2}}}{2}.\] Therefore, the solutions to the inequality \[\lambda^2+\lambda\left(-2\overline{Y}-\frac{c^2}{J}\right)+\left(\overline{Y}\right)^2\leq 0\] are given by $L\leq \lambda \leq U$. Next, show that the desired guarantee is attained: \[\mathbb{P}\left(L\leq \lambda \leq U\right)=\mathbb{P}\left(\left(\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right)^2 \leq c^2\right)=\mathbb{P}\left(\left|\frac{\overline{Y}-\lambda}{\sqrt{\lambda/J}}\right| \leq c \right) \geq 1-\dfrac{1}{c^2}.\]

(1 point) Recall we constructed a confidence interval for $\mu$ in the IID normal case. We used an approach where we pretended to know $\sigma^2$. Do we have to pretend to know $\sigma^2$ for the Poisson case? Explain.

We cannot even pretend to know $\sigma^2$ for the Poisson case because $\sigma^2=\lambda$ and $\lambda$ is unknown.

(2 points) You will now apply the interval found in Item 3 to obtain a finite-sample conservative 95% confidence interval for $\lambda$ using our cookies dataset. You may have to use read.csv(). Report the R code you have used and your finding.

We must set $c^2=20$ because we want a finite-sample conservative 95% confidence interval for $\lambda$. Below, we produce the calculations which apply the interval obtained in Item 3.

cookie <- read.csv("data_cookie.csv")
ybar <- mean(cookie$numchoc)
J <- length(cookie$numchoc)
c <- sqrt(20)
c(2*ybar + c^2/J - sqrt(4*ybar*c^2/J + c^4/J^2), 2*ybar + c^2/J + sqrt(4*ybar*c^2/J + c^4/J^2))/2

[1] 13.64160 16.71778

(4 points) You will be using a Monte Carlo simulation to evaluate the performance of the finite-sample conservative $100\left(1-1/c^2\right)\%$ confidence interval for $\lambda$. Consider the R code we used to demonstrate what a confidence interval means in the IID normal case here. I created a modified version below. You have to fill up the missing parts accordingly for the case of a finite-sample conservative $100\left(1-1/c^2\right)\%$ confidence interval for $\lambda$. Report the code you have used and your finding for mean(results[1,] < lambda & lambda < results[2,]). Discuss whether your finding matches the theoretical guarantee developed in Item 3.

ANSWER: Below you will find the R code used. The theoretical guarantee matches the one found in Item 3. In fact, the lower bound 3/4 (when $c=2$) is quite conservative.

# Set a value for lambda to generate artificial datasets following a Poisson distribution
lambda <- 14
# Set c for the desired 1-1/c^2  
c <- 2
# Create a function depending on sample size J
cons.ci <- function(J)
{
  y <- rpois(J, lambda)
  ybar <- mean(y)
  c(2*ybar + c^2/J - sqrt(4*ybar*c^2/J + c^4/J^2), 2*ybar + c^2/J + sqrt(4*ybar*c^2/J + c^4/J^2))/2
}
# Construct interval 10000 times, you can change 5 into something else
results <- replicate(10^4, cons.ci(J))
# Calculate how many 
mean(results[1,] < lambda & lambda < results[2,])

[1] 0.9553

NOTE: It is not necessary to match the simulation design to the Chips Ahoy example, but you may do so. For one thing, we really do not know the real value of $\lambda$.

R NOTE: The object J in cons.ci(J) points to the $J$ available in memory which should match J <- length(cookie$numchoc). The J in the function definition for cons.ci is only a placeholder.

Exercise C: The IID Uniform case

This exercise should force you to start working on the examples and exercises in the book. This exercise is a version of LM Examples 5.4.2, 5.4.6, 5.7.1, and LM Exercises 5.4.18, 5.7.4, 5.7.5. You may also need to use integration by parts at some point for the mathematical derivations.

Let $Y_1,\ldots,Y_n$ be IID $\mathsf{U}\left(0,\theta\right)$, where $\theta>0$. The common pdf of these random variables is given by \[f\left(y\right)=\begin{cases}\dfrac{1}{\theta} & \ \ \ 0\leq y\leq \theta\\ 0 & \ \ \ \mathrm{otherwise} \end{cases},\] the common mean is $\theta/2$, and the common variance is $\theta^2/12$. The estimand for this exercise is $\theta$.

You will be considering three estimators for $\theta$: \[\widehat{\theta}_1=\left(n+1\right)Y_{\mathsf{min}}, \ \ \ \widehat{\theta}_2=2\overline{Y}, \ \ \ \widehat{\theta}_3=\dfrac{n+1}{n}Y_{\mathsf{max}}.\] You will learn later that $\widehat{\theta}_1$, which depends on smallest order statistic², $Y_{\mathsf{min}}$ is a bias-adjusted estimator that applies the plug-in principle. You will also learn later that $\widehat{\theta}_2=2\overline{Y}$ is a method of moments estimator. You will also learn later that $\widehat{\theta}_3$, which depends on the largest order statistic $Y_{\mathsf{max}}$, is a bias-adjusted maximum likelihood estimator.

ANSWER: First, let me write down all the calculus-based results I am going to need. The cdf for $\mathsf{U}\left(0,\theta\right)$ is given by \[F\left(y\right)=\int_{-\infty}^y f\left(s\right)\,\,ds=\begin{cases} 0 & \mathrm{if}\ y<0\\ \displaystyle\int_{-\infty}^0 f\left(s\right)\,\,ds+\int_0^y f\left(s\right)\,\,ds = \int_0^y \frac{1}{\theta}\,\,ds=\frac{y}{\theta} & \mathrm{if}\ 0\leq y \leq \theta\\ \displaystyle\int_{-\infty}^\theta f\left(s\right)\,\,ds+\int_\theta^y f\left(s\right)\,\,ds = 1 & \mathrm{if}\ y > \theta \end{cases}\] Next, by Theorem 3.10.1, the pdfs of $Y_{\mathsf{max}}$ and $Y_{\mathsf{min}}$ are given by \[f_{Y_{\mathsf{max}}}\left(y\right)=\frac{ny^{n-1}}{\theta^n}, \ \ f_{Y_{\mathsf{min}}}\left(y\right)=\dfrac{n}{\theta}\left(1-\frac{y}{\theta}\right)^{n-1}\] for $0\leq y\leq \theta$. Finally, I collect the following integration results below to tidy up the calculations later.

For positive integers $k$, we have \[\int \left(1-\frac{s}{\theta}\right)^k \,\, ds = -\frac{\theta}{k+1}\left(1-\frac{s}{\theta}\right)^{k+1}\] This result is obtained using integration by substitution.
For positive integers $m$ and $k$, we have \[\int s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = -\frac{\theta s^m}{k+1}\left(1-\frac{s}{\theta}\right)^{k+1}+\frac{m\theta}{k+1} \int s^{m-1}\left(1-\frac{s}{\theta}\right)^{k+1}\,\, ds\] So, if $m=1$ and $k=n-1$, $\displaystyle\int_0^\theta s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = \frac{\theta^2}{n\left(n+1\right)}$. If $m=2$ and $k=n-1$, \[\int_0^\theta s^m\left(1-\frac{s}{\theta}\right)^k \,\, ds = \frac{2\theta^3}{n\left(n+1\right)\left(n+2\right)}.\]

(6 points) Find $\mathbb{E}\left(\widehat{\theta}_1\right)$, $\mathbb{E}\left(\widehat{\theta}_2\right)$, and $\mathbb{E}\left(\widehat{\theta}_3\right)$. One of the calculations is much easier. Can you explain why?

ANSWER:

We need to derive the pdf of $Y_{\mathsf{min}}$ and to directly calculate some integrals. These results were collected earlier. So, we have

\[\mathbb{E}\left(\widehat{\theta}_1\right)=\left(n+1\right)\mathbb{E}\left(Y_{\mathsf{min}}\right)=\displaystyle\frac{\left(n+1\right)n}{\theta}\int_0^\theta s\left(1-\frac{s}{\theta}\right)^{n-1}\,\, ds= \frac{\left(n+1\right)n}{\theta} \cdot \frac{\theta^2}{n\left(n+1\right)} =\theta\] Because $Y_1,\ldots, Y_n$ are IID $\mathsf{U}\left(0,\theta\right)$, $\mathbb{E}\left(\overline{Y}\right)=\theta/2$. Therefore,

\[\mathbb{E}\left(\widehat{\theta}_2\right)=2\mathbb{E}\left(\overline{Y}\right)=2\theta/2=\theta\] We also need to derive the pdf of $Y_{\mathsf{max}}$ and to directly calculate some integrals. So, we have \[\displaystyle \mathbb{E}\left(\widehat{\theta}_3\right)=\frac{n+1}{n}\mathbb{E}\left(Y_{\mathsf{max}}\right)=\frac{n+1}{\theta^n}\int_0^\theta s^n\,\, ds = \frac{n+1}{\theta^n}\cdot\frac{\theta^{n+1}}{n+1}=\theta\]

The easiest to work out was $\mathbb{E}\left(\widehat{\theta}_2\right)$ because you do not need the distribution of $\overline{Y}$. The linearity of $\overline{Y}$ substantially simplifies the calculations compared to the other two estimators.

NOTE: $Y_{\mathsf{min}}$ took a lot more time to work out compared to $Y_{\mathsf{max}}$, as the latter does not require integration by parts.

(6 points) Find $\mathsf{Var}\left(\widehat{\theta}_1\right)$, $\mathsf{Var}\left(\widehat{\theta}_2\right)$, and $\mathsf{Var}\left(\widehat{\theta}_3\right)$. Which estimator has a variance that goes to zero more rapidly?

ANSWER: Using the previously collected results, we have

\[\mathbb{E}\left(\widehat{\theta}_1^2\right)= \left(n+1\right)^2\mathbb{E}\left(Y_{\mathsf{min}}^2\right)=\frac{\left(n+1\right)^2 n}{\theta} \cdot \frac{2\theta^3}{n\left(n+1\right)\left(n+2\right)}=\frac{2\left(n+1\right)\theta^2}{n+2}.\] As a result, \[\mathsf{Var}\left(\widehat{\theta}_1\right)=\frac{2\left(n+1\right)\theta^2}{n+2}-\theta^2=\frac{n}{n+2}\theta^2.\] Next, $Y_1,\ldots, Y_n$ are IID $\mathsf{U}\left(0,\theta\right)$, so $\mathsf{Var}\left(\overline{Y}\right)=\theta^2/\left(12n\right)$. Therefore, \[\mathsf{Var}\left(\widehat{\theta}_2\right)=4\mathsf{Var}\left(\overline{Y}\right)=\frac{1}{3n}\theta^2.\] Finally, \[\mathbb{E}\left(\widehat{\theta}_3^2\right)=\left(\frac{n+1}{n}\right)^2\mathbb{E}\left(Y_{\mathsf{max}}^2\right)=\left(\frac{n+1}{n}\right)^2 \cdot \frac{n}{\theta^n}\int_0^\theta s^{n+1}\,\, ds=\frac{\left(n+1\right)^2}{n\left(n+2\right)}\theta^2.\] Therefore, \[\mathsf{Var}\left(\widehat{\theta}_3\right)=\frac{\left(n+1\right)^2}{n\left(n+2\right)}\theta^2-\theta^2=\frac{1}{n\left(n+2\right)}\theta^2\]

$\widehat{\theta}_3$ has a variance that goes to zero more rapidly compared to $\widehat{\theta}_1$ and $\widehat{\theta}_2$.

(3 points) Read the portion of the notes related to squared-error consistency here. There you will find the definition of squared-error consistency, bias, asymptotically unbiased, and MSE. Using what you already have in Item 1, what would be the bias of $Y_{\mathsf{min}}$, $2\overline{Y}$, and $Y_{\mathsf{max}}$. Are these estimators unbiased for $\theta$? asymptotically unbiased for $\theta$? Why do you think $\widehat{\theta}_1$ and $\widehat{\theta}_3$ are labeled bias-adjusted?

ANSWER: The bias of $Y_{\mathsf{min}}$ for $\theta$ is $\displaystyle\mathbb{E}\left(Y_{\mathsf{min}}\right)-\theta=-\frac{n}{n+1}\theta$. The bias of $2\overline{Y}$ for for $\theta$ is 0. The bias of $Y_{\mathsf{max}}$ for $\theta$ is $\displaystyle\mathbb{E}\left(Y_{\mathsf{max}}\right)-\theta=-\frac{1}{n+1}\theta$. Therefore, only $2\overline{Y}$ is unbiased for $\theta$.

The limit of the bias of $2\overline{Y}$ and $Y_{\mathsf{max}}$ as $n\to\infty$ are both equal to zero. Therefore, these two estimators are asymptotically unbiased for $\theta$.

Since $\displaystyle\mathbb{E}\left(Y_{max}\right)=\frac{n}{n+1}\theta$, we must have $\displaystyle\mathbb{E}\left(\frac{n+1}{n}Y_{max}\right)=\theta$. Observe that $\displaystyle\frac{n+1}{n}Y_{max}$ is exactly $\widehat{\theta}_3$.

In a similar fashion, $\displaystyle\mathbb{E}\left(Y_{min}\right)=\frac{1}{n+1}\theta$, we must have $\displaystyle\mathbb{E}\left(\left(n+1\right)Y_{min}\right)=\theta$. Observe that $\displaystyle\left(n+1\right)Y_{min}$ is exactly $\widehat{\theta}_1$. Therefore, in both cases, we can justify the label bias-adjusted estimators.

(3 points) Determine whether or not $\widehat{\theta}_1$, $\widehat{\theta}_2$, and $\widehat{\theta}_3$ are squared-error consistent for $\theta$. Show your work and reasoning.

Since $\widehat{\theta}_1$, $\widehat{\theta}_2$, and $\widehat{\theta}_3$ are all unbiased for $\theta$, they are also asymptotically unbiased for $\theta$.

From Item 2, \[\lim_{n\to\infty}\mathsf{Var}\left(\widehat{\theta}_1\right)=\lim_{n\to\infty}\frac{n}{n+2}\theta^2=\theta^2 \neq 0.\] Therefore, $\widehat{\theta}_1$ is not squared-error consistent for $\theta$.

From Item 2, observe that both $\widehat{\theta}_2$ and $\widehat{\theta}_3$ have variances that go to zero as $n\to\infty$. Because $\widehat{\theta}_2$ and $\widehat{\theta}_3$ are asymptotically unbiased and have variances that go to zero asymptotically, both these estimators are squared-error consistent for $\theta$.

Below you see sample code to demonstrate the apply() function. To learn more about it, type ?apply. But the idea it is to “apply” a function (either built-in or user-defined) to either the rows or columns of an array. This is a convenient way to repeatedly do something to the columns or rows of an array without really specifying a for loop. For example, you encountered colMeans() which may be formulated in terms of apply().

n <- 5
mu <- 1
sigma.sq <- 4
nsim <- 8 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
ymat

           [,1]        [,2]      [,3]     [,4]       [,5]      [,6]      [,7]
[1,]  1.3904051 -0.03111589  2.369097 2.213403 2.22615523 -2.934541 2.3109174
[2,]  0.3284039  2.29255777  1.778515 3.661879 1.00847223 -2.319194 3.6014871
[3,] -2.1726322  0.03044256 -3.045596 1.200483 3.73690601 -1.535897 3.0837312
[4,]  0.6038961  0.26164353  5.022951 2.845416 0.01033531  2.566453 0.4153617
[5,]  0.7250735  1.18166540  2.054126 2.044209 2.23500607  3.795071 1.8485274
           [,8]
[1,]  3.9297191
[2,]  2.2290166
[3,]  1.7287033
[4,] -0.6197381
[5,] -1.8608037

apply(ymat, 2, mean)

[1]  0.17502928  0.74703867  1.63581857  2.39307784  1.84337497 -0.08562156
[7]  2.25200495  1.08137944

apply(ymat, 1, mean)

[1] 1.4342550 1.5726422 0.3782676 1.3882898 1.5028593

(6 points) You are going to repeatedly generate $10^4$ artificial datasets from $\mathsf{U}\left(0,\theta\right)$. Set $n=5$ and $n=500$, but you can choose any valid value of $\theta$. You may have to use the apply(), min(), max(), and runif() commands for this item. Your tasks are (a) to apply the estimators $\widehat{\theta}_1$, $\widehat{\theta}_2$, and $\widehat{\theta}_3$ to the artificial datasets (b) to produce histograms of the sampling distributions of $\widehat{\theta}_1$, $\widehat{\theta}_2$, and $\widehat{\theta}_3$ (c) to report your own R code by modifying the code below. What do you notice about the sampling distributions of each estimator?

theta <- 1
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, runif(5, min = 0, max = theta))
# calculate estimates
theta1hat <- 6*apply(ymat, 2, min)
theta2hat <- 2*colMeans(ymat)
theta3hat <- 6/5*apply(ymat, 2, max)
# put a 1x3 canvas for the case of n=5
par(mfrow=c(1,3))
hist(theta1hat, freq = FALSE)
hist(theta2hat, freq = FALSE)
hist(theta3hat, freq = FALSE)

c(mean(theta1hat), mean(theta2hat), mean(theta3hat))

[1] 0.9935800 0.9930326 0.9979351

c(var(theta1hat), var(theta2hat), var(theta3hat))

[1] 0.70735197 0.06603466 0.02905609

# n=500
ymat <- replicate(nsim, runif(500, min = 0, max = theta))
# calculate estimates
theta1hat <- 501*apply(ymat, 2, min)
theta2hat <- 2*colMeans(ymat)
theta3hat <- 501/500*apply(ymat, 2, max)
# put a 1x3 canvas for the case of n=500
par(mfrow=c(1,3))
hist(theta1hat, freq = FALSE)
hist(theta2hat, freq = FALSE)
hist(theta3hat, freq = FALSE)

c(mean(theta1hat), mean(theta2hat), mean(theta3hat))

[1] 1.0046896 1.0000454 0.9999998

c(var(theta1hat), var(theta2hat), var(theta3hat))

[1] 9.782793e-01 6.656409e-04 3.967086e-06

(1 point) Report your findings for c(mean(theta1hat), mean(theta2hat), mean(theta3hat)) and c(var(theta1hat), var(theta2hat), var(theta3hat)) when $n=5$ and $n=500$. Compare with the theoretical results obtained in Items 1 and 2.

Findings from the simulation could be found in Item 5, where we fixed $\theta=1$ by design. We should see a value close to 1 for the c(mean(theta1hat), mean(theta2hat), mean(theta3hat)) regardless of whether $n=5$ or $n=500$.

For c(var(theta1hat), var(theta2hat), var(theta3hat)), we should obtain values roughly around 0.714, 0.067, 0.029, respectively, when $n=5$. When $n=500$, we should obtain values roughly around 1, $6.7\times 10^{-4}$, and $4\times 10^{-6}$.

The theoretical and simulation results match very well for the case where $\theta=1$.

Footnotes

I used $J$ for sample size instead of $n$, which is what we usually use in class. The main reason is that I have already used up $n$ for the exercise.↩︎
If you forgot this concept, consult LM Section 3.10.↩︎