Homework 01 Solution

Exercise A: ID numbers and contributions

1 bonus point earned for doing this exercise properly.

Exercise B: Chips Ahoy, Part 1

You have personally observed that the number of chocolate chips on a Chips Ahoy cookie is not the same across all cookies. In other words, there is variability in the number of chocolate chips even though these cookies are mass-produced and have quality control standards. You will now go through an argument meant to motivate the choice of a parametric statistical model to describe the variability of \(X\).

Let \(m\) be the total number of cookies produced from a well-mixed batch of cookie dough. Let \(n\) be the total number of chocolate chips present in that batch of cookie dough. Let \(X_i\) be a Bernoulli random variable which records whether or not a chip is placed on a cookie and \(X\) be the total number of chocolate chips on a cookie. The probability of seeing one chip on a cookie is \(1/m\).

  1. (2 points) What distribution would be suitable for the random variable \(X\)? Do you need additional assumptions to ensure the suitability of this distribution?

    • Based on the description provided, we could choose \(X_i\sim \mathrm{Ber}\left(1/m\right)\) for all \(i\). If \(X_1,\ldots,X_n\) are additionally mutually independent, we have \(X\sim \mathrm{Bin}\left(n,1/m\right)\).
  2. (3 points) Given your distribution in Item 1, what would be \(\mathbb{E}\left(X\right)\)? Why would this quantity be of interest to a consumer of Chips Ahoy and to the factory managing the production of Chips Ahoy?

    • \(\mathbb{E}\left(X\right)=n/m\). Consumers would definitely be interested in the magnitude of n/m as they are buying chocolate chip cookies. They would likely prefer more chocolate chips to less. The factory managing the production of Chips Ahoy would also be interested in the magnitude of \(n/m\) as it can be used to make approximate cost calculations and potentially decide on pricing.
    • NOTE: There may be many plausible answers here.
  3. (2 points) Do you think \(m\) or \(n\) or both would be available to the consumer? to the factory management? Explain.

    • It is very unlikely that \(m\) or \(n\) would be available to the consumer, but it would be available to the factory management. Consumers simply do not have access to information on purchases of chocolate chips or the amount of cookies produced.
    • NOTE: There may be many plausible answers here.
  4. (1 point) Suppose \(n\) and \(m\) are both large. Use LM Theorem 4.2.1 to suggest an alternative distribution in case information about both \(m\) and \(n\) are not available. Why would this alternative distribution be more attractive compared to the distribution in Item 1?

    • When both \(n\) and \(m\) are large, and provided that \(n/m\) remains constant, then LM Theorem 4.2.1 allows us consider \(X\sim \mathrm{Poi}\left(n/m\right)\). This alternative distribution may be more attractive as it depends on only on \(n/m\) alone without necessarily knowing \(n\) and \(m\) separately. Consumers are in this situation and consumers can use the cookies produced to provide evidence for or against factory management claims (for example).
    • NOTE: There may be many plausible answers here.

Exercise C: How do you actually ensure IID conditions in real life?

Imagine you have a container and you cannot see the inside of this container. Inside this container, you have 5 ping-pong balls which have no discernible features apart from a numerical label:

Numerical labels on balls in an urn

While blindfolded, draw two balls from this container with replacement. What this means is that after drawing the first ball, you return the ball before drawing the second ball. Assume that someone is right by your side recording the outcome of the two draws. This procedure is called simple random sampling. Define two random variables \(Y_1\) and \(Y_2\) to be the recorded outcome of the first and second draws, respectively. Further define another random variable \[\overline{Y}=\frac{Y_1+Y_2}{2}\] You can assume that all possible outcomes of the pair \((Y_1, Y_2)\) are equally likely to occur.

  1. (2 points) Obtain the probability mass function of \(\overline{Y}\).
\(\overline{y}\) 1 1.5 2 2.5 3 3.5 4 5 6
\(\mathbb{P}\left(\overline{Y}=\overline{y}\right)\) 1/25 4/25 4/25 2/25 4/25 2/25 5/25 2/25 1/25
  1. (2 points) Using what you have found in Item 1, calculate \(\mathbb{E}\left(\overline{Y}\right)\) and \(\mathsf{Var}\left(\overline{Y}\right)\).

    • \(\mathbb{E}\left(\overline{Y}\right)=\dfrac{1(1)+1.5(4)+2(4)+2.5(2)+3(4)+3.5(2)+4(5)+5(2)+6(1)}{25}=3\)
    • \(\mathbb{E}\left[\left(\overline{Y}\right)^2\right]=\dfrac{1^2(1)+1.5^2(4)+2^2(4)+2.5^2(2)+3^2(4)+3.5^2(2)+4^2(5)+5^2(2)+6^2(1)}{25}=10.6\), so \(\mathsf{Var}\left(\overline{Y}\right)=10.6-3^2=1.6\)
  2. (1 point) \(Y_1\) and \(Y_2\) are independent random variables. Can you explain why?

    • Yes, they are independent because the balls are drawn with replacement.
  3. (1 point) Do you think \(Y_1\) and \(Y_2\) are identically distributed random variables? Show why or why not.

    • Yes, because both have a common distribution with the following pmf:
\(y\) 1 2 4 6
\(\mathbb{P}\left(Y=y\right)\) 1/5 2/5 1/5 1/5
  1. (1 point) Can you think of a way to change the conditions of the experiment so that you can break the independence of \(Y_1\) and \(Y_2\)?

    • Had we drawn balls without replacement, then we can break the independence of \(Y_1\) and \(Y_2\).

Now take the perspective of someone who knows the contents of the container. From this perspective, you actually have a list of numbers \(\{1,2,2,4,6\}\). You can think of this list as a population of size \(N=5\). Recall that if you have a list of numbers \(\{x_1,\ldots,x_N\}\), the population mean and the population variance are given by \(\mu=\displaystyle\frac{1}{N}\sum_{i=1}^N x_i\) and \(\sigma^2=\displaystyle\frac{1}{N}\sum_{i=1}^N \left(x_i-\mu\right)^2\). Applying these formulas gives \(\mu=3\) and \(\sigma^2=16/5\), which you can check for yourself.

  1. (2 points) Discuss how simple random sampling is able to produce the properties of the sample mean under IID conditions.

    • Observe that \(\mathbb{E}\left(\overline{Y}\right)=\mu=3\) and \(\mathsf{Var}\left(\overline{Y}\right)=\sigma^2/n=(16/5)/2=3.2\). Therefore, simple random sampling is able to match the mean and variance of the sampling distribution of \(\overline{Y}\) .

Finally, you are going to use R and simulate the distribution of \(\overline{Y}\) based on the earlier description. Adapt the following code to suit the situation. Try out the code for yourself first.

# Your list of numbers, represented as a vector in R using c() 
x <- c(2, 5, 1, 3, 1, 5, 7, 6)
# To draw a simple a random sample from a list of numbers, use sample(). 
sample(x, 3, replace = TRUE) # Produce a simple random sample of size 3 with replacement. Try running this code a few times. 
# To learn more about the sample command, see the help file. 
?sample 
  1. (3 points) Write down R code to simulate 10000 draws from the distribution of \(\overline{Y}\) based on the description at the beginning of the exercise. Use table() to display the frequency distribution of your simulated distribution. How well does your simulated distribution match the theoretical distribution found in Item 1?
y <- c(1, 2, 2, 4, 6)
ymat <- replicate(10^4, sample(y, 2, replace = TRUE))
ybars <- colMeans(ymat)
table(ybars) # Frequency
ybars
   1  1.5    2  2.5    3  3.5    4    5    6 
 385 1634 1582  782 1665  757 2011  802  382 
table(ybars)/10^4 # Relative frequencies based on simulated distribution
ybars
     1    1.5      2    2.5      3    3.5      4      5      6 
0.0385 0.1634 0.1582 0.0782 0.1665 0.0757 0.2011 0.0802 0.0382 
c(1, 4, 4, 2, 4, 2, 5, 2, 1)/25 # Theoretical distribution 
[1] 0.04 0.16 0.16 0.08 0.16 0.08 0.20 0.08 0.04
c(mean(ybars), var(ybars)) # should be close to 3 and 1.6
[1] 2.994550 1.584154