Homework 01

Instructions

For this homework, you may choose to work with another classmate enrolled in the course. Choose only one member of the pair to submit at SPOC. Otherwise, submit your solutions in the usual way.
Submit your solutions in PDF format to our SPOC website. The file name should be of the form IDNumber1_IDNumber2_HW01.pdf, where IDNumber1 and IDNumber2 are the corresponding university ID numbers of the pair who worked on this homework. If you worked alone, then use the file name IDNumber_HW01.pdf. The file size should be less than 20 MB.
The deadline is on March 12, 2023 at 1700 Beijing time. This deadline is also indicated in the submission form at SPOC. Make sure to click on the Submit button after you have uploaded your assignment, so that it does not stay in Draft mode.
You are allowed to discuss the exercises with other classmates, but you have to write up your own solutions. This means you are not allowed to directly copy or even look at solutions from any other source.
Some of the solutions may be chosen for presentation in class. If you or your partner are unable to explain your answers in class, then both of you will get zero credit for this homework.
If you do not provide the table in Exercise A, you automatically will get zero credit for this homework. If you do provide the table in Exercise A, you get at least a 1 point bonus right away.

Exercise A: ID numbers and contributions

Create a table which lists every classmate involved in the development of your solutions. Only put down ID numbers. You will also indicate the contributions made by each member of the pair. You will also indicate what you have discussed with other classmates, along with their ID numbers. If you work alone, then there is no need to be explicit about the contributions. But if you discussed with another classmates, make sure to indicate what you have discussed with them.

The table has the following form:

ID number	Write YES if ID number belongs to submitter.	Contribution

Here is an example, which is not very complete, but conveys the general idea.

ID number	Write YES if ID number belongs to submitter.	Contribution
ID number 1	YES	Worked on R code.
ID number 2	YES	Wrote up the solutions.
ID number 3		Discussed Exercise B Item 3.

If you are working alone, your table might look like:

ID number	Write YES if ID number belongs to submitter.	Contribution
ID number 1	YES
ID number 2		Discussed Exercise B Item 3.
ID number 3		Discussed Exercise A Item 4.

Exercise B: Chips Ahoy, Part 1

You have personally observed that the number of chocolate chips on a Chips Ahoy cookie is not the same across all cookies. In other words, there is variability in the number of chocolate chips even though these cookies are mass-produced and have quality control standards. You will now go through an argument meant to motivate the choice of a parametric statistical model to describe the variability of \(X\).

Let \(m\) be the total number of cookies produced from a well-mixed batch of cookie dough. Let \(n\) be the total number of chocolate chips present in that batch of cookie dough. Let \(X_i\) be a Bernoulli random variable which records whether or not a chip is placed on a cookie and \(X\) be the total number of chocolate chips on a cookie. The probability of seeing one chip on a cookie is \(1/m\).

(2 points) What distribution would be suitable for the random variable \(X\)? Do you need additional assumptions to ensure the suitability of this distribution?
(3 points) Given your distribution in Item 1, what would be \(\mathbb{E}\left(X\right)\)? Why would this quantity be of interest to a consumer of Chips Ahoy and to the factory managing the production of Chips Ahoy?
(2 points) Do you think \(m\) or \(n\) or both would be available to the consumer? to the factory management? Explain.
(1 point) Suppose \(n\) and \(m\) are both large. Use LM Theorem 4.2.1 to suggest an alternative distribution in case information about both \(m\) and \(n\) are not available. Why would this alternative distribution be more attractive compared to the distribution in Item 1?

Exercise C: How do you actually ensure IID conditions in real life?

Imagine you have a container and you cannot see the inside of this container. Inside this container, you have 5 ping-pong balls which have no discernible features apart from a numerical label:

Numerical labels on balls in an urn

While blindfolded, draw two balls from this container with replacement. What this means is that after drawing the first ball, you return the ball before drawing the second ball. Assume that someone is right by your side recording the outcome of the two draws. This procedure is called simple random sampling. Define two random variables \(Y_1\) and \(Y_2\) to be the recorded outcome of the first and second draws, respectively. Further define another random variable \[\overline{Y}=\frac{Y_1+Y_2}{2}\] You can assume that all possible outcomes of the pair \((Y_1, Y_2)\) are equally likely to occur.

(2 points) Obtain the probability mass function of \(\overline{Y}\).
(2 points) Using what you have found in Item 1, calculate \(\mathbb{E}\left(\overline{Y}\right)\) and \(\mathsf{Var}\left(\overline{Y}\right)\).
(1 point) \(Y_1\) and \(Y_2\) are independent random variables. Can you explain why?
(1 point) Do you think \(Y_1\) and \(Y_2\) are identically distributed random variables? Show why or why not.
(1 point) Can you think of a way to change the conditions of the experiment so that you can break the independence of \(Y_1\) and \(Y_2\)?

Now take the perspective of someone who knows the contents of the container. From this perspective, you actually have a list of numbers \(\{1,2,2,4,6\}\). You can think of this list as a population of size \(N=5\). Recall that if you have a list of numbers \(\{x_1,\ldots,x_N\}\), the population mean and the population variance are given by \(\mu=\displaystyle\frac{1}{N}\sum_{i=1}^N x_i\) and \(\sigma^2=\displaystyle\frac{1}{N}\sum_{i=1}^N \left(x_i-\mu\right)^2\). Applying these formulas gives \(\mu=3\) and \(\sigma^2=16/5\), which you can check for yourself.

(2 points) Discuss how simple random sampling is able to produce the properties of the sample mean under IID conditions.

Finally, you are going to use R and simulate the distribution of \(\overline{Y}\) based on the earlier description. Adapt the following code to suit the situation. Try out the code for yourself first.

# Your list of numbers, represented as a vector in R using c() 
x <- c(2, 5, 1, 3, 1, 5, 7, 6)
# To draw a simple a random sample from a list of numbers, use sample(). 
sample(x, 3, replace = TRUE) # Produce a simple random sample of size 3 with replacement. Try running this code a few times. 
# To learn more about the sample command, see the help file. 
?sample

(3 points) Write down R code to simulate 10000 draws from the distribution of \(\overline{Y}\) based on the description at the beginning of the exercise. Use table() to display the frequency distribution of your simulated distribution. How well does your simulated distribution match the theoretical distribution found in Item 1?