Welcome to Mathematical Statistics course webpage!

Course description

Mathematical statistics is one language you can use to discuss statistical topics and applications. In this course, we discuss how to construct estimation and inference procedures which could be applied to real data. We discuss what would be desirable in the design of these procedures. Along the way, we discuss how these procedures can be used to shed light on research questions you may have.

This course webpage will serve as a “living” class syllabus. Course materials (notes, homework assignments, etc.) will be linked to from here and will be regularly updated.

Goals and prerequisites

The main goal of the course is for you to gain a deeper conceptual, theoretical, and computational understanding of how statistics is applied in scientific, professional, and industrial contexts. I want you to know what statistical methods are available to you and understand in what contexts you should be applying these methods. Although this course may feature statistical methods which seem too simple, understanding these simpler methods ultimately is a step towards gaining confidence in properly using more complex statistical methods.

For the prerequisites of the course, you should have already taken and passed a first course in probability theory, and a course in differential and integral calculus. We will be doing a review of some aspects of the prerequisites by applying what you have learned to what is called normal models.

About the course instructor and teaching assistant

My name is Andrew Adrian Pua. I am teaching this course to economics and finance majors, along with Huihui Li (data science majors) and Wei Zhong (statistics majors). The latter is the main coordinator of the course.

Consultations

I am available to answer questions in four ways:

  • in class: Ask immediately before, during, or after the lecture.

  • via the public mathstat2023 DingTalk group: Asking questions in the DingTalk group is for the benefit of everyone.

  • emailing me at andrewypua at outlook dot com: Expect responses within two working days. If I have not responded, please remind me through email or in class. You may also choose to send an email to set up an appointment. How do you write an email properly? Here is some good advice.

  • physically at the Economics Building B405 during office hours Tuesdays and Thursdays 1300 to 1400 and 1700 to 1900.

Teaching assistant

Your teaching assistant (TA) is Zheng Zhesheng. His office hours are TBA.

Course textbook and references

The coordinator of the course requires the use of the following textbook (I will call it LM):

Larsen, R. J., & Marx, M. L. (2018). An Introduction to Mathematical Statistics and Its Applications (6th ed.). Pearson.

The book is available as a reprinted Chinese edition from the China Machine Press 机械工业出版社. The Chinese title is 数理统计及其应用(英文版,原书第6版). I bought a secondhand copy at 多抓鱼 for 43 yuan, but it might be rare. Other copies are available at 京东 and 淘宝.

The course focuses on Chapters 5, 6, 7, 9, 10, and 12. In the notes and during the lecture, I will make references to other chapters of the book.

The coordinator also suggests the following main English language references:

  • (More graduate level) Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Brooks/Cole.
  • (Undergraduate to graduate level) Hogg, R. V., McKean, J. W., & Craig, A. T. (2013). Introduction to Mathematical Statistics. Pearson College Division.

There are other references out there and the choice of reference depends on your taste. Follow Roger Koenker’s advice: “In my experience it is always better to find a book that seems slightly below your comfort level and then try to conscientiously read it – by which I mean fill in the details of the arguments along the way and do a reasonable selection of the problems.” Personally, I learned the hard way that not everything written in a book is always right.

  • (Best read as a supplement, a reviewer comments that this book “demonstrates that statistics is not merely a branch of mathematics”) Cox, D. R. and Donnelly, C. A. (2011). Principles of Applied Statistics. Cambridge University Press.

  • (Thinner than standard books, with coverage of more recent topics, author writes the book for the “mathematically” literate and completely avoids normal models and parametric families) Arias-Castro, E. (2022). Principles of Statistical Analysis: Learning from Randomized Experiments. Cambridge University Press. (Free, legal pre-publication version here, R notebook here)

  • (My personal favorite, undergraduate introductory level but makes you think more) Freedman, D. A., Pisani, R., and Purves, R. (1998). Statistics (4th ed.). W. W. & Norton Co. 

  • (Undergraduate introductory level, but very different from the usual business statistics book) Stine, R. A. and D. P. Foster. (2011). Statistics for Business: Decision Making and Analysis (3rd ed.). Pearson.

  • (My personal favorite, short chapters, undergraduate to graduate level) Wasserman, L. A. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer-Verlag.

  • (Undergraduate level, first third of the book may be relevant) Goldberger, A. S. (1998). Introductory Econometrics. Harvard University Press.

  • (My personal favorite, undergraduate to graduate level targeted towards economics, first hald of the book may be relevant) Goldberger, A. S. (1991). A Course in Econometrics. Harvard University Press.

  • (Short chapters, meant for undergraduate students) Dekking, F. M., Kraaikamp, C., Louphaä, H. P., and Meester, L. E. (2005). A Modern Introduction to Probability and Statistics. Springer-Verlag. available within XMU only

  • (Standard, undergraduate to graduate level, personally prefer this over LM) DeGroot, M. H. and Schervish, M. J. (2011). Probability and Statistics (4th ed.). Pearson.

  • (Standard, undergraduate to graduate level) Devore, J. L., K. N. Berk, and M. A. Carlton. (2021). Modern Mathematical Statistics with Applications (3rd ed.). Springer-Verlag. available within XMU only

  • (Thinner than standard books, undergraduate to graduate level) Abramovich, F. and Ritov, Y. (2013). Statistical Theory: A Concise Introduction. CRC Press.

  • (The first half of Part I directly tied to the course, but the remainder may be of applied interest) Kennett, R. S. and Zacks, S. (2021). Modern Industrial Statistics: With Applications in R, MINITAB and JMP (3rd ed.). John Wiley & Sons.

  • (Undergraduate to graduate level targeted towards economics) Amemiya, T. (1994). Introduction to Statistics and Econometrics. Harvard University Press.

  • (Graduate level but tailored to the social sciences, suitable for undergraduates with stronger backgrounds) Aronow and Miller, (2019). Foundations of Agnostic Statistics. Cambridge University Press.

  • (Graduate level but targeted towards economics and finance) Gallant, A. R. (1997). An Introduction to Econometric Theory: Measure Theoretic Probability and Statistics with Applications to Economics. Princeton University Press.

  • (Not for everyone, unconventional but dated) Fraser, D. A. S. (1958). Statistics: An Introduction. John Wiley & Sons.

  • (Not for everyone, can be demanding even for the familiar, with emphasis on the likelihood) Fraser, D. A. S. (1976). Probability & Statistics: Theory and Applications. Duxbury.

  • (Not very standard, heavily uses linear algebra, undergraduate to graduate level) Stone, C. J. (1996). A Course in Probability and Statistics. Wadsworth.

  • (Graduate level) Bickel, P. J. and Doksum, K. A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics Volume I. CRC Press.

Course grading

There are five components of the assessment and grading:

  1. Closed-book final exam: This component is worth 40% of final grade. Coverage is comprehensive. Bring a non-programmable scientific calculator.

  2. Closed-book midterm exam: This component is worth 30% of final grade. Coverage is likely Chapters 5 and 6, but this is yet to be confirmed. Bring a non-programmable scientific calculator.

  3. Homework, in-class work, work in pursuit of the final project, and statistics diary: This component is a catch-all for what you put in regularly into the course. The large number of items belonging to this component should reduce the pressure on you to copy someone else’s (or let someone else copy) solutions to the homework. I encourage you to make your own mistakes for the homework. This component is worth 15% of final grade.

    1. Homework is based on material from the textbook and other references. They are supposed to be handed in during the specified date and time.
    2. In-class work is based on activities and worksheets which would be submitted during class time.
    3. Work in pursuit of the final project involve documenting the work leading to the final project. Examples include selecting a topic, writing up your understanding of the topic, preliminary results, etc. More details to follow later.
    4. The statistics diary is a modified version of Andrew Gelman’s suggestion in his blog. His blog has examples of students’ work on their diaries. Another example of a diary can be found here. Submit entries at our SPOC website.
  4. Final project: The final project is a short English-language paper based on a topic agreed upon between you and the instructor. This component is worth 10% of final grade. Details about the project can be found here.

  5. Quiz: Quizzes are unannounced but are designed to be short to facilitate immediate feedback. You are allowed to refer to the notes and the textbook. This component is worth 5% of final grade.

Course policies

  1. Follow instructions. Ask for clarification when something is unclear.
  2. I do not take attendance regularly, but if you are marked or found absent six or more times, then you automatically are not allowed to take the final exam. You are invited to sleep in class if necessary. If you need to ask for leave, completely fill up the relevant form here, scan it along with supporting evidence/documents, email to me, and then I will take care of the rest. Do this as early as possible. If it is not possible, send me an email letting me know the situation and we can take care of the paperwork later on.
  3. You may use electronics such as a laptop or a tablet but not phones. It is easy to be distracted so do not open anything else with your laptop or tablet aside from class-related material. If I catch you doing something else unrelated to class, then you will be marked absent immediately.
  4. With the advent of new technologies making it easier for anyone to not work hard, it is becoming more difficult to determine whether a student actually knows or even understands anything related to the course. Furthermore, it becomes harder to assess the individual contribution of a student in submitted work. Therefore, using AI-based tools, translation software, solution manuals (especially those explicitly for instructors only), or other material that is explicitly forbidden will have severe consequences. As always, cheating in whatever form will also have severe consequences.
  5. In her syllabi, Deirdre McCloskey writes, “All grades are final. No amount of pleading will change your grade unless I make a mistake in adding up grades. Life is much more unfair than this!” Sending me a message pleading or begging will not help.

Information about using the materials

Ownership and citations

Lecture Materials for Mathematical Statistics (2023 version) by Andrew Adrian Pua is licensed under Attribution-ShareAlike 4.0 International

To cite these slides, please use

Pua, Andrew Adrian (Year, Month Day). Lecture Materials for Mathematical Statistics (2023 Version). https://mathstat.neocities.org/.

Finding typos or unclear portions

If you find typos or unclear portions in the notes, please let me know. I will be monitoring your contributions during the semester and I will acknowledge you in these notes. If you make substantial contributions, I will treat you to some non-alcoholic drinks at Cafe Avion located inside the university.

Resources on time management and learning to learn

I would ask you to take an opportunity to reexamine how you learn and study things. It does not matter if your motivation is only to pass the exam or something greater. It would be good for society if you study for something much greater. I have found the following resources to be helpful to students I have taught in the past. Of course, I am not sure if it would work for you, but do keep an open mind.

Course diary

June 6 and 8

  1. Going back to our first and second cookie datasets, our M&M dataset, slides with analysis
  2. Practice exercises, hints and partial solutions
  3. Notes on the bootstrap

May 30, June 1, and 6

Topics:

  1. Extending the binomial distribution
  2. Testing equality of proportions
  3. Testing whether a pre-specified distribution is compatible with the data
  4. Testing for independence

Assigned readings: Testing \(H_0: \ p_X=p_Y\) (Section 9.3), Introduction (Section 10.1) The Multinomial Distribution (Section 10.2), Goodness-of-Fit Tests: All Parameters Known (Section 10.3), Goodness-of-Fit Tests: All Parameters Unknown (Section 10.4), Contingency Tables (Section 10.5), Notes on goodness-of-fit testing

Activities: Does the color distribution of M & M’s in China match the color distribution in the US (at least based on company information in 2008)?

Exercises:

  • Test the goodness-of-fit of the Poisson for Case Study 4.2.2 and the goodness-of-fit of the exponential for Case Study 4.2.4.
  • All exercises in LM Section 10.2
  • Pay attention to Case Study 10.3.2 and Example 10.3.1 (for the latter, you encountered something similar for the median test in Example 5.3.2)
  • All exercises in LM Section 10.3: Pay attention to the computational shortcut in 10.3.1, the different ways of setting up a test in 10.3.4 and 10.3.5.
  • All exercises in LM Section 10.4: Pay attention to 10.4.14 as the distribution here may not be very familiar to you.
  • Pay attention to Case Study 10.4.3 as the data shows up as discrete yet the goodness of fit test is for checking whether normality is compatible with the data.
  • Pay attention to Case Study 10.5.2 as the data are originally continuous variables then converted into two categorical variables!
  • All exercises in LM Section 10.5: Most questions here are really very typical. Perhaps the slightly different one is 10.5.5.

May 30:

  • Examples of goodness-of-fit situations: features, what to pay attention to, similarities and differences
  • The form of the goodness-of-fit statistic and its distribution in large samples
  • In-class exercise to test whether the color distribution of M & M’s in China match the color distribution in the US (at least based on company information in 2008)“M&M data: Every student had one dataset and was asked to calculate the test statistic, along with the critical value at the 5% level, and if possible a \(p\)-value.
  • R code based on one of your classmates’ dataset:
observed <- c(1, 3, 4, 2, 4, 1)
null.prob <- c(0.13, 0.13, 0.24, 0.2, 0.16, 0.14)
chisq.test(observed, p = null.prob) # Note there is a warning
qchisq(0.95, 5) # critical value which should match the table from textbook

June 1:

  • Why are there warnings when we applied goodness-of-fit testing to the M&M dataset? We can also apply the test to the totals for the entire class. Complete dataset here
# Load M&M dataset
mandm <- read.csv("mandm-01.csv")
# obtain totals
observed <- apply(mandm[, 2:7], 2, sum)
null.prob <- c(0.13, 0.13, 0.24, 0.2, 0.16, 0.14)
chisq.test(observed, p = null.prob)
  • Applying chi-square testing to test goodness-of-fit for continuous distributions: Not very straightforward and many arbitrary choices are involved.
  • How to do equiprobable partitioning: complicated computationally but may be preferable so that you can achieve \(np_{i0}\geq 5\)
# Example 10.3.1 equiprobable partitioning
# finding right endpoint to guarantee probability equal to 0.2
f <- function(c) 3*c^2-2*c^3-0.2 
# use uniroot() to find a root in an interval which is known to us 
# check Table 10.3.5 as to why this interval is a good place to search
uniroot(f, c(0.2, 0.4))
  • Choosing an appropriate plug-in when proposed statistical model depends on unknown parameters: Ideally, you use a consistent estimator with low variance as a plug-in. Typically, MLE would be the right choice. But as discussed in class, it should be MLE applied to the grouped data rather than the ungrouped data. In practice, people use the latter!
  • You also pay a price for not knowing the parameters! The degrees of freedom are reduced by the number of parameters estimated.
  • Applications: Benford’s law (a nice elementary explanation of why the law exists in the first place is found here, a nice example for using Benford’s law for detecting corruption in election campaigns and discusses what kind of numbers will Benford’s law potentially apply can be found here), Case Study 10.3.3 (data which are too good to be true)
  • Application on testing independence of two categorical variables: Why is this an application? Convert the problem into a goodness-of-fit testing problem involving one categorical variable. Finished with an introduction to the sex bias article.

May 4, 9, 11, 16, 18, 23, and 25

Topic:

  1. Distributions connected to the normal
  2. Normal one-sample model, again
  3. Normal multi-sample models
  4. Analysis of variance
  5. Multiple comparisons vs contrasts

Assigned readings:

  1. Section 3.2.2 of Notes on normal models, Notes on the analysis of variance
  2. Deriving the Distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\) (Section 7.3), Drawing Inferences About \(\mu\) (Section 7.4), Testing \(H_0: \ \mu_X=\mu_Y\) (Section 9.2), Introduction (Section 12.1), The \(F\)-Test (Section 12.2), Testing Subhypotheses with Contrasts (Section 12.4)
  3. Drawing Inferences About \(\sigma^2\) (Section 7.5), Testing \(H_0: \ \sigma^2_X=\sigma^2_Y\) – The \(F\)-Test (Section 9.3)
  4. Confidence Intervals for the Two-Sample Problem (Section 9.5), Multiple Comparisons: Tukey’s Method (Section 12.3)
  5. Contrasts (Section 12.4) and data transformations (Section 12.5)

Exercises:

  • Homework 04, almost complete solutions to HW04
  • Repeat the in-class exercise but this time for \(V_2=\dfrac{1}{2}Y_1-\dfrac{1}{2}Y_2\) and everything else held constant. What happens to your answers? Students forget the Jacobian but you can indirectly recover it if you follow the exercise. Try it!
  • (Optional) Look back at the in-class exercise. Is the orthogonal transformation unique?
  • (Extremely painful, but only requires perseverance and an extension of the normal distribution to three dimensions) Repeat the in-class exercise but this time for \(n=3\) and consider the transformation of \(Y_1,Y_2,Y_3\) into \(V_1,V_2,V_3\) as follows: \[\begin{eqnarray*}V_1 &=& \frac{1}{\sqrt{3}}Y_1+\frac{1}{\sqrt{3}}Y_2+\frac{1}{\sqrt{3}}Y_3 \\ V_2 &=&-\frac{1}{\sqrt{2}}Y_1+\frac{1}{\sqrt{2}}Y_2 \\ V_3 &=&-\frac{1}{\sqrt{6}}Y_1-\frac{1}{\sqrt{6}}Y_2+\frac{2}{\sqrt{6}}Y_3 \end{eqnarray*}\] You also have to think about a multivariate normal distribution instead of a bivariate normal.
  • LM Exercises 7.3.1, 7.3.14, 7.3.15: More about practicing quick “tricks” with integration
  • LM Exercises 7.3.7, 7.3.8, 7.3.9, 7.3.11, 7.3.12, 7.4.1, 7.4.2, 7.5.1, 7.5.2, 7.5.3, 7.5.4: More about reading tables, practice this too!
  • LM Exercises 7.3.2 to 7.3.6, 7.5.5 to 7.5.8, 7.5.11 to 7.5.14: These are about the theoretical connections provided by knowing the chi-squared distribution or random variables having a chi-squared distribution. 7.5.5 is linked to the work of Wilson and Hilferty (1931). Related to 7.5.5, ask why do you think theis exercise was relevant in the past in the time when software was not widely available!
  • LM Exercises 7.4.4 to 7.4.6, 7.4.12, 7.4.13, 7.4.15: confidence intervals from a theoretical point of view
  • LM Exercises 7.4.7 to 7.4.11, 7.4.14, 7.4.16: These exercises are about calculation of confidence intervals using the data. But pay attention to interesting questions like 7.4.9 and 7.4.16.
  • LM Exercises 7.4.17 to 7.4.22: hypothesis testing exercises
  • LM Exercise 7.4.23 and 7.4.27: probably the most interesting exercises related to hypothesis testing
  • LM Exercises 7.4.24 to 7.4.26: a good place to apply R, how do you simulate from the indicated distributions?
  • LM Exercises 7.5.9, 7.5.10, 7.5.15 to 7.5.17: calculating confidence intervals and testing claims for \(\sigma^2\) using data
  • LM Theorem 7.A.2.1 in the Appendix to Chapter 7: Work out the details.
  • LM Appendix 7.A.3: Work out the details, especially the form of the GLRT.
  • LM Exercises 12.2.1 to 12.2.6: All computations
  • LM Exercises 12.2.7, 12.2.9: Exercises to check whether you understood the algebraic relationships
  • LM Exercise 12.2.8: An interesting question meant for you to think about the assumptions
  • LM Exercise 12.2.10: We did this in class for the case of 3 groups.
  • LM Exercises 12.2.11 to 12.2.13: Connections to Section 9.2
  • all exercises in Chapter 9
  • all exercises in LM Section 12.3, 12.4
  • LM Exercise 12.5.2: presentation here is a bit different, pay attention to the assumptions of ANOVA

May 4:

  • Mostly the big picture – Why should we care about finding the distribution of \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\)?
  • Summarizing the key results: Under IID normality, we have a pivotal quantity \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}} \sim T_{n-1}\) which can be used to construct confidence intervals , \(\dfrac{\left(n-1\right)S^2}{\sigma^2}\sim \chi^2_{n-1}\), and the independence of \(\overline{Y}\) and \(S\). The latter is NOT intuitive at all and is central to the statistical analysis of normal data.

May 9:

  • In-class exercise meant to provide a way to prove that \(\overline{Y}\) and \(S^2\) are statistically independent under IID normality, at least for \(n=2\). The more general idea for the proof may be found in the Section 3.2.2 of Notes on normal models or in the Appendix for Chapter 7.
  • The beginnings of the analysis of variance (ANOVA) table: Decompose the sum of squares for the data \(\sum_{i=1}^n Y_i^2\) into two independent parts \(V_1^2\) and \(\displaystyle\sum_{i=2}^n V_i^2\). The algebraic decomposition does not require normality, but the independence does. In addition, \(\displaystyle\sum_{i=2}^n V_i^2\) is a sum of independent random variables as well.
  • How to make sense of an ANOVA table in terms of the degrees of freedom and the \(F\)-ratio (or \(t\)-ratio)?
  • In terms of the in-class exercise, \(V_1=\sqrt{2}\cdot\overline{Y}\) and \(V_2=S\). Furthermore, \(V_1\sim N\left(\sqrt{2}\mu,\sigma^2\right)\), \(V_2\sim N\left(0,\sigma^2\right)\), and \(V_1\) and \(V_2\) are independent. We can write \[\left(\dfrac{\overline{Y}-\mu}{S/\sqrt{2}}\right)^2=\dfrac{\left(\dfrac{V_1-\sqrt{2}\mu}{\sigma}\right)^2/1}{\left(\dfrac{V_2^2}{\sigma^2}\right)/1}\] has an \(F\) distribution with 1 numerator degree of freedom and 1 denominator degree of freedom.
  • The previous idea extends to \(n>2\), with a similar table and a similar distributional result. In particular, we can write \[\left(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\right)^2=\dfrac{\left(\dfrac{V_1-\sqrt{n}\mu}{\sigma}\right)^2/1}{\left(\dfrac{\sum_{i=2}^n V_i^2}{\sigma^2}\right)/\left(n-1\right)}\] has an \(F\) distribution with 1 numerator degree of freedom and \(n-1\) denominator degree of freedom.
  • A \(t\)-distributed random variable is related to an \(F\)-distributed random variable. In particular, the square of a \(t\)-distributed random variable with \(n\) degrees of freedom has the same distribution as an \(F\)-distributed random variable with 1 numerator degree of freedom and \(n\) denominator degrees of freedom.

May 11:

  • Summarizing Chapter 7 once more: Pay attention to the key results and the pivotal quantities which are now available. Pay attention to the new distributions connected to the normal – are they symmetric? what is their support? what are their moments? how do we calculate probabilities?
  • Bridge to next chapters: the ANOVA table
  • In-class exercise about testing the equality of means for three groups: The main point is for you to become more confident about your understanding of the topics connecting Chapters 7, 9 and 12. The algebra, which may be tedious, is for improving your ability to interpret expressions and get a sense of where things could be going.

May 16:

  • Finished the in-class exercise: intuition for the form of the pivotal quantity, how to show the distribution of the pivotal quantity
  • Connect with the notation in Sections 12.1 and 12.2: focus on the meaning of the symbols rather than the symbols themselves
  • Assigned data collection task which will be Quiz 02

May 18:

  • Review of what we have done and connecting to the book
  • Key ideas behind ANOVA: testing the null of equality of means is really looking into different “cuts” of the sum of squares of every observation; emphasized the relative nature of the test statistic; how the numerator reflects between-sample variation and how the denominator reflects within-sample variation
  • Different ways to think about the ANOVA tables and computational shortcuts
  • Why don’t economics and finance use ANOVA nowadays?
  • What “fishing” and/or data snooping does to hypothesis tests: What is the problem and how do we make adjustments?

May 23:

  • How do we exactly adjust for data snooping (in the sense of testing multiple hypotheses or constructing multiple confidence intervals all using the same data)? Introduced how to make \(P\)-value adjustments using the most basic and general approach by Bonferroni. Bonferroni corrections work in situations even beyond ANOVA, but is extremely conservative.
  • Key aspects of the Bonferroni correction are the probability of a union of events being upper bounded by the sum of individual probabilities and that \(p\)-values have a \(\mathsf{U}(0,1)\) distribution under the null.
  • LM Section 12.3 introduces an alternative called Tukey’s method. The key is to construct a studentized range. But this studentized range has to have a very particular form and is restricted to the normal case. Worked on those details, studied and explained the proof of LM Theorem 12.3.1.
  • Constructing simultaneous confidence intervals for all pairwise differences is slightly similar to the confidence intervals in Section 9.5. But they differ in the sense that Tukey’s approach is “honest” because it accounts for the fact that you went “fishing” for a significant difference between any two groups.
  • Worked on Case Study 12.3.1: Pay attention to the quantiles of the Tukey distribution, especially the notation. If you are using unfamiliar tables, you have to check how the tables were created! How do you create Figure 12.3.2?

May 25:

  • Reminders about the project
  • How are the confidence intervals in LM Section 12.3 different from LM Section 9.5?
  • Reminder about the special case of the test statistic for LM Section 12.2 when there are only two groups
  • Discuss the R implementation of Case Study 12.3.1: document some weird behavior of some commands in R
  • What if you are not “fishing” and you actually already know, even before seeing the data, what subhypotheses to test? Enter the idea of contrasts. The distribution of the contrast estimator under the null repeats what you have seen before for the simplest case of testing \(\mu=\mu_0\) where \(\sigma^2\) is unknown. There is a \(t\)-distributed and an \(F\)-distributed version of the test.
  • Ended with an activity about the color distribution of M&M’s: How to set up the null, what is the intuitive comparison you have to do to determine if the data are compatible with a pre-specified distribution?

Apr 18, 20, 23, 25, 27, and May 4

Topics:

  1. Testing claims and calibrating decision rules
  2. The \(p\)-value, again
  3. The likelihood ratio test

Activities: Predicting the number and suit of 12 cards randomly chosen from a deck of 52 cards

Assigned readings: The Decision Rule (Section 6.2), Type I and II Errors (Section 6.4), A Notion of Optimality: The Generalized Likelihood Ratio (Section 6.5), Taking a Second Look at Hypothesis Testing (Statistical Significance versus “Practical” Significance) (Section 6.6), Some notes on hypothesis testing

Exercises in LM:

  • 6.2.2: What are the null and alternative hypotheses here? Seeing \(\alpha=0.06\) may feel strange.
  • 6.2.3, 6.2.5: To check whether you understood what changing \(\alpha\) or changing the type of alternative could mean, try to explain your answer rather than just giving a yes or no.
  • 6.2.4: Typical question similar to Examples 6.2.1 and 6.2.2, but nicely articulates (what I also did in class) what are held constant when evaluating a claim like \(\mu=\mu_0\).
  • 6.2.6: Look carefully at the critical region which was proposed.
  • 6.2.7: Probably the most interesting exercise available for this section! Do it, especially (b).
  • 6.2.9: May feel mindless.
  • 6.2.10 and 6.2.11: Need to setup the appropriate null and alternative hypotheses.
  • 6.3.1: More often than not (not just in the exam but in actual practice), you would not even see the setup in (a). You will be asked to setup things in a way that reflects (a).
  • 6.3.2: This is an interesting context for an exercise.
  • 6.3.5: This is an exercise meant to give some connection between confidence intervals and hypothesis testing, but both have different use cases.
  • Revisit Case Study 4.3.1, Examples 5.3.2 and 5.3.3. You have seen these formulated as confidence interval problems, but these could be formulated as hypothesis testing problems. Try the reformulation for yourself! Connect to Exercises 6.3.3 and 6.3.5.
  • Case Study 6.3.2 and Exercise 6.3.6 should be done together.
  • 6.3.7 and 6.3.9 are typical questions where the decision rule is already provided. Your job is to assess the probabilities of both errors.
  • 6.3.8 is very interesting but not used a lot in practice. It gives you a way to deal with the discrete nature of the data.
  • All exercises in Section 6.4 should be done. I highlight some here: 6.4.10 (similar to FastBurger), 6.4.11, 6.4.12 (is this binomial?), 6.4.15 and 6.4.19 (curious why Type II error was emphasized), 6.4.16 (two sample binomial case). Perhaps the most interesting exercises are 6.4.21 and 6.4.22.
  • All exercises in Section 6.5 should be done. 6.5.1 and 6.5.2 are typical exercises which also deal with non-normal, non-binomial cases. 6.5.3 and 6.5.4 are partially solved in the notes. Solutions for 6.5.5 and 6.5.6 are relatively hard to write down.

Apr 18 (remaining 45 minutes):

  • Recap of your first encounter with testing claims and a \(p\)-value calculation
  • How do we actually set up a hypothesis testing problem? Be very aware that there is always a model in the setup!
  • We need to determine the null (the status quo), the alternative (something has changed), a test statistic whose behavior is known under the null (preferably one whose distribution under the null is pivotal), a standard to decide whether there is support for the null or the alternative.
  • This standard needs to be designed properly. I introduced a way to design this standard through a decision rule which allows you to control the probability of a Type I error to a pre-specified level.

Apr 20:

  • Recap of the ideas behind designing a decision rule for hypothesis testing: The observed data did not play a role in the design of the decision rule.
  • What is a significance level and why is the \(p\)-value called the observed significance level?
  • It is difficult to simultaneously make both Type I and Type II error as small as possible. Type II error depends on the alternative (which has a great range compared to the null). The treatment of the null versus the alternative is not equal!
  • When do you actually use hypothesis testing? Usually done in the context of trying to make discoveries or for probing what could be present amidst the noise.
  • It is easy to manipulate the “template” for hypothesis testing. What happens when you design a decision rule that guarantees \(\alpha=0.05\) but you “peek” into the data to fish for a “discovery”? The decision rule designed was “Reject the null when \(\bigg| \dfrac{\overline{Y}-494}{124/\sqrt{86}} \bigg| \geq 1.96\)” so that \(\alpha=0.05\). But then if you do not reject the null, you “peeked” into the data once again and checked whether \(\dfrac{\overline{Y}-494}{124/\sqrt{86}} > 1.65\). If the latter condition is satisfied, we count that as a rejection. The original guarantee of \(\alpha=0.05\) no longer holds! Check the simulation below.
# Function with no arguments, but could be modified
# This code is likely to be slow
# Context is LM Example 6.2.1 and 6.2.2
peeking <- function()
{
  # The null is true
  data <- rnorm(86, 494, 124)
  # Calculate test statistic
  test.stat <- (mean(data)-494)/(124/sqrt(86))
  # Two-tailed test first
  if(abs(test.stat) > qnorm(0.975))
  {
    return(1) # 1 means reject null
  } else
  {
    # Then one-tailed test
    if(test.stat > qnorm(0.95))
    {
      return(1) # 1 means reject null
    } else
    {
      return(0) # 0 means fail to reject null
    }
  }
}
# collect all results
results <- replicate(10^4, peeking())
mean(results)
  • We did an activity where you predicted the number and the suit of 12 cards drawn randomly from a deck of 52 cards.

    • The drawn cards were : 9 clubs, 9 diamonds, 9 clubs (again), 6 clubs, A diamonds, 3 spades, 9 diamonds (again), 3 diamonds, K diamonds, Q diamonds, 5 spades, K clubs.
    • Some students misheard ace of diamonds and thought it was eight of diamonds.
    • I was not very specific about what counts as a correct prediction. So I am going to be specific now. The ordering of the drawn cards should not matter. For example, if six of clubs came out on the 4th draw and you placed six of clubs as your 12th prediction, that still counts as a correct prediction. We had duplicates in the draw. If you wrote 9 clubs twice, then that counts as two correct predictions. If you only wrote it once, then that just counts as one correct prediction.
    • The activity is to determine whether you have ESP or not. What is the statistical model under the null? What is the statistical model under the alternative? Try calculating a \(p\)-value for the test.

Apr 23:

  • Recap of Examples 6.2.1 and 6.2.2: What a hypothesis test is really doing again? How do the definitions in the book look like in the example?
  • It is possible to consider two-sided alternatives, but be careful about how you allocate \(\alpha\) in the lower and the upper tails. For the normal case, the symmetry makes things easy.
  • Weird example about FastBurger: How do you setup something that is outside the normal and the binomial case. Sometimes, the problem or context will make decision rule available. Minimizing the sum of the probabilities of both types of errors may be difficult.
  • Just like in estimation, there could be many decision rules out there. We typically will choose the one that has good control over Type I error and has the highest power for a broad range of alternatives.
  • Pay attention to Example 6.4.1, which I think is a very useful way to think about how to plan and design experiments or studies for the purpose of discovery and detection.
  • In some sense, the course is almost finished because the remaining chapters for the course involve different types of hypothesis tests. The most unique involve tests about the goodness of fit and independence of random variables.
  • Try the following exercise, where \(\sigma\) is unknown. Here you have to use an asymptotically pivotal quantity like \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\overset{d}{\to} N\left(0,1\right)\) to test the claim: Suppose a senator introduces a bill to simplify the tax code. The senator claims that his bill is revenue-neutral. This means that tax revenues will stay the same. Suppose the Treasury Department will evaluate the senator’s claim. The Treasury Department has more than a million representative set of tax returns. An employee from the Treasury Department chooses a random sample of 100 tax returns from this tax file. The employee will then recompute the taxes to be paid under the simplified tax code and compare it with the taxes paid under the old tax code. The employee finds that the sample average of the differences obtained from the 100 tax files was -219 dollars and that the sample standard deviation of the differences is 752 dollars. How would you use the data to evaluate whether there is support for the senator’s claim or for the employee’s claim?
  • Revisited our activity: How do you setup the problem? Every student has their own \(p\)-value. With enough students, we can find a rejection of the null of no ESP even if most would agree that ESP is hard to believe.

Apr 25:

  • Large-sample tests in the binomial case are easy to implement but suffers from approximation problems given the discrete nature of the data (hence the need for a continuity correction), and the sometimes weird combinations of \(n\) and \(p\) may produce problems (see Brown, Cai, and DasGupta (2001)). One way to check is to use simulation. Don’t rely too much on rules of thumb.
  • Exact tests in the binomial case are more complicated: Constructing a decision rule for one-sided or two-sided tests require indirect calculations (specifically, you need a tabulation of the probability mass function under the null). It is also difficult to exactly attain the desired Type I error rate (see Example 6.3.1, Exercise 6.3.2 where you have two possible decision rules, Exercise 6.3.8 for a randomized decision rule but the idea is harder to use for Exercise 6.3.2). R code used for Exercise 6.3.2:
# Tabulate pmf under the null
dbinom(0:35, 35, 0.67)
# Figure out possible critical values
# Lower threshold
sum(dbinom(0:17, 35, 0.67)) 
# or 
pbinom(17, 35, 0.67)
# Upper threshold
sum(dbinom(29:35, 35, 0.67)) 
# or
pbinom(28, 35, 0.67, lower.tail = FALSE)
# Another set of thresholds
pbinom(18, 35, 0.67)
pbinom(29, 35, 0.67, lower.tail = FALSE)
  • Exact tests in the binomial case are more complicated: The one-sided \(p\)-value may be direct to calculate, but the two-sided \(p\)-value is not, as explained in class. You simply do not have the symmetry you enjoy in the normal case.
  • Exact tests for non-normal, non-binomial cases are even more complicated. That is why if you notice the exercises were always looking at sample sizes equal to 1. Work on Examples 6.4.2, 6.4.3, and 6.4.4 so that you can see the ingredients you need to be able to construct a decision rule. Example 6.4.4 is more curious for reasons discussed in class.
  • Pay attention to the gamma distribution whose story extends the exponential case. We will see this distribution again in Chapter 7.
  • Pay attention to Exercises 6.4.21 and 6.4.22: These require your ability to construct the distribution of the sum and the product of two IID random variables. But answering these using the computer is actually much easier! Try it.
  • Introduce the idea behind why the likelihood function could be a reasonable starting point for constructing decision rules.

Apr 27:

  • Took some time explaining the value and possible complications of hand calculations for Exercises 6.4.21 and 6.4.22. Hardest thing to do is to calculate integrals properly. Pay attention to cases. But these two exercises are simpler to answer with R, which leads to simulated critical values. The important thing is to know how to simulate the null distribution and then find the suitable critical value for your test. R code can be found below:
# Implement Exercise 6.4.21 in R
# Number of simulations
nsim <- 10^4
# Simulated distribution of Y1+Y2 under the null theta=2
# If you do not impose the null, then it would be difficult to simulate!
sumy <- replicate(nsim, sum(runif(2, 0, 2)))
# Show histogram 
# You need the point at which 5% of the probability is below that point based on this histogram
hist(sumy, freq = FALSE)
# Manual calculations
temp <- hist(sumy, freq = FALSE)
# Points on the horizontal axis
temp$breaks
# Densities (heights)
temp$density
# We did trial and error in class
# Simulated critical value will be different every time (but close enough to each other)
# Here is a shortcut for the simulated critical value
quantile(sumy, 0.05)
  • Motivate once again why the likelihood function is a good starting point for testing hypotheses: Likelihood function is the probability of observing the data as a function of the parameter. When there is a claim about a parameter, we can evaluate whether this claim makes it more likely to observe the data we have. Higher likelihood values would indicate support for the claim. But it can be harder to implement this intuition when the hypotheses are not simple, for example, if the alternative is an interval of values.

  • Three objects are related to the likelihood function: the MLE, the score, and the likelihood function itself. There is a testing approach for each of this. The simplest to implement is possibly a direct test using the MLE. It is almost automatic. The score approach uses the fact that the score has zero mean and variance equal to the Fisher information. The likelihood function is the one emphasized in the book. Under quadraticity of the log-likelihood, \(-2\) times the difference between the log-likelihood at the MLE \(\widehat{\theta}\) and the log-likelihood at the claimed value \(\theta_0\) (NOT necessarily the true value!) is asymptotically pivotal. In particular, the asymptotic distribution is actually the square of a standard normal. This is a convenient motivation to study Chapter 7!

  • More importantly, the difference between log-likelihoods is ultimately related to the likelihood ratio. This led to the discussion of the generalized likelihood ratio test (GLRT). The test statistic involves a ratio of the supremum of the likelihood function under the null parameter space to the supremum of the likelihood function under the alternative parameter space. They can be complicated to compute, as illustrated in class and as seen in the repeated focus of the exercises on two-sided alternatives.

  • I used supremum instead of maximum (LM sneaks in an equal sign under the alternative, but that is ok). This is to prevent those situations where the maximizer is at the boundaries.

  • We worked on the uniform case discussed in the book. This example is interesting because you have to be careful with constructing the likelihood function and the likelihood ratio. In addition, the likelihood ratio is a function of the fraction \(W=Y_{\mathsf{max}}/\theta_0\). It turns out that this random variable is a pivotal quantity. \(W\) has density equal to \(f_W\left(w\right)=nw^{n-1}\) where \(0<w<1\). Notive that the density does not have any unknown parameters! Thus, you can use the distribution of \(W\) to find the appropriate threshold. This distribution has shown up many times in the exercises of the book! It is actually a member of the family of Beta distributions. To learn more, see how these univariate distributions are related to each other.

  • We ended with an in-class exercise on Exercise 6.5.2. Time yourself! Here is my solution:

    • A likelihood function is given by \[L\left(\lambda\right)=\prod_{i=1}^{10}\lambda\exp\left(-\lambda Y_i\right)=\lambda^{10}\exp\left(-\lambda\sum_{i=1}^{10}Y_i\right).\]
    • The related log-likelihood function is given by \[\mathcal{l}\left(\lambda\right)=10\ln\lambda-\lambda\sum_{i=1}^{10}Y_i.\] You can choose to simplify things a bit if you want: \[\mathcal{l}\left(\lambda\right)=10\ln\lambda-10\lambda \overline{Y}.\]
    • The maximized likelihood function under the null \(\lambda=\lambda_0\) (a singleton) is given by \[L\left(\lambda_0\right)=\lambda_0^{10}\exp\left(-10\lambda_0 \overline{Y}\right).\]
    • The MLE \(\widehat{\lambda}_{\mathsf{MLE}}\) is the solution to \[\mathcal{l}^\prime\left(\widehat{\lambda}_{\mathsf{MLE}}\right)=0 \Rightarrow \frac{10}{\widehat{\lambda}_{\mathsf{MLE}}}-10\overline{Y}=0.\] So, \(\widehat{\lambda}_{\mathsf{MLE}}=\dfrac{1}{\overline{Y}}\).
    • The maximized likelihood function under the alternative \(\lambda\neq\lambda_0\) is given by \[L\left(\widehat{\lambda}_{\mathsf{MLE}}\right)=\widehat{\lambda}_{\mathsf{MLE}}^{10}\exp\left(-10\widehat{\lambda}_{\mathsf{MLE}}\overline{Y}\right)=\left(\dfrac{1}{\overline{Y}}\right)^{10}\exp(-10).\]
    • Thus, \[\Lambda = \left(\lambda_0\overline{Y}\right)^{10}\exp\left(-10\left(\lambda_0\overline{Y}-1\right)\right)\] is the required generalized likelihood ratio. To find the required integral which would have to be evaluated to determine the critical value \(\lambda^*\), this requires more work.But the idea is to find \(\lambda^*\) to solve the following equality: \[\mathbb{P}\left(\Lambda \leq \lambda^*\big|\lambda=\lambda_0\right)=0.05\]
  • For Exercise 6.5.2, that was all that was required. But it is a good idea to explore the random variable \(W=\lambda_0\overline{Y}\). In a similar manner as what you have seen for the uniform case, try to determine the distribution of \(W\). It is connected to Chapter 7 as well and uses Section 4.6.

  • Note that for Exercise 6.5.2, we don’t have to worry too much here because we are in the two-sided case. Try the one-sided alternative \(\lambda < \lambda_0\) as an exercise.

May 4:

  • Showed some of the difficulties encountered to really answer Exercise 6.5.2: Finding the required integral takes work. The idea is to let \(W=\lambda_0\overline{Y}\) in the generalized likelihood ratio. Afterwards, you need to obtain the distribution of \(W\). It can be shown that \(W\) has a gamma distribution. For some reason, I did not show this in class. I am not sure why. I will fix this when we see each other again.
  • We had a quiz and the answers are here.

Mar 23, 28, 30, Apr 4, 6, 11, and 13

Topics:

  1. General-purpose estimation principles: Maximum likelihood, method of moments, and least squares
  2. Sufficiency as another guiding principle for estimation
  3. Maximum likelihood estimation in R

Activities: Regrettably, none for the moment.

Assigned readings: Estimating Parameters (Section 5.2), Minimum-Variance Estimators: The Cramér-Rao Lower Bound (Section 5.5), Sufficient Estimators (Section 5.6), Notes on likelihood functions, Properties of MLE, Computational examples and issues, Method of moments, Other topics related to the likelihood, Example 5.4.2

Exercises:

  • Work out Examples 5.2.1 (Poisson case), 5.2.2 (Gamma case), 5.2.5 (Normal case). The latter has also been worked out in the notes.
  • The special examples are Examples 5.2.3 (truncated Poisson with fused outcomes) and 5.2.4 (looks exponential, but the support depends on the unknown parameter!). For the latter, make sure to put it in your “zoo” of weird examples like the uniform case. For 5.2.3, fused outcomes mean that 4, 5, and 6 are taken together. How does this affect the log-likelihood?
  • Work out Case Study 5.2.1 (Geometric case) on modeling ups and downs of a financial market.
  • Exercises 5.2.1, 5.2.3: Pretty standard, plot the log-likelihood using R, use R to find optima as well.
  • Exercise 5.2.2: Not very standard, but checks whether you understood what maximum likelihood really is. \(p\) takes on only two values, which is different from 5.2.1 where \(p\) could take on any value between 0 and 1.
  • Exercise 5.2.4: Looks like a Poisson?
  • Exercises 5.2.5, 5.2.6, 5.2.7: Pretty standard, but is there a special name for this density? If there is, determine whether you can use a built-in function in R. If there is not, then you have the code the log-likelihood from scratch. Also ask yourself: How does one generate random draws from this distribution? Pay attention to 5.2.7, as it is an example where you can set up a model for data which involve proportions.
  • Exercises 5.2.8 and 5.2.9: These are similar to the examples in LM. But pay attention to how to account for fused outcomes. Try finding optima in R.
  • Exercises 5.2.10 to 5.2.12: Look at the support.
  • Exercises 5.2.13 to 5.2.16: These exercises really involve two parameters, but the problem fixes one of the parameters and takes them as known.
  • For each of the exercises involving MLE: calculate the expected value of the score and the Fisher information.
  • Homework 03 (typo fixed, thanks to Ziyi!), suggested solutions
  • Exercises 5.2.17, 5.2.18, 5.2.21, 5.2.23, 5.2.25, 5.2.26: Here you might have to derive the required moments first and make sure it is linked to \(\theta\). Also pay attention to how many moments you would need. For 5.2.18, the distribution may be familiar. Look up the Beta distribution.
  • Exercise 5.2.19, 5.2.20, 5.2.22: Compare with the corresponding MLE.
  • Exercise 5.2.24: Answered already in the notes.
  • All exercises in Section 5.5: Notice the common things about them (notice that the sample mean shows up again and again). Exercise 5.5.7 is already answered in the notes for MLE.
  • Exercise 5.6.2: Use what we did in class and find a particular dataset which will produce a conditional probability which still depends on \(p\).
  • Exercise 5.6.6: Can you use the conditioning argument here? Consider \(W=\exp\left(\log W \right)\). Pay attention to the supports!!
  • Exercises 5.6.1, 5.6.4, 5.6.5 are typical exercises. If possible, use all the approaches found in the notes so that you can practice all of them to determine which are easy and which are difficult approaches.
  • Exercises 5.6.7 and 5.6.8 are the type of the exercises where the support depends on the unknown parameter.
  • Exercises 5.6.9 to 5.6.11: 5.6.9 is extremely important in how statistics has developed around the 70s and 80s. 5.6.10 and 5.6.11 are specific cases of 5.6.9. Try them!

Mar 23:

  • Review of joint densities and the specific case of IID normal random variables (15 minutes)
  • Point out the sum of squares algebra (connections with Chapter 12) and the joint density depending solely on \(\overline{y}\) and \(s^2\) (connections with Section 5.6 and privacy) (5 minutes)
  • What is a likelihood function? Key idea, the tricky part in the continuous case, maximizing the log-likelihood instead of the likelihood directly (25 minutes)

Mar 28:

  • Why Homework 02 was assigned (15 minutes)
  • Recap of the setup and the algorithm for calculating MLE by hand (15 minutes)
  • Calculating MLE using R: How does the computer look for optima? Numerical issues abound, especially in higher dimensions. If interested, look into numerical optimization and specifically focus on gradient descent methods. (15 minutes)
  • From the simple IID \(N\left(\mu,\sigma^2\right)\) example, there are many points of note which make MLE an attractive approach to estimation but also pitfalls of the approach. (25 minutes)
  • Visual demonstration for the case of IID \(N\left(\mu,\sigma^2\right)\) where \(\sigma^2\) is known: R code used in class is displayed below (new commands involve sapply(), plot(), seq(), abline()). The first and second derivatives of the log-likelihood play a big role, especially in establishing the nice statistical properties of MLE! (15 minutes)
  • Pointing out a problem with a comment in LM about the likelihood not being a function of the data, some of the examples to work out (5 minutes)
# Draw random numbers from N(1, 4)
n <- 5
mu <- 1
sigma.sq <- 4 # change this to 40 if you want to look at the curvature of the log-likelihood
y <- rnorm(n, mu, sqrt(sigma.sq))
# Set up MINUS the log-likelihood (reused code from example)
# BUT sigma.sq is known, rather than a parameter to be estimated
mlnl <- function(par)
{
  sum(-dnorm(y, mean = par, sd = sqrt(sigma.sq), log = TRUE))
}
# New part where I want a plot of the log-likelihood for mu
# Place a grid of values for mu, adjust length.out if you wish
mu.val <- seq(-10, 10, length.out = 1000)
# Compute the log-likelihood at every value in mu.val
# The MINUS sign is to enable me to display the log-likelihood rather than the negative of the log-likelihood
log.like <- -sapply(mu.val, mlnl)
# Create a plot: vertical axis are the log-likelihood values, horizontal axis are the values of mu
# type = "l" is to connect the dots, try removing it
plot(mu.val, log.like, type = "l")
# Draw a vertical line at 1, hence v = 1
# Make sure the line is colored red and is dotted
abline(v = 1, col = "red", lty = "dotted")

Mar 30:

  • Weird example discussed in LM Example 5.2.4, be careful of geometric distribution in LM Case Study 5.2.1, pay attention to fused outcomes (LM Example 5.2.3, LM Exercises 5.2.8 and 5.2.9) when forming the likelihood, practical examples where using nlm() blindly may become problematic (45 minutes)
  • R code from Mar 28 was cleaned up to make things look nicer. Discuss what the R code is doing and what the picture is trying to tell you. Differentiate between the log-likelihood and the MLE (which are both random) versus the expected log-likelihood and the truth (which are fixed). Introduce new notation and intuition for looking at the first two derivatives of the log-likelihood. (15 minutes)
  • Look into the properties of the score function and the concept of Fisher information. (15 minutes)
  • How the quadratic shape of the log-likelihood is responsible for what makes MLE work in a broad number of settings (15 minutes)
# Draw random numbers from N(1, 4)
n <- 5
mu <- 1
sigma.sq <- 4 # change this to 40 if you want to look at the curvature of the log-likelihood
y <- rnorm(n, mu, sqrt(sigma.sq))
# Set up MINUS the log-likelihood (reused code from example)
# BUT sigma.sq is known, rather than a parameter to be estimated
mlnl <- function(par)
{
  sum(-dnorm(y, mean = par, sd = sqrt(sigma.sq), log = TRUE))
}
# New part where I want a plot of the log-likelihood for mu
# Place a grid of values for mu, adjust length.out if you wish
mu.val <- seq(-10, 10, length.out = 1000)
# Compute the log-likelihood at every value in mu.val
# The MINUS sign is to enable me to display the log-likelihood rather than the negative of the log-likelihood
log.like <- -sapply(mu.val, mlnl)
# Create a plot: vertical axis are the log-likelihood values, horizontal axis are the values of mu
# type = "l" is to connect the dots, try removing it
# fix the vertical axis for a nicer effect
plot(mu.val, log.like, type = "l", ylim = c(-200, 0))
# Draw a vertical line at the MLE for mu
# Make sure the line is colored blue-ish and dotted
abline(v = nlm(mlnl, 0)$estimate, col = "#0072B2", lty = "dotted")
# Draw a vertical line at mu, hence v = mu
# Make sure the line is colored orange-ish and is dotted
abline(v = mu, col = "#D55E00", lty = "dotted")
# Draw a curve representing the expected log-likelihood
curve(-n/2*log(2*pi)-n/2*log(sigma.sq)-n/2-n/(2*sigma.sq)*(1-x)^2, add = TRUE, col = "#CC79A7", lty = "dashed", lwd = 3)

Apr 4:

  • Recap: How to apply the whole toolkit provided by MLE and focused on two examples (IID \(N\left(\mu,\sigma^2\right)\) with \(\sigma^2\) known and the usual IID \(N\left(\mu,\sigma^2\right)\))
  • Ultimately, the key result for MLE is that the distribution of the MLE can be approximated very well by \(N\left(\theta_0, \left[I\left(\theta_0\right)\right]^{-1}\right)\). Take note that this does not mean that MLE is unbiased for \(\theta_0\) (recall our IID \(N\left(\mu,\sigma^2\right)\) example).
  • Spent a lot of time using asymptotic tools such as the law of large numbers (consistency of the sample mean of IID random variables for the population mean which should exist) and the central limit theorem (LM Theorem 4.3.2)
  • Laid out the argument for \(S^2 \overset{p}{\to} \sigma^2\) and \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\overset{d}{\to} N\left(0,1\right)\). Emphasized the similarity of the latter to the distribution of the standardized MLE \(\dfrac{\widehat{\theta}-\theta_0}{\sqrt{\left[I\left(\widehat{\theta}\right)\right]^{-1}}} \overset{d}{\to} N\left(0,1\right)\).
  • Introduced what the phrase asymptotic variance means and in what contexts you see this phrase.
  • What does the result \(\widehat{\theta} \approx N\left(\theta_0, \left[I\left(\theta_0\right)\right]^{-1}\right)\) look like in the IID case? Try to sketch how to obtain the asymptotic distribution of \(\sqrt{n}\left(\widehat{\theta}-\theta_0\right)\) in the IID case. Start from \[\sqrt{n}\left(\widehat{\theta}-\theta_0\right) \approx \left[-\frac{1}{n}\mathcal{l}^{\prime\prime}\left(\theta_0\right)\right]^{-1}\frac{1}{\sqrt{n}}\mathcal{l}^\prime\left(\theta_0\right)\] Lay out the argument to show that \[\begin{eqnarray}\left[-\frac{1}{n}\mathcal{l}^{\prime\prime}\left(\theta_0\right)\right]^{-1} &\overset{p}{\to}& \left(\mathbb{E}\left[-\frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right]\bigg|_{\theta=\theta_0}\right)^{-1} \\ \frac{1}{\sqrt{n}}\mathcal{l}^\prime\left(\theta_0\right) &\overset{d}{\to}& N\left(0,\mathbb{E}\left[-\frac{d^2}{d\theta^2}\ln f\left(Y_i;\theta\right)\right]\bigg|_{\theta=\theta_0}\right)\end{eqnarray}\] and the result for the IID case follows.

Apr 6:

  • Finished IID case for MLE asymptotics using asymptotic tools and properties of the log-likelihood derivatives: Note that the properties of the score function and the Fisher information were derived for whatever the form of the log-likelihood (need not be IID), but for the IID case, things become simpler as you can focus on each data point’s contribution to the log-likelihood (which is \(\ln f\left(Y_i;\theta\right)\)).
  • Introduced the key idea for the method of moments as a procedure, skimmed through some of the other more interesting points relevant for economics and finance (but these may have to be revisited after the midterm)
  • Big picture once again: Where could estimators come from, what properties should they have, how to express preferences for certain estimators
  • It may be desirable to have minimum variance and efficient estimators in the context of unbiased estimators. It may also be desirable to have the sufficiency property. Focused on Definition 5.6.1 and the deeper meaning behind the decomposition of the likelihood function (leading to Theorem 5.6.1). What I said in class is not exactly Definition 5.6.1 but will be equivalent to that definition through Theorem 5.6.1.
  • A question after class was to give an example of an estimator that is not sufficient. Here is a simplified version of what I gave as an example: Let \(X_1,X_2\) be IID \(N\left(\mu,1\right)\). \(\widehat{\theta}=X_1\) is not sufficient for \(\theta=\mu\). Write down the likelihood for \(\mu\) and see if you can do the required decomposition as in Definition 5.6.1.

Apr 11:

  • Why Exercise B of Homework 03 was assigned to you: First, I want you to get acquainted with how to apply the statistical method to answering research questions (Problem, plan, data, model/analysis, conclusion), in particular: What is a plausible range of values for what the manufacturer has set for the average number of chocolate chips to be placed in one piece of Chips Ahoy? Second, there is also a link to the “weird” oscillating coverage rates of the so-called Wald interval, see Brown, Cai, and DasGupta (2003). In particular, the Poisson case was treated in Examples 1 and 2. Now, you should be able to gain confidence in reading more technical papers. Both these points are to prepare you for your eventual project.
  • Cramér-Rao lower bound: What is it good for? It helps narrows the search for MVUE and the computational burden of comparing relative variances of estimators. The CRLB is the inverse of the Fisher information, so it is linked to the behavior of MLE and the curvature of the log-likelihood function. Gave a realistic example of a case where you have zero Fisher information.
  • Pointed out Example 5.5.2: This is not a counterexample.
  • The definition of sufficiency in LM is not based on the motivating example of the IID \(\mathsf{Ber}\left(p\right)\) where it was shown that once we condition on \(\widehat{p}\), the probability of observing the data does not depend on \(p\) anymore.
  • Illustrate the case where \(X_1,X_2,X_3\) are IID \(\mathsf{Ber}\left(p\right)\). Derive the conditional distribution of \(X_1,X_2,X_3\) given \(\widehat{p}\) directly by enumerating the possible datasets one could observe. Make a connection to the general case illustrated in the book. Points to note here are: (1) We look back to what data could have produced a particular value of \(\widehat{p}\). (2) We knew how to derive the distribution of \(\widehat{p}\).
  • Illustrate the difficulties of using this conditioning argument in the continuous case. Refer to LM Exercise 5.6.6. Make an honest attempt at deriving the distribution of \(W\). You will find that it can be extremely difficult! This motivates the need for other versions of sufficiency.

Apr 13:

  • Recap of our \(X_1,X_2,X_3\) are IID \(\mathsf{Ber}\left(p\right)\) example. Worked out the conditional distribution of \(X_1,X_2,X_3\) given \(\widehat{p}\) completely. Emphasized the data compression provided by \(\widehat{p}\). Considered \(X_1+X_2+X_3\) as a one-to-one transformation of \(\widehat{p}\) and \(X_1+X_2+X_3\) is still sufficient for \(p\). Connection with how to do Exercise 5.6.3 (the equivalence classes induced by \(\widehat{p}\) and \(X_1+X_2+X_3\) are the same.
  • Showed that the data compression provided by \(\widehat{p}^*\) in Exercise 5.6.2 is “not enough”, especially when compared to \(\widehat{p}\).
  • The extension of “\(\mathbb{P}\left(\mathrm{observing}\ \ \mathrm{data} | \mathrm{statistic}=\mathrm{fixed}\ \ \mathrm{value}\right)\) does not depend on the unknown parameter” to the continuous case is much more difficult, primarily because you need the distribution of the statistic you want to show is sufficient. Consider Exercise 5.6.6.
  • Theorem 5.6.1 is a powerful characterization of sufficiency, but it may be best used to show that a statistic is sufficient for an unknown parameter rather than showing that a statistic is not sufficient. Why?
  • To show a statistic is not sufficient for an unknown parameter, it might be better to use an approach similar to what we did in Exercise 5.6.2 or perhaps an approach sketched in LM page 321. The latter is more formalized in the notes in terms of likelihood ratios. Try applying this approach to the example in LM page 321. Revisit Exercise 5.6.2 using this approach as well.
  • Note that the discussion in LM page 321 about the true value of \(\theta\) being 4 is incorrect!
  • One has to be careful about constructing the likelihood function, especially when the support depends on the unknown parameter like in Example 5.6.2. Connect the importance of setting up a likelihood function correctly to Example 5.2.4. The MLE algorithm does not work here because the likelihood function is not set up correctly. If it were set up correctly, then we would know that we cannot apply derivatives at the very beginning! Be careful of this point.
  • A question was raised by one of your classmates (thanks to Yi!) at the end about Example 5.6.2. They say that the “critical fact” \[\prod_{i=1}^n I_{[0,\theta]}\left(y_i\right)=I_{[0,\theta]}\left(y_{max}\right)\] is not very obvious if, for example, there exists \(y_1\) such that \(y_1<0\) but \(0\leq y_i\leq \theta\) is satisfied for \(i\neq 2\). Here the equation presented does not hold. How do we resolve this? When your classmate asked this, my initial reaction was that \(y_1<0\) should give a likelihood equal to zero anyway, but your classmate is correct that the equation presented is not always true. I now illustrate a better approach: The joint pdf of \(Y_1,\ldots, Y_n\) in this case is \[f\left(y_1\ldots,y_n;\theta\right)=\begin{cases} \displaystyle \prod_{i=1}^n \frac{2y_i}{\theta^2} & \mathrm{if}\ 0\leq y_i\leq \theta,\ i=1,\ldots,n \\ 0 & \mathrm{otherwise}\end{cases}\] Therefore, if there is a \(y_1<0\) but \(0\leq y_i\leq \theta\) is satisfied for \(i\neq 2\), then \(f\left(y_1\ldots,y_n;\theta\right)=0\). In this sense, the factorization theorem can be applied directly to the case where \(y_i\geq 0\) for all \(i=1,\ldots,n\). The “critical fact” is true in this situation and we can rule out the situation where any of the \(y_i<0\).

Apr 18 (45 minutes):

  • Wrap up sufficiency: recap of past examples and how sufficiency is related to data compression which is “lossless” in terms of learning about an unknown parameter (We talked about examples of “too much” and “too little” compression when learning \(p\) in the IID Bernoulli case. We also talked about non-uniqueness of sufficient statisics.)

  • Pros and cons of the different characterizations of sufficiency: Practice doing each characterization when you try the exercises. That way, you get a feel for which characterizations are easy to work with.

  • Why should anyone care about sufficiency? Although the idea is not as useful in practice (especially in economics and finance, where the likelihood is not the starting point), the idea is useful in thinking about deeper questions about statistics in a privacy-respecting environment and separation of nuisance parameters from parameters of interest in complex models.

  • A more classical way of the usefulness of sufficiency arises from actually constructing minimum variance estimators starting from an unbiased estimator.

  • A statement in the book (LM page 325) gives a reason why MLE is attractive. In particular, if a sufficient statistic exists, because of the factorization theorem (LM Theorem 5.6.1), MLEs have to be functions of sufficient statistics. The intuition is that if sufficient statistics “squeeze” out all the information about the unknown parameter in the data and the MLE is a function of these statistics, then somehow MLEs become very attractive indeed.

  • One of your classmates (thanks to Xiangqing!) asked about the previous point. They were concerned about how to show that the MLE is sufficient. Based on the remark in the book, the MLE is a function of a sufficient statistic. But this leaves open the question “Is MLE a sufficient statistic?”. Here you can use Exercise 5.6.3. But can MLE be not sufficient despite it being a function of sufficient statistics? This turns out to be a difficult question.

    • The remark in the book is not fully correct. Moore (1971) gives a counterexample where \(X_1,\ldots,X_n\) is IID \(\mathsf{U}\left(\theta-0.5,\theta+0.5\right)\). This example also shows that the MLE could be nonunique. Furthermore, the example could be a good exercise to show the sufficiency of the minimum and the maximum for \(\theta\). Levy (1985) also has a simpler example and could be used as an exercise as well. Although Levy (1985) looks at minimal sufficiency, you can focus on sufficiency alone.
    • We can impose uniqueness of the MLE so that the remark in the book becomes correct. The question “Can MLE be not sufficient despite it being a function of sufficient statistics?” is still unanswered. I give the example of a Cauchy distribution with location \(\mu\) and scale \(\sigma\). Copas (1975) has shown that the MLE of the location and scale is unique. This is an example of a distribution where all of the order statistics (in fact, the entire sample) form sufficient statistics. In this sense, there is no data reduction. The MLE is not a one-to-one transformation of the order statistics. Intuitively, once you get the MLE, you cannot “undo” it to produce the \(n\) order statistics. So, I think this is an example of a situation where the MLE is not sufficient.

Mar 9, 14, 16, and 21

Topics:

  1. Standard errors
  2. Confidence intervals for a normal population mean
  3. Testing claims about a normal population mean
  4. The plug-in principle
  5. Introduction to simulation-based inference using R

Activities: Regrettably, none for this time.

Assigned readings: Properties of Estimators (Section 5.4), Consistency (Section 5.7), Interval Estimation (Section 5.3), The Decision Rule (Section 6.2, specifically pages 351-352 and LM Examples 6.2.1, 6.2.2), Comparing \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) and \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\) (Section 7.2), Sections 2.3 to 3.2.1 of Notes on normal models, What do we mean when we say “mean”?

Exercises and resources:

  • Try to draw random numbers from other parent distributions aside from \(N\left(\mu,\sigma^2\right)\). Visualize the sampling distributon of \(\overline{Y}\). For a list of possible distributions, have fun with distributions in R (focus on base functionality, as these should be enough).
  • If you have not done so, start fasteR Lessons 1 to 9.
  • We have discussed aspects of Exercises 1 to 7, 9, 10 in Notes on normal models. But you could already work on Exercises 8, 11, and 12.
  • Exercise 4 involving IID Bernoulli random variables is also a very useful special case. As there is a lot we know about the total \(\displaystyle\sum_{i=1}^n Y_i\). Most of the findings here will show up in some form in Interval Estimation (Section 5.3) and Testing Binomial Data – \(H_0:\ \ p=p_0\) (Section 6.3). This is parallel to the results for IID normal random variables.
  • Homework 02, suggested solutions

Exercises in LM:

  • 5.4.11: This is an exercise to show you that asking for unbiasedness is sometimes too much.
  • 5.4.4, 5.7.1: Although Exercise 5.7.1 is in the section on consistency, you do not really need the concept to answer this question. Try it.
  • 5.7.2: One way to do this is to apply Chebyshev’s inequality to \(S_n^2\) as the random variable. After that, let \(n\to\infty\).
  • 5.7.3 (a): Calculate \(\mathbb{P}\left(|Y_1-\lambda|\leq \varepsilon\right)\) directly since you know the distribution of \(Y_1\).
  • Examples 5.4.2 and 5.7.1: Compare the difficulty of these examples to how we developed the properties of \(\overline{Y}\). Of course, try working out these examples as it allows you to review how to compute the distribution of order statistics. Continue with Exercises 5.4.2 and 5.4.6.
  • 5.4.7: Do this after reviewing order statistics. For now, there is no need to refer to Example 5.2.4.
  • 5.4.9: This is a direct calculation. You need to calculate integrals here.
  • 5.4.10: This exercise makes clear than unbiasedness is a finite-sample property. In my opinion, the hint may be confusing. If it helps, replace with “What is \(\mathbb{E}\left(Y^2\right)\)?”
  • 5.4.12: Study Example 5.4.4 first, then do this particular exercise.
  • 5.3.1: There are two ways to answer this. One is to assume normality which was not given in the exercise. The other is to use Chebyshev’s inequality.
  • 5.3.2: This is a direct calculation problem.
  • 5.3.5: This is an exercise about minimum sample sizes when the goal is to narrow the range of “plausible” values for the unknown \(\mu\).
  • 5.3.6, 5.3.7, 5.3.8: Exercise 5.3.6 uses the sampling distribution \(\overline{Y}\) under IID normality. In contrast to what we did in class, the intervals here are not symmetric.
  • 5.3.4 and 5.3.5: Both these exercises are somewhat related to testing claims. These exercises show that confidence intervals and testing claims may be connected.
  • 5.3.9: This may feel strange, but it asks you to look at the assumptions necessary for constructing a confidence interval for \(\mu\).

Mar 9:

  • Recap: Estimand, estimator, and estimate; visualizing the sampling distribution of \(\overline{Y}\) under IID normality, visualizing random draws from distributions, R code in notes (20 minutes)
  • The importance of the change in terminology from probability distribution to sampling distribution (5 minutes)
  • Making probability statements about \(\overline{Y}\) (5 minutes)
  • What results will survive if we remove normality, identical distribution, and independence? Not all assumptions/conditions are made equal. (15 minutes)
  • Can will still make probability statements about \(\overline{Y}\) if we only know that \(Y_1,\ldots,Y_n\) are IID with mean \(\mu\) and variance \(\sigma^2\)? Markov’s inequality and Chebyshev’s inequality play crucial roles here. (15 minutes)
  • Establishing consistency of \(\overline{Y}\) for \(\mu\) (10 minutes)
  • Distinguishing between finite-sample (small-sample) results and asymptotic (large-sample) results (5 minutes)
  • An alternative interpretation of the expected value and how it relates to the unbiasedness of an estimator (15 minutes)

Mar 14:

  • Long sermon: following instructions, draft mode, clicking submit, etc. (10 minutes)
  • Revisiting incorrect answers in Homework 01: What is the point of Homework 01? (35 minutes)
  • What is the meaning of \(\mathsf{Var}\left(\overline{Y}\right)=\sigma^2/n\)? How do we visualize it? How is it linked to Chebyshev’s inequality, where \(\varepsilon\) is chosen carefully? Frame Chebyshev’s inequality as a guarantee about the behavior of the sample mean in hypothetical samples. (25 minutes)
  • Why do we call \(\sigma/\sqrt{n}\) the standard error of the sample mean? Why give a new name? (10 minutes)
  • The standardized sample mean \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\): Where does it show up in Chebyshev’s inequality? Notice also the link to the central limit theorem for IID random variables. Note the presence of the standard error of the sample mean. Also note that in the future, we will be considering general estimators which when appropriately standardized as \(\dfrac{\widehat{\theta}-\theta}{\sqrt{\mathsf{Var}\left(\widehat{\theta}\right)}}\) will have a similar behavior as the usual central limit theorem. (10 minutes)

Mar 16:

  • Reminders: diary entries, exercises (5 minutes)
  • Recap of everything we have so far in a tabular form (15 minutes)
  • The probability statements we can make about \(\overline{Y}\) is restricted to a particular format (of if you wish, events), especially if we do not know \(\mu\) and \(\sigma^2\). Contrast probability statements using Chebyshev’s inequality with the probability statements based on knowing that the sampling distribution of \(\overline{Y}\) is normal (25 minutes)
  • R code to give a sense of the probability statements (10 minutes)
  • Why is the restriction to a particular format (or event) good enough? What do we gain? We gain a way to produce intervals that “capture” \(\mu\) with a pre-specified probability. Imagine me being blind and killing an insect staying at the same location on the board using my hand. (10 minutes)
  • What a confidence interval actually says and what the guarantee actually means (10 minutes)
  • So far, we pretended to know \(\sigma\) when constructing a confidence interval for \(\mu\). What makes the construction of a confidence interval different from calculating the exact value of \(\displaystyle\mathbb{P}\left(\left|\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\right|\geq c\right)\) under IID normality? (5 minutes)
  • The idea behind testing a claim involving a population mean (5 minutes)
# Code used to demonstrate the exact results under IID normality
n <- 5
mu <- 1 # common expected value 
sigma.sq <- 4 # common variance 
nsim <- 10^4 # number of realizations to be obtained
# repeatedly obtain realizations
ymat <- replicate(nsim, rnorm(n, mu, sqrt(sigma.sq)))
# collect all the possible realizations of the sample mean
ybars <- colMeans(ymat)
# create a temporary object called temp to store an indicator
# of whether the event below happened (TRUE/FALSE)
temp <- (abs(ybars-1) < 2/sqrt(5))
# Look at the last three sample means as an example
ybars[9998:10000]
# Look at the corresponding values of temp
temp[9998:10000]
# Count how many in total
# What do you expect to see here?
sum(temp) 
# Change the event to 
temp <- (abs(ybars-1) < 4/sqrt(5))
# Count how many in total
# What do you expect to see here?
sum(temp) 

Mar 21:

  • Recap of confidence intervals for a normal population mean \(\mu\): what it is, how to construct it when \(\sigma^2\) is known (20 minutes)
  • What if we do not want a 95% confidence interval? Then you just have to make adjustments. Do we really 5% to be equally distributed in both tails? Not really, but there are benefits from wanting 5% to be equally distributed in both tails. But there are also situations where it may be a good idea to not want this equal area in both tails. (10 minutes)
  • What happens if \(\sigma^2\) is not known? Well, you now have a new estimand \(\sigma^2\). LM Example 5.4.4 suggests two possible estimators \(\widehat{\sigma}^2\) and \(S^2\). \(\widehat{\sigma}^2\) is not unbiased for \(\sigma^2\), but \(S^2\) is unbiased for \(\sigma^2\). The problem is that for constructing a confidence interval for \(\mu\), you really need a plug-in for \(\sigma\), not \(\sigma^2\). Is this a big deal? Yes and no. Yes, especially when sample sizes are small. No when sample sizes are very large. Recall our pictures for the simulated distribution of \(\dfrac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\) and \(\dfrac{\overline{Y}-\mu}{S/\sqrt{n}}\). When \(n\) is small, the latter has fatter tails than normal even if we assume IID normality. (15 minutes)
  • How do we test claims? We are going to focus on the key ideas behind the \(p\)-value first, before we think about all the other details. At its very core, how are we to evaluate a claim using probability theory? (30 minutes)
  • Work on a direct calculation of the \(p\)-value for LM Examples 6.2.1 and 6.2.2 and a simulation-based \(p\)-value (15 minutes)

Mar 23:

  • Recap of LM Examples 6.2.1 and 6.2.2 (30 minutes)
  • Point out the IID Bernoulli case in LM Section 5.3, Case Study 4.3.1 (testing the claim that \(p=1/5\)) (10 minutes)
  • What is mathematical statistics? What have we been doing so far? (5 minutes)

Feb 28, Mar 2 and 7

Topics:

  1. Probability vs mathematical statistics
  2. A baseline statistical model: IID normal random variables
  3. Introduction to R: Your first dataset and your first Monte Carlo simulation

Activities: Collecting data on chocolate chip cookies

Assigned readings: Introduction (Section 8.1), Classifying Data (Section 8.2, pages 427 to 431 only), Introduction (Section 5.1), Introduction (Section 7.1), Introduction (Section 6.1), Introduction (Section 10.1), Sections 1 to 2.2 of Notes on normal models

Exercises and resources: Homework 01 and suggested solutions, fasteR Lessons 1 to 9, Notes on normal models, Distributions in R (focus on base functionality, as these should be enough), Fix the typo in Example 19.1 of Ruppert and Matteson (2015).

Exercises from LM:

  • 3.12.19 and 4.3.18(b): You can implement a Monte Carlo simulation in R to roughly verify your answer in 4.3.18(b).
  • 4.2.12 and 4.2.14: Feel free to use R to help you answer these questions related to applications of the Poisson distribution.
  • Case Study 4.2.1: Read the case study as this is an application of the binomial distribution. Verify the computed values using R.
  • Case Study 4.3.1 and Exercise 4.3.12: Read the case study and work out the details. The exercise is related to the case study.
  • 4.2.29: This is similar to Case Study 4.2.3. Feel free to use R to help you solve the exercise.
  • 4.3.32, 4.3.33, 4.3.34, 4.3.38: For 4.3.38, you can implement a Monte Carlo simulation in R to roughly verify your answer.

Feb 28: Introductions (45 minutes), data collection activity and discussion (45 minutes)

Mar 2:

  • Recap of data collection activity, the cookie dataset, and R demonstration (what is a csv file, loading data in csv format using read.csv(), the assignment operator <-, making a histogram using hist() and its function as a data summary) (25 minutes)
  • What is the statistical method? Refer to Figure 8 of MacKay and Oldford (2000). (20 minutes)
  • What is a model? (30 minutes)
  • LM Case Study 4.2.2 and where did they get the estimate for \(\lambda\) (15 minutes)
# Code used in class, modify accordingly for your situation
cookie <- read.csv("E:/data_cookie.csv")
# Shows the entire dataset
cookie
# Presents histogram of number of chocolate chip cookies
hist(cookie$numchoc)
# Store the histogram as an object
temp <- hist(cookie$numchoc)
# What is stored in temp?
temp 

Mar 7:

  • Reminders and announcements (10 minutes)
  • Recap of Case Study 4.2.2: mentioned that although \(\mathbb{E}\left(X\right)=\lambda\) whenever \(X\sim \mathrm{Poi}\left(\lambda\right)\), we also have \(\mathsf{Var}\left(X\right)=\lambda\). Note that the idea behind the general method used to combine these two sources of information to learn about \(\lambda\) got the Bank of Sweden Prize for the Economic Sciences (or the Nobel for Economics). (20 minutes)
  • Covered Case Study 4.2.3 and contrasted with Case Study 4.2.2 where, in the former, the reciprocal of the sample mean is used as a “guess” for \(\lambda\) in the exponential case (15 minutes)
  • Introduced Monte Carlo simulation using R and Case Study 4.2.3 as the example (10 minutes, new commands include ?, dexp(), pexp(), qexp(), rexp())
  • Introduced the parameter of interest in value-at-risk application which is the upper \(\alpha\)-quantile of the loss distribution (5 minutes)
  • The distinction among parameter (or estimand), estimator, and estimate (5 minutes, picture from WomenInStat twitter account)
  • Recalled sums of IID random variables, for example, is the sum of independent exponential random variables also exponential? (5 minutes)
  • Sampling distribution of the sample mean under IID normality: why and Monte Carlo demonstration (20 minutes, new commands include sqrt(), mean(), var(), dim(), rnorm(), replicate(), colMeans())
# Code used in class for Case Study 4.2.3, other code in notes
# Draw 218 random numbers from Exp(0.015) and display as a histogram on a density scale
# freq = FALSE is to convert vertical axis into density scale
# The default is freq = TRUE, so if there is no freq setting, then you get histogram on a frequency scale
hist(rexp(218, 0.015), freq = FALSE)
# Designing the histogram a bit to make it look nicer
# You can repeatedly run the next two commands to see how the histogram fluctuates
hist(rexp(218, 0.015), freq = FALSE, xlim = c(0, 800), ylim = c(0, 0.015), main = "Draws from Exp(0.015)", xlab = "x")
# Superimpose pdf of Exp(0.015), add = TRUE is used for this. col, lty, lwd are graphical parameter options, check ?par. 
curve(dexp(x, 0.015), from = 0, to = 800, add = TRUE, col = "#CC79A7", lty = 2, lwd = 3)
# Computing P(X <= 30) when X is distributed as Exp(0.015)
# Compare this with the area of the first rectangle in the figure found in Case Study 4.2.3
pexp(30, 0.015)