Individual paper

The project: big picture

You have about 6 weeks to pursue a project which will eventually lead to a written report with some supplemental material. Because of the time frame, you are NOT expected to produce a perfect project or report. In addition, because this is not a thesis, you are NOT expected to create something entirely new. But you are expected to use everything you have learned in the course (along with whatever you bring as part of your own experience and your own reading) to produce an informative report.

Aim for something in between “a simple project done very well” and “a complicated project with flaws”.

The language of the report, along with any supplemental material, has to be in English. You are not expected to have perfect grammar and diction, but you are expected to do your best to communicate your own thoughts and understanding to the reader. You are required to use R for computations. The supplemental material may include R code, the datasets used, the Quarto document (qmd file).

The project is done on an individual basis. The report will be written in Quarto documenting and discussing what you have done. You will be applying everything that you have learned in the course and perhaps more, depending on your interests and learning goals. Because it is a formal report, references and citations are required. Furthermore, you are to write for an audience who has already finished a mathematical statistics course, just like the current course you are taking.

Depending on the type of project you have chosen, there may be different sets of requirements, on top of those just mentioned.

The project will be divided into stages.

  • The third stage is the report you will submit around the final exam week (more definite time will be available soon, as it depends on arrangements beyond my control).

  • The second stage, as of 2023-05-25, is where you will be asked to provide your plan and progress report for the project based on your chosen project option. You get 0 if you do not submit on time. If there are no citations and there are indications of plagiarism, you also automatically get 0 and you will be reported to the administration. There are three elements that matter at this stage: level of understanding with respect to your chosen project option, how mature is your plan and progress, and the overall writing. Every element that is extremely lacking will lead to a deduction of 10 from a maximum of 100. The resulting grade becomes “work in pursuit of the final project” (see course grading for more). The deadline is 2023-06-04, 2300 Beijing time at SPOC. You will submit the following:

    • Quarto (qmd) file which has a Preamble (if needed), Introduction and Motivation (answer the questions, put in your Plan, and document your Progress): Refer to the template for the structure. Make sure your qmd file can be rendered from your end. I should be able to render it on my computer using the submitted qmd file.
    • The bib file as you will be including citations
    • The html file you rendered or generated
  • The first stage, as of 2023-05-14, is the selection of a project option.

A template you can use for the project is available as a qmd file (a text file you have to render in RStudio, after installing RStudio and Quarto) and as an HTML file. You also need a bib file, which is also a text file containing the references used for the template. Both the bib and qmd files have to be in the same directory.

Restrictions

You are not to use any AI assistance (for example, but not limited to, ChatGPT or its variants) for your project. The project options are sufficiently narrowed down and have enough background. The expectations about the project are also calibrated enough. Both these aspects are designed so that it would be a personal and authentic learning experience for you. Therefore, you are free to make not-so-serious mistakes along the way.

But you will be made to face the consequences of violating the spirit of a personal and authentic learning experience. Examples include, but are not limited to, attempting to reuse past projects, using someone else’s work and claiming it as your own, falsifying data, and other dishonest acts covered in the academic rules of the university.

You are free to discuss with your classmates, but not any other people. If you decide to discuss with your classmates, you have to acknowledge them in your report.

Project options

Special topics in LM

This project is for students who are interested in applying what they have learned in the course to topics that are available in the book. This project is also for students who are interested in explaining to other readers how they have understood material they have tried to process.

The aim is to write a set of your own lecture notes or tutorial articles covering the following chapters of LM. You only have to choose one.

  • Chapter 11: Regression
  • Chapter 13: Randomized Block Designs
  • Chapter 14: Nonparametric Statistics

The expected project output here is a report which demonstrates your understanding of the chapter and how you convey your understanding to the specified audience. In effect, you will be drafting notes (similar to what I have done for the class) to accompany the chapters. The case studies and exercises from the book should be used and solutions provided as part of your illustrations. Furthermore, you have to provide code in R so that a reader could see how to apply the methods discussed in the chapter. You are also encouraged to explicitly discuss how your chosen chapter is connected to what you have seen in other chapters of LM and the lectures in class.

Note that you are NOT supposed to provide PowerPoint presentations (PPTs).

Projects based on papers from The American Statistician

The journal called The American Statistician is a good starting point for a student who will eventually apply statistics in their work. The articles in this journal are written for practitioners with varying levels of exposure to statistics. Since you are a beginner in mathematical statistics, I have chosen a subset of articles over the past 20 years which I think a beginner may enjoy working on (hopefully).

If you choose this type of project, you should choose a topic of interest and then choose a paper belonging to that topic. This project is for students who want to start reading a serious, but short paper where mathematical statistics is used to guide practice.

The expected project output here is a report which demonstrates your understanding of the article and how it is connected to what you have seen in LM and in class. You are to focus on key aspects of the article, provide more complete derivations and other illustrations, reproduce some of the simulations and calculations, and perhaps provide alternative applications sharing a similar context.

The papers may be accessed within the university. If you need off-campus access, consult the university library website for more.

Sample size and power calculations

This project is for students who are interested in sample size and power calculations. The articles here extend the sample size and power calculations for the normal and binomial cases found in LM.

Confidence intervals

This project is for students interested in pursuing more realistic constructions of a confidence interval for other estimands.

Hypothesis testing

This project is for students curious about more realistic cases of hypothesis testing compared to those encountered in LM. Perhaps these students are also curious about whether researchers agree about the approaches learned in the classroom.

Projects based on applied papers from The American Statistician

These projects are for students interested in applied statistical projects. The word “applied” here involves real-life cases which are similar to the Case Studies found in LM. But these papers here are expanded and longer compared to the case studies you saw in LM.

In this type of project, you will be creating a case study similar to the Case Studies found in LM. Choosing one of the papers here means that you should find something very similar in the Chinese context. You must imagine yourself writing a case study which could have the potential to be included as part of (maybe) a new edition of our textbook. Ultimately, you have to gather some relevant data which will allow you to proceed with the approaches presented in your chosen paper but using data obtained from a Chinese context.

Below you will find a subset of articles from The American Statistician. The articles involve themes related to legal cases, over-collection of fees, and giving ratings.

Projects based on longer papers

This project option is likely to be much more challenging, but may be very rewarding. I only provide three options as many aspects of these longer papers could be turned into a small project.

The expected project output is a report which demonstrates your understanding of the article and how it is connected to what you have seen in LM and in class. You are to focus on a subset of the key aspects of the article, provide more complete derivations and other illustrations, reproduce or even create some of the simulations and calculations, and perhaps provide alternative applications sharing a similar context.

Projects based on your diary entries

This project option is likely to be slightly more challenging, because you have to do everything from scratch. The objective is to formulate, motivate, and answer a research question which can be answered using the techniques you have learned in the course of mathematical statistics applied to secondary data sources. You are expected to determiine whether or not your analysis techniques will be suitable for answering your research question.

Many students have ideas which require primary data collection. Primary data collection means that you have to propose a data collection protocol, implement a trial of the data collection protocol, revise the protocol, and then implement it more widely under a random sampling mechanism. Furthermore, there may be ethical concerns with whether the data may be collected in the first place. Therefore, I would advise against choosing primary data collection to pursue ideas based on your diary entries.

Use secondary data sources instead. In particular, use datasets belonging to IPUMS. The datasets available at IPUMS have enough variety and contexts (refer to the following overview). You have to create an account to use IPUMS and some might even require more time before an account will be approved (pay attention to this!).

Let me use the example of IPUMS USA. You can create an account at IPUMS USA now or at the end (just like when you are doing online shopping).

  1. Go to the website of IPUMS. Give yourself some time to get acquainted with IPUMS itself. Take note of the other datasets covered by IPUMS.
  2. Next, visit the site of IPUMS USA. Give yourself some time to get acquainted with IPUMS USA.
  3. Click on the button “Get Data”. You will be taken to a “shopping” interface which will allow you to browse and check out data extracts.
  4. Uncheck the box labeled “Default sample from each year”. Select samples first. You can choose one of the years. Take some time to get acquainted with the IPUMS samples. Once you are done, click on “SUBMIT SAMPLE SELECTIONS”.
  5. You can explore the variables available. You can also look into the questionnaires which were used to obtain the available data.

If you choose this project option, you will be creating a dataset from scratch based on how you are going to answer your research question. You would have to really dig into data cleaning, learn a lot about the dataset you are using, and perhaps learn more R commands depending on your situation.

All processing, cleaning, and analysis have to be done in R and built in to your Quarto document. This means that anyone could start from loading the rawest data (meaning that I could be able to go to IPUMS download your rawest data following your documentation), has explicit commands which can trace the processing of the rawest data to the data until your analysis.