The bootstrap

Preliminary reading and motivation

Read LM Section 5.9 first. The main motivation of that section is to differentiate between an analytical derivation (meaning that there is an expression or formula) of a standard error of some estimator and using the computer to obtain a standard error. What the latter means is not about applying a formula you already derived analytically but rather using the computer to obtain such a standard error without knowing what it actually looks like.

Recall once again that the standard error is the standard deviation of the sampling distribution of an estimator. Recall that we derived these standard errors either through direct calculation using formulas or we may have them as an output of an algorithm (like looking into the inverse of the Fisher information in the case of MLE). Finally, recall that we have been using Monte Carlo simulations to get a sense of whether the formulas obtained by direct analytical calculation are accurate. I refer you to our Homework 02 where the standard errors of different estimators of \(\theta\) for IID \(\mathsf{U}\left(0,\theta\right)\) were derived explicitly and then matched with the output of a Monte Carlo simulation.

In LM Section 5.9, you are actually exposed to a form of the bootstrap called the parametric bootstrap. It is called a parametric bootstrap because a parametric statistical model was used. I will reproduce the steps here using R:

# Step 1: Draw 15 random observations from a gamma pdf with r=2 and theta=10.
# Imagine that this is the data we actually have. 
# Note that R implements the gamma distribution with some slight differences.
step1.data <- rgamma(15, shape = 2, rate = 1/10)
round(step1.data, 3)
 [1] 13.242  5.647  9.786  8.128 26.727  0.835  9.156 19.762 30.411 22.882
[11] 14.800 18.911 28.066  7.075  4.667
# Step 2: Imagine that theta is unknown and you estimate it. 
# Define an R function to calculate the estimator. 
theta.est <- function(data) 1/2*mean(data) 
# Compute the estimate for the dataset we actually have. 
est <- theta.est(step1.data)
est
[1] 7.336476
# Step 3: Imagine 200 resamples of size 15 from a gamma pdf with r=2 but with theta = est
step3.resample <- replicate(200, rgamma(15, shape = 2, rate = 1/est))
# step3.resample has 200 columns and 15 rows
# every column is a resample of size 15 from the indicated gamma pdf
# Show what the 200 resamples of size 15 look like
# Just like Table 5.9.2, I show the first two rows and the last row
# t() is to take the transpose to match the orientation of Table 5.9.2
round(t(step3.resample[, c(1, 2, 200)]), 3)
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
[1,]  7.468  4.735 19.956 14.347  3.261 11.948  5.158 48.441 14.330 29.080
[2,] 19.192 15.240 19.011 20.099 12.602 36.465 16.707  7.664  2.990 12.346
[3,]  5.327 18.393  6.402 26.334  4.004  5.557  5.649 35.311 14.722  9.756
     [,11]  [,12]  [,13]  [,14]  [,15]
[1,] 5.393 23.593 23.891 15.959  5.609
[2,] 2.081  4.847 30.030 17.336 32.135
[3,] 1.874 21.941 23.472 52.190 27.642
# Step 4: Compute the estimator of theta for every resample
collect.est <- apply(step3.resample, 2, theta.est)
# What it looks like
# 200 estimates of theta based on the resampled observations
collect.est 
  [1]  7.772299  8.291484  9.031640  6.513187  7.870433  6.998364  8.455264
  [8] 10.873682  9.220911  8.036163  9.126621  6.724047  6.392733  8.434900
 [15]  6.480775  6.359551  7.475639  7.501997  5.872307  7.715925  9.048979
 [22]  5.785542  5.198291  7.623578  7.294960  8.316653  5.752928  6.922642
 [29]  7.045586  6.179828  6.889172  5.814978  7.496318  8.034360  8.227438
 [36]  5.757648  9.063369  6.047005  7.320521  9.000681  8.226047  8.072830
 [43]  8.139285  6.827447  6.388023 10.079980  7.723930  5.507103 11.184897
 [50]  8.639038  6.923632  7.337270  8.285944  7.822500  8.862395  8.489130
 [57]  6.105287  6.380688  5.917173 10.140259  9.625092  6.256254  7.252036
 [64]  7.056020  6.202223  6.904643  8.388519  5.703024  7.682226  5.813033
 [71]  8.040395  8.663926  4.909996  7.663541  6.242449  6.839438  7.812049
 [78]  9.232892  6.820489  5.424089  8.345370  5.767555  9.086246  5.852367
 [85]  7.460680  6.229139  7.341408  7.058534  6.890126  7.858950  7.003017
 [92]  6.394533  7.862752  6.936595  7.752444  7.072077  7.818117  9.278933
 [99]  7.435700  8.906870  9.174540  7.682238  5.074636  6.143237  7.213078
[106]  6.541838  7.852276  7.347126  6.714479  5.839020  7.366956  9.328700
[113]  6.037812  6.165141  7.169451 10.002120  7.726789  8.891376  7.217861
[120]  5.682138  5.937334  8.236505  7.961495  5.351382  8.268537  5.661993
[127]  5.896576  6.327050  6.989255  8.119791  8.046670  5.722071  6.061361
[134]  6.707667  6.537776  7.655259  6.678061  8.564960  7.059742  6.850817
[141]  5.001006  8.295542  6.452181  6.702923  6.277211  6.495864  6.326296
[148]  9.289162  4.665056  6.783445  7.217313  8.681256  5.465026  8.381213
[155]  6.002042  7.387753  9.395961  3.701501  7.120402  8.229265  8.341535
[162]  7.154837  7.670015  7.016837  7.918561  7.222742  6.015647  7.156007
[169]  8.907859  6.444934  6.256156  8.052538  8.719487  8.477526  5.225195
[176]  7.676268  4.878017  7.795936  6.622555  7.206947  9.702525  4.737825
[183]  7.792984  7.938359  8.484146  6.430828  7.710208  9.434171  5.929685
[190]  8.317174  7.743341  6.491544  6.914508  7.290563  5.992249  6.256025
[197]  6.748726  6.218718  6.082232  8.619165
# Step 5: Calculate the sd of the 200 estimates of theta
sd(collect.est)
[1] 1.262752
# Does it match theoretical standard error?
10/sqrt(2*15)
[1] 1.825742

Notice that if we did not know how to derive the theoretical standard error of \(\theta/\sqrt{2n}\), we would not know what standard error to report. We also may not be able to compute confidence intervals or test hypotheses. Therefore, having a tool like the bootstrap actually helps in practice. Natural questions to ask include:

  1. Does it work all the time? When will it not work?
  2. What if I do not know the distribution that generated the data?
  3. Is there a bootstrap version of obtaining confidence intervals and testing hypotheses directly? I am not sure if you noticed it yet, but you have encountered an easier form of parametric boostrap hypothesis testing. Recall the simulated \(p\)-value we calculated in the LM Example 6.2.1 about test scores.
  4. You may also wonder why did we not look at mean(collect.est). Is there something to learn from mean of the estimates obtained from the resampled data?
  5. Did we actually create new datasets as a result of resampling?

As an exercise, try repeating the code above on your own computer. What changed? What did not change?

We can also answer the fourth question: The value of mean(collect.est) is 7.2858756. Notice that it is actually not the true value 10! It is much closer to the value of est, which is the estimate of \(\theta\) for the data we actually have. We will return to this observation later.

The bootstrap principle

The bootstrap may be used to estimate bias, estimate standard errors, construct confidence intervals, and conduct hypothesis tests. But in these notes, I will focus on estimating standard errors.

The idea is as follows:

  • You have data on hand.
  • There was a sampling mechanism which produced the data you have.
  • You may or may not know or you may not have full information about the mechanism which produced the data you have. What we do is to apply the Monte Carlo simulation ideas. If we have full information about the statistical model, except perhaps for the unknown parameter, we are back to the beginning of these notes. But if you do not even know the statistical model, then you cannot use a Monte Carlo simulation (where we essentially knew everything). For example, in LM Example 6.2.1, we calculated the simulated \(p\)-value with full knowledge that we are working with a normal distribution and we knew the mean and variance under the null. Therefore, we were able to simulate data. If we do not know the variance, for example, then we have to plug-in a value for it. In this sense, we are back to a setting similar to the example at the beginning of these notes.

But what if you really do not even know the distribution which produced the data? You sample from the data you have on hand instead. You are going to resample. We are NOT creating new data, unlike the artificial datasets generated by Monte Carlo. We have to resample in the same way that reflects the sampling mechanism which generated the data. The bottom line is that we are not adding anything new, but we are injecting randomness into the problem so that we can get a sense of the sampling distribution of some estimator.

The idea is to use the computer to “mimic” some features of the sampling distribution of some estimator. The realization we see came from the population and the resampled data should reflect the original population as well. Of course, the matching between the resampled data and the original population may not be perfect. Since the realization reflects the population, we could draw random samples with replacement from the realization we actually have.

What I have described is called the nonparametric bootstrap. The bottom line is that we have to find a way to generate bootstrap resamples which obey the setting which generated the data we actually have.

The bootstrap principle for the sample mean

We differentiate among three worlds:

  • Imagined world (where we can make assumptions or talk about ideal things): Let \(X_1, \ldots,X_n\) be IID random variables from some distribution. Let \(\overline{X}\) be the sample mean.
  • Real world (where things are messy and where assumptions we make might not match): Let \(x_1,\ldots,x_n\) be the data you actually have. Let \(\overline{x}\) be the actual sample mean.
  • Bootstrap world (where we pretend that the real world is the imagined world): Let \(X_1^*, \ldots, X_n^*\) be a resample with replacement from \(x_1,\ldots,x_n\). Let \(\overline{X}^*\) be a bootstrapped sample mean.

We will be operating in the bootstrap world by collecing an arbitrarily large \(B\) resamples of the bootstrapped sample mean, denoted by \(\overline{X}^*_{(1)}, \ldots, \overline{X}^*_{(B)}\). Provided the original sample size \(n\) is large, the bootstrap distribution of the \(\overline{X}^*-\overline{X}\) will be a good approximation to the sampling distribution of \(\overline{X}-\mu\).

There would be different bootstrap principles to approximate sampling distributions of the objects you are interested in. These principles have to be proven in every case or you have to make sure that your estimator belongs to a particular class of objects for which a bootstrap principle exists. Unfortunately, the bootstrap principle alone also does not tell you what kind of sample statistics may be bootstrapped. Therefore, you have to be careful.

Let us return to the example in LM Section 5.9. Observe that the estimator of \(\theta\) is actually a constant multiple of the sample mean. We are going to apply the nonparametric bootstrap here and compare with the parametric bootstrap.

# Step 1: Start from the data we actually have. 
round(step1.data, 3)
 [1] 13.242  5.647  9.786  8.128 26.727  0.835  9.156 19.762 30.411 22.882
[11] 14.800 18.911 28.066  7.075  4.667
# Step 2: Imagine 200 resamples of size 15 with replacement using the data in Step 1. 
step2.resample <- replicate(200, sample(step1.data, 15, replace = TRUE))
# step2.resample has 200 columns and 15 rows
# every column is a resample of size 15 from the data in Step 1
# Show what the 200 resamples of size 15 look like
# Just like Table 5.9.2, I show the first two rows and the last row
# t() is to take the transpose to match the orientation of Table 5.9.2
# What do you notice for every row?
round(t(step2.resample[, c(1, 2, 200)]), 3)
       [,1]   [,2]   [,3]  [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
[1,] 14.800 18.911 26.727 8.128 14.800  0.835  9.786  8.128 28.066 14.800
[2,]  8.128 22.882  9.156 8.128  9.156 28.066  9.156 13.242 13.242 13.242
[3,] 22.882 13.242 26.727 4.667  8.128 26.727 19.762  9.156 26.727 18.911
      [,11]  [,12]  [,13]  [,14] [,15]
[1,] 30.411  8.128 28.066  9.156 9.786
[2,] 19.762  7.075  9.156  9.786 4.667
[3,] 19.762 19.762 30.411 19.762 5.647
# Step 3: Compute the estimator of theta for every resample
collect.est <- apply(step2.resample, 2, theta.est)
# What it looks like
# 200 estimates of theta based on the resampled observations
collect.est 
  [1]  7.684253  6.161363  7.037579  9.669423  9.148413  6.026931  6.836527
  [8]  6.801916  7.310108  7.521028  8.574021  7.176071  7.615399 10.191879
 [15]  6.874518  7.956791  8.819622  6.767512  8.575242  7.345206  5.112871
 [22]  8.227488  5.798221  6.918075  6.456158  5.110770  5.484534  7.388958
 [29] 11.459361  7.429853  8.566485  9.265727  6.718847  6.596510  7.697771
 [36]  8.212046  7.777173  7.515970  5.676489  8.807772 10.196334  7.653311
 [43]  7.458482  6.628327  6.838233  8.028151  6.665307  8.962132  8.617013
 [50]  7.168018  8.980189  9.406556  7.530833  8.795145  7.635737  7.298326
 [57]  6.750143  7.587612  7.450064  7.217378  8.311681  7.044414  8.274760
 [64]  7.025825  6.487561  5.637454  8.956427  8.218149  7.007596  6.998455
 [71]  8.478921  6.953148  7.049831  8.349634  7.376806  6.560408  7.177007
 [78]  7.924209  7.856223  6.346999  7.121394  8.221564  7.791824  7.868988
 [85]  7.042970  7.679898  6.309248  7.867363  7.347806  5.830306  7.522622
 [92]  6.319380  5.892193  5.588167  6.914861  6.261256  9.003409  7.823542
 [99]  5.173421  5.454647  4.999593  8.268601  6.803888  6.102428  7.726160
[106]  7.251054  6.436652  6.296063  7.641029  5.105604  6.389059  7.712435
[113]  6.801093  6.787461  6.688082  7.282185  8.058732  7.335317  7.804559
[120]  6.547189  8.274419  6.023266  7.183858  7.750126  6.178995  6.966876
[127]  7.176214  7.431944  7.301574  7.040185  6.049775  8.868953  7.072561
[134]  6.836229  7.152691  8.681230 10.091839  7.738576  7.658003  8.179254
[141]  9.034682  6.941416  5.088126  8.688269  6.928838  6.401243  6.884369
[148]  6.440495  8.446831  6.718178  8.004726  8.697364  9.153004  6.550593
[155]  6.273551  6.370632  7.624656  7.458233  8.636006  5.066830  5.892888
[162]  6.538854  7.346747  4.783799  7.453867  7.122903  9.017879  8.983764
[169]  7.554740  6.050612  7.837533  7.106501  6.987533  7.227362  7.801334
[176]  7.723867  6.971595  9.666907  8.423888  8.356025  7.673624  8.250034
[183]  9.079291  8.491999  7.243386  6.215050  7.563033  8.259749  8.541724
[190]  5.338555  6.687263  7.946220  5.009412  8.215954  6.805270  6.402027
[197]  8.757710  5.258261  8.833361  9.075705
# Step 5: Calculate the sd of the 200 estimates of theta
sd(collect.est)
[1] 1.142264
# Does it match theoretical standard error?
10/sqrt(2*15)
[1] 1.825742

Observe that the standard error of \(\widehat{\theta}\) based on the nonparametric bootstrap is not as close as the one based on the parametric bootstrap, but they are still remarkably close to each other. Of course, the example in LM Section 5.9 has information (the fact that we have a gamma pdf) which needs to be exploited in the bootstrapping process. But in more complicated applications, we typically do not have full information.

In these notes, I will discuss three key versions of the bootstrap. In fact, you actually have already encountered two versions. But the remaining one is important because it shows you why the bootstrap works.

  • Theoretical or exhaustive: Enumerate all possible resamples and construct the bootstrapped distribution of the estimator.
  • Empirical or nonparametric bootstrap: Use the data you have, take them as fixed, and resample from them a large number of times.
  • Parametric bootstrap: Fit a distribution which matches the data you have and draw resamples from that distribution.

The theoretical bootstrap

Consider the following situation:

  • Suppose \(X_1,X_2,X_3 \overset{\mathsf{IID}}{\sim} N\left(0,1\right)\).
  • Focus on the sample mean \(\overline{X}\) as the sample statistic or estimator.
  • You know that the sample mean is unbiased for 0.
  • You also know that the theoretical standard error of \(\overline{X}\) is \(1/\sqrt{3}\approx 0.577\).
  • You also know that if \(\sigma=1\) is unknown, then you would use a standard error equal to \(s/\sqrt{n}\), where \(s\) is the observed sample standard deviation.
  • Finally, IID plus normality implies that we know the finite-sample distribution of \(\overline{X}\), i.e., \(\overline{X} \sim N\left(0,1/3\right)\).

In this specific example, we can actually enumerate all the possible resamples of size 3 and derive the so-called theoretical bootstrapped distribution.

# Draw 3 random numbers from standard normal
x <- rnorm(3)
# Show what they look like
x
[1] 0.6670298 0.2100579 1.6179838
# Enumerate all possible resamples of size 3 with replacement
all <- matrix(c(x[1], x[1], x[1], x[1], x[1], x[2], x[1], x[2], x[1], 
    x[2], x[1], x[1], x[1], x[2], x[2], x[2], x[2], x[1],
    x[2], x[1], x[2], x[2], x[2], x[2], x[2], x[2], x[3],
    x[2], x[3], x[2], x[3], x[2], x[2], x[3], x[3], x[2],
    x[3], x[2], x[3], x[2], x[3], x[3], x[3], x[3], x[3],
    x[3], x[3], x[1], x[3], x[1], x[3], x[1], x[3], x[3],
    x[1], x[1], x[3], x[1], x[3], x[1], x[3], x[1], x[1],
    x[1], x[2], x[3], x[3], x[2], x[1], x[1], x[3], x[2],
    x[2], x[1], x[3], x[3], x[1], x[2], x[2], x[3], x[1]), 
    nrow=27, byrow=TRUE)
# Show the first 5 rows
all[1:5,]
          [,1]      [,2]      [,3]
[1,] 0.6670298 0.6670298 0.6670298
[2,] 0.6670298 0.6670298 0.2100579
[3,] 0.6670298 0.2100579 0.6670298
[4,] 0.2100579 0.6670298 0.6670298
[5,] 0.6670298 0.2100579 0.2100579
# Apply the sample mean to every row
allmeans <- apply(all, 1, mean)
# summary statistics for the data we actually have
c(mean(x), var(x), sd(x))
[1] 0.8316905 0.5158987 0.7182609
# summary statistics of the bootstrapped sample means
c(mean(allmeans), var(allmeans), sd(allmeans))
[1] 0.8316905 0.1190536 0.3450414
# compare
1/sqrt(3) # Theoretical standard error as you know the formula
[1] 0.5773503
sd(x)/sqrt(3) # How you would estimate the standard error if you know the formula but do not know that true variance = 1
[1] 0.4146881
sqrt(2/3)*sd(x)/sqrt(3) # degrees of freedom adjustment
[1] 0.3385914

Observe that mean(allmeans) is the mean of the collection of bootstrapped sample means. It is not equal to the true mean 0. But it actually matches the observed sample mean! Therefore, the bootstrap does not actually help you learn the true mean at all!

Observe that sd(allmeans) is the standard deviation of the collection of bootstrapped sample means. It does not match the theoretical standard error exactly, but notice that the theoretical standard error is based on knowing the formula \(\sigma/\sqrt{n}\). Typically, we would use \(S/\sqrt{n}\). But this also does not match very well. There would be an explanation for this below, but if you adjust for the fact that \(s\) has \(n-1\) in the denominator and change it to \(n\), sqrt(2/3)*sd(x)/sqrt(3) matches sd(allmeans) very well. What explains all these findings?

Let \(X_1^*,X_2^*, X_3^*\) be a bootstrap random sample. Take note that \(X_1^*,X_2^*, X_3^*\) are IID r.v.’s. Denote the possible values of \(X_i^*\) to be \(\{x_1,x_2,x_3\}\). Then given the way we construct the bootstrap random sample, \[\mathbb{P}\left(X_i^*=x_1\right)=\mathbb{P}\left(X_i^*=x_2\right)=\mathbb{P}\left(X_i^*=x_3\right)=1/3\] We are going to examine the distribution of the bootstrapped sample means.

Let \(\overline{x}_3\) be the observed sample mean. Every bootstrap r.v. has expected value equal to \[\mathbb{E}\left(X_i^*\right)=x_1 * \frac{1}{3}+x_2 * \frac{1}{3}+x_3 * \frac{1}{3}= \overline{x}_3\] for \(i=1,2,3\). As a result fo applying the properties of the expected value, we have \(\mathbb{E}\left(\overline{X}^*_3\right)=\overline{x}_3.\) So the mean of the distribution of the bootstrapped sample means is nothing more than the observed sample mean.

Let \(s^2\) be the observed sample variance. Every bootstrap r.v. has variance equal to

\[\begin{aligned} \mathsf{Var}\left(X_i^*\right) &=\mathbb{E}\left[\left(X_i^* - \mathbb{E}\left(X_i^*\right)\right)^2\right] \\ &= \left(x_1-\overline{x}_3\right)^2 * \frac{1}{3}+\left(x_2-\overline{x}_3\right)^2* \frac{1}{3}+\left(x_3-\overline{x}_3\right)^2 * \frac{1}{3} \\ &= \frac{1}{3}\sum_{i=1}^3\left(x_i-\overline{x}_3\right)^2 \end{aligned}\]

As a result of applying the properties of the variance, we have \[\mathsf{Var}\left(\overline{X}^*\right)=\frac{1}{3}\left(\frac{1}{3}\sum_{i=1}^3\left(x_i-\overline{x}_3\right)^2\right)=\frac{1}{3}\times\frac{2}{3}\left(\frac{1}{2}\sum_{i=1}^3\left(x_i-\overline{x}_3\right)^2\right)=\frac{2}{3}\frac{s^2_x}{3}=\left(1-\frac{1}{3}\right)\frac{s^2_x}{3}.\] Thus, the standard deviation of the distribution of the bootstrappped sample means is 2/3 times what we would have done when we plugged in the sample standard deviation into the formula for the standard error of the sample mean!!

You can extend what we have here for an arbitrary sample size \(n\). Let \(X_1^*,X_2^*, \ldots, X_n^*\) be a bootstrap random sample. The set of possible outcomes of \(X_i^*\) are \(\{x_1,x_2,x_3,\ldots,x_n\}\) and \(\mathbb{P}\left(X_i^*=a\right)=1/n\) for all \(a\in\{x_1,x_2,x_3,\ldots,x_n\}\). Let \(\overline{x}_n\) and \(s_x^2\) be the observed sample mean and observed sample variance. As a result, we have \[ \begin{aligned} \mathbb{E}\left(X_i^*\right) &= x_1 * \frac{1}{n}+x_2 * \frac{1}{n}+\ldots+x_n * \frac{1}{n} = \overline{x}_n \\ \mathsf{Var}\left(X_i^*\right) &=\mathbb{E}\left[\left(X_i^* - \mathbb{E}\left(X_i^*\right)\right)^2\right] = \frac{1}{n}\sum_{i=1}^n\left(x_i-\overline{x}_n\right)^2=\left(1-\frac{1}{n}\right) s_x^2 \end{aligned} \]

Thus, the expected value and variance of the distribution of bootstrapped sample means is given by \[ \mathbb{E}\left(\overline{X}^*\right) =\overline{x}_n,\ \ \mathsf{Var}\left(\overline{X}^*\right) =\frac{1}{n}\left(\frac{1}{n}\sum_{i=1}^n\left(x_i-\overline{x}_n\right)^2\right)=\left(1-\frac{1}{n}\right) \frac{s_x^2}{n} \]

When \(n\) is large enough, the standard deviation of the bootstrapped sample means is roughly the same as the standard error of the sample mean based on the formula where we plugged in \(s_x\) for an unknown \(\sigma\).

The bootstrap in practice

So what was new about the theoretical bootstrap? Well, nothing at first glance. But do remember that we had to enumerate all the bootstrap samples! For \(n=3\), there were only 27 possible bootstrap samples. For arbitrary \(n\), there will be \(n^n\) possible bootstrap samples. It is impossible to enumerate all these! Where did we use all the possible bootstrap samples? What is the solution? Randomly sample from all the possible bootstrap samples instead. Thus, the nonparametric bootstrap is born.

Consider the following example where I have a sample of 100 standard normals and I apply the nonparametric bootstrap to obtain an estimate of the standard error of the sample mean.

# This is n
sample.size <- 100
# This is the data we actually have
sample.obs <- rnorm(sample.size)
# This is B
boot.size <- 10^4
# These are resamples and we apply the sample mean to each resample
boot <- replicate(boot.size, mean(sample(sample.obs, sample.size, replace=TRUE)))
# These are resamples and we apply the sample mean to each resample (parametric bootstrap)
boot.par <- replicate(boot.size, mean(rnorm(sample.size, mean = mean(sample.obs), sd = sd(sample.obs))))
# Compare the theoretical SE against the SD of the bootstrapped sample means
c(sd(sample.obs)/sqrt(sample.size), sd(boot), sd(boot.par))
[1] 0.1127622 0.1125305 0.1123147
# Some pictures
par(mfrow=c(1,3))
# The bootstrapped distribution of the sample mean 
temp <- hist(boot, freq=FALSE)
# The true mean of the sampling distribution
abline(v=0, lty = 2)
# The actual mean of the bootstrapped distribution
abline(v=mean(boot), lty=3)
# The bootstrapped distribution of the sample mean (parametric bootstrap)
temp.par <- hist(boot.par, freq=FALSE)
# The true mean of the sampling distribution
abline(v=0, lty = 2)
# The actual mean of the bootstrapped distribution
abline(v=mean(boot.par), lty=3)
# The sampling distribution of the sample mean is known to us
curve(dnorm(x, 0, 1/sqrt(sample.size)), temp$breaks[1], temp$breaks[length(temp$breaks)], xlab="", ylab="", main="N(0,1/sqrt(n))")
abline(v=0, lty = 2)

The pictures above already indicate that the bootstrapped distribution of the sample mean is NOT centered at the true mean of 0, but is instead centered at the observed sample mean. But the shape and the spread of the sampling distribution are captured very well by the bootstrapped distribution.

There are two approximations at work:

  • The imagined world and the real world are different: We simply do not have complete knowledge of the distribution of which produced the data: Very crucial, and approximation error could be not so small!
  • The real world and the bootstrap world are different: We were drawing random samples with replacement rather than fully enumerating. We simply cannot list down all \(n^n\) outcomes. This is not as crucial as we can make the number of bootstrap resamples large.

When does the bootstrap not work

What we will be learning from the next example is that there are cases where the nonparametric bootstrap does not work. We will be looking into the sample maximum. Observe that the bootstrap principle I gave earlier in these notes refer to the sample mean. A big question is whether there are similar bootstrap principles for other estimators or sample statistics.

Consider the sample maximum and its sampling distribution. Let \(X_1,\ldots, X_n\overset{\mathsf{IID}}{\sim} U\left(0,\theta\right)\). Our interest is to estimate \(\theta\). Compare to our Homework 02.

  • Recall that the MLE is given by \(\widehat{\theta}=\max\{X_1,\ldots, X_n\}\).
  • The exact distribution of \(\widehat{\theta}\) is \[f\left(a\right)=n\frac{a^{n-1}}{\theta^n}.\]
  • The sample maximum is not unbiased for \(\theta\) because \[\mathbb{E}\left(\widehat{\theta}\right)=\frac{n}{n+1}\theta.\]
  • But \(\mathbb{E}\left(\widehat{\theta}\right) \to \theta\) as \(n\to\infty\).
  • In addition, \[\mathsf{Var}\left(\widehat{\theta}\right)=\left(\frac{n}{n+2}-\frac{n^2}{\left(n+1\right)^2}\right)\theta^2.\]
  • Thus, \(\mathsf{Var}\left(\widehat{\theta}\right)\to 0\), as \(n\to\infty\).
  • So \(\widehat{\theta}\) is consistent for \(\theta\).
  • It is also possible to obtain an asymptotic distribution for a normalized version of the sample maximum. Clearly, simply letting \(n\to\infty\) in the cdf of \(\widehat{\theta}\) is not very useful. It can be shown that \[n\left(1-\frac{\widehat{\theta}}{\theta}\right)\] is the appropriate normalization (compare to standardization in the CLT) and that \[n\left(1-\frac{\widehat{\theta}}{\theta}\right) \overset{d}{\to} Exp\left(1\right).\]
# true value
theta <-  2
# sample size
n <- 100
# simulated distribution of the sample maximum
max.dist <- replicate(10^4, max(runif(n, 0, theta)))
# first two moments of simulated distribution
c(mean(max.dist), sd(max.dist))
[1] 1.98029170 0.01940359
# histogram of the sample maximum
hist(max.dist, freq = FALSE)

# histogram of a "standardized" or normalized version of the sample maximum
hist((1-max.dist/theta)*n, freq = FALSE)
curve(dexp(x, 1), add = TRUE, from = 0, to = 8, col = "#CC79A7", lty = "dashed", lwd = 3)

Now, let us demonstrate in what sense the bootstrap does not work for this IID uniform case.

# same setup as before
theta <-  2
n <- 100
# draw one sample only: data at hand
x <- runif(n, 0, theta)
# apply the nonparametric bootstrap
boot <- replicate(10^4, max(sample(x, replace = TRUE)))
# Bootstrap estimate of the standard error
sd(boot)
[1] 0.02113971
# Theoretical standard error from exact calculation
sqrt((n/(n+2)-n^2/(n+1)^2)*2^2)
[1] 0.01960688

The nonparametric bootstrap may be a bit off. Try running the previous code on your computer some number of times. If we use a parametric bootstrap to produce bootstrap estimate of the standard error, things would look a bit better.

boot <- replicate(10^4, max(runif(n, 0, max(x))))
# Parametric bootstrap estimate of the standard error
sd(boot)
[1] 0.01976858
# Theoretical standard error from exact calculation
sqrt((n/(n+2)-n^2/(n+1)^2)*2^2)
[1] 0.01960688

We can actually use the bootstrap to approximate probabilities too. The following is based on Dekking et al (2005) Exercises 18.12 and 18.13. Let us consider approximation of probabilities involving the sample maximum. Suppose you are interested in \[\mathrm{P}\left(1-\frac{\widehat{\theta}}{\theta} \leq 0\right).\] Based on our results, we should get 0 for this probability.

n <- 100
x <- runif(n, 0, theta)
# Sample maximum of dataset
max(x)
[1] 1.958694
boot <- replicate(10^4, 1-max(sample(x, replace = TRUE))/max(x))
mean(boot <= 0)
[1] 0.6434

One way to fix this issue is to use a parametric bootstrap.

max(x)
[1] 1.958694
boot <- replicate(10^4, 1-max(runif(n, 0, max(x)))/max(x))
mean(boot <= 0)
[1] 0

What this example illustrates is that when we apply the bootstrap, we have to establish whether or not the bootstrap will actually apply. Thankfully, for a large number of problems, the bootstrap usually works, especially for a very large class of what are called regular estimators. But the simple example above serves as a warning to applying the bootstrap in unfamiliar situations. The remaining sections point to usual application settings.

Applying the bootstrap to compute standard errors for the MLE and MME

A bootstrap algorithm to compute standard errors for the MLE basically follows the ideas earlier. This is sometimes called the parametric bootstrap. Let \(\widehat{\theta}_n\) be the MLE of \(\theta\) and let \(\theta\) have dimension \(k=1\).

  1. Draw samples \(X_{1}^{*},X_{2}^{*},\ldots,X_{n}^{*}\) from \(f\left(x;\widehat{\theta}_n\right)\).

  2. Compute the MLE \(\widehat{\theta}^{*}_n\) based on \(X_{1}^{*},X_{2}^{*},\ldots,X_{n}^{*}\) .

  3. Repeat Steps 1 and 2 \(B\) times to get \(\widehat{\theta}_{n,1}^{*},\widehat{\theta}_{n,2}^{*},\ldots,\widehat{\theta}_{n,B}^{*}\).

  4. For \(k=1\), we have \(\displaystyle\mathsf{\widehat{se}}_{boot}=\sqrt{\dfrac{1}{B}\sum_{b=1}^{B}\left(\widehat{\theta}_{n,b}^{*}-\dfrac{1}{B}\sum_{r=1}^{B}\widehat{\theta}_{n,r}^{*}\right)^{2}}\). The case for \(k>1\) is a bit heavier in notation and we skip it (but it would have a matrix form).

Observe that we did not have to actually invert a Fisher information matrix anymore. For simple one-dimensional problems, the gains are probably now great, but for high-dimensional problems, the bootstrap can be very useful.

In a similar fashion, a bootstrap algorithm can also be designed to compute standard errors for the MME. Let \(\widehat{\theta}_n\) be the MME of \(\theta\) and let \(\theta\) have dimension \(k=1\). Observe that only the first step is different!

  1. Draw samples \(X_{1}^{*},X_{2}^{*},\ldots,X_{n}^{*}\) with replacement from \(X_{1},X_{2},\ldots,X_{n}\).

  2. Compute the MME \(\widehat{\theta}_{n}^{*}\) based on \(X_{1}^{*},X_{2}^{*},\ldots,X_{n}^{*}\).

  3. Repeat Steps 1 and 2 \(B\) times to get \(\widehat{\theta}_{n,1}^{*},\widehat{\theta}_{n,2}^{*},\ldots,\widehat{\theta}_{n,B}^{*}\).

  4. For \(k=1\), we have \(\displaystyle\mathsf{\widehat{se}}_{boot}=\sqrt{\dfrac{1}{B}\sum_{b=1}^{B}\left(\widehat{\theta}_{n,b}^{*}-\dfrac{1}{B}\sum_{r=1}^{B}\widehat{\theta}_{n,r}^{*}\right)^{2}}\). The case for \(k>1\) is a bit heavier in notation and we skip it (but it would have a matrix form).

Functions of a parameter

If you are interested in functions of the parameters of interest, say, \(\tau=g\left(\theta\right)\), you can also apply what is called the delta method. Roughly, the delta method states that if \(\tau=g\left(\theta\right)\) with \(g\) differentiable and \(g^{\prime}\left(\theta\right)\neq0\), then \(\dfrac{\widehat{\tau}_{n}-\tau}{\widehat{\mathsf{se}}\left(\widehat{\tau}_{n}\right)}\overset{d}{\rightarrow}N\left(0,1\right),\) where \(\widehat{\tau}_{n}=g\left(\widehat{\theta}_{n}\right)\) and \[\widehat{\mathsf{se}}\left(\widehat{\tau}_{n}\right)=\left|g^{\prime}\left(\widehat{\theta}_{n}\right)\right|\widehat{\mathsf{se}}\left(\widehat{\theta}_{n}\right).\]

The bootstrap can be very useful in avoiding the calculation of the standard errors from the delta method. The idea is to retain the resampling part but

  1. Now compute \(\widehat{\tau}_{n}^{*}=g\left(\widehat{\theta}_{n}^{*}\right)\) after computing \(\widehat{\theta}_{n}^{*}\) (whether MME or MLE or some other method).

  2. Repeat all the previous steps \(B\) times to get \(\widehat{\tau}_{n,1}^{*},\widehat{\tau}_{n,2}^{*},\ldots,\widehat{\tau}_{n,B}^{*}\).

  3. We have \(\displaystyle\mathsf{se}_{boot}=\sqrt{\dfrac{1}{B}\sum_{b=1}^{B}\left(\widehat{\tau}_{n,b}^{*}-\dfrac{1}{B}\sum_{r=1}^{B}\widehat{\tau}_{n,r}^{*}\right)^{2}}\).

To learn more

I refer you to the relevant chapters of Dekking et al (2005), Wasserman (2004), Devore et al (2022) for more details.