Which Bootstrap When? - ADAfaEPoV.pdf

experiment like this one.

6.7 Which Bootstrap When?

This chapter has introduced a bunch of different bootstraps, and before it closes it’s worth reviewing the general principles, and some of the considerations which go into choosing among them in a particular problem.

When we bootstrap, we try to approximate the sampling distribution of some statistic (mean, median, correlation coefficient, regression coefficients, smoothing curve, difference in MSEs. . . ) by running simulations, and calculating the statistic on the simulation. We’ve seen three major ways of doing this:

• The parametric bootstrap: we estimate the model, and then simulate from the estimated model;

• Resampling residuals: we estimate the model, and then simulate by resampling residuals to that estimate and adding them back to the fitted values;

• Resampling cases or whole data points: we ignore the estimated model completely in our simulation, and just re-sample whole rows from the data frame. Which kind of bootstrap is appropriate depends on how much trust we have in our model.

The parametric bootstrap trusts the model to be completely correct forsome parameter value. In, e.g., regression, it trusts that we have the right shape for the regression function and that we have the right distribution for the noise. When we trust our model this much, we could in principle work out sampling distributions analytically; the parametric bootstrap replaces hard math with simulation.

Resampling residuals doesn’t trust the model as much. In regression problems, it assumes that the model gets theshape of the regression function right, and that the noise around the regression function is independent of the predictor variables, but doesn’t make any assumption about how the fluctuations are distributed. It is therefore more secure than parametric bootstrap.20

Finally, resampling cases assumes nothing at all about either the shape of the regression function or the distribution of the noise, it just assumes that each data point (row in the data frame) is an independent observation. Because it assumes so little, and doesn’t depend on any particular model being correct, it is very safe.

The reason we do not always use the safest bootstrap, which is resampling cases, is that there is, as usual, a bias-variance trade-off. Generally speaking, if we compare three sets of bootstrap confidence intervals on the same data for the same statistic, the parametric bootstrap will give the narrowest intervals, followed by resampling residuals, and resampling cases will give the loosest bounds. If the model reallyis

20_{You could also imagine simulations where we presume that the noise takes a very particular form (e.g.,}

at -distribution with 10 degrees of freedom), but are agnostic about the shape of the regression function, and learn that non-parametrically. It’s harder to think of situations where this is really plausible, however, exceptmaybe Gaussian noise arising from central-limit-theorem considerations.

model is also correct about the distribution of noise, we can do even better with a parametric bootstrap.

To sum up: resampling cases is safer than resampling residuals, but gives wider, weaker bounds. If you have good reason to trust a model’s guess at the shape of the regression function, then resampling residuals is preferable. If you don’t, or it’s not a regression problem so there are no residuals, then you prefer to resample cases. The parametric bootstrap works best when the over-all model is correct, and we’re just uncertain about the exact parameter values we need.

6.8 Further Reading

Davison and Hinkley (1997) is both a good textbook, and the reference I consult most often. Efron and Tibshirani (1993), while also very good, is more theoretical. Canty et al. (2006) has useful advice for serious applications. For professional purposes, I strongly recommend using the R packageboot (Canty and Ripley, 2013), based on Davison and Hinkley (1997). I deliberately donot use it in this chapter, or later in the book, for pedagogical purposes; I have found that forcing students to write their own bootstrapping code helps build character, or at least understanding.

Historical notes The original paper on the bootstrap, Efron (1979), is extremely clear, and for the most part presented in the simplest possible terms; it’s worth reading. His later small book (Efron, 1982), while often cited, is not in my opinion so useful nowadays21.

As the title of that last reference suggests, the bootstrap is in some ways a suc- cessor to an older method, apparently dating back to the 1940s if not before, called the “jackknife”, in which each data point is successively held back and the estimate is re-calculated; the variance of these re-estimates, appropriately scaled, is then taken as the variance of estimation, and similarly for the bias22. The jackknife is appealing in its simplicity, and sometimes works, but is valid much less generally than is the bootstrap.

6.9 Exercises

1. Show thatx₀is the mode of the Pareto distribution.

2. Derive the maximum likelihood estimator for the Pareto distribution (Eq. 6.15) from the density (Eq. 6.14).

3. Find confidence bands for the linear regression model of §6.4.1 using

21_{It seems to have done a good job of explaining things to people who were already professional statis-}

ticians in 1982.

22_{A “jackknife” is a knife with a blade which folds into the handle; think of the held-back data point as}

149 6.9. EXERCISES (a) The usual Gaussian assumptions (hint: try theintervals="confidence"

option topredict); (b) Resampling of residuals; and

Chapter 7

Moving Beyond Conditional

Expectations: Weighted Least

Squares, Heteroskedasticity,

Local Polynomial Regression

So far, all our estimates have been based on the mean squared error, giving equal im- portance to all observations. This is appropriate for looking at conditional expectations. In this chapter, we’ll start to work with giving more or less weight to different observations. On the one hand, this will let us deal with other aspects of the distribution beyond the conditional expectation, especially the conditional variance. First we look at weighted least squares, and the effects that ignoring heteroskedasticity can have. This leads naturally to trying to estimate variance functions, on the one hand, and generalizing kernel regression to local polynomial regression, on the other.

7.1 Weighted Least Squares

When we use ordinary least squares to estimate linear regression, we (naturally) minimize the mean squared error:

M SE(β) = 1 n n X i=1 (yi− ~xi· β)2 (7.1)

The solution is of course

βOLS= (xTx)−1xTy (7.2)

We could instead minimize theweighted mean squared error, W M SE(β, ~w) = 1 n n X i=1 w_i(yi− ~xi· β)2 (7.3)

151 7.1. WEIGHTED LEAST SQUARES This includes ordinary least squares as the special case where all the weightsw_i= 1. We can solve it by the same kind of linear algebra we used to solve the ordinary linear least squares problem. If we write w for the matrix with thew_i on the diagonal and zeroes everywhere else, the solution is

βW LS= (xTwx)−1xTwy (7.4)

But why would we want to minimize Eq. 7.3?

1. Focusing accuracy. We may care very strongly about predicting the response for certain values of the input — ones we expect to see often again, ones where mistakes are especially costly or embarrassing or painful, etc. — than others. If we give the points~x_i near that region big weightsw_i, and points elsewhere smaller weights, the regression will be pulled towards matching the data in that region.

2. Discounting imprecision. Ordinary least squares is the maximum likelihood estimate when theε in Y = ~X · β + ε is IID Gaussian white noise. This means that the variance ofε has to be constant, and we measure the regression curve with the same precision elsewhere. This situation, of constant noise variance, is called homoskedasticity. Often however the magnitude of the noise is not constant, and the data are heteroskedastic.

When we have heteroskedasticity, even if each noise term is still Gaussian, ordinary least squares is no longer the maximum likelihood estimate, and so no longer efficient. If however we know the noise varianceσ2

i at each measure-

menti, and set w_i= 1/σ2

i, we get the heteroskedastic MLE, and recover effi-

ciency. (See below.)

To say the same thing slightly differently, there’s just no way that we can estimate the regression function as accurately where the noise is large as we can where the noise is small. Trying to give equal attention to all parts of the input space is a waste of time; we should be more concerned about fitting well where the noise is small, and expect to fit poorly where the noise is big.

3. Sampling bias. In many situations, our data comes from a survey, and some members of the population may be more likely to be included in the sample than others. When this happens, the sample is a biased representation of the population. If we want to draw inferences about the population, it can help to give more weight to the kinds of data points which we’ve under-sampled, and less to those which were over-sampled. In fact, typically the weight put on data pointi would be inversely proportional to the probability of i being included in the sample (exercise 1). Strictly speaking, if we are willing to believe that linear model is exactly correct, that there are no omitted variables, and that the inclusion probabilitiesp_ido not vary withy_i, then this sort of survey weighting is redundant (DuMouchel and Duncan, 1983). When those assumptions are not met — when there’re non-linearities, omitted variables, or “selection on the

-4 -2 0 2 4 -15 -10 -5 0 5 10 x y

FIGURE7.1:Black line: Linear response function (y= 3 − 2x). Grey curve: standard deviation

as a function of x (σ(x) = 1 + x2_{/2). [[TODO: Add code.]]}

dependent variable” — survey weighting is advisable, if we know the inclusion probabilities fairly well.

The same trick works under the same conditions when we deal with “covari- ate shift”, a change in the distribution of X . If the old probability density function was p(x) and the new one is q(x), the weight we’d want to use is w_i = q(x_i)/p(x_i) (Quiñonero-Candela et al., 2009). This can involve estimat- ing both densities, or their ratio (chapter 16).

4. Doing something else. There are a number of other optimization problems which can be transformed into, or approximated by, weighted least squares. The most important of these arises from generalized linear models, where the mean response is some nonlinear function of a linear predictor; we will look at them in Chapters 12 and 13.

In the first case, we decide on the weights to reflect our priorities. In the third case, the weights come from the optimization problem we’d really rather be solving. What about the second case, of heteroskedasticity?

In document ADAfaEPoV.pdf (Page 147-152)