• No results found

Smoothing Splines versus Kernel Regression

everything which kernel regression can do9. The advantages of splines are their com-

putational speed and (once we’ve calculated the basis functions) simplicity, as well as the clarity of controlling curvature directly. Kernels however are easier to program (if slower to run), easier to analyze mathematically10, and extend more straightfor-

wardly to multiple variables, and to combinations of discrete and continuous vari- ables.

8.6 Further Reading

There are good discussions of splines in Simonoff (1996, §5), Hastieet al. (2009, ch. 5) and Wasserman (2006, §5.5). Wood (2006, ch. 4) includes a thorough practical 8Generalizations to more than two input variables are conceptually straightforward — just keep adding up more partial derivatives — but the book-keeping gets annoying.

9In fact, asn → ∞, smoothing splines approach the kernel regression curve estimated with a specific, rather non-Gaussian kernel. See Simonoff (1996, §5.6.2).

10Most of the bias-variance analysis for kernel regression can be done with basic calculus, as we did in Chapter 4. The corresponding analysis for splines requires working in infinite-dimensional function spaces called “Hilbert spaces”. It’s a pretty theory, if you like that sort of thing.

191 8.7. EXERCISES treatment of splines as a preparation for additive models (see Chapter 9 in these notes) and generalized additive models (see Chapters 12–13).

The classic reference, by one of the people who developed splines as a useful statistical tool, is Wahba (1990), which is great if you already know what a Hilbert space is and how to navigate one.

The first introduction of spline smoothing in the statistical literature seems to be Whittaker (1922). (“Graduation” was the term often used then for what we call “smoothing”.) He begins with an “inverse probability” (we would now say “Bayesian”) argument for minimizing Eq. 8.1 to find the most probable curve, based on thea priori hypothesis of smooth Gaussian curves observed through Gaussian error, and gives tricks for fitting splines more easily with the mathematical technology available in 1922. He does not, however, use the word “spline”, and I am not sure when that analogy was made.

In economics and econometrics, the use of spline smoothing on time series is known as the “Hodrick-Prescott filter”, after two economists who re-discovered the technique in 1981, along with a fallacious argument thatλ should always take a partic- ular value (1600), regardless of the data11. See Paige and Trindade (2010) for a (polite)

discussion, and demonstration of the advantages of cross-validation.

8.7 Exercises

1. The smooth.spline function lets you set the effective degrees of freedom ex- plicitly. Write a function which chooses the number of degrees of freedom by five-fold cross-validation.

2. When we can’t measure our predictor variables perfectly, it seems like a good idea to try to include multiple measurements for each one of them. For in- stance, if we were trying to predict grades in college from grades in high school, we might include the student’s grade from each year separately, rather than sim- ply averaging them. Multiple measurements of the same variable will however tend to be strongly correlated, so this means that a linear regression will be nearly multi-collinear. This in turn means that it will tend to have multiple, mutually-canceling large coefficients. This makes it hard to interpret the re- gression and hard to treat the predictions seriously.

One strategy for coping with this situation is to carefully select the variables one uses in the regression. Another, however, is to add a penalty for large coefficient values. For historical reasons, this second strategy is called ridge regression, or Tikhonov regularization. Specifically, while the OLS estimate is b βOLS= argmin β 1 n n X i=1 (yi− xiβ)2, (8.34)

11As it were: Hodrick and Prescott re-invented the wheel, and decided that it should be an octagon.

8.7. EXERCISES 192 the regularized or penalized estimate is

b βRR= argmin β   1 n n X i=1 (yi− xiβ)2  + p X j=1 β2 j (8.35)

(a) Show that the matrix form of the ridge-regression objective function is

n−1(y − xβ)T(y − xβ) + λβTβ (8.36)

(b) Show that the optimum is b

βRR= (xTx+ nλI)−1xTy (8.37)

(c) What happens asλ → 0? As λ → ∞? (For the latter, it may help to think about the case of a one-dimensionalX first.)

(d) LetY = Z + ε, with Z ∼ U (−1,1) and ε ∼ N (0,0.05). Generate 2000 draws fromZ and Y . Now let Xi = 0.9Z + η, with η ∼ N (0,0.05), for i ∈ 1 : 50. Generate corresponding Xivalues. Using the first 1000 rows of

the data only, do ridge regression ofY on the Xi(not onZ), plotting the 50 coefficients as functions ofλ. Explain why ridge regression is called a shrinkage estimator.

(e) Use cross-validation with the first 1000 rows to pick the optimal value of

λ. Compare the out-of-sample performance you get with this penalty to

the out-of-sample performance of OLS. For more on ridge regression, see Appendix E.4.1.

14:41 Thursday 25thApril, 2013

Chapter 9

Additive Models

9.1 Partial Residuals and Back-fitting for Linear Mod-

els

The general form of a linear regression model is

Y | ~X= ~x— = β0+ ~β ·~x =

p

X

j=0

βjxj (9.1)

where for j ∈ 1 : p, the xj are the components of~x, and x0 is always the constant

1. (Adding this fictitious constant variable lets us handle the intercept just like any other regression coefficient.)

Suppose we don’t condition on all of ~X but just one component of it, say Xk. What is the conditional expectation ofY ?

EY |Xk= xk  = E” E”Y |X1,X2, . . .Xk, . . .Xp — |Xk= xk — (9.2) = E    p X j=0 βjXj|Xk= xk    (9.3) = βkxk+ E    X j 6=k βjXj|Xk= xk    (9.4)

where the first line uses the law of total expectation1, and the second line uses Eq.

1As you learned in baby prob., this is the fact that E[Y |X ] = E[E[Y |X ,Z]|X ] — that we can always condition on another variable or variables (Z), provided we then average over those extra variables when we’re done.

9.2. ADDITIVE MODELS 194