Additively separable models - How do we estimate conditional mean functions?

4.2 How do we estimate conditional mean functions?

5.1.1 Additively separable models

An additively separable model restricts g(x) to be additively separable in the components of the vector X:

E(Y |X) = α + g1(X1) + g2(X2) + g3(X3) + · · · + gd(Xd),

where the gi(xi), i = 1..d, are assumed to be unknown and are nonparametrically estimated. A

key advantage of imposing additive separability is that the nonparametric estimators of the gi(xi)

functions as well as of the conditional mean function E(Y |X = x) can be made to converge at the univariate nonparametric rate. Another advantage is interpretive: the model allows for graphical depiction of the effect of xj on y holding other regressors constant. The separability assumption

is also not as restrictive as it may seem, because some regressors could be interactions of other regressors (e.g. x3= x1x2). However, for gi(xi) to be nonparametrically identified, it is necessary to

rule out general forms of collinearity between the regressors. That is, we could not allow x1 = ψ(xk)

for some ψ function, for example, and still separately identify g1(x), . . . , gd(x).42

Estimation Methods

Back-fitting algorithms As described in Hastie and Tibshirani (1990), additively separable models can be solved through an algorithm called back-fitting.

The algorithm involves three steps:

(i) Choose initial starting values for α and for gj. A good starting value might set α0 = average(Y )

and g_j0equal to the values predicted by a linear in x least squares regression of Y on a constant and all the regressors.

(ii) For each j = 1..d, define gj = ˆE(y − α −Pk6=jgk0(xk) |xj), where g0kis the most recent estimate

of gk(xk) (the starting value at the first iteration). The conditional expectation is estimated

by a smoothing method, such as kernel or local linear regression, or series expansion or spline regression. At this stage, if it is desired that a functional form restriction be imposed on the shape of one or more of the gj functions, then the restriction can be imposed by setting, for

example, ˆE(y − α −P

k6=jgk0(xk) |xj) = xjβj.

(iii) Repeat step (ii) until convergence is reached (when the estimated gj(xj) functions no longer

change).43

Back-fitting can require many iterations to reach convergence, but it is relatively easy to implement and is available in the software package Splus. Disadvantages of the method are that consistency has not been shown when nonparametric smoothing methods are used in step (ii) and there is as of yet no general distribution theory available that can be used to evaluate the variation of the estimators.

An estimator based on integration An alternative approach to estimating the additively separable model, which is studied by Newey (1994), H¨ardle and Linton (1996), Linton, Chen, Wang and H¨ardle (1997) and others. Although it is more difficult to implement than the back-fitting procedure, because it requires a pilot estimator of the nonparametric model g(x), the integration approach has the advantage of having a distribution theory available.

For notational simplicity, consider the additively separable model with two regressors Y = α + g1(X1) + g2(X2) + ε. Define the integrated parameter

g1(x1) =

g(x1, x2)dFx2.

Note that this is generally not equal to E(Y |X1= x1) which would be

E(Y |X1= x1) =

g(x1, x2)dFx2|X1=x1.

If X1 and X2 are independent, then the two parameters coincide. The integration estimator is

given by ˆ g1(x1) = n−1 n X i=1 ˆ g(x1, x2i).

If the model is additive, then ˆg1(x1) estimates g1(x1) up to an additive constant. Reversing the

roles of x1and x2 obtains an estimator for g2(x2), again up to scale.

In general, we do not really believe that the underlying function g(x1, x2) is additively separable

but that we use the model as a convenient way to summarize data. From this perspective, the integration estimator proposes to examine the effect of one variable X1 on the dependent variable

Also see Hastie and Tibshirani (1990) for discussion of a modified back-fitting algorithm that, in some circum- stances, converges in fewer iterations.

after integrating out the rest of the variables X2, . . . Xdusing the marginal distribution of X2, . . . Xd,

which would be exactly the correct procedure if the underlying function g is indeed additively separable between X1 and X2, . . . Xd.

The back fitting algorithm seems to be an attempt to obtain the solution to the least squares problem within the class of additively separable functions. Although these two sets of functions should coincide, up to an additive constant terms, if underlying function g is additively separable, if not, the two estimates in general would converge to different functions.

Newey (1994) shows that the estimator ˆg1(x1) converges at a one-dimensional nonparametric

rate because of the averaging. As we have seen, the convergence rate decreased because the rate at which we obtain data decreased if we needed to condition on a point in a higher dimension space. Since there is no need to condition on X2, . . . Xdfor examining g1(x1), the convergence rate

corresponds to that for one-dimensional cases.

As noted above, an advantage of estimating additive models through integration is that the distribution theory for the estimators has been developed.44 A disadvantage of the integration estimator is that it requires that the higher dimensional estimate of the g(x) be calculated prior to averaging, and existing distribution theory for the estimator requires that negative kernel functions be used for bias reduction.

Generalized additive models The additive modeling framework has been generalized to allow for known or unknown transformations of the dependent variable, Y . That is, estimators are available for models of the form

θ(Y ) = α + g1(X1) + g2(X2) + · · · + gd(Xd) + ε,

where the link function θ may be a known transformation (such as the Box-Cox transformation) or may be assumed to be unknown and nonparametrically estimated along with the gj functions.

Hastie and Tibshirani (1990) describe how to modify back-fitting procedures to accommodate binary response data and survival data, when the link function is known. For the case of an unknown θ function, Breiman and Friedman (1985) propose an estimation procedure called ACE (Alternating Conditional Expectation).45 _{Linton, Chen, Wang and H¨}_{ardle (1997) describe an}

instrumental variables procedure for estimating the θ function, which is based on the identifying assumption that the model is only additively separable for the correct transformation so that misspecification in θ shows up as a correlation between the error terms and the instruments. We are not aware of empirical applications of these methods in economics, although generalized additive models (GAMs) and ACE seem potentially very useful ways for empirical researchers to gain some flexibility in modeling the conditional mean function while at the same time avoiding the curse-of- dimensionality.

See, for example, H¨ardle and Linton (1996).

45_{ACE is also discussed in Hastie and Tibshirani (1990). The ACE algorithm is available in the software package} Splus.

In document "Implementing Nonparametric and Semiparametric Estimators" (Page 43-46)