4.2 How do we estimate conditional mean functions?
5.1.3 Partially Linear Regression Model
The partially linear regression model extends the linear regression model to include a nonparametric component and specifies:
Y = X0β0+ ϕ (Z) + ε
where X ∈ Rp and Z ∈ Rq do not have common variables. If they do, then the common variables
would be regarded as a part of Z but not X because the coefficients that correspond to the common variables would be not identifiable. If there is no cross terms of z among X’s, then the model presumes additive separability of ϕ (Z) and X, which may be too restrictive in some applications. This framework is convenient for a model with many regressors, where fully nonparametric estimation is often impractical. It is also a good choice for a model that contains discrete regressors along with a few continuous ones. As discussed in section two, this model has been broadly applied in economics, mainly to the problem of estimating Engel curves and to the problem of controlling for sample selection bias. Estimators for the partially linear model are studied in Heckman (1980, 1990), Shiller (1984), Stock (1991), Wahba (1984), Engle, Grander, Rice and Weiss (1986), Chamberlain (1986b), Powell (1987), Newey (1988), Robinson (1988), Ichimura and Lee (1991), Andrews (1991), Cosslett (1991), Choi (1992), Ahn and Powell (1993), Honor´e and Powell (1994), Yatchew (1997), Heckman, Ichimura, Smith, and Todd (1998a), Heckman, Ichimura, and Todd (1998b) and others. As we saw, the nonparametric convergence rate would depend on the number of continuous regressors in (X, Z). In the partially linear regression framework, the convergence rate of the estimator of ϕ depends only on the number of continuous regressors among z and that the n1/2- consistent estimation of β can be carried out regardless of the number of continuous regressors in (X, Z) provided there is enough smoothness in underlying functions as shown by Robinson (1988).
To consider the estimator Robinson studied, observe that
E (Y |Z = z) = E X0|Z = z β0+ ϕ (z) so that
Y − E (Y |Z = z) = (X − E (X|Z = z))0β0+ ε.
If we knew E (Y |Z = z) and E (X|Z = z) then one could estimate β0 by the ordinary least squares method of Y − E (Y |Z = z) on X − E (X|Z = z). Since we do not know them, we can estimate them by some nonparametric method, call them ˆE (Y |Z = z) and ˆE (X|Z = z), and estimate β0 by N X i=1 h xi− ˆE (xi|zi) i h xi− ˆE (xi|zi) i0 !−1 N X i=1 h xi− ˆE (xi|zi) i h yi− ˆE (yi|zi) i .
Since the conditional mean functions will not be estimated well where the density of Z is low, Robinson makes use of a trimming function ˆIi = 1n ˆf (zi) > bn
o
estimator, for a given sequence of numbers {bn}.47 The estimator is defined as ˆ β = N X i=1 h xi− ˆE (xi|zi) i h xi− ˆE (xi|zi) i0 ˆ Ii !−1 N X i=1 h xi− ˆE (xi|zi) i h yi− ˆE (yi|zi)i ˆIi.
The estimation method is reminiscent of an interpretation of OLS estimator: consider the OLS estimation of
Y = X0β0+ Z0γ + ε.
Then as it is well known the OLS estimator of β0 is the OLS estimator of uy on ux where uy is the
OLS residual of running Y on Z and ux is the OLS residual of running X on Z.48 Here, the first
stage is replaced by nonparametric regressions.
Let α and µ be nonnegative real numbers and m be the integer such that m − 1 ≤ µ ≤ m. For such µ > 0, =αµ is the class of functions g : Rq → R satisfying: g is (m − 1)-times partially differentiable for all z; for some ρ > 0, supy∈{y;|y−z|<ρ}|g (y) − g (z) − Qm−1(y, z)| / |y − z|µ≤ h (z),
where Q0= 0 and for m ≥ 2, Qm−1(y, z) is the (m − 1)th-degree homogeneous polynomial in y − z
with coefficients the partial derivatives of g at z of order 1 through m − 1; and g (z), its partial derivatives of order (m − 1) and less, and h (z), all have αth moments.
Robinson uses kernel regression estimator with independent kernel functions. He introduces the following notation: Kl, l ≥ 1 is the class of even functions k : R → R satisfying
Z ∞ −∞ uik (u) du = ( 1 if i = 0 0 if i = 1, ...l − 1 k (u) = O 1 + |u|l+1+δ −1 , for some δ > 0.
In the statement below, k is the kernel function, a is the bandwidth for estimating regression function and density, and b is the trimming value, q is the dimension of z. Both a and b depend on N although the notation does not explicitly express it.
Theorem 3. (Robinson) Let the following conditions hold: (i) (Xi, Yi, Zi), i = 1, 2..., are inde-
pendent and distributed as (X, Y, Z); (ii) the model specification is correct; (iii) ε is independent of (X, Z); (iv) E ε2 = σ2 < ∞; (v) E|X|4 < ∞; (vi) Z admits a pdf f such that f ∈ =∞
λ ,
for some λ > 0; (vii) E (X|Z = z) ∈ =2µ, for some µ > 0; (viii) ϕ (z) ∈ =4ν, for some ν > 0; (ix) as N → ∞, N a2qb4 → ∞, na2 min(λ+1,µ)+2 min(λ+1,ν)b−4 → 0, amin(λ+1,2λ,µ,ν)b−2 → 0, b → 0; (x)
k ∈ Kmax(l+m−1,l+n−1), for the integers l, m, n such that l − 1 < λ ≤ l, m − 1 < µ ≤ m, and n − 1 < ν ≤ n. Then the condition
Φ ≡ E[x − E (x|z)] [x − E (x|z)]0 is positive definite is necessary and sufficient for√N ˆβ − β→ N 0, σd 2Φ−1 and
ˆ σ2 N−1 N X i=1 h xi− ˆE (xi|zi) i h xi− ˆE (xi|zi) i0 ˆ Ii !−1 p → σ2Φ−1,
47This trimming is used by Bickel (1982). 48
where ˆ σ2= N−1 N X i=1 yi− ˆE (yi|zi) − xi− ˆE (xi|zi) 0 ˆ β 2 .
As stated earlier, the convergence rate of ˆβ is √N , which does not depend on the dimension of Z, despite the presence of ϕ. The theorem is stated for the kernel regression estimator, but the result should hold for other nonparametric estimators as discussed in section 7.
If ˆE is a linear in dependent variable estimator, then ˆσ2 can be rewritten as
N−1 N X i=1 h yi− x0iβ − ˆˆ E yi− x0iβ|zˆ i i2 ,
which is a natural estimator of σ2.
Compared to the OLS estimation without ϕ under homoskedasticity variance is higher because V ar (x) = Φ + V ar (E (x|z)) .
When there is heteroskedasticity so that (iii) does not hold, under analogous conditions √
N ˆβ − β→ N 0, Φd −1ΩΦ−1 , where
Ω = Eε2[x − E (x|z)] [x − E (x|z)]0 .
The partially linear regression model also resembles the conditional mean function in the sample selection models. If the outcome equation is specified as Y = X0β + u and the selection equation is specified by the latent model of the form 1z0θ + v > 0, where (u, v) and (X, Z) are independent, then without specifying the joint distribution of (u, v), the following relationship holds:
Y = X0β0+ ϕ Z0θ + ε, E (ε|X, Z) = 0.
Note that in this case, there is more structure in ϕ function and that θ (up to a scalar) can be estimated from the data about selection. Without this structure, as discussed above, the partially linear regression model only identifies coefficients of X variables that are not in the Z variables.
Powell (1987) made use of this observation, modified Robinson’s estimator so that there is no need for trimming, and discussed estimation of β0. Ahn and Powell (1993) extended this approach further based on the observation that in the sample selection model one can write the conditional mean function as
Y = X0β0+ ϕ (P (Z)) + ε, E (ε|X, Z) = 0,
where P (z) is the probability of being selected into samples, which can be estimated from the data about selection.49 Ichimura and Lee (1991) propose a way of simultaneously estimating β and θ with truncated data. Yatchew (1997) proposes to examine the differencing idea of Powell (1987) to a finite number. Heckman, Ichimura, Smith, and Todd (1998a), Heckman, Ichimura, and Todd (1998b) study estimation of β and ϕ(P (z)), allowing for parametrically estimated P (z) and data- dependent bandwidths. The estimator they study is basically the same with the estimator studied by Robinson but they use local polynomial estimator instead of the kernel regression estimator, instead of Z, they have a parametric form P (z0θ) where θ is estimated by ˆθ from the data on selection, use trimming based on the estimated low percentile (usually 1 or 2%) of P (zi0ˆθ), denoted as ˆqnso that the trimming function is written as ˆIi = 1( ˆf ( ˆPi) > ˆqn) where ˆf (·) is the kernel density
estimator of the density of P (z0θ), and smoothing parameter can be data dependent. Estimation of ϕ is done using the estimated β to purge Y of its dependence on X, we can estimate ϕ(p0) by a
local linear regression of Yi− Xi0β on ˆˆ Pi evaluated at p0, which we denote it by ϕ(pˆ0)
The following theorem summarize the results by Heckman, Ichimura, Todd (1998b). Di denote
the indicator whether the ith observation is in the sample or not. Theorem 4. Assume that
(i) Data {(Xi, Yi, Zi, Di)} are i.i.d., E{||xi||2+ε+||zi||2+ε} < ∞ for some ε > 0, and E{|yi|3} < ∞,
(ii) √n(ˆθ − θ0) = n−1/2Pni=1ψ(zi, di) + op(1), where n1/2Pni=1ψ(zi) converges in distribution to
a normal random vector,
(iii) the kernel function K(·) is supported on [−1, 1] and it is twice continuously differentiable, (iv) P (zi0θ) is twice continuously differentiable with respect to θ and both derivatives have second
moments,
(v) E(X|P ), E{ϕ(P )} are twice continuously differentiable with respect to θ,
(vi) H1= E{[X − E(X|P )][X − E(X|P )]0I} evaluated at the true θ = θ0 is nonsingular.
(vii) The density of P (Z0θ), fθ, is uniformly bounded and uniformly continuous in the neighborhood
of θ0and for any ε > 0 there exists δ > 0 such that if ||θ − θ0|| < δ then sup 0≤s≤1
|fθ(s) − fθ0(s)| <
ε.
(viii) na3n/ log n → ∞ and na8n→ 0. Then n1/2(ˆβ − β0) = n−1/2 n X i=1 H1−1{[Xi− E(X|Pi)]εiIi+ H2ψ(Zi, Di)} + op(1)
49Establishing asymptotic distribution theory for an estimator that involves trimming which uses estimated θ or estimated P (z) would be a non-trivial task. Powell (1987) and Ahn and Powell (1993) avoided the need for trimming by a clever re-weighting scheme. This approach have been developed to be applicable to broader models by Honor´e and Powell (1994), Honor´e and Powell (2005), and Aradillas-Lopez, Honor´e, and Powell (2005).
where H2= E{[X − E(X|P )]P (Z0θ0)[Z − E(Z|P )]0I}.
If in addition to the assumptions above, the following assumptions hold: (ix) ϕ is twice continuously differentiable,
(x) fθ0(p0) > 0,
(xi) the bandwidth sequence satisfies ˆan= ˆαnn−1/5, plim ˆαn= α0 > 0,
(xii) σ2(p0) = E[|Y − X0β|2|P = p0] is finite and continuous at p0,
then, n2/5(ˆϕ(p0) − ϕ(p0)) ∼ N (B, V ) where B = 1 2ϕ 00(p 0) Z s2K(s)ds α20 V = Var (Y − X 0β|P = p 0) fθ0(p0)α0 R K2(s)ds,
where ϕ00(p0) is the second derivative of the regression function.