Estimation and its Targets in Submodels - Valid Post-Selection Inference

Following Section 1.2.2, the meaning and numeric value of a regression coefficient depends on what the other predictors in the model are. This statement requires a qualification: it assumes that the predictors are non-orthogonal/partially collinear. If they are perfectly pairwise orthogonal, as in some designed experiments or in function fitting with orthogonal basis functions, a coefficient has the same identity across all submodels, both in meaning and in value, because adjustment of predictors for each other and the ceteris paribus clause become vacuous. This article is hence largely a story of (partial) collinearity.

1.3.1 Multiplicity of Regression Coefficients

We will give meaning to LS estimators and their targets in the absence of any assumptions other than the existence ofµ=E[Y], which in turn is permitted to be

entirely unconstrained in_Rn_{. Besides resolving the issue of estimation in “first order}

wrong models”, the major point here is to follow up on the idea that the regression coefficient of a predictor generates different parameters in different submodels. As each predictor appears in 2p−1 _{submodels, the} _p _{regression coefficients of the full}

model generally proliferate into a plethora of as many as p2p−1 _{distinct regression}

start with notation.

To denote a submodel we use the (non-empty) index set M = {j1, j2, ..., jm} ⊂

MF ={1, . . . , p} of the predictors Xji in the submodel; the size of the submodel is

m = |M| and that of the full model is p = |MF|. Let XM = (Xj1, ...,Xjm) denote

the n×m submatrix of X with columns indexed by M. We will assume that only submodels M are considered for which XM is of full rank:

rank(XM) = m ≤ d.

We let βˆM be the unique least squares estimate in M:

βM= (XTMXM)−1XTMY. (1.3.1)

Now thatβˆMis an estimate, what is it estimating? A conclusion from Section 1.2.1

is that βˆM does not estimate the coefficients in the full model. Because any larger

model could have been the full model, we generalize by asserting that βˆM does not

estimate parameters in any other model than M itself. In M, it is natural to ask that βˆM be an unbiased estimate of its target:

βM , E[βˆM] = (XTMXM)−1XTME[Y] (1.3.2)

= argmin

β0_∈

kµ−XMβ0k2

This definition requires only the existence of µ=E[Y] but no other assumptions.

does it matter to what degree M provides a good approximation to µ in terms of

approximation errorkµ−XMβMk2. Asserting that the model M is “correct” would

mean µ∈span(XM) or equivalently the approximation error vanishes; in this case

βM would be the “true” parameter.

In the classical cased=p≤n, we can define the target of the full-model estimate

β = (XT_X₎−1_XT_Y _{as a special case of (1.3.2) with M = M} F:

β _, E[βˆ] = (XTX)−1XTE[Y]. (1.3.3)

In the general (non-classical) case, let β be any (possibly non-unique) minimizer of

kµ−Xβ0k2_{; the link between}_β _and _β

M is as follows:

βM = (XTMXM)−1XTMXβ. (1.3.4)

Thus the target βM is an estimable linear function of β, without any first-order

assumptions. Equation (1.3.4) follows from span(XM)⊂span(X).

Notation: To distinguish regression coefficients as a function of the model they appear in, we write βj·M = E[ ˆβj·M] for the components of βM =E[βˆM] with j∈M.

An important convention we adopt throughout this article is that the index j of a coefficient refers to the coefficient’s index in the original full model MF: βj·M

for j ∈ M refers not to the j’th coordinate of βM, but to the coordinate of βM

corresponding to the j’th predictor Xj in the full predictor matrix X. We refer to

1.3.2 “Omitted Variables Bias”

By allowing each ˆβj·M to estimate its own target βj·M and thereby relieving ˆβj·M of

the burden of estimating the parameterβj in the full model, we sidestep the problem

of “omitted variables bias” and with it a major driver of the problems analyzed by Leeb and P¨otscher (Section 1.2.1). In the present framework βj−βj·M is not a bias

as these are two different parameters that answer two different questions. Just the same, we consider briefly the difference between βj and βj·M in the classical case

d=p≤n. Compare the following two definitions:

βM ,E[βˆM] and βM,(βj)j∈M, (1.3.5)

the latter being the coefficientsβj from the full model MF subsetted to the submod-

el M. While βˆM estimates βM, it does not generally estimate βM. The difference

βM−βM is the vectorized “omitted variables bias”.

In general, the definition of βM involves X and all of β, not just βM, through

(1.3.4). A little algebra shows that βM =βM if and only if

XT_MXMcβM c

=0, (1.3.6)

where Mc_{denotes the complement of M in the full model M}

F. Special cases of (1.3.6)

include: (1) the column space ofXMis orthogonal to that ofXMc, and (2)βM c

=0, meaning that the approximation to µ in MF is no better than in M, or if the full

1.3.3 Interpreting Regression Coefficients in First-Order In-

correct Models

The regression coefficient βj·M is conventionally interpreted as the “average dif-

ference in the response for a unit difference in Xj, ceteris paribus in the mod-

el M”. This interpretation no longer holds when the assumption of first order correctness is given up. Instead, the phrase “average difference in the response” should be replaced with the unwieldy but more correct phrase “average difference in the response approximated in the submodel M”. The reason is that the fit in the submodel M is ˆYM = HMY (HM = XM(XTMXM)−1XTM) whose target

is µM = E[ ˆYM] = HME[Y] = HMµ. Thus in the submodel M we estimate not

the true µ but the LS approximation µM to µ using XM: µM = XMβM, where

βM= argminβ0kµ−X_Mβ0k2.

A second interpretation of regression coefficients is in terms of adjusted predictors: For j ∈ M define the M-adjusted predictor Xj·M as the residual vector of

the regression of Xj on all other predictors in M. Multiple regression coefficients,

both estimates ˆβj·M and parameters βj·M, can be expressed as simple regression

coefficients with regard to the M-adjusted predictor:

ˆ βj·M = XT j·MY kXj·Mk2 , βj·M = XT j·Mµ kXj·Mk2 . (1.3.7)

The left hand formula lends itself to an interpretation of ˆβj·M in terms of the well-

known leverage plot which shows Y plotted against Xj·M and the line with slope

A third interpretation can be derived from the second: For notational reasons let x = (xi)i=1...n be any adjusted predictor Xj·M, so that ˆβ = xTY/kxk2 and

β = xT_µ_/_k_x_k2 _{are the corresponding ˆ}_β

j·M and βj·M. Introduce case-wise slopes

through the origin, both as estimates ˆβ(i) = Yi/xi and as parameters β(i) = µi/xi,

as well as case-wise weights w(i) = x2i/

i0₌₁_...nx2_i0. Equations (1.3.7) are then

equivalent to the following:

ˆ β = X i w(i)βˆ(i), β = X i w(i)β(i).

Hence regression coefficients are weighted averages of case-wise slopes, and this interpretation holds without first-order assumptions.

In document Valid Post-Selection Inference (Page 34-39)