4.1.1
The linear regression model and best linear predictor
Linear regression describes the (linear) relationship between one variable of interest and d explaining variables X by the following linear combination
Y= β0+ βT |{z} 1×d X |{z} d×1 +e (4.1) where
• Y is the dependent variable, outcome, or response with
– expectation µY, and
– variance σY2,
• X are the d predictors or explaining variables (d×1) with
– expectations µ, and
– covariance matrix Σ that can be decomposed into a variance matrix V and correlation matrix P according to
Σ=V1/2PV1/2.
• β are the d regression coefficients of size(d×1), • β0is the intercept or offset, and
• e is the irreducible error with E(e) =0.
See e.g. Whittaker (1990) [Chapter 5] for more details. For interpretation the β-coefficients are most important; βi, with i ∈1, ..., d, gives the influence
of Xion Y conditional on all the other d−1 variables. In the following this
thesis refers to
Xi as zero variable, if βi =0
Xi as nonzero variable, if βi 6=0.
Intercept and β-coefficients are selected to minimize the squared divergence between the established model ˆY =β0+βTX and the response, the so called
prediction error E (Y−Yˆ)2. As we are going to show in Section 4.1.2 the prediction error is a pivotal quantity in the linear model. The prediction error is minimized by regression coefficients equal to
whereΣXYis the d-dimensional vector of covariances between X and Y, and
an intercept equal to
β0=µY−βTµ. (4.3)
Hence, the best linear predictor equals
Y? = β0+βTX . (4.4)
The coefficients β0 and β = (β1, . . . , βd)T are constants, and not random
variables like X, Y and Y?.
Often, it is convenient to center and standardize the response and the predictor variables. With Ystd = (Y−µY)/σYand Xstd =V−1/2(X−µ)the
predictor equation (Equation 4.4) can be written as
Ystd? = (Y?−µY)/σY =βTstdXstd (4.5)
where βstdare the standardized regression coefficients
βstd =V1/2βσY−1 =P−1PXY (4.6)
where PXYis the d-dimensional vector of correlations between X and Y and
P is the d×d matrix of correlations among X. The standardized intercept
vanishes because of the centering.
In practice, the variable of interest y is represented by a n-dimensional vector of observations and the explaining variables x by a d×n matrix. Em- pirical estimates b of the regression coefficients are derived by minimizing the residual sum of squares (RSS)
RSS(b) = y− (b0+bTx)
T
y− (b0+bTx)
. (4.7)
Differentiating with respect to b and setting the derivative to zero leads to the ordinary least squares solution
b=x(xTx)−1xTy .
According to the Gauss Markov Theorem the least squares solution has the smallest variance of all unbiased estimates (Fahrmeir et al., 2003). Nonethe- less, it is possible that there exist biased estimates, like regularized estimates, that have a lower prediction error than the least squares estimate. Addi- tionally, the ordinary least squares estimate requires the matrix (xTx) to be positive definite. Otherwise, (xTx) is not invertible. Matrices are only invertible if they are of full rank. For one, deviations from the full rank are due to either strong correlation among the d explaining variables or even linear dependencies. Moreover, especially in small n, large d situations, estimates of the covariance matrix have a rank at most equal to the size of
the samples n<< d. Then, regularization is needed to derive an estimate of full rank. There exist several strategies for regularization or penalization in regression. Since they aim at minimizing the prediction error, the most important ones are discussed in section Section 4.2.
4.1.2
The decomposition of variance
The resulting minimal prediction error is
E(Y−Y?)2=σY2−βTΣ β .
Alternatively, this irreducible error may be written E (Y−Y?)2
=σY2(1−
Ω2)whereΩ =Cor(Y, Y?)
and Ω2=P
YXP−1PXY
is the squared multiple correlation coefficient. Furthermore, Cov(Y, Y?) = σY2Ω2
and E(Y?) = µY. The expectation E (Y−Y?)2
= Var(Y−Y?) is also called the unexplained variance or noise variance. Together with the explained variance or signal variance Var(Y?) = σY2Ω2it adds up to the total variance
Var(Y) = σY2. Accordingly, the proportion of explained variance is
Var(Y?)
Var(Y) =Ω
2
which indicates thatΩ2is the central quantity for understanding both nomi- nal prediction error and variance decomposition in the linear model. The ratio of signal variance to noise variance is
Var(Y?)
Var(Y−Y?) =
Ω2
1−Ω2.
A summary of these relations is given in Table 4.1, along with the empirical error decomposition in terms of observed sum of squares.
If instead of the optimal parameters β0and β we employ β00 =β0+∆β0
and β0 = β+∆β the minimal prediction error E (Y−Y?)2 increases by
the model error
ME(∆β0,∆β) = (∆β)TΣ ∆β+ (∆β0)2. (4.8)
The relative model error is the ratio of the model error and the irreducible error E (Y−Y?)2.
Table 4.1: Variance decomposition in terms of squared multiple correlation Ω2and corresponding empirical sums of squares.
Level Total variance = unexplained + explained
variance variance
Population Var(Y) = Var(Y−Y?) + Var(Y?)
σY2 = σY2(1−Ω2) + σY2Ω2
Empirical TSS = RSS + ESS
∑n
l=1(yl− ¯y)2 = ∑nl=1(yl− ˆyl)2 + ∑nl=1(ˆyl− ¯y)2
df=n−1 df =n−d−1 df=d
Abbreviations: ¯y= 1n∑ni=1yi; df: degrees of freedom; TSS: total sum of
squares; RSS: residual sum of squares; ESS: explained sum of squares.
4.1.3
Classical strategies for variable selection
A rudimental approach to variable selection in the linear model is based on a t-test that examines if the regression coefficients differ from zero. The test utilizes the distribution of the estimated regression coefficients. Un- der model Equation 4.1 with an error e, that is normally distributed with N(0, Var(e)), the estimated regression coefficients b follow a multivariate
gaussian distribution
b∼ N(β,Σ−1Var(e)).
From the decomposition of variance, as presented in Table 4.1, it is obvious that Var(e)equals the unexplained variance given as σY2(1−Ω2). To test the
null hypothesis that the coefficient of variable i equals zero, i.e. bi =0, the
following t-score vector is derived
τXY =diag{Σ−1}−1/2βσY−1(1−Ω2)−1/2
√
df (4.9)
where df = n−d−1 represents the degrees of freedom. Under the null hypothesis the estimate ˆτXY(i)follows a t-distribution with df=n−d−1
degrees of freedom. Using this result it is possible to assign p-values to each variable. In Section 4.3 the connection of τXYto partial correlation is
discussed.
Stepwise selection comprises heuristic strategies to select variables based on statistics like ˆτXYor alternatively the F- or Wald-statistic, see e.g. Fahrmeir
and Tutz (2001) or Hastie et al. (2009). In backward selection all variables are included and ˆτXYis computed for all variables, then the variable with
the lowest value of ˆτXY is excluded from the model and ˆτXYis recomputed
for the remaining variables. This step is repeated until there are only vari- ables in the model that have a p-value below a predefined threshold. Such
selection strategies suffer from unstable results due to dependencies among the predictor variables since the final model highly depends on the order of the excluded variables. For example, backward and forward selection usually do not agree on the same model (Burnham and Anderson, 2002).
A different approach to variable selection is taken by penalized RSS. Penalized RSS quantify the goodness of fit of a specific model with an addi- tional penalty on the model size q <d. Using penalized RSS it is possible to compare models of different size and including different variables. Follow- ing George (2000) a general illustration of penalized RSS for a given model of size q is given as
RSSpenq =RSSq+λ·qdVar(e) (4.10)
where RSSq is the RSS based on the model of dimension q and dVar(e) =
RSS
n−d−1 is the estimated residual variance for the full model. The penalization
parameter λ is fixed in advance and differs in:
• Akaike’s Information Criterion (AIC)
RSSAICq =RSSq+2·qdVar(e),
• Bayesian Information Criterion (BIC)
RSSBICq =RSSq+log(n) ·qdVar(e),
• Risk Inflation Criterion (RIC)
RSSRICq =RSSq+2·log(q) ·q·dVar(e),
• (minimum) Mallowes’ Cp
RSSCpq =RSSq+2·qdVar(e).
Variable selection using penalized RSS is widespread, still there are two drawbacks. First, the fixed choice of λ has a strong impact on the size of the selected model. Large values of λ favor a small model size and vice versa. In contrast to penalized regression, as discussed in Section 4.2, the parameter λ is fixed and there is no intrinsic adaption to the data under analysis. Furthermore, penalized RSS is relatively sensitive with respect to small changes in the data (George, 2000).