• No results found

4.1.1

The linear regression model and best linear predictor

Linear regression describes the (linear) relationship between one variable of interest and d explaining variables X by the following linear combination

Y= β0+ βT |{z} 1×d X |{z} d×1 +e (4.1) where

• Y is the dependent variable, outcome, or response with

expectation µY, and

variance σY2,

• X are the d predictors or explaining variables (d×1) with

expectations µ, and

covariance matrix Σ that can be decomposed into a variance matrix V and correlation matrix P according to

Σ=V1/2PV1/2.

• β are the d regression coefficients of size(d×1), • β0is the intercept or offset, and

• e is the irreducible error with E(e) =0.

See e.g. Whittaker (1990) [Chapter 5] for more details. For interpretation the β-coefficients are most important; βi, with i ∈1, ..., d, gives the influence

of Xion Y conditional on all the other d−1 variables. In the following this

thesis refers to

Xi as zero variable, if βi =0

Xi as nonzero variable, if βi 6=0.

Intercept and β-coefficients are selected to minimize the squared divergence between the established model ˆY =β0+βTX and the response, the so called

prediction error E (Y−Yˆ)2. As we are going to show in Section 4.1.2 the prediction error is a pivotal quantity in the linear model. The prediction error is minimized by regression coefficients equal to

whereΣXYis the d-dimensional vector of covariances between X and Y, and

an intercept equal to

β0=µY−βTµ. (4.3)

Hence, the best linear predictor equals

Y? = β0+βTX . (4.4)

The coefficients β0 and β = (β1, . . . , βd)T are constants, and not random

variables like X, Y and Y?.

Often, it is convenient to center and standardize the response and the predictor variables. With Ystd = (Y−µY)Yand Xstd =V−1/2(Xµ)the

predictor equation (Equation 4.4) can be written as

Ystd? = (Y?−µY)Y =βTstdXstd (4.5)

where βstdare the standardized regression coefficients

βstd =V1/2βσY−1 =P−1PXY (4.6)

where PXYis the d-dimensional vector of correlations between X and Y and

P is the d×d matrix of correlations among X. The standardized intercept

vanishes because of the centering.

In practice, the variable of interest y is represented by a n-dimensional vector of observations and the explaining variables x by a d×n matrix. Em- pirical estimates b of the regression coefficients are derived by minimizing the residual sum of squares (RSS)

RSS(b) = y− (b0+bTx)

T

y− (b0+bTx)



. (4.7)

Differentiating with respect to b and setting the derivative to zero leads to the ordinary least squares solution

b=x(xTx)−1xTy .

According to the Gauss Markov Theorem the least squares solution has the smallest variance of all unbiased estimates (Fahrmeir et al., 2003). Nonethe- less, it is possible that there exist biased estimates, like regularized estimates, that have a lower prediction error than the least squares estimate. Addi- tionally, the ordinary least squares estimate requires the matrix (xTx) to be positive definite. Otherwise, (xTx) is not invertible. Matrices are only invertible if they are of full rank. For one, deviations from the full rank are due to either strong correlation among the d explaining variables or even linear dependencies. Moreover, especially in small n, large d situations, estimates of the covariance matrix have a rank at most equal to the size of

the samples n<< d. Then, regularization is needed to derive an estimate of full rank. There exist several strategies for regularization or penalization in regression. Since they aim at minimizing the prediction error, the most important ones are discussed in section Section 4.2.

4.1.2

The decomposition of variance

The resulting minimal prediction error is

E(Y−Y?)2=σY2−βTΣ β .

Alternatively, this irreducible error may be written E (Y−Y?)2

=σY2(1−

Ω2)where =Cor(Y, Y?)

and Ω2=P

YXP−1PXY

is the squared multiple correlation coefficient. Furthermore, Cov(Y, Y?) = σY2Ω2

and E(Y?) = µY. The expectation E (Y−Y?)2



= Var(Y−Y?) is also called the unexplained variance or noise variance. Together with the explained variance or signal variance Var(Y?) = σY2Ω2it adds up to the total variance

Var(Y) = σY2. Accordingly, the proportion of explained variance is

Var(Y?)

Var(Y) =Ω

2

which indicates thatΩ2is the central quantity for understanding both nomi- nal prediction error and variance decomposition in the linear model. The ratio of signal variance to noise variance is

Var(Y?)

Var(Y−Y?) =

Ω2

1−Ω2.

A summary of these relations is given in Table 4.1, along with the empirical error decomposition in terms of observed sum of squares.

If instead of the optimal parameters β0and β we employ β00 =β0+∆β0

and β0 = β+∆β the minimal prediction error E (Y−Y?)2 increases by

the model error

ME(∆β0,∆β) = (∆β)TΣ ∆β+ (∆β0)2. (4.8)

The relative model error is the ratio of the model error and the irreducible error E (Y−Y?)2.

Table 4.1: Variance decomposition in terms of squared multiple correlation Ω2and corresponding empirical sums of squares.

Level Total variance = unexplained + explained

variance variance

Population Var(Y) = Var(Y−Y?) + Var(Y?)

σY2 = σY2(1−Ω2) + σY2Ω2

Empirical TSS = RSS + ESS

∑n

l=1(yl− ¯y)2 = ∑nl=1(yl− ˆyl)2 + ∑nl=1(ˆyl− ¯y)2

df=n−1 df =n−d−1 df=d

Abbreviations: ¯y= 1nni=1yi; df: degrees of freedom; TSS: total sum of

squares; RSS: residual sum of squares; ESS: explained sum of squares.

4.1.3

Classical strategies for variable selection

A rudimental approach to variable selection in the linear model is based on a t-test that examines if the regression coefficients differ from zero. The test utilizes the distribution of the estimated regression coefficients. Un- der model Equation 4.1 with an error e, that is normally distributed with N(0, Var(e)), the estimated regression coefficients b follow a multivariate

gaussian distribution

b∼ N(β,Σ−1Var(e)).

From the decomposition of variance, as presented in Table 4.1, it is obvious that Var(e)equals the unexplained variance given as σY2(1−Ω2). To test the

null hypothesis that the coefficient of variable i equals zero, i.e. bi =0, the

following t-score vector is derived

τXY =diag{Σ−1}−1/2βσY−1(1−Ω2)−1/2

df (4.9)

where df = n−d−1 represents the degrees of freedom. Under the null hypothesis the estimate ˆτXY(i)follows a t-distribution with df=n−d−1

degrees of freedom. Using this result it is possible to assign p-values to each variable. In Section 4.3 the connection of τXYto partial correlation is

discussed.

Stepwise selection comprises heuristic strategies to select variables based on statistics like ˆτXYor alternatively the F- or Wald-statistic, see e.g. Fahrmeir

and Tutz (2001) or Hastie et al. (2009). In backward selection all variables are included and ˆτXYis computed for all variables, then the variable with

the lowest value of ˆτXY is excluded from the model and ˆτXYis recomputed

for the remaining variables. This step is repeated until there are only vari- ables in the model that have a p-value below a predefined threshold. Such

selection strategies suffer from unstable results due to dependencies among the predictor variables since the final model highly depends on the order of the excluded variables. For example, backward and forward selection usually do not agree on the same model (Burnham and Anderson, 2002).

A different approach to variable selection is taken by penalized RSS. Penalized RSS quantify the goodness of fit of a specific model with an addi- tional penalty on the model size q <d. Using penalized RSS it is possible to compare models of different size and including different variables. Follow- ing George (2000) a general illustration of penalized RSS for a given model of size q is given as

RSSpenq =RSSq+λ·qdVar(e) (4.10)

where RSSq is the RSS based on the model of dimension q and dVar(e) =

RSS

n−d−1 is the estimated residual variance for the full model. The penalization

parameter λ is fixed in advance and differs in:

• Akaike’s Information Criterion (AIC)

RSSAICq =RSSq+2·qdVar(e),

• Bayesian Information Criterion (BIC)

RSSBICq =RSSq+log(n) ·qdVar(e),

• Risk Inflation Criterion (RIC)

RSSRICq =RSSq+2·log(q) ·q·dVar(e),

• (minimum) Mallowes’ Cp

RSSCpq =RSSq+2·qdVar(e).

Variable selection using penalized RSS is widespread, still there are two drawbacks. First, the fixed choice of λ has a strong impact on the size of the selected model. Large values of λ favor a small model size and vice versa. In contrast to penalized regression, as discussed in Section 4.2, the parameter λ is fixed and there is no intrinsic adaption to the data under analysis. Furthermore, penalized RSS is relatively sensitive with respect to small changes in the data (George, 2000).