Linear regression revisited - A Multivariate Framework for Variable Selection and Identificatio

4.1.1 The linear regression model and best linear predictor

Linear regression describes the (linear) relationship between one variable of interest and d explaining variables X by the following linear combination

Y= β0+ βT |{z} 1×d X |{z} d×1 +e (4.1) where

• Y is the dependent variable, outcome, or response with

– expectation µY, and

– variance σ_Y2,

• X are the d predictors or explaining variables (d×1) with

– expectations µ, and

– covariance matrix Σ that can be decomposed into a variance matrix V and correlation matrix P according to

Σ=V1/2PV1/2.

• β are the d regression coefficients of size(d×1), • β0is the intercept or offset, and

• e is the irreducible error with E(e) =0.

See e.g. Whittaker (1990) [Chapter 5] for more details. For interpretation the β-coefficients are most important; βi, with i ∈1, ..., d, gives the influence

of Xion Y conditional on all the other d−1 variables. In the following this

thesis refers to

Xi as zero variable, if βi =0

Xi as nonzero variable, if βi 6=0.

Intercept and β-coefficients are selected to minimize the squared divergence between the established model ˆY =β0+βTX and the response, the so called

prediction error E (Y−Yˆ)2. As we are going to show in Section 4.1.2 the prediction error is a pivotal quantity in the linear model. The prediction error is minimized by regression coefficients equal to

whereΣXYis the d-dimensional vector of covariances between X and Y, and

an intercept equal to

β0=µY−βTµ. (4.3)

Hence, the best linear predictor equals

Y? = β0+βTX . (4.4)

The coefficients β0 and β = (β1, . . . , βd)T are constants, and not random

variables like X, Y and Y?.

Often, it is convenient to center and standardize the response and the predictor variables. With Ystd = (Y−µY)/σYand Xstd =V−1/2(X−µ)the

predictor equation (Equation 4.4) can be written as

Y_std? = (Y?−µY)/σY =βT_stdXstd (4.5)

where β_stdare the standardized regression coefficients

β_std =V1/2βσ_Y−1 =P−1PXY (4.6)

where PXYis the d-dimensional vector of correlations between X and Y and

P is the d×d matrix of correlations among X. The standardized intercept

vanishes because of the centering.

In practice, the variable of interest y is represented by a n-dimensional vector of observations and the explaining variables x by a d×n matrix. Em- pirical estimates b of the regression coefficients are derived by minimizing the residual sum of squares (RSS)

RSS(b) = y− (b0+bTx)

y− (b0+bTx)

. (4.7)

Differentiating with respect to b and setting the derivative to zero leads to the ordinary least squares solution

b=x(xTx)−1xTy .

According to the Gauss Markov Theorem the least squares solution has the smallest variance of all unbiased estimates (Fahrmeir et al., 2003). Nonethe- less, it is possible that there exist biased estimates, like regularized estimates, that have a lower prediction error than the least squares estimate. Addi- tionally, the ordinary least squares estimate requires the matrix (xTx) to be positive definite. Otherwise, (xTx) is not invertible. Matrices are only invertible if they are of full rank. For one, deviations from the full rank are due to either strong correlation among the d explaining variables or even linear dependencies. Moreover, especially in small n, large d situations, estimates of the covariance matrix have a rank at most equal to the size of

the samples n<< d. Then, regularization is needed to derive an estimate of full rank. There exist several strategies for regularization or penalization in regression. Since they aim at minimizing the prediction error, the most important ones are discussed in section Section 4.2.

4.1.2 The decomposition of variance

The resulting minimal prediction error is

E(Y−Y?)2=σ_Y2−βTΣ β .

Alternatively, this irreducible error may be written E (Y−Y?)2

=σ_Y2(1−

Ω2₎_where_Ω ₌_Cor₍_{Y, Y}?₎

and Ω2₌_P

YXP−1PXY

is the squared multiple correlation coefficient. Furthermore, Cov(Y, Y?) = σ_Y2Ω2

and E(Y?) = µY. The expectation E (Y−Y?)2

= Var(Y−Y?) is also called the unexplained variance or noise variance. Together with the explained variance or signal variance Var(Y?) = σ_Y2Ω2it adds up to the total variance

Var(Y) = σ_Y2. Accordingly, the proportion of explained variance is

Var(Y?)

Var(Y) =Ω

which indicates thatΩ2is the central quantity for understanding both nomi- nal prediction error and variance decomposition in the linear model. The ratio of signal variance to noise variance is

Var(Y?)

Var(Y−Y?₎ =

Ω2

1−Ω2.

A summary of these relations is given in Table 4.1, along with the empirical error decomposition in terms of observed sum of squares.

If instead of the optimal parameters β0and β we employ β00 =β0+∆β0

and β0 = β+∆β the minimal prediction error E (Y−Y?)2 increases by

the model error

ME(∆β₀,∆β) = (∆β)TΣ ∆β+ (∆β₀)2. (4.8)

The relative model error is the ratio of the model error and the irreducible error E (Y−Y?)2.

Table 4.1: Variance decomposition in terms of squared multiple correlation Ω2_{and corresponding empirical sums of squares.}

Level Total variance = unexplained + explained

variance variance

Population Var(Y) = Var(Y−Y?) + Var(Y?)

σ_Y2 = σ_Y2(1−Ω2) + σ_Y2Ω2

Empirical TSS = RSS + ESS

∑n

l=1(yl− ¯y)2 = ∑nl=1(yl− ˆyl)2 + ∑nl=1(ˆyl− ¯y)2

df=n−1 df =n−d−1 df=d

Abbreviations: ¯y= 1_n_∑n_i₌₁yi; df: degrees of freedom; TSS: total sum of

squares; RSS: residual sum of squares; ESS: explained sum of squares.

4.1.3 Classical strategies for variable selection

A rudimental approach to variable selection in the linear model is based on a t-test that examines if the regression coefficients differ from zero. The test utilizes the distribution of the estimated regression coefficients. Un- der model Equation 4.1 with an error e, that is normally distributed with N(0, Var(e)), the estimated regression coefficients b follow a multivariate

gaussian distribution

b∼ N(β,Σ−1Var(e)).

From the decomposition of variance, as presented in Table 4.1, it is obvious that Var(e)equals the unexplained variance given as σ_Y2(1−Ω2). To test the

null hypothesis that the coefficient of variable i equals zero, i.e. bi =0, the

following t-score vector is derived

τXY =diag{Σ−1}−1/2βσ_Y−1(1−Ω2)−1/2

√

df (4.9)

where df = n−d−1 represents the degrees of freedom. Under the null hypothesis the estimate ˆτXY(i)follows a t-distribution with df=n−d−1

degrees of freedom. Using this result it is possible to assign p-values to each variable. In Section 4.3 the connection of τXYto partial correlation is

discussed.

Stepwise selection comprises heuristic strategies to select variables based on statistics like ˆτXYor alternatively the F- or Wald-statistic, see e.g. Fahrmeir

and Tutz (2001) or Hastie et al. (2009). In backward selection all variables are included and ˆτXYis computed for all variables, then the variable with

the lowest value of ˆτXY is excluded from the model and ˆτXYis recomputed

for the remaining variables. This step is repeated until there are only variables in the model that have a p-value below a predefined threshold. Such

selection strategies suffer from unstable results due to dependencies among the predictor variables since the final model highly depends on the order of the excluded variables. For example, backward and forward selection usually do not agree on the same model (Burnham and Anderson, 2002).

A different approach to variable selection is taken by penalized RSS. Penalized RSS quantify the goodness of fit of a specific model with an addi- tional penalty on the model size q <d. Using penalized RSS it is possible to compare models of different size and including different variables. Follow- ing George (2000) a general illustration of penalized RSS for a given model of size q is given as

RSSpenq =RSSq+λ·qdVar(e) (4.10)

where RSSq is the RSS based on the model of dimension q and dVar(e) =

RSS

n−d−1 is the estimated residual variance for the full model. The penalization

parameter λ is fixed in advance and differs in:

• Akaike’s Information Criterion (AIC)

RSSAICq =RSSq+2·qdVar(e),

• Bayesian Information Criterion (BIC)

RSSBIC_q =RSSq+log(n) ·qdVar(e),

• Risk Inflation Criterion (RIC)

RSSRICq =RSSq+2·log(q) ·q·dVar(e),

• (minimum) Mallowes’ Cp

RSSCpq =RSSq+2·qdVar(e).

Variable selection using penalized RSS is widespread, still there are two drawbacks. First, the fixed choice of λ has a strong impact on the size of the selected model. Large values of λ favor a small model size and vice versa. In contrast to penalized regression, as discussed in Section 4.2, the parameter λ is fixed and there is no intrinsic adaption to the data under analysis. Furthermore, penalized RSS is relatively sensitive with respect to small changes in the data (George, 2000).

In document A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data (Page 54-59)