STAT 340: Applied Regression Methods

(1)

1/27

STAT 340: Applied Regression Methods

Lecture Notes 9:

Two views on least squares regression

Recall:

•

We can write the

RSS in matrix notation

as:

RSS

= (

y

−

X

β

b

)

T

(

y

−

X

β

b

)

=

y

T

y

−

β

b

T

X

T

y

−

y

T

X

β

b

+

β

b

T

X

T

X

β

b

=

y

T

y

−

2

β

b

T

X

T

y

+

β

b

T

X

T

X

β

b

(4)

4/27

•

To find the minimum, we take the derivative:

dRSS

d

β

b

=

−

2

X

T

y

+ 2

X

T

X

b

β

= 0

•

If

X

T

X

is invertible, then:

b

β

=

X

T

X

−

1

X

T

y

(5)

5/27

•

The predicted outcome

y

given

X

is:

b

y

=

X

β

b

=

X

(

X

T

X

)

−1

X

T

Y

=

MY

where

M

is the

orthogonal projection operator

onto the C(X).

•

If rank of

X

is two, then

b

y

is the orthogonal projection of

y

onto the

plane spanned by two linearly independent columns of

X

.

(6)

6/27

46 3. Linear Methods for Regression

x1

x2

y

ˆ y

FIGURE 3.2.TheN-dimensional geometry of least squares regression with two predictors. The outcome vectoryis orthogonally projected onto the hyperplane spanned by the input vectorsx1andx2. The projectionˆyrepresents the vector

of the least squares predictions

The predicted values at an input vectorx0are given by ˆf(x0) = (1 :x0)Tβˆ; the fitted values at the training inputs are

ˆ

y=Xβˆ=X(XT_X₎−1_XT_y, _(3.7)

where ˆyi= ˆf(xi). The matrixH=X(XTX)−1XT appearing in equation

(3.7) is sometimes called the “hat” matrix because it puts the hat ony. Figure 3.2 shows a diﬀerent geometrical representation of the least squares estimate, this time in IRN. We denote the column vectors ofXbyx0,x1, . . . ,xp,

withx0≡1. For much of what follows, this first column is treated like any other. These vectors span a subspace of IRN_{, also referred to as the column}

space ofX. We minimize RSS(β) =∥y−Xβ∥2_{by choosing ˆ}_β_{so that the} residual vectory−ˆyis orthogonal to this subspace. This orthogonality is expressed in (3.5), and the resulting estimate ˆyis hence theorthogonal pro-jectionofyonto this subspace. The hat matrixHcomputes the orthogonal projection, and hence it is also known as a projection matrix.

It might happen that the columns ofXare not linearly independent, so thatXis not of full rank. This would occur, for example, if two of the inputs were perfectly correlated, (e.g.,x2 = 3x1). ThenXTXis singular and the least squares coeﬃcients ˆβ are not uniquely defined. However, the fitted values ˆy= Xβˆare still the projection ofyonto the column space of X; there is just more than one way to express that projection in terms of the column vectors ofX. The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns inX. Most regression software packages detect these redundancies and automatically implement

(7)

7/27

Ridge Regression

•

Recall, for Ridge Regression we aim to minimize

RSS

(

λ

) = (

y

−

X

β

b

)

T

(

y

−

X

β

b

) +

λ

β

b

T

β

b

=

y

T

y

−

2

β

b

T

X

T

y

+

β

b

T

X

T

X

β

b

+

λ

β

b

T

β

b

For simplicity here assume the columns of

X

have all been centered,

and we exclude the column corresponding to the intercept.

•

To find the minimum, we take the derivative:

dRSS

(

λ

)

d

β

b

=

−

2

X

T

y

+ 2

X

T

X

b

(8)

8/27

Ridge Regression

•

This yields the Ridge Estimator:

b

β

ridge

= (

X

T

X

+

λI

)

−1

X

T

y

•

Note that now the problem is nonsingular, i.e.

we can solve for

β

b

ridge

(9)

9/27

Lasso

•

Recall, in Lasso, we aim to minimize:

RSS

(

λ

) = (

y

−

X

β

b

)

T

(

y

−

X

β

b

) +

λ

|

β

b

|

T

|

β

b

|

•

No closed form solution

→

alternative optimization strategy

(10)

10/27

New Concept - SVD

Theorem 4.1

: Singular Value Decomposition (SVD). For an

n

×

p

matrix

A

of rank

r

, there exists orthogonal matrices

U

n×p

and

V

p×p

such that

U

0

AV

=

∆ 0

0

p×p

where ∆ =

diag

(

δ

1

, . . . , δr

) and

δ

1

≥

. . .

≥

δr

>

0 are called the singular

values of

A

. Or equivalently,

A

=

U

∆ 0

0

p×p

V

0

(11)

11/27

Singular Value Decomposition (SVD)

Recall, for an orthogonal matrix X, we have

X

−1

=

X

0

or equivalently,

X

0

X

=

XX

0

=

I

. Since

U

and

V

are orthogonal, we know

UU

0

=

I

,

VV

0

=

I

and therefore

U

0

AV

=

D

⇔

UU

0

AVV

0

=

UDV

0

⇔

A

=

UDV

0

.

Example

Consider the matrix

A

=

1 1

. Note that

A

is a 2

×

2 symmetric

matrix of rank,

r

(

A

) = 1. The SVD is obtained using the

svd()

(12)

12/27

# Create the matrix A > A <- matrix(c(1,1,1,1),2) > A

[,1] [,2]

[1,] 1 1

[2,] 1 1

# Calculate the singular value decomposition of A > svd(A) $d [1] 2 0 $u [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068 $v [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068

(13)

13/27

•

We know from the SVD theorem that we can write

D

=

U

0

AV

.

Since

A

is of dimension 2

×

2 and

r

(

A

) = 1, we expect only one

non-zero singular value,

δ

in the diagonal elements of

D

.

•

We can also reconstruct

A

from the components of the SVD as

shown below.

(14)

14/27

# Checking that U’AV=Delta > V <- svd(A)$v > U <- svd(A)$u > round(t(U) %*% A %*% V,2) [,1] [,2] [1,] 2 0 [2,] 0 0

# Generate A from the components of the SVD of A > Delta <- diag(svd(A)$d)

> U %*% Delta %*% t(V) [,1] [,2]

[1,] 1 1

(15)

15/27

Singular Value Decomposition (SVD)

Recall,

matrices are spatial operators

that rotate, stretch and shrink

vectors. The SVD tells us how to break up these operations of a matrix

into three steps. Let

A

=

UDV

0

act upon the vector

x

=

2

0

. It is

easily seen that

Ax

=

2

(16)

16/27

Singular Value Decomposition (SVD)

!3 !2 !1 0 1 2 3 ! 3 ! 2 ! 1 0 1 2 3 x V’x DV’x S(v1) UDV’x=Ax

(17)

17/27

•

First we have

x

∗

=

V

0

x

=

−

0

.

707

−

0

.

707

−

0

.

707

0

.

707

2

0

=

−

1

.

414

−

1

.

414

•

Note

V

0

x

∈ S

(

v

1

) where

v

1

is an eigenvector of

A

0

A

. That is,

pre-multiplying by the matrix

V

0

rotates the vector

x

to be in the

span of the first eigenvector of

A

0

A

.

(18)

18/27

•

Since

D

is a diagonal matrix, pre-multiplying by

D

has the effect of

stretching or shrinking the corresponding axes. In this example the

first diagonal element equals 2 and the second equals 0 and so the

effect is to stretch the x-axis by 2 and reduce the

y

axis to 0:

x

∗∗

=

Dx

∗

=

2 0

0 0

−

1

.

414

−

1

.

414

=

−

2

.

828

0

(19)

19/27

•

Note that 2 and 0 are the square roots of the eigenvalues of

A

0

A

.

•

Finally, pre-multiplying by

U

gives

Ux

∗∗

=

−

0

.

707

−

0

.

707

−

0

.

707

0

.

707

−

2

.

828

0

=

2

(20)

20/27

•

Also of note,

the columns of

U

and

V

are the eigenvectors

corresponding to the nonzero eigenvalues of

AA

0

and

A

0

A

respectively,

•

Also, for

r

(

A

) =

r

, and

δ

2

1

, . . . , δ

r2

are the eigenvalues

of

A

0

A

.

•

Finally, it can be shown that

A

=

UDV

0

=

U

1

∆

V

10

where

U

1

is

n

×

r

and

V

1

is

p

×

r

, representing the orthonormal columns of

U

and

V

(21)

21/27

LS, Ridge and SVD

Using the SVD, we can write

X

=

UDV

T

. In turn, we have:

•

The

least squares estimator

:

X

β

b

=

X

(

X

T

X

)

−1

X

T

y

=

UU

T

y

=

p

X

j=1

uj

u

T j

y

.

•

The

Ridge estimator

:

X

β

b

=

X

(

X

T

X

−

λI

)

−1

X

T

y

=

UD

(

D

2

+

λI

)

−1

DU

T

y

=

p

X

j=1

uj

δ

2 j

δ

2 j

+

λ

u

T j

y

.

(22)

22/27

LS, Ridge and SVD

•

Similar to LS, Ridge is computing coordinates of

y

with respect to

the orthonormal basis

U

.

•

However, for Ridge, we additionally shrink these by the amount

δ

2 j

/

(

δ

2 j

+

λ

).

•

The degree of shrinkage increases for smaller values of

δ

2 j

.

•

Recall, the

δ

2

j

’s are the eigenvalues of

X

T

X

.

(23)

23/27

PCA

3.4 Shrinkage Methods 67

-4 -2 0 2 4 -4 -2 0 2 4 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Largest Principal Component Smallest Principal Component X1 X2

FIGURE 3.9.Principal components of some input data points. The largest prin-cipal component is the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance. Ridge regression projectsyonto these components, and then shrinks the coeﬃcients of the low– variance components more than the high-variance components.

component. Subsequent principal componentszjhave maximum variance d2

j/N, subject to being orthogonal to the earlier ones. Conversely the last

principal component hasminimumvariance. Hence the small singular val-uesdjcorrespond to directions in the column space of Xhaving small

variance, and ridge regression shrinks these directions the most. Figure 3.9 illustrates the principal components of some data points in two dimensions. If we consider fitting a linear surface over this domain (theY-axis is sticking out of the page), the configuration of the data allow us to determine its gradient more accurately in the long direction than the short. Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.

(24)

24/27

PCA and SVD

•

The first PC direction of

X

is given by

v

1

, the first column of the

V

matrix.

•

Note that if we let

z

1

=

X

v

1

, then

z

1

has the largest sample variance

among all (normalized) linear combinations of the columns of

X

:

Var

(

z

1

) =

Var

(

Xv

1

)

=

v

1

Var

(

X

)

v

1t

=

1

N

v

1

X

T

X

v

T 1

=

1

N

v

1

VD

2

V

T

v

T 1

=

d

2 1

N

.

(25)

25/27

PCA

Linear, Ridge Regression, and Principal Component Analysis

(26)

26/27

PCA and Ridge

•

It can be show that

z

j

=

X

v

j

=

u

j

δj

.

•

Ridge shrinks all variables.

(27)

27/27

PCA and Ridge

Linear, Ridge Regression, and Principal Component Analysis

The Geometric interpretation of principal components and shrinkage by

ridge regression.