• No results found

STAT 340: Applied Regression Methods

N/A
N/A
Protected

Academic year: 2021

Share "STAT 340: Applied Regression Methods"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

1/27

STAT 340: Applied Regression Methods

Lecture Notes 9:

More on Least Squares, Ridge and PCA

(2)

2/27

Two views on least squares regression

3.2 Linear Regression Models and Least Squares 45

• •

• •

• •

X1 X2 Y

FIGURE 3.1.Linear least squares fitting withX∈IR2. We seek the linear

function ofXthat minimizes the sum of squared residuals fromY.

space occupied by the pairs (X, Y). Note that (3.2) makes no assumptions about the validity of model (3.1); it simply finds the best linear fit to the data. Least squares fitting is intuitively satisfying no matter how the data arise; the criterion measures the average lack of fit.

How do we minimize (3.2)? Denote byXtheN×(p+ 1) matrix with each row an input vector (with a 1 in the first position), and similarly let

ybe theN-vector of outputs in the training set. Then we can write the residual sum-of-squares as

RSS(β) = (y−Xβ)T(y). (3.3)

This is a quadratic function in thep+ 1 parameters. Differentiating with respect toβwe obtain ∂RSS ∂β =−2X T(y) ∂2RSS ∂β∂βT = 2X TX. (3.4)

Assuming (for the moment) thatXhas full column rank, and henceXTX

is positive definite, we set the first derivative to zero

XT(y) = 0 (3.5)

to obtain the unique solution ˆ

(3)

3/27

Two views on least squares regression

Recall:

We can write the

RSS in matrix notation

as:

RSS

= (

y

X

β

b

)

T

(

y

X

β

b

)

=

y

T

y

β

b

T

X

T

y

y

T

X

β

b

+

β

b

T

X

T

X

β

b

=

y

T

y

2

β

b

T

X

T

y

+

β

b

T

X

T

X

β

b

(4)

4/27

Two views on least squares regression

To find the minimum, we take the derivative:

dRSS

d

β

b

=

2

X

T

y

+ 2

X

T

X

b

β

= 0

If

X

T

X

is invertible, then:

b

β

=

X

T

X

1

X

T

y

(5)

5/27

Two views on least squares regression

The predicted outcome

y

given

X

is:

b

y

=

X

β

b

=

X

(

X

T

X

)

−1

X

T

Y

=

MY

where

M

is the

orthogonal projection operator

onto the C(X).

If rank of

X

is two, then

b

y

is the orthogonal projection of

y

onto the

plane spanned by two linearly independent columns of

X

.

(6)

6/27

Two views on least squares regression

46 3. Linear Methods for Regression

x1

x2

y

ˆ y

FIGURE 3.2.TheN-dimensional geometry of least squares regression with two predictors. The outcome vectoryis orthogonally projected onto the hyperplane spanned by the input vectorsx1andx2. The projectionˆyrepresents the vector

of the least squares predictions

The predicted values at an input vectorx0are given by ˆf(x0) = (1 :x0)Tβˆ; the fitted values at the training inputs are

ˆ

y=Xβˆ=X(XTX)−1XTy, (3.7)

where ˆyi= ˆf(xi). The matrixH=X(XTX)−1XT appearing in equation

(3.7) is sometimes called the “hat” matrix because it puts the hat ony. Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in IRN. We denote the column vectors ofXbyx0,x1, . . . ,xp,

withx0≡1. For much of what follows, this first column is treated like any other. These vectors span a subspace of IRN, also referred to as the column

space ofX. We minimize RSS(β) =∥y−Xβ∥2by choosing ˆβso that the residual vectory−ˆyis orthogonal to this subspace. This orthogonality is expressed in (3.5), and the resulting estimate ˆyis hence theorthogonal pro-jectionofyonto this subspace. The hat matrixHcomputes the orthogonal projection, and hence it is also known as a projection matrix.

It might happen that the columns ofXare not linearly independent, so thatXis not of full rank. This would occur, for example, if two of the inputs were perfectly correlated, (e.g.,x2 = 3x1). ThenXTXis singular and the least squares coefficients ˆβ are not uniquely defined. However, the fitted values ˆy= Xβˆare still the projection ofyonto the column space of X; there is just more than one way to express that projection in terms of the column vectors ofX. The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns inX. Most regression software packages detect these redundancies and automatically implement

(7)

7/27

Ridge Regression

Recall, for Ridge Regression we aim to minimize

RSS

(

λ

) = (

y

X

β

b

)

T

(

y

X

β

b

) +

λ

β

b

T

β

b

=

y

T

y

2

β

b

T

X

T

y

+

β

b

T

X

T

X

β

b

+

λ

β

b

T

β

b

For simplicity here assume the columns of

X

have all been centered,

and we exclude the column corresponding to the intercept.

To find the minimum, we take the derivative:

dRSS

(

λ

)

d

β

b

=

2

X

T

y

+ 2

X

T

X

b

(8)

8/27

Ridge Regression

This yields the Ridge Estimator:

b

β

ridge

= (

X

T

X

+

λI

)

−1

X

T

y

Note that now the problem is nonsingular, i.e.

we can solve for

β

b

ridge
(9)

9/27

Lasso

Recall, in Lasso, we aim to minimize:

RSS

(

λ

) = (

y

X

β

b

)

T

(

y

X

β

b

) +

λ

|

β

b

|

T

|

β

b

|

No closed form solution

alternative optimization strategy

(10)

10/27

New Concept - SVD

Theorem 4.1

: Singular Value Decomposition (SVD). For an

n

×

p

matrix

A

of rank

r

, there exists orthogonal matrices

U

n×p

and

V

p×p

such that

U

0

AV

=

∆ 0

0

0

p×p

where ∆ =

diag

(

δ

1

, . . . , δr

) and

δ

1

. . .

δr

>

0 are called the singular

values of

A

. Or equivalently,

A

=

U

∆ 0

0

0

p×p

V

0
(11)

11/27

Singular Value Decomposition (SVD)

Recall, for an orthogonal matrix X, we have

X

−1

=

X

0

or equivalently,

X

0

X

=

XX

0

=

I

. Since

U

and

V

are orthogonal, we know

UU

0

=

I

,

VV

0

=

I

and therefore

U

0

AV

=

D

UU

0

AVV

0

=

UDV

0

A

=

UDV

0

.

Example

Consider the matrix

A

=

1 1

1 1

. Note that

A

is a 2

×

2 symmetric

matrix of rank,

r

(

A

) = 1. The SVD is obtained using the

svd()

(12)

12/27

Singular Value Decomposition (SVD)

# Create the matrix A > A <- matrix(c(1,1,1,1),2) > A

[,1] [,2]

[1,] 1 1

[2,] 1 1

# Calculate the singular value decomposition of A > svd(A) $d [1] 2 0 $u [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068 $v [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068

(13)

13/27

Singular Value Decomposition (SVD)

We know from the SVD theorem that we can write

D

=

U

0

AV

.

Since

A

is of dimension 2

×

2 and

r

(

A

) = 1, we expect only one

non-zero singular value,

δ

in the diagonal elements of

D

.

We can also reconstruct

A

from the components of the SVD as

shown below.

(14)

14/27

Singular Value Decomposition (SVD)

# Checking that U’AV=Delta > V <- svd(A)$v > U <- svd(A)$u > round(t(U) %*% A %*% V,2) [,1] [,2] [1,] 2 0 [2,] 0 0

# Generate A from the components of the SVD of A > Delta <- diag(svd(A)$d)

> U %*% Delta %*% t(V) [,1] [,2]

[1,] 1 1

(15)

15/27

Singular Value Decomposition (SVD)

Recall,

matrices are spatial operators

that rotate, stretch and shrink

vectors. The SVD tells us how to break up these operations of a matrix

into three steps. Let

A

=

UDV

0

act upon the vector

x

=

2

0

. It is

easily seen that

Ax

=

2

2

(16)

16/27

Singular Value Decomposition (SVD)

!3 !2 !1 0 1 2 3 ! 3 ! 2 ! 1 0 1 2 3 x V’x DV’x S(v1) UDV’x=Ax

(17)

17/27

Singular Value Decomposition (SVD)

First we have

x

=

V

0

x

=

0

.

707

0

.

707

0

.

707

0

.

707

2

0

=

1

.

414

1

.

414

Note

V

0

x

∈ S

(

v

1

) where

v

1

is an eigenvector of

A

0

A

. That is,

pre-multiplying by the matrix

V

0

rotates the vector

x

to be in the

span of the first eigenvector of

A

0

A

.

(18)

18/27

Singular Value Decomposition (SVD)

Since

D

is a diagonal matrix, pre-multiplying by

D

has the effect of

stretching or shrinking the corresponding axes. In this example the

first diagonal element equals 2 and the second equals 0 and so the

effect is to stretch the x-axis by 2 and reduce the

y

axis to 0:

x

∗∗

=

Dx

=

2 0

0 0

1

.

414

1

.

414

=

2

.

828

0

(19)

19/27

Singular Value Decomposition (SVD)

Note that 2 and 0 are the square roots of the eigenvalues of

A

0

A

.

Finally, pre-multiplying by

U

gives

Ux

∗∗

=

0

.

707

0

.

707

0

.

707

0

.

707

2

.

828

0

=

2

2

(20)

20/27

Singular Value Decomposition (SVD)

Also of note,

the columns of

U

and

V

are the eigenvectors

corresponding to the nonzero eigenvalues of

AA

0

and

A

0

A

respectively,

Also, for

r

(

A

) =

r

, and

δ

2

1

, . . . , δ

r2

are the eigenvalues

of

A

0

A

.

Finally, it can be shown that

A

=

UDV

0

=

U

1

V

10

where

U

1

is

n

×

r

and

V

1

is

p

×

r

, representing the orthonormal columns of

U

and

V

(21)

21/27

LS, Ridge and SVD

Using the SVD, we can write

X

=

UDV

T

. In turn, we have:

The

least squares estimator

:

X

β

b

=

X

(

X

T

X

)

−1

X

T

y

=

UU

T

y

=

p

X

j=1

uj

u

T j

y

.

The

Ridge estimator

:

X

β

b

=

X

(

X

T

X

λI

)

−1

X

T

y

=

UD

(

D

2

+

λI

)

−1

DU

T

y

=

p

X

j=1

uj

δ

2 j

δ

2 j

+

λ

u

T j

y

.

(22)

22/27

LS, Ridge and SVD

Similar to LS, Ridge is computing coordinates of

y

with respect to

the orthonormal basis

U

.

However, for Ridge, we additionally shrink these by the amount

δ

2 j

/

(

δ

2 j

+

λ

).

The degree of shrinkage increases for smaller values of

δ

2 j

.

Recall, the

δ

2

j

’s are the eigenvalues of

X

T

X

.

(23)

23/27

PCA

3.4 Shrinkage Methods 67

-4 -2 0 2 4 -4 -2 0 2 4 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Largest Principal Component Smallest Principal Component X1 X2

FIGURE 3.9.Principal components of some input data points. The largest prin-cipal component is the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance. Ridge regression projectsyonto these components, and then shrinks the coefficients of the low– variance components more than the high-variance components.

component. Subsequent principal componentszjhave maximum variance d2

j/N, subject to being orthogonal to the earlier ones. Conversely the last

principal component hasminimumvariance. Hence the small singular val-uesdjcorrespond to directions in the column space of Xhaving small

variance, and ridge regression shrinks these directions the most. Figure 3.9 illustrates the principal components of some data points in two dimensions. If we consider fitting a linear surface over this domain (theY-axis is sticking out of the page), the configuration of the data allow us to determine its gradient more accurately in the long direction than the short. Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.

(24)

24/27

PCA and SVD

The first PC direction of

X

is given by

v

1

, the first column of the

V

matrix.

Note that if we let

z

1

=

X

v

1

, then

z

1

has the largest sample variance

among all (normalized) linear combinations of the columns of

X

:

Var

(

z

1

) =

Var

(

Xv

1

)

=

v

1

Var

(

X

)

v

1t

=

1

N

v

1

X

T

X

v

T 1

=

1

N

v

1

VD

2

V

T

v

T 1

=

d

2 1

N

.

(25)

25/27

PCA

Linear, Ridge Regression, and Principal Component Analysis
(26)

26/27

PCA and Ridge

It can be show that

z

j

=

X

v

j

=

u

j

δj

.

Ridge shrinks all variables.

(27)

27/27

PCA and Ridge

Linear, Ridge Regression, and Principal Component Analysis

The Geometric interpretation of principal components and shrinkage by

ridge regression.

References

Related documents