1/27
STAT 340: Applied Regression Methods
Lecture Notes 9:
More on Least Squares, Ridge and PCA
2/27
Two views on least squares regression
3.2 Linear Regression Models and Least Squares 45
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
• •
•
•
•
•
•
•
X1 X2 YFIGURE 3.1.Linear least squares fitting withX∈IR2. We seek the linear
function ofXthat minimizes the sum of squared residuals fromY.
space occupied by the pairs (X, Y). Note that (3.2) makes no assumptions about the validity of model (3.1); it simply finds the best linear fit to the data. Least squares fitting is intuitively satisfying no matter how the data arise; the criterion measures the average lack of fit.
How do we minimize (3.2)? Denote byXtheN×(p+ 1) matrix with each row an input vector (with a 1 in the first position), and similarly let
ybe theN-vector of outputs in the training set. Then we can write the residual sum-of-squares as
RSS(β) = (y−Xβ)T(y−Xβ). (3.3)
This is a quadratic function in thep+ 1 parameters. Differentiating with respect toβwe obtain ∂RSS ∂β =−2X T(y−Xβ) ∂2RSS ∂β∂βT = 2X TX. (3.4)
Assuming (for the moment) thatXhas full column rank, and henceXTX
is positive definite, we set the first derivative to zero
XT(y−Xβ) = 0 (3.5)
to obtain the unique solution ˆ
3/27
Two views on least squares regression
Recall:
•
We can write the
RSS in matrix notation
as:
RSS
= (
y
−
X
β
b
)
T(
y
−
X
β
b
)
=
y
Ty
−
β
b
TX
Ty
−
y
TX
β
b
+
β
b
TX
TX
β
b
=
y
Ty
−
2
β
b
TX
Ty
+
β
b
TX
TX
β
b
4/27
Two views on least squares regression
•
To find the minimum, we take the derivative:
dRSS
d
β
b
=
−
2
X
Ty
+ 2
X
TX
b
β
= 0
•
If
X
TX
is invertible, then:
b
β
=
X
TX
−
1X
Ty
5/27
Two views on least squares regression
•
The predicted outcome
y
given
X
is:
b
y
=
X
β
b
=
X
(
X
TX
)
−1X
TY
=
MY
where
M
is the
orthogonal projection operator
onto the C(X).
•
If rank of
X
is two, then
b
y
is the orthogonal projection of
y
onto the
plane spanned by two linearly independent columns of
X
.
6/27
Two views on least squares regression
46 3. Linear Methods for Regression
x1
x2
y
ˆ y
FIGURE 3.2.TheN-dimensional geometry of least squares regression with two predictors. The outcome vectoryis orthogonally projected onto the hyperplane spanned by the input vectorsx1andx2. The projectionˆyrepresents the vector
of the least squares predictions
The predicted values at an input vectorx0are given by ˆf(x0) = (1 :x0)Tβˆ; the fitted values at the training inputs are
ˆ
y=Xβˆ=X(XTX)−1XTy, (3.7)
where ˆyi= ˆf(xi). The matrixH=X(XTX)−1XT appearing in equation
(3.7) is sometimes called the “hat” matrix because it puts the hat ony. Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in IRN. We denote the column vectors ofXbyx0,x1, . . . ,xp,
withx0≡1. For much of what follows, this first column is treated like any other. These vectors span a subspace of IRN, also referred to as the column
space ofX. We minimize RSS(β) =∥y−Xβ∥2by choosing ˆβso that the residual vectory−ˆyis orthogonal to this subspace. This orthogonality is expressed in (3.5), and the resulting estimate ˆyis hence theorthogonal pro-jectionofyonto this subspace. The hat matrixHcomputes the orthogonal projection, and hence it is also known as a projection matrix.
It might happen that the columns ofXare not linearly independent, so thatXis not of full rank. This would occur, for example, if two of the inputs were perfectly correlated, (e.g.,x2 = 3x1). ThenXTXis singular and the least squares coefficients ˆβ are not uniquely defined. However, the fitted values ˆy= Xβˆare still the projection ofyonto the column space of X; there is just more than one way to express that projection in terms of the column vectors ofX. The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns inX. Most regression software packages detect these redundancies and automatically implement
7/27
Ridge Regression
•
Recall, for Ridge Regression we aim to minimize
RSS
(
λ
) = (
y
−
X
β
b
)
T(
y
−
X
β
b
) +
λ
β
b
Tβ
b
=
y
Ty
−
2
β
b
TX
Ty
+
β
b
TX
TX
β
b
+
λ
β
b
Tβ
b
For simplicity here assume the columns of
X
have all been centered,
and we exclude the column corresponding to the intercept.
•
To find the minimum, we take the derivative:
dRSS
(
λ
)
d
β
b
=
−
2
X
Ty
+ 2
X
TX
b
8/27
Ridge Regression
•
This yields the Ridge Estimator:
b
β
ridge= (
X
TX
+
λI
)
−1X
Ty
•
Note that now the problem is nonsingular, i.e.
we can solve for
β
b
ridge9/27
Lasso
•
Recall, in Lasso, we aim to minimize:
RSS
(
λ
) = (
y
−
X
β
b
)
T(
y
−
X
β
b
) +
λ
|
β
b
|
T|
β
b
|
•
No closed form solution
→
alternative optimization strategy
10/27
New Concept - SVD
Theorem 4.1
: Singular Value Decomposition (SVD). For an
n
×
p
matrix
A
of rank
r
, there exists orthogonal matrices
U
n×pand
V
p×psuch that
U
0AV
=
∆ 0
0
0
p×pwhere ∆ =
diag
(
δ
1, . . . , δr
) and
δ
1≥
. . .
≥
δr
>
0 are called the singular
values of
A
. Or equivalently,
A
=
U
∆ 0
0
0
p×pV
011/27
Singular Value Decomposition (SVD)
Recall, for an orthogonal matrix X, we have
X
−1=
X
0or equivalently,
X
0X
=
XX
0=
I
. Since
U
and
V
are orthogonal, we know
UU
0=
I
,
VV
0=
I
and therefore
U
0AV
=
D
⇔
UU
0AVV
0=
UDV
0⇔
A
=
UDV
0.
Example
Consider the matrix
A
=
1 1
1 1
. Note that
A
is a 2
×
2 symmetric
matrix of rank,
r
(
A
) = 1. The SVD is obtained using the
svd()
12/27
Singular Value Decomposition (SVD)
# Create the matrix A > A <- matrix(c(1,1,1,1),2) > A
[,1] [,2]
[1,] 1 1
[2,] 1 1
# Calculate the singular value decomposition of A > svd(A) $d [1] 2 0 $u [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068 $v [,1] [,2] [1,] -0.7071068 -0.7071068 [2,] -0.7071068 0.7071068
13/27
Singular Value Decomposition (SVD)
•
We know from the SVD theorem that we can write
D
=
U
0AV
.
Since
A
is of dimension 2
×
2 and
r
(
A
) = 1, we expect only one
non-zero singular value,
δ
in the diagonal elements of
D
.
•
We can also reconstruct
A
from the components of the SVD as
shown below.
14/27
Singular Value Decomposition (SVD)
# Checking that U’AV=Delta > V <- svd(A)$v > U <- svd(A)$u > round(t(U) %*% A %*% V,2) [,1] [,2] [1,] 2 0 [2,] 0 0
# Generate A from the components of the SVD of A > Delta <- diag(svd(A)$d)
> U %*% Delta %*% t(V) [,1] [,2]
[1,] 1 1
15/27
Singular Value Decomposition (SVD)
Recall,
matrices are spatial operators
that rotate, stretch and shrink
vectors. The SVD tells us how to break up these operations of a matrix
into three steps. Let
A
=
UDV
0act upon the vector
x
=
2
0
. It is
easily seen that
Ax
=
2
2
16/27
Singular Value Decomposition (SVD)
!3 !2 !1 0 1 2 3 ! 3 ! 2 ! 1 0 1 2 3 x V’x DV’x S(v1) UDV’x=Ax
17/27
Singular Value Decomposition (SVD)
•
First we have
x
∗=
V
0x
=
−
0
.
707
−
0
.
707
−
0
.
707
0
.
707
2
0
=
−
1
.
414
−
1
.
414
•
Note
V
0x
∈ S
(
v
1) where
v
1is an eigenvector of
A
0A
. That is,
pre-multiplying by the matrix
V
0rotates the vector
x
to be in the
span of the first eigenvector of
A
0A
.
18/27
Singular Value Decomposition (SVD)
•
Since
D
is a diagonal matrix, pre-multiplying by
D
has the effect of
stretching or shrinking the corresponding axes. In this example the
first diagonal element equals 2 and the second equals 0 and so the
effect is to stretch the x-axis by 2 and reduce the
y
axis to 0:
x
∗∗=
Dx
∗=
2 0
0 0
−
1
.
414
−
1
.
414
=
−
2
.
828
0
19/27
Singular Value Decomposition (SVD)
•
Note that 2 and 0 are the square roots of the eigenvalues of
A
0A
.
•
Finally, pre-multiplying by
U
gives
Ux
∗∗=
−
0
.
707
−
0
.
707
−
0
.
707
0
.
707
−
2
.
828
0
=
2
2
20/27
Singular Value Decomposition (SVD)
•
Also of note,
the columns of
U
and
V
are the eigenvectors
corresponding to the nonzero eigenvalues of
AA
0and
A
0A
respectively,
•
Also, for
r
(
A
) =
r
, and
δ
21
, . . . , δ
r2are the eigenvalues
of
A
0A
.
•
Finally, it can be shown that
A
=
UDV
0=
U
1∆
V
10where
U
1is
n
×
r
and
V
1is
p
×
r
, representing the orthonormal columns of
U
and
V
21/27
LS, Ridge and SVD
Using the SVD, we can write
X
=
UDV
T. In turn, we have:
•
The
least squares estimator
:
X
β
b
=
X
(
X
TX
)
−1X
Ty
=
UU
Ty
=
pX
j=1uj
u
T jy
.
•
The
Ridge estimator
:
X
β
b
=
X
(
X
TX
−
λI
)
−1X
Ty
=
UD
(
D
2+
λI
)
−1DU
Ty
=
pX
j=1uj
δ
2 jδ
2 j+
λ
u
T jy
.
22/27
LS, Ridge and SVD
•
Similar to LS, Ridge is computing coordinates of
y
with respect to
the orthonormal basis
U
.
•
However, for Ridge, we additionally shrink these by the amount
δ
2 j/
(
δ
2 j
+
λ
).
•
The degree of shrinkage increases for smaller values of
δ
2 j.
•
Recall, the
δ
2j
’s are the eigenvalues of
X
TX
.
23/27
PCA
3.4 Shrinkage Methods 67-4 -2 0 2 4 -4 -2 0 2 4 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Largest Principal Component Smallest Principal Component X1 X2
FIGURE 3.9.Principal components of some input data points. The largest prin-cipal component is the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance. Ridge regression projectsyonto these components, and then shrinks the coefficients of the low– variance components more than the high-variance components.
component. Subsequent principal componentszjhave maximum variance d2
j/N, subject to being orthogonal to the earlier ones. Conversely the last
principal component hasminimumvariance. Hence the small singular val-uesdjcorrespond to directions in the column space of Xhaving small
variance, and ridge regression shrinks these directions the most. Figure 3.9 illustrates the principal components of some data points in two dimensions. If we consider fitting a linear surface over this domain (theY-axis is sticking out of the page), the configuration of the data allow us to determine its gradient more accurately in the long direction than the short. Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.
24/27
PCA and SVD
•
The first PC direction of
X
is given by
v
1, the first column of the
V
matrix.
•
Note that if we let
z
1=
X
v
1, then
z
1has the largest sample variance
among all (normalized) linear combinations of the columns of
X
:
Var
(
z
1) =
Var
(
Xv
1)
=
v
1Var
(
X
)
v
1t=
1
N
v
1X
TX
v
T 1=
1
N
v
1VD
2V
Tv
T 1=
d
2 1N
.
25/27
PCA
Linear, Ridge Regression, and Principal Component Analysis26/27
PCA and Ridge
•
It can be show that
z
j=
X
v
j=
u
jδj
.
•
Ridge shrinks all variables.
27/27
PCA and Ridge
Linear, Ridge Regression, and Principal Component Analysis
The Geometric interpretation of principal components and shrinkage by
ridge regression.