Introduction to Machine Learning
Linear Regression
Varun Chandola
Computer Science & Engineering State University of New York at Buffalo
Buffalo, NY, USA [email protected]
Outline
Linear Regression Problem Formulation Geometric Interpretation Learning Parameters Recap
Issues with Linear Regression Bayesian Linear Regression Bayesian Regression
Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Handling Non-linear Relationships
Handling Overfitting via Regularization
Taking the next step
Hypothesis Space, H
I
Conjunctive
I
Disjunctive
I
Disjunctions of k attributes
I
Linear hyperplanes
I
c
∗∈ H /
I
Non-linear network
Input Space, x
I
x ∈ {0, 1}
dI
x ∈ R
dInput Space, y
I
y ∈ {0, 1}
I
y ∈ {−1, +1}
I
y ∈ R
Linear Regression
I
There is one scalar target variable y (instead of hidden)
I
There is one vector input variable x
I
Inductive bias:
y = w
>x
Linear Regression Learning Task
Learn w given training examples, hX, yi.
Two Interpretations
1. Probabilistic Interpretation
I
y is assumed to be normally distributed y ∼ N (w
>x, σ
2)
I
or, equivalently:
y = w
>x + where ∼ N (0, σ
2)
I
y is a linear combination of the input variables
I
Given w and σ
2, one can find the probability distribution of y for a
given x
Two Interpretations
2. Geometric Interpretation
I
Fitting a straight line to d dimensional data y = w
>x
y = w
>x = w
1x
1+ w
2x
2+ . . . + w
dx
d IWill pass through origin
I
Add intercept
y = w
0+ w
1x
1+ w
2x
2+ . . . + w
dx
d IEquivalent to adding another column in X of 1s.
Learning Parameters - MLE Approach
I
Find w and σ
2that maximize the likelihood of training data
w b
MLE= (X
>X)
−1X
>y b σ
MLE2= 1
N (y − Xw)
>(y − Xw)
Learning Parameters - Least Squares Approach
I
Minimize squared loss
J(w) = 1 2
N
X
i =1
(y
i− w
>x
i)
2I
Make prediction (w
>x
i) as close to the target (y
i) as possible
I
Least squares estimate
w = (X b
>X)
−1X
>y
Gradient Descent Based Method
I
Minimize the squared loss using Gradient Descent
J(w) = 1 2
N
X
i =1
(y
i− w
>x
i)
2I
Why?
Recap - Linear Regression
Geometric
y = w
>x
1. Least Squares
w = (X b
>X)
−1X
>y 2. Gradient Descent
J(w) = 1 2
N
X
i =1
(y
i− w
>x
i)
2Bayesian
p(y ) = N (w
>x, σ
2)
1. Maximum Likelihood Estimation
w b = (X
>X)
−1X
>y σ b
2MLE= 1
N
N
X
i =1
(y − Xw)
>(y − Xw)
Issues with Linear Regression
1. Not truly Bayesian 2. Susceptible to outliers 3. Too simplistic - Underfitting 4. No way to control overfitting
5. Unstable in presence of correlated input attributes
6. Gets “confused” by unnecessary attributes
Putting a Prior on w
I
“Penalize” large values of w
I
A zero-mean Gaussian prior
p(w) = N (w|0, τ
2I )
I
What is posterior of w
p(w|D) ∝ Y
i
N (y
i|w
>x
i, σ
2)p(w)
I
Posterior is also Gaussian
Posterior Estimates of the Weight Vector
I
MAP estimate of w arg max
w N
X
i =1
log N (y
i|w
>x
i, σ
2) + log N (w|0, τ
2I )
Parameter Estimation for Bayesian Regression
I
Prior for w
w ∼ N (w|0, τ
2I
D)
I
Posterior for w
p(w|y, X) = p(y|X, w)p(w) p(y|X)
= N ( ¯ w = (X
>X + σ
2τ
2I
N)
−1X
>y, σ
2(X
>X + σ
2τ
2I
N)
−1)
I
Posterior distribution for w is also Gaussian
I
What will be MAP estimate for w?
Prediction with Bayesian Regression
I
For a new x
∗, predict y
∗I
Point estimate of y
∗y
∗= w b
>MLEx
∗I
Treating y as a Gaussian random variable
p(y
∗|x
∗) = N ( w b
>MLEx
∗, b σ
MLE2)
p(y
∗|x
∗) = N ( w b
>MAPx
∗, b σ
MAP2)
Full Bayesian Treatment
I
Treating y and w as random variables p(y
∗|x
∗) =
Z
p(y
∗|x
∗, w)p(w|X, y)d w
I
This is also Gaussian!
Handling Non-linear Relationships
I
Replace x with non-linear functions φ(x) p(y |x, θ) ∼ N (w
>φ(x))
I
Model is still linear in w
I
Also known as basis function expansion
Example
φ(x ) = [1, x , x
2, . . . , x
p]
I
Increasing p results in more complex fits
How to Control Overfitting?
I
Use simpler models (linear instead of polynomial)
I
Might have poor results (underfitting)
I
Use regularized complex models Θ = arg min b
Θ
J(Θ) + λR(Θ)
I
R() corresponds to the penalty paid for complexity of the model
Examples of Regularization
Ridge Regression
w = arg min b
w
J(w) + λkwk
2I
Also known as l
2or Tikhonov regularization
I
Helps in reducing impact of correlated inputs
Least Absolute Shrinkage and Selection Operator - LASSO
w = arg min b
w
J(w) + λ|w|
I
Also known as l
1regularization
Parameter Estimation for Ridge Regression
Exact Loss Function
J(w) = 1 2
N
X
i =1