• No results found

Introduction to Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Machine Learning"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Machine Learning

Linear Regression

Varun Chandola

Computer Science & Engineering State University of New York at Buffalo

Buffalo, NY, USA [email protected]

(2)

Outline

Linear Regression Problem Formulation Geometric Interpretation Learning Parameters Recap

Issues with Linear Regression Bayesian Linear Regression Bayesian Regression

Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Handling Non-linear Relationships

Handling Overfitting via Regularization

(3)

Taking the next step

Hypothesis Space, H

I

Conjunctive

I

Disjunctive

I

Disjunctions of k attributes

I

Linear hyperplanes

I

c

∈ H /

I

Non-linear network

Input Space, x

I

x ∈ {0, 1}

d

I

x ∈ R

d

Input Space, y

I

y ∈ {0, 1}

I

y ∈ {−1, +1}

I

y ∈ R

(4)

Linear Regression

I

There is one scalar target variable y (instead of hidden)

I

There is one vector input variable x

I

Inductive bias:

y = w

>

x

Linear Regression Learning Task

Learn w given training examples, hX, yi.

(5)

Two Interpretations

1. Probabilistic Interpretation

I

y is assumed to be normally distributed y ∼ N (w

>

x, σ

2

)

I

or, equivalently:

y = w

>

x +  where  ∼ N (0, σ

2

)

I

y is a linear combination of the input variables

I

Given w and σ

2

, one can find the probability distribution of y for a

given x

(6)

Two Interpretations

2. Geometric Interpretation

I

Fitting a straight line to d dimensional data y = w

>

x

y = w

>

x = w

1

x

1

+ w

2

x

2

+ . . . + w

d

x

d I

Will pass through origin

I

Add intercept

y = w

0

+ w

1

x

1

+ w

2

x

2

+ . . . + w

d

x

d I

Equivalent to adding another column in X of 1s.

(7)

Learning Parameters - MLE Approach

I

Find w and σ

2

that maximize the likelihood of training data

w b

MLE

= (X

>

X)

−1

X

>

y b σ

MLE2

= 1

N (y − Xw)

>

(y − Xw)

(8)

Learning Parameters - Least Squares Approach

I

Minimize squared loss

J(w) = 1 2

N

X

i =1

(y

i

− w

>

x

i

)

2

I

Make prediction (w

>

x

i

) as close to the target (y

i

) as possible

I

Least squares estimate

w = (X b

>

X)

−1

X

>

y

(9)

Gradient Descent Based Method

I

Minimize the squared loss using Gradient Descent

J(w) = 1 2

N

X

i =1

(y

i

− w

>

x

i

)

2

I

Why?

(10)

Recap - Linear Regression

Geometric

y = w

>

x

1. Least Squares

w = (X b

>

X)

−1

X

>

y 2. Gradient Descent

J(w) = 1 2

N

X

i =1

(y

i

− w

>

x

i

)

2

Bayesian

p(y ) = N (w

>

x, σ

2

)

1. Maximum Likelihood Estimation

w b = (X

>

X)

−1

X

>

y σ b

2MLE

= 1

N

N

X

i =1

(y − Xw)

>

(y − Xw)

(11)

Issues with Linear Regression

1. Not truly Bayesian 2. Susceptible to outliers 3. Too simplistic - Underfitting 4. No way to control overfitting

5. Unstable in presence of correlated input attributes

6. Gets “confused” by unnecessary attributes

(12)

Putting a Prior on w

I

“Penalize” large values of w

I

A zero-mean Gaussian prior

p(w) = N (w|0, τ

2

I )

I

What is posterior of w

p(w|D) ∝ Y

i

N (y

i

|w

>

x

i

, σ

2

)p(w)

I

Posterior is also Gaussian

(13)

Posterior Estimates of the Weight Vector

I

MAP estimate of w arg max

w N

X

i =1

log N (y

i

|w

>

x

i

, σ

2

) + log N (w|0, τ

2

I )

(14)

Parameter Estimation for Bayesian Regression

I

Prior for w

w ∼ N (w|0, τ

2

I

D

)

I

Posterior for w

p(w|y, X) = p(y|X, w)p(w) p(y|X)

= N ( ¯ w = (X

>

X + σ

2

τ

2

I

N

)

−1

X

>

y, σ

2

(X

>

X + σ

2

τ

2

I

N

)

−1

)

I

Posterior distribution for w is also Gaussian

I

What will be MAP estimate for w?

(15)

Prediction with Bayesian Regression

I

For a new x

, predict y

I

Point estimate of y

y

= w b

>MLE

x

I

Treating y as a Gaussian random variable

p(y

|x

) = N ( w b

>MLE

x

, b σ

MLE2

)

p(y

|x

) = N ( w b

>MAP

x

, b σ

MAP2

)

(16)

Full Bayesian Treatment

I

Treating y and w as random variables p(y

|x

) =

Z

p(y

|x

, w)p(w|X, y)d w

I

This is also Gaussian!

(17)

Handling Non-linear Relationships

I

Replace x with non-linear functions φ(x) p(y |x, θ) ∼ N (w

>

φ(x))

I

Model is still linear in w

I

Also known as basis function expansion

Example

φ(x ) = [1, x , x

2

, . . . , x

p

]

I

Increasing p results in more complex fits

(18)

How to Control Overfitting?

I

Use simpler models (linear instead of polynomial)

I

Might have poor results (underfitting)

I

Use regularized complex models Θ = arg min b

Θ

J(Θ) + λR(Θ)

I

R() corresponds to the penalty paid for complexity of the model

(19)

Examples of Regularization

Ridge Regression

w = arg min b

w

J(w) + λkwk

2

I

Also known as l

2

or Tikhonov regularization

I

Helps in reducing impact of correlated inputs

Least Absolute Shrinkage and Selection Operator - LASSO

w = arg min b

w

J(w) + λ|w|

I

Also known as l

1

regularization

(20)

Parameter Estimation for Ridge Regression

Exact Loss Function

J(w) = 1 2

N

X

i =1

(y

i

− w

>

x

i

)

2

+ 1 2 λ||w||

22

MAP Estimate of w

w b

MAP

= (λI

D

+ X

>

X)

−1

X

>

y

(21)

References

References

Related documents