Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

(1)

Machine Learning and Data Mining

Regression Problem

(adapted from) Prof. Alexander Ihler

(2)

Overview

• Regression Problem Definition and define

parameters

ϴ.

• Prediction using

ϴ as parameters

• Measure the error

• Finding good parameters

ϴ (direct

minimization problem)

(3)

Example of Regression

• Vehicle price estimation problem

– Features x: Fuel type, The number of doors, Engine size

– Targets y: Price of vehicle

– Training Data: (fit the model)

– Testing Data: (evaluate the model)

Fuel type The number of doors Engine size (61 – 326) Price 1 4 70 11000 1 2 120 17000 2 2 310 41000

Fuel type The number of doors

Engine size (61 – 326)

Price

(4)

Example of Regression

• Vehicle price estimation problem

– ϴ = [-300, -200 , 700 , 130]

– Ex #1 = -300 + -200 *1 + 4*700 + 70*130 = 11400

– Ex #2 = -300 + -200 *1 + 2*700 + 120*130 = 16500

– Ex #3 = -300 + -200 *2 + 2*700 + 310*130 = 41000

• Mean Absolute Training Error = 1/3 *(400 + 500 + 0) = 300

– Test = -300 + -200 *2 + 4*700 + 150*130 = 21600

• Mean Absolute Testing Error = 1/1 *(600) = 600

Fuel type #of doors Engine size Price

1 4 70 11000

1 2 120 17000

2 2 310 41000

Fuel type The number of

doors

Engine size (61 – 326)

Price

(5)

Supervised learning

• Notation

– Features x (input variables)

– Targets y (output variables)

– Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction

Training data (examples) Features Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)

(6)

Overview

• Regression Problem Definition and

parameters.

• Prediction using

ϴ as parameters

• Measure the error

• Finding good parameters

ϴ (direct

minimization problem)

(7)

Linear regression

• Define form of function f(x) explicitly

• Find a good f(x) within that family

0 10 20 0 20 40 T ar ge t y Feature X₁ “Predictor”: Evaluate line: ӯ = ϴ₀+ ϴ₁ * X₁ return ӯ ӯ = Predicted target value (Black line)

(c) Alexander Ihler

New instance with X₁=8 Predicted value =17

(8)

More dimensions?

0 10 20 30 40 0 10 20 30 20 22 24 26 0 10 20 30 40 0 10 20 30 20 22 24 26

x

₁

x

₂

y

x

₁

x

₂

y

(9)

Notation

Define “feature” x

₀

= 1 (constant)

Then

(c) Alexander Ihler

n = the number of features in dataset

(10)

Overview

• Regression Problem Definition and

parameters.

• Prediction using

ϴ as parameters

• Measure the error

• Finding good parameters

ϴ (direct

minimization problem)

(11)

Supervised learning

• Notation

– Features x (input variables)

– Targets y (output variables)

– Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction

(12)

Measuring error

0 20 0

Error or “residual”

Prediction

Observation

Red points = Real target values Black line = ӯ (predicted value) ӯ = ϴ₀+ ϴ₁ * X

Blue lines = Error (Difference between real value y and predicted value ӯ)

(13)

Mean Squared Error

• How can we quantify the error?

• Y= Real target value in dataset,

• ӯ = Predicted target value by ϴ*X

• Training Error: m= the number of training instances,

• Testing Error: Using a partition of Training error to check predicted

values. m= the number of testing instances,

(c) Alexander Ihler

(14)

MSE cost function

• Rewrite using matrix form

m=number of instance of data n = the number of features,

X = input variables in dataset y= output variable in dataset

(15)

Visualizing the error function

-1 -0.5 0 0.5 1 1.5 2 2.5 3 -40 -30 -20 -10 0 10 20 30 40 (c) Alexander Ihler

θ

₁

J(θ

)

_{Dimensions are}_{ϴ0 and ϴ1 instead of}

X1 and X2

Output is J instead of y as target value

Representation of J in 2D space. Inner red circles has less value of J Outer red circles has higher value of J

J is error function.

θ

₀

θ

1

The plane is the value of J, not the plane fitted to output values.

(16)

Overview

• Regression Problem Definition and

parameters.

• Prediction using

ϴ as parameters

• Measure the error

• Finding good parameters

ϴ (direct

minimization problem)

(17)

Supervised learning

• Notation

– Features x – Targets y – Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction

(18)

Finding good parameters

• Want to find parameters which minimize our error…

(19)

MSE Minimum (m <= n+1)

• Consider a simple problem

– One feature, two data points

– Two unknowns and two equations:

• Can solve this system directly:

(c) Alexander Ihler

m=number of instance of data n = the number of features,

Theta gives a line or plane that exactly fit to all target values. n +1=1+1 = 2

(20)

SSE Minimum (m > n+1)

• Reordering, we have

•Most of the time, m > n

– There may be no linear function that hits all the data exactly

– Minimum of a function has gradient equal to zero (gradient is a

horizontal line.)

Just need to know how to compute

n +1=1+1 = 2 m=3

(21)

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18

Effects of Mean Square Error choice

16 2 cost for this one datum

Heavy penalty for large errors

Distract line from other points.

(c) Alexander Ihler

(22)

Absolute error

2 4 6 8 10 12 14 16

18 MSE, original data

Abs error, original data

(23)

Error functions for regression

“Arbitrary” Error functions

can’t be solved in closed form… So as alternative way, use

gradient descent (Mean Square Error)

(Mean Absolute Error)

Something else entirely…

(???)

(24)

Overview

• Regression Problem Definition and

parameters.

• Prediction using

ϴ as parameters

• Measure the error

• Finding good parameters

ϴ (direct

minimization problem)

(25)

Nonlinear functions

• Single feature x, predict target y:

• Sometimes useful to think of “feature transform”

– Convert a non-linear function to linear function and then

solve it.

Add features:

Linear regression in new features

(26)

Higher-order polynomials

• Are more features better?

• “Nested” hypotheses

– 2nd_{order more general than 1}st_,

– 3rd_{order “ “ than 2}nd_{, …}

• Fits the observed data better

18nd_order

Y = ϴ₀

Y = ϴ₀+ ϴ₁ * X Y = ϴ₀+ ϴ₁ * X + ϴ₂*X2₊ϴ

(27)

Test data

• After training the model

• Go out and get more data from the world

– New observations (x,y)

• How well does our model perform?

(c) Alexander Ihler

Training data

(28)

0 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30

Training data

Training versus test error

• Plot MSE as a function

of model complexity

– Polynomial order

• Decreases

– More complex function

fits training data better

• What about new data?

Mea n s q u a red err o r Polynomial order

New, “test” data

• 0th_{to 2}st_order – Error decreases – Underfitting • Higher order – Error increases Overfitting Under fitting