Machine Learning and Data Mining
Regression Problem
(adapted from) Prof. Alexander Ihler
Overview
•
Regression Problem Definition and define
parameters
ϴ.
•
Prediction using
ϴ as parameters
•
Measure the error
•
Finding good parameters
ϴ (direct
minimization problem)
Example of Regression
•
Vehicle price estimation problem
– Features x: Fuel type, The number of doors, Engine size
– Targets y: Price of vehicle
– Training Data: (fit the model)
– Testing Data: (evaluate the model)
Fuel type The number of doors Engine size (61 – 326) Price 1 4 70 11000 1 2 120 17000 2 2 310 41000
Fuel type The number of doors
Engine size (61 – 326)
Price
Example of Regression
•
Vehicle price estimation problem
– ϴ = [-300, -200 , 700 , 130]
– Ex #1 = -300 + -200 *1 + 4*700 + 70*130 = 11400
– Ex #2 = -300 + -200 *1 + 2*700 + 120*130 = 16500
– Ex #3 = -300 + -200 *2 + 2*700 + 310*130 = 41000
• Mean Absolute Training Error = 1/3 *(400 + 500 + 0) = 300
– Test = -300 + -200 *2 + 4*700 + 150*130 = 21600
• Mean Absolute Testing Error = 1/1 *(600) = 600
Fuel type #of doors Engine size Price
1 4 70 11000
1 2 120 17000
2 2 310 41000
Fuel type The number of
doors
Engine size (61 – 326)
Price
Supervised learning
•
Notation
– Features x (input variables)
– Targets y (output variables)
– Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction
Training data (examples) Features Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)
Overview
•
Regression Problem Definition and
parameters.
•
Prediction using
ϴ as parameters
•
Measure the error
•
Finding good parameters
ϴ (direct
minimization problem)
Linear regression
•
Define form of function f(x) explicitly
•
Find a good f(x) within that family
0 10 20 0 20 40 T ar ge t y Feature X1 “Predictor”: Evaluate line: ӯ = ϴ0 + ϴ1 * X1 return ӯ ӯ = Predicted target value (Black line)
(c) Alexander Ihler
New instance with X1=8 Predicted value =17
More dimensions?
0 10 20 30 40 0 10 20 30 20 22 24 26 0 10 20 30 40 0 10 20 30 20 22 24 26x
1x
2y
x
1x
2y
Notation
Define “feature” x
0= 1 (constant)
Then
(c) Alexander Ihler
n = the number of features in dataset
Overview
•
Regression Problem Definition and
parameters.
•
Prediction using
ϴ as parameters
•
Measure the error
•
Finding good parameters
ϴ (direct
minimization problem)
Supervised learning
•
Notation
– Features x (input variables)
– Targets y (output variables)
– Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction
Training data (examples) Features Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)
Measuring error
0 20 0Error or “residual”
Prediction
Observation
Red points = Real target values Black line = ӯ (predicted value) ӯ = ϴ0 + ϴ1 * X
Blue lines = Error (Difference between real value y and predicted value ӯ)
Mean Squared Error
•
How can we quantify the error?
• Y= Real target value in dataset,
• ӯ = Predicted target value by ϴ*X
• Training Error: m= the number of training instances,
• Testing Error: Using a partition of Training error to check predicted
values. m= the number of testing instances,
(c) Alexander Ihler
MSE cost function
•
Rewrite using matrix form
m=number of instance of data n = the number of features,
X = input variables in dataset y= output variable in dataset
Visualizing the error function
-1 -0.5 0 0.5 1 1.5 2 2.5 3 -40 -30 -20 -10 0 10 20 30 40 (c) Alexander Ihlerθ
1J(θ
)
Dimensions are ϴ0 and ϴ1 instead ofX1 and X2
Output is J instead of y as target value
Representation of J in 2D space. Inner red circles has less value of J Outer red circles has higher value of J
J is error function.
θ
0θ
1The plane is the value of J, not the plane fitted to output values.
Overview
•
Regression Problem Definition and
parameters.
•
Prediction using
ϴ as parameters
•
Measure the error
•
Finding good parameters
ϴ (direct
minimization problem)
Supervised learning
•
Notation
– Features x – Targets y – Predictions ŷ – Parameters θ Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a predictionTraining data (examples) Features Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)
Finding good parameters
•
Want to find parameters which minimize our error…
MSE Minimum (m <= n+1)
•
Consider a simple problem
– One feature, two data points
– Two unknowns and two equations:
•
Can solve this system directly:
(c) Alexander Ihler
m=number of instance of data n = the number of features,
Theta gives a line or plane that exactly fit to all target values. n +1=1+1 = 2
SSE Minimum (m > n+1)
•
Reordering, we have
•Most of the time, m > n– There may be no linear function that hits all the data exactly
– Minimum of a function has gradient equal to zero (gradient is a
horizontal line.)
Just need to know how to compute
n +1=1+1 = 2 m=3
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18
Effects of Mean Square Error choice
16 2 cost for this one datum
Heavy penalty for large errors
Distract line from other points.
(c) Alexander Ihler
Absolute error
2 4 6 8 10 12 14 1618 MSE, original data
Abs error, original data
Error functions for regression
“Arbitrary” Error functions
can’t be solved in closed form… So as alternative way, use
gradient descent (Mean Square Error)
(Mean Absolute Error)
Something else entirely…
(???)
Overview
•
Regression Problem Definition and
parameters.
•
Prediction using
ϴ as parameters
•
Measure the error
•
Finding good parameters
ϴ (direct
minimization problem)
Nonlinear functions
•
Single feature x, predict target y:
•
Sometimes useful to think of “feature transform”
– Convert a non-linear function to linear function and then
solve it.
Add features:
Linear regression in new features
Higher-order polynomials
•
Are more features better?
•
“Nested” hypotheses
– 2nd order more general than 1st,
– 3rd order “ “ than 2nd, …
•
Fits the observed data better
18nd order
Y = ϴ0
Y = ϴ0+ ϴ1 * X Y = ϴ0+ ϴ1 * X + ϴ2*X2 + ϴ
Test data
•
After training the model
•
Go out and get more data from the world
– New observations (x,y)
•
How well does our model perform?
(c) Alexander Ihler
Training data
0 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30
Training data
Training versus test error
• Plot MSE as a function
of model complexity
– Polynomial order
• Decreases
– More complex function
fits training data better
• What about new data?
Mea n s q u a red err o r Polynomial order
New, “test” data
• 0th to 2st order – Error decreases – Underfitting • Higher order – Error increases Overfitting Under fitting