((even) more)
Decision Trees
(and regression)
CS540-002, Spring 2015 Lecture 34
● Upcoming AIRG
○ Causality -- An Introduction (Monday)
● WHW2 Released
○ Due Wednesday, April 22 (before class)
● Remember Project 4
○ Due April 17
○ Please turn in latedays.txt no matter what
Today:
● Decision Trees
○ Overfitting (and pruning)
● Regression
○ Linear
○ Polynomial
So Far:
Given data, we can build a consistent DT if: ● The setting is Classification.
● All features are categorical.
Roadmap:
● More on decision trees
○ Broadening Applicability (18.3.6) ○ Overfitting
■ Early Stopping ■ Pruning
● Regression (18.6)
○ Linear
○ Polynomial
● Practical Considerations (18.4)
Overfitting (Remedy 1):
Suppose we find ourselves in the following position. 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13
f139
0 1
1, 2, 3, 4, 5, 6, 7 8, 9, 10, 11, 12, 13
f10
0 1
8, 9 10, 11, 12,13
10
10, 11, 12, 13
f15
0 1
12, 13 10, 11
f15
Overfitting (Remedy 1):
Early stopping.
Basic idea: if you’re close to 0 entropy, don’t split.
Alternative early stopping:
Overfitting (Remedy 2):
Basic idea:
Suppose we find that splitting on a feature has a low (but non-zero) information gain.
Maybe the feature in question is pure noise, and it just so happens that it helps separate the classes. Can we determine if it is actually an indicative
Overfitting (Remedy 2):
Basic idea:
Suppose this feature is pure noise (the ‘null-hypothesis’)
We got Δ better than that.
What’s the probability we ‘lucked out’ by so much? Less than 5% you say? Then then it’s probably a good split.
Overfitting (Remedy 2):
Suppose we split on a feature and get p positive, and n negative examples in the kth split.
Overfitting (Remedy 2):
Δ is distributed as
Continuous or Ordinal Features:
Suppose we have a feature Height. Example:
[60, 64, 80, 70, 71, 68, 81, 55, 48, 70, 71] Step 1: Sort.
[48, 55, 60, 64, 68, 70, 70, 71, 71, 80, 81] Step 2: Consider each possible split.
Regression Trees:
Basic idea:
Replaces leaves with simple regression models. Two (sub) problems arise:
● How do we learn a regression model?
Regression:
Recall: In Classification, our target variable is one of a discrete set.
E.g., Author, WillWait, WillGetHeartDisease In Regression, the target variable is real.
Given x ∈ ℝ, predict y where
y = f(x) = w1x + w0+ noise
Univariate Linear Regression:
Univariate Linear Regression:
y Basic idea:
Find the line
which minimizes the sum of
Univariate Linear Regression:
Minimizing the loss:
Of all the pairs (w0, w1) ...
...the prediction for the jth point...
Univariate Linear Regression:
Sum over all points
The loss
incurred from...
...and the true value.
Let There be Calculus...
Key point:
Because the loss is quadratic, its derivative is linear. This yields a linear system.
Multivariate Linear Regression:
Notational note:
The book denotes the ith feature of the jth datapoint as: xj, i
Detour: This w
0business is ugly
Note how w0 is ‘different’ (it’s not multiplied by any feature).
We can get rid of it by augmenting x:
[x1, x2, x3] [1, x1, x2, x3] w0 + w1x1+ w2x2 + w3x3
Multivariate Linear Regression:
(Univariate) ‘Polynomial’ Regression:
Given x ∈ ℝ, predict y where
y = f(x) = w2x2 + w1x + w0+ noise
Detour: This w
0business is ugly
Note how w0 is ‘different’ (it’s not multiplied by any feature).
We can get rid of it by augmenting x: [x1] [1, x1]
w0 + w1x1
Remember this? w2
w2
(But rather, by a new ‘feature’, x2) add
[1, x1, x12]
Becomes w
Now we’re back to the linear
regression case!
(Univariate) ‘Polynomial’ Regression:
Given x ∈ ℝ, predict y where
y = f(x) = w3x3 + w2x2 + w1x + w0+ noise