Lecture 34.pdf

(1)

((even) more)

Decision Trees

(and regression)

CS540-002, Spring 2015 Lecture 34

(2)

● Upcoming AIRG

○ Causality -- An Introduction (Monday)

● WHW2 Released

○ Due Wednesday, April 22 (before class)

● Remember Project 4

○ Due April 17

○ Please turn in latedays.txt no matter what

(3)

Today:

● Decision Trees

○ Overfitting (and pruning)

● Regression

○ Linear

○ Polynomial

(4)

So Far:

Given data, we can build a consistent DT if: ● The setting is Classification.

● All features are categorical.

(5)

Roadmap:

● More on decision trees

○ Broadening Applicability (18.3.6) ○ Overfitting

■ Early Stopping ■ Pruning

● Regression (18.6)

○ Linear

○ Polynomial

● Practical Considerations (18.4)

(6)

Overfitting (Remedy 1):

Suppose we find ourselves in the following position. 1, 2, 3, 4, 5, 6, 7, 8,

9, 10, 11, 12, 13

f₁₃₉

0 1

1, 2, 3, 4, 5, 6, 7 8, 9, 10, 11, 12, 13

f₁₀

0 1

8, 9 10, 11, 12,13

10

10, 11, 12, 13

f₁₅

0 1

12, 13 10, 11

f₁₅

(7)

Overfitting (Remedy 1):

Early stopping.

Basic idea: if you’re close to 0 entropy, don’t split.

Alternative early stopping:

(8)

Overfitting (Remedy 2):

Basic idea:

Suppose we find that splitting on a feature has a low (but non-zero) information gain.

Maybe the feature in question is pure noise, and it just so happens that it helps separate the classes. Can we determine if it is actually an indicative

(9)

Overfitting (Remedy 2):

Basic idea:

Suppose this feature is pure noise (the ‘null-hypothesis’)

We got Δ better than that.

What’s the probability we ‘lucked out’ by so much? Less than 5% you say? Then then it’s probably a good split.

(10)

Overfitting (Remedy 2):

Suppose we split on a feature and get p positive, and n negative examples in the kth split.

(11)

Overfitting (Remedy 2):

(12)

Δ is distributed as

(13)

Continuous or Ordinal Features:

Suppose we have a feature Height. Example:

[60, 64, 80, 70, 71, 68, 81, 55, 48, 70, 71] Step 1: Sort.

[48, 55, 60, 64, 68, 70, 70, 71, 71, 80, 81] Step 2: Consider each possible split.

(14)

Regression Trees:

Basic idea:

Replaces leaves with simple regression models. Two (sub) problems arise:

● How do we learn a regression model?

(15)

Regression:

Recall: In Classification, our target variable is one of a discrete set.

E.g., Author, WillWait, WillGetHeartDisease In Regression, the target variable is real.

(16)

Given x ∈ ℝ, predict y where

y = f(x) = w₁x + w₀+ noise

Univariate Linear Regression:

(17)

(18)

(19)

Univariate Linear Regression:

y Basic idea:

Find the line

which minimizes the sum of

(20)

Univariate Linear Regression:

Minimizing the loss:

Of all the pairs (w₀, w₁) ...

(21)

...the prediction for the jth point...

Univariate Linear Regression:

Sum over all points

The loss

incurred from...

...and the true value.

(22)

(23)

Let There be Calculus...

Key point:

Because the loss is quadratic, its derivative is linear. This yields a linear system.

(24)

(25)

(26)

Multivariate Linear Regression:

Notational note:

The book denotes the ith feature of the jth datapoint as: x_{j, i}

(27)

Detour: This w

₀

business is ugly

Note how w₀ is ‘different’ (it’s not multiplied by any feature).

We can get rid of it by augmenting x:

[x₁, x₂, x₃] [1, x₁, x₂, x₃] w₀ + w₁x₁+ w₂x₂ + w₃x₃

(28)

Multivariate Linear Regression:

(29)

(Univariate) ‘Polynomial’ Regression:

y = f(x) = w₂x2 + w₁x + w₀+ noise

(30)

Detour: This w

₀

business is ugly

Note how w₀ is ‘different’ (it’s not multiplied by any feature).

We can get rid of it by augmenting x: [x₁] [1, x₁]

w₀ + w₁x₁

Remember this? w₂

w₂

(But rather, by a new ‘feature’, x2) add

[1, x₁, x₁2]

Becomes w

(31)

Now we’re back to the linear

regression case!

(32)

(Univariate) ‘Polynomial’ Regression:

y = f(x) = w₃x3 + w₂x2 + w₁x + w₀+ noise

(33)

No problem!

(34)

Lecture 34.pdf

((even) more)

Decision Trees

(and regression)

Today:

So Far:

Roadmap:

Overfitting (Remedy 1):

Overfitting (Remedy 1):

Overfitting (Remedy 2):

Overfitting (Remedy 2):

Overfitting (Remedy 2):

Overfitting (Remedy 2):

Continuous or Ordinal Features:

Regression Trees:

Regression:

Univariate Linear Regression:

Univariate Linear Regression:

Univariate Linear Regression:

Minimizing the loss:

Univariate Linear Regression:

Let There be Calculus...

Multivariate Linear Regression:

Detour: This w

business is ugly

Multivariate Linear Regression:

(Univariate) ‘Polynomial’ Regression:

Detour: This w

business is ugly

Now we’re back to the linear

regression case!

(Univariate) ‘Polynomial’ Regression:

No problem!

Let’s look at the [1, x] case.