Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

(1)

Lecture 2: Supervised vs. unsupervised

learning, bias-variance tradeoff

Reading: Chapter 2

Stats 202: Data Mining and Analysis

Lester Mackey September 23, 2015

(Slide credits: Sergio Bacallado)

(2)

Announcements

I Homework 1 is online

I _{Submit solutions to Gradescope by 1:30pm (PT) next Weds.}

I _{Remember our course policy: Late homework is not accepted,}

but you can submit partial solutions, and we will drop the lowest homework score

I We’ve created a pinned post on Piazza to help you find Kaggle competition teammates

I I’d like to get into the habit of repeating your questions (please remind me if I forget!)

(3)

Supervised vs. unsupervised learning

Inunsupervised learningwe seek to understand the relationships between observations or variables in a data matrix:

Variables or features Datap oints, examples, o r observations

I Quantitative variables: weight, height, number of children, ... I Qualitative (categorical) variables: college major, profession,

gender, ...

(4)

Supervised vs. unsupervised learning

Inunsupervised learningwe seek to understand the relationships between observations or variables in a data matrix.

Standardunsupervised learning goalsinclude:

I Finding lower-dimensional representations of the data for improved visualization, interpretation, or compression.

Dimensionality reduction: PCA, ICA, isomap, locally linear embeddings, etc.

I Finding meaningful groupings of the data. Clustering.

Unsupervised learning is also known in Statistics asexploratory data analysis.

(5)

Supervised vs. unsupervised learning

Insupervised learning, there areinput variables andoutput variables:

Variables, features, or predictors

Datap oints, examples, o r observations

Input variables Output variable If quantitative, we say this is a regression

problem

If qualitative / cate-gorical, we say this is a

classification problem

(6)

Supervised vs. unsupervised learning

Insupervised learning, there areinput variables andoutput variables.

IfX is the vector of inputs for a particular sample, a quantitative output variableY is typically modeled as

Y =f(X) + ε

|{z}

Random error

I f fixed and unknown, captures the systematic information that

X provides aboutY

I εrepresents the unpredictable “noise” in the problem Supervised learning methodsseek to learn the relationship betweenX andY (e.g., by estimating the functionf) using a set oftrainingexamples.

(7)

Supervised vs. unsupervised learning

Y =f(X) + ε

|{z}

Random error

Why learn the relationship betweenX andY?

I Prediction: Useful when the input variable is readily available, but the output variable is not.

Example: Predict stock prices next month using data from last year.

I Inference: A model forf can help us understand the structure of the data — which variables influence the output, and which don’t? What is the relationship between each variable and the output, e.g., linear, non-linear?

Examples: Infer which genes are associated with the incidence of heart disease.

(8)

Parametric and nonparametric methods:

Most supervised learning methods fall into one of two classes:

I Parametric methods assume thatf has a fixed functional form with a fixed number of parameters, for example, a linear form:

f(X) =X1β1+· · ·+Xpβp

with parametersβ1, . . . , βp. Estimating f then reduces to

estimating orfitting the parameters.

I Non-parametric methods do not assume fixed functional form with a fixed number of parameters but often still limit how “wiggly” or “rough” the function f can be.

(9)

Parametric vs. nonparametric prediction

Years of Education Senior ity Income Years of Education Senior ity Income Figures 2.4 and 2.5

Parametric methods are fundamentally limited in their fit quality. Non-parametric methods keep improving as we add more data to fit. Parametric methods are often simpler to interpret and are less prone to overfitting to noise in the data.

(10)

Evaluating supervised learning methods

Training data: (x1, y1),(x2, y2). . .(xn, yn)

Prediction function estimate: fˆ.

Question: How do we know if fˆis a good estimate?

Intuitive answer: fˆis good if it predicts well

I if(x0, y0)is a new datapoint (not used in training), then ˆ

f(x0) andy0 should be close

I Popular measure of closeness is squared error (y0−fˆ(x0))2

(but other measures of prediction error are also used) Given many test datapoints{(x0

i, y0i);i= 1, . . . , m} (not used in

training), common and reasonable to use average test prediction error, e.g.,test mean squared error (MSE)

1 m m X i=1 (y0_i−fˆ(x0_i))2

as a measure of the prediction quality offˆ

Our goal in supervised learning is to minimize theprediction error. For regression, this is typically the populationMean Squared Error:

M SE( ˆf) =E(y0−fˆ(x0))2

where the expectation is overfˆand independent test point(x0, y0).

Unfortunately, this quantity cannot be computed, because we don’t know the distribution of(x0, y0). We can compute a sample

average using thetraining data; this is known as the training MSE:

M SEtraining( ˆf) = 1 n n X i=1 (yi−fˆ(xi))2. 10 / 24

(11)

Evaluating supervised learning methods

Training data: (x1, y1),(x2, y2). . .(xn, yn)

Prediction function estimate: fˆ.

Question: What if you don’t have extra test data lying around?

Tempting solution: Use training error, e.g., training MSE

1 n n X i=1 (yi−fˆ(xi))2.

as your measure of prediction quality instead.

Problem: Small training error does not imply small test error!

(12)

Figure 2.9. 0 20 40 60 80 100 2 4 6 8 10 12 X Y 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility

Mean Squared Error

Three estimates fˆare shown:

1. Linear regression.

2. Splines (very smooth).

3. Splines (quite rough).

Red line: Test MSE.

Gray line: Training MSE.

(13)

Figure 2.10 0 20 40 60 80 100 2 4 6 8 10 12 X Y 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility

Mean Squared Error

The functionf is now almost linear.

(14)

Figure 2.11 0 20 40 60 80 100 −10 0 10 20 X Y 2 5 10 20 0 5 10 15 20 Flexibility

Mean Squared Error

When the noiseεhas small variance, the third method does well. 14 / 24

(15)

The bias variance decomposition

Letx0 be afixed test point,y0 =f(x0) +ε0, and fˆbe estimated

fromn training samples(x1, y1). . .(xn, yn).

LetE denote the expectation over y0 and the training outputs (y1, . . . , yn). Then, the expected mean squared error(MSE) at

x0 decomposes into three interpretable terms:

E(y0−fˆ(x0))2 =Var( ˆf(x0)) + [Bias( ˆf(x0))]2+Var(ε0).

E(y0−fˆ(x0))2 =Var( ˆf(x0)) + [Bias( ˆf(x0))]2+Var(ε0). Irreducible error

E(y0−fˆ(x0))2 =Var( ˆf(x0)) + [Bias( ˆf(x0))]2+Var(ε0). The variance of fˆ(x0): E[ ˆf(x0)−E( ˆf(x0))]2 This measures how much the estimate offˆatx0

changes when we sample new training data.

E(y0−fˆ(x0))2 =Var( ˆf(x0)) + [Bias( ˆf(x0))]2+Var(ε0). The squared bias offˆ(x0): [E( ˆf(x0))−f(x0)]2

This measures the deviation of the average prediction fˆ(x0) from the truthf(x0).

(16)

(17)

Take-aways from bias variance decomposition

E(y0−fˆ(x0))2 =Var( ˆf(x0)) + [Bias( ˆf(x0))]2+Var(ε0).

I The expected MSE is never smaller than the irreducible error I Both small bias and small variance are needed for small

expected MSE

I Easy to find zero variance procedures with high bias (predict a constant value) and very low bias procedures with high

variance (choose fˆpassing through every training point) I In practice, best expected MSE achieved by incurring some

bias to decrease variance and vice-versa: this is the bias-variance trade-off

Rule of thumb

More flexible methods⇒Higher variance andLower bias. I Example: Linear fit may not change substantially with

training data but high bias if true f very non-linear

(18)

Squigglyf, high noise Linearf, high noise Squigglyf, low noise 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility 2 5 10 20 0 5 10 15 20 Flexibility MSE Bias Var Figure 2.12 18 / 24

(19)

Classification problems

In a classification setting, the output takes values in a discrete set. For example, if we are predicting the brand of a car based on a number of car attributes, then the functionf takes values in the set{Ford, Toyota, Mercedes-Benz, . . .}.

We will adopt some additional notation:

P(X, Y) : joint distribution of (X, Y),

P(Y |X) : conditional distribution of X given Y, ˆ

yi:prediction of yi givenxi.

(20)

Loss function for classification

There are many ways to measure the error of a classification prediction. One of the most common is the 0-1 loss:

1(y06= ˆy0).

As with squared error, we can compute average test prediction error (calledtest error rateunder 0-1 loss) using previously unseen test data{(x0_i, y_i0);i= 1, . . . , m} : 1 m m X i=1 1(y_i0 6= ˆy_i0).

Similarly, we can compute the (often optimistic)training error rate 1 n n X i=1 1(yi6= ˆyi). 20 / 24

(21)

Bayes classifier

o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o _o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o X1 X2 Figure 2.13

In practice, we never know the joint probability P. However, we can assume that it exists. TheBayes classifierassigns:

ˆ

yi =argmaxj P(Y =j|X =xi)

It can be shown that this is the best classifier under the 0-1 loss.

(22)

K

-nearest neighbors

To assign a color to the input×, we look at its K= 3 nearest neighbors. We predict the color of the majority of the neighbors.

o o o o o o o o o o o o o o o o o o o o o o o o Figure 2.14 22 / 24

(23)

K

-nearest neighbors also has a decision boundary

o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o _o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o X1 X2 KNN: K=10 Figure 2.15 23 / 24

(24)

Higher

K

⇒

smoother decision boundary

o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o KNN: K=1 KNN: K=100 Figure 2.16 24 / 24