Machine Learning Algorithms for Classification. Rob Schapire Princeton University

(1)

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Rob Schapire Princeton University

(2)

Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning

• studies how to automatically learnto make accurate predictions based on past observations

• classification problems:

• classify examples into given set of categories new example

machine learning algorithm

classification predicted rule

classification examples

training labeled

(3)

Examples of Classification Problems Examples of Classification Problems Examples of Classification Problems Examples of Classification Problems Examples of Classification Problems

• text categorization (e.g., spam filtering)

• fraud detection

• optical character recognition

• machine vision (e.g., face detection)

• natural-language processing

(e.g., spoken language understanding)

• market segmentation

(e.g.: predict if customer will respond to promotion)

• bioinformatics

(e.g., classify proteins according to their function) ...

(4)

Characteristics of Modern Machine Learning Characteristics of Modern Machine Learning Characteristics of Modern Machine Learning Characteristics of Modern Machine Learning Characteristics of Modern Machine Learning

• primary goal: highly accurate predictions on test data

• goal isnot to uncover underlying “truth”

• methods should be general purpose, fullyautomatic and

“off-the-shelf”

• however, in practice, incorporation ofprior, human knowledge is crucial

• rich interplay between theory andpractice

• emphasis on methods that can handle large datasets

(5)

Why Use Machine Learning? Why Use Machine Learning? Why Use Machine Learning? Why Use Machine Learning? Why Use Machine Learning?

• advantages:

• often much more accurate than human-crafted rules (since data driven)

• humans often incapable of expressing what they know (e.g., rules of English, or how to recognize letters), but can easily classify examples

• don’t need a human expert or programmer

• automatic method to search for hypotheses explaining data

• cheap and flexible — can apply to any learning task

• disadvantages

• need a lot oflabeleddata

• error prone— usually impossible to get perfect accuracy

(6)

This Talk This Talk This Talk This Talk This Talk

• machine learning algorithms:

• decision trees

• conditions for successful learning

• boosting

• support-vector machines

• othersnot covered:

• neural networks

• nearest neighbor algorithms

• Naive Bayes

• bagging

• random forests ...

• practicalities of using machine learning algorithms

(7)

Decision Trees Decision Trees Decision Trees Decision Trees Decision Trees

(8)

Example: Good versus Evil Example: Good versus Evil Example: Good versus Evil Example: Good versus Evil Example: Good versus Evil

• problem: identify people as good or bad from their appearance sex mask cape tie ears smokes class

training data

batman male yes yes no yes no Good

robin male yes yes no no no Good

alfred male no no yes no no Good

penguin male no no yes no yes Bad

catwoman female yes no no yes no Bad

joker male no no no no no Bad

test data

batgirl female yes yes no yes no ??

riddler male yes no no no no ??

(9)

A Decision Tree Classifier A Decision Tree Classifier A Decision Tree Classifier A Decision Tree Classifier A Decision Tree Classifier

tie

good no yes

yes no

bad bad

cape smokes

no yes

good

(10)

How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees

• choose rule to split on

• divide data using splitting rule into disjoint subsets

tie

no yes

alfred catwoman

joker penguin

robin batman

batman alfred

robin joker catwoman

penguin

(11)

How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees How to Build Decision Trees

• choose rule to split on

• divide data using splitting rule into disjoint subsets

• repeat recursively for each subset

• stop when leaves are (almost) “pure”

tie

no yes

alfred catwoman

joker penguin

robin batman

batman alfred

penguin

⇒

tie

no yes

(12)

How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule

• key problem: choosing best rule to split on:

tie

no yes

alfred catwoman

joker penguin

robin batman

batman alfred

penguin

no yes

alfred catwoman

joker penguin

robin batman

joker catwoman

cape

batman robin alfred

penguin

(13)

How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule How to Choose the Splitting Rule

• key problem: choosing best rule to split on:

tie

no yes

alfred catwoman

joker penguin

robin batman

batman alfred

penguin

no yes

alfred catwoman

joker penguin

robin batman

joker catwoman

cape

batman robin alfred

penguin

• idea: choose rule that leads to greatest increase in “purity”

(14)

How to Measure Purity How to Measure Purity How to Measure Purity How to Measure Purity How to Measure Purity

• want (im)purity function to look like this:

(p = fraction of positive examples)

impurity

0 1/2 1

p

• commonly used impurity measures:

• entropy: −p ln p − (1 − p) ln(1 − p)

• Gini index: p(1 − p)

(15)

Kinds of Error Rates Kinds of Error Rates Kinds of Error Rates Kinds of Error Rates Kinds of Error Rates

• training error = fraction of training examples misclassified

• test error = fraction of test examples misclassified

• generalization error = probability of misclassifying new random example

(16)

A Possible Classifier A Possible Classifier A Possible Classifier A Possible Classifier A Possible Classifier

cape

ears

tie cape

ears

good

good bad bad

good bad

bad good bad good sex smokes

smokes

no yes

no

no no

no

no yes

yes yes yes

female yes male

mask

• perfectly classifies training data

• BUT: intuitively, overly complex

(17)

Another Possible Classifier Another Possible Classifier Another Possible Classifier Another Possible Classifier Another Possible Classifier

bad good

no yes

mask

• overly simple

• doesn’t even fit available data

(18)

Tree Size versus Accuracy Tree Size versus Accuracy Tree Size versus Accuracy Tree Size versus Accuracy Tree Size versus Accuracy

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0 10 20 30 40 50 60 70 80 90 100

Accuracy

On training data On test data

40

20 30

error (%)

50

tree size 50

test

train

0 100

10

• trees must be big enough to fit training data (so that “true” patterns are fully captured)

• BUT: trees that are too big may overfit

(capture noise or spurious patterns in the data)

• significant problem: can’t tell best tree size from training error

(19)

Overfitting Example Overfitting Example Overfitting Example Overfitting Example Overfitting Example

• fitting points with a polynomial

underfit ideal fit overfit

(degree = 1) (degree = 3) (degree = 20)

(20)

Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier

• for goodtest peformance, need:

• enough training examples

• good performance on training set

• classifier that is not too “complex” (“Occam’s razor”)

• classifiers should be “as simple as possible, but no simpler”

• “simplicity” closely related to prior expectations

(21)

Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier Building an Accurate Classifier

• for goodtest peformance, need:

• enough training examples

• good performance on training set

• classifier that is not too “complex” (“Occam’s razor”)

• classifiers should be “as simple as possible, but no simpler”

• “simplicity” closely related to prior expectations

• measure “complexity” by:

• number bits needed to write down

• number of parameters

• VC-dimension

(22)

Example Example Example Example Example

Training data: