Consistent Binary Classification with Generalized Performance Metrics

(1)

Consistent Binary Classification with Generalized

Performance Metrics

Nagarajan Natarajan

Joint work with

Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon

(2)

Problem and Motivation (1/3)

• State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited.

• It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholdingP(Y = 1|x) at 0.5.

• Such a characterization is lacking for many utility measures used in practice.

(3)

Problem and Motivation (2/3)

• Most performance measures are based on the four fundamental population quantities:

• Examples includeFβ, Jaccard coefficient, and other cost-sensitive

(4)

Problem and Motivation (3/3)

• Goals:

1. Develop a general framework for analyzing performance measures

2. Characterize optimal decision functions for a large family of utility measures

3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice

(5)

A Family of Generalized Performance Metrics (1/3)

• Letθ:X → {0,1}denote a classifier, and Pbe a fixed unknown

distribution over labeled dataX × {0,1}.

• We define the following ratio family of performance metrics:

L(θ,P) = a0+a11TP +a10FP +a01FN +a00TN

b0+b11TP +b10FP +b01FN +b00TN

wherea0,b0,aij,bij,i,j ∈ {0,1} are non-negative constants and:

TP := TP(θ,P) =P(Y = 1, θ= 1),

FP := FP(θ,P) =P(Y = 0, θ= 1),

FN := FN(θ,P) =P(Y = 1, θ= 0),

(6)

A Family of Generalized Performance Metrics (2/3)

• Example metrics in this family:

AM = (1−π)TP +πTN 2π(1−π) Fβ = (1 +β2)TP (1 +β2_{)TP +}_β2_{FN + FP} Jaccard Coefficient = TP TP + FN + FP Weighted Accuracy = w1TP +w2TN w1TP +w2TN +w3FP +w4FN

(7)

A Family of Generalized Performance Metrics (3/3)

• Letγ(θ) :=P(θ= 1) andπ :=P(Y = 1).

• Observing that:

FP =γ(θ)−TP, FN =π−TP(θ), TN = 1−γ(θ)−π+ TP

we get the following equivalent, simpler representation of the family:

L(θ,P) = c0+c1TP +c2γ(θ)

d0+d1TP +d2γ(θ)

,

(8)

Optimal Classifier (1/2)

• Optimal (Bayes) decision function for a given metricL is:

θ∗= arg max

θ∈Θ

L(θ,P).

Main Result 1. Given a performance metricL, or equivalently, the constants{c0,c1,c2}and{d0,d1,d2}, letL∗:=L(θ∗) and let:

δ∗ = d2L ∗₋_c

2

c1−d1L∗

The Bayes classifierθ∗ takes the form

(9)

Optimal Classifier (2/2)

• Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers:

θ∗∈arg max

δ∈(0,1)L(I(P(Y = 1|x)≥δ),P),

whereI(P(Y = 1|x)≥δ) is the classifier that thresholds the conditional atδ.

(10)

(11)

Recovered and New Results (2/2)

• Simulated results showingη(x) :=P(Y = 1|x), optimal thresholdδ∗

and Bayes classifierθ∗

F1 Weighted Accuracy 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP 2TP+FP+FN

η

(

x

)

δ

∗

₌₀

_.

₃₄

θ

∗ 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP+2TN 2TP+FP+FN+2TN

η

(

x

)

δ

∗

₌₀

_.

₅₀

θ

∗

(12)

Maximizing

L

in Practice (1/3)

• Given iid sample (Xi,Yi),i = 1,2, . . . ,n, we would want to maximize

the empirical measure:

L_n(θ) = c1TPn(θ) +c2γn(θ) +c0

d1TPn(θ) +d2γn(θ) +d0 ,

where TPn(θ) = _n1Pni=1θ(Xi)Yi andγn(θ) = 1_nPni=1θ(Xi).

• However, maximizingL_n(θ) is often NP-hard.

• Main Result 1 suggests two simple procedures for estimatingθfrom training data

(13)

Maximizing

L

in Practice (2/3)

Algorithm 1: Two-Step EUM

Input: Training examples S ={X_i,Yi}ni=1 and utility measureL.

1. Split the training dataS into two sets S1 andS2.

2. Estimateη(x) :=P(Y = 1|x) usingS1, define θb_δ = sign(ˆη(x)−δ)

3. Computebδ= arg max_δ∈(0,1) L_n(θδb) on S₂.

4. Return: θb

b

δ

• 1-d optimization in Step 3 can be done efficiently —L_n changes only onO(n) discrete thresholds

(14)

Maximizing

L

in Practice (3/3)

• The second method is based on minimizing a surrogate` of the weighted 0-1 loss:

`δ(t,y) = (1−δ)1{y=1}`(t,1) +δ1{y=0}`(t,0).

Algorithm 2: Weighted ERM

Input: Training examples S ={X_i,Yi}ni=1, prediction function class

Φ⊂ {φ:X →R}and utility measureL.

1. Split the training dataS into two sets S1 andS2. 2. Computebδ= arg max_δ∈(0,1) L_n(θb_δ) on S₂.

Sub-algorithm: θb_δ(x) := sign(φb_δ(x)) where b φδ(x) = arg minφ∈Φ _|S1 1| P|S1| i=1`δ(φ(Xi),Yi). 3. Return: θb b δ

(15)

Consistency of Empirical Estimation (1/2)

• For consistency w.r.t. L metric, we need estimated bθto satisfy

L∗− L(θb)

p

→0.

Theorem (Uniform convergence of Ln). Consider the function class of

all thresholded decisions Θ ={I(φ(x)≥δ) ∀δ ∈(0,1)} for a [0,1]-valued functionφ:X →[0,1]. For sufficiently largen that is a function of constants associated withL, andρ, with prob. at least 1−ρ,

sup

θ∈Θ

(16)

Consistency of Empirical Estimation (2/2)

Main Result 2. If the estimateη_b(x) satisfies _bη(x)→p η(x), Algorithm 1 is

L-consistent.

Main Result 3. Let`:R: [0,∞) be a classification-calibratedconvex

(margin) loss and let `δ be the corresponding weighted loss for a givenδ

(17)

Experimental Results

• Evaluate Algorithms 1 and 2 on two metrics,F1 and Weighted Accuracy ₂₍_TP2(₊TP_TN+₎₊TN_FP)₊_FN.

• Compare the two algorithms with standard ERM (regularized logistic regression).

• On datasets listed below:

1. _Letters: 26 classes (English alphabet), 20000 instances

2. Scene: 6 classes (scene types), 2230 images

3. Web Page: 2 classes (spam/non-spam), 34780 pages

4. Image: 2 classes, 2068 images

(18)

(19)

(20)

Open Problems & Future Directions

• There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster ??!)

• Raises the question — Identify/characterize the entire family of utility metrics with simple optimal decision functions

• Develop surrogate theory forL

• Obtain convergence rates forL(ˆθ)→ Lp (θ∗) as ˆθ→p θ∗

• Multi-label classification setting:

• Can extend the definitionL in more than one way! • Do similar results hold in this setting?

Consistent Binary Classification with Generalized Performance Metrics