• No results found

Consistent Binary Classification with Generalized Performance Metrics

N/A
N/A
Protected

Academic year: 2021

Share "Consistent Binary Classification with Generalized Performance Metrics"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Consistent Binary Classification with Generalized

Performance Metrics

Nagarajan Natarajan

Joint work with

Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon

(2)

Problem and Motivation (1/3)

• State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited.

• It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholdingP(Y = 1|x) at 0.5.

• Such a characterization is lacking for many utility measures used in practice.

(3)

Problem and Motivation (2/3)

• Most performance measures are based on the four fundamental population quantities:

• Examples includeFβ, Jaccard coefficient, and other cost-sensitive

(4)

Problem and Motivation (3/3)

• Goals:

1. Develop a general framework for analyzing performance measures

2. Characterize optimal decision functions for a large family of utility measures

3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice

(5)

A Family of Generalized Performance Metrics (1/3)

• Letθ:X → {0,1}denote a classifier, and Pbe a fixed unknown

distribution over labeled dataX × {0,1}.

• We define the following ratio family of performance metrics:

L(θ,P) = a0+a11TP +a10FP +a01FN +a00TN

b0+b11TP +b10FP +b01FN +b00TN

wherea0,b0,aij,bij,i,j ∈ {0,1} are non-negative constants and:

TP := TP(θ,P) =P(Y = 1, θ= 1),

FP := FP(θ,P) =P(Y = 0, θ= 1),

FN := FN(θ,P) =P(Y = 1, θ= 0),

(6)

A Family of Generalized Performance Metrics (2/3)

• Example metrics in this family:

AM = (1−π)TP +πTN 2π(1−π) Fβ = (1 +β2)TP (1 +β2)TP +β2FN + FP Jaccard Coefficient = TP TP + FN + FP Weighted Accuracy = w1TP +w2TN w1TP +w2TN +w3FP +w4FN

(7)

A Family of Generalized Performance Metrics (3/3)

• Letγ(θ) :=P(θ= 1) andπ :=P(Y = 1).

• Observing that:

FP =γ(θ)−TP, FN =π−TP(θ), TN = 1−γ(θ)−π+ TP

we get the following equivalent, simpler representation of the family:

L(θ,P) = c0+c1TP +c2γ(θ)

d0+d1TP +d2γ(θ)

,

(8)

Optimal Classifier (1/2)

• Optimal (Bayes) decision function for a given metricL is:

θ∗= arg max

θ∈Θ

L(θ,P).

Main Result 1. Given a performance metricL, or equivalently, the constants{c0,c1,c2}and{d0,d1,d2}, letL∗:=L(θ∗) and let:

δ∗ = d2L ∗c

2

c1−d1L∗

The Bayes classifierθ∗ takes the form

(9)

Optimal Classifier (2/2)

• Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers:

θ∗∈arg max

δ∈(0,1)L(I(P(Y = 1|x)≥δ),P),

whereI(P(Y = 1|x)≥δ) is the classifier that thresholds the conditional atδ.

(10)
(11)

Recovered and New Results (2/2)

• Simulated results showingη(x) :=P(Y = 1|x), optimal thresholdδ∗

and Bayes classifierθ∗

F1 Weighted Accuracy 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP 2TP+FP+FN

η

(

x

)

δ

=0

.

34

θ

∗ 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP+2TN 2TP+FP+FN+2TN

η

(

x

)

δ

=0

.

50

θ

(12)

Maximizing

L

in Practice (1/3)

• Given iid sample (Xi,Yi),i = 1,2, . . . ,n, we would want to maximize

the empirical measure:

Ln(θ) = c1TPn(θ) +c2γn(θ) +c0

d1TPn(θ) +d2γn(θ) +d0 ,

where TPn(θ) = n1Pni=1θ(Xi)Yi andγn(θ) = 1nPni=1θ(Xi).

• However, maximizingLn(θ) is often NP-hard.

• Main Result 1 suggests two simple procedures for estimatingθfrom training data

(13)

Maximizing

L

in Practice (2/3)

Algorithm 1: Two-Step EUM

Input: Training examples S ={Xi,Yi}ni=1 and utility measureL.

1. Split the training dataS into two sets S1 andS2.

2. Estimateη(x) :=P(Y = 1|x) usingS1, define θbδ = sign(ˆη(x)−δ)

3. Computebδ= arg maxδ∈(0,1) Ln(θδb) on S2.

4. Return: θb

b

δ

• 1-d optimization in Step 3 can be done efficiently —Ln changes only onO(n) discrete thresholds

(14)

Maximizing

L

in Practice (3/3)

• The second method is based on minimizing a surrogate` of the weighted 0-1 loss:

`δ(t,y) = (1−δ)1{y=1}`(t,1) +δ1{y=0}`(t,0).

Algorithm 2: Weighted ERM

Input: Training examples S ={Xi,Yi}ni=1, prediction function class

Φ⊂ {φ:X →R}and utility measureL.

1. Split the training dataS into two sets S1 andS2. 2. Computebδ= arg maxδ∈(0,1) Ln(θbδ) on S2.

Sub-algorithm: θbδ(x) := sign(φbδ(x)) where b φδ(x) = arg minφ∈Φ |S1 1| P|S1| i=1`δ(φ(Xi),Yi). 3. Return: θb b δ

(15)

Consistency of Empirical Estimation (1/2)

• For consistency w.r.t. L metric, we need estimated bθto satisfy

L∗− L(θb)

p

→0.

Theorem (Uniform convergence of Ln). Consider the function class of

all thresholded decisions Θ ={I(φ(x)≥δ) ∀δ ∈(0,1)} for a [0,1]-valued functionφ:X →[0,1]. For sufficiently largen that is a function of constants associated withL, andρ, with prob. at least 1−ρ,

sup

θ∈Θ

(16)

Consistency of Empirical Estimation (2/2)

Main Result 2. If the estimateηb(x) satisfies bη(x)→p η(x), Algorithm 1 is

L-consistent.

Main Result 3. Let`:R: [0,∞) be a classification-calibratedconvex

(margin) loss and let `δ be the corresponding weighted loss for a givenδ

(17)

Experimental Results

• Evaluate Algorithms 1 and 2 on two metrics,F1 and Weighted Accuracy 2(TP2(+TPTN+)+TNFP)+FN.

• Compare the two algorithms with standard ERM (regularized logistic regression).

• On datasets listed below:

1. Letters: 26 classes (English alphabet), 20000 instances

2. Scene: 6 classes (scene types), 2230 images

3. Web Page: 2 classes (spam/non-spam), 34780 pages

4. Image: 2 classes, 2068 images

(18)
(19)
(20)

Open Problems & Future Directions

• There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster ??!)

• Raises the question — Identify/characterize the entire family of utility metrics with simple optimal decision functions

• Develop surrogate theory forL

• Obtain convergence rates forL(ˆθ)→ Lp (θ∗) as ˆθ→p θ∗

• Multi-label classification setting:

• Can extend the definitionL in more than one way! • Do similar results hold in this setting?

References

Related documents

We use a set of Stochastic Frontier Analysis (SFA) models to estimate cost functions that allow us to examine the impact of regional-level economic and

commercial agreement between the parties, there may be restrictions on distributions to the Limited Partners or the General Partner (other than certain limited exceptions) while

This research set out to investigate a case study of contemporary dance being used in a New Zealand museum in order to provide insight into the specific issues faced when

To develop a numerical model of the atmosphere, often, a set of people are assigned separately for the development of different physical schemes: one for clouds, another for

and globally, remain on the negative side of the digital divide. This age-based digital divide is of concern because the internet enables users to expand their

RT-qPCR analysis demon- strated that gene expression of MMP3 and MMP9 was increased following IL-1 β stimulation ( p < 0.001, Fig. 1a ), and the induction was especially

Home theater experts agree that a theater-like experience is only achieved when the screen size is large enough with respect to the viewing distance from the screen – generally, when

Outputs from the Activities Activities Resources to support the system Self-Financing methods Administration Positions Incentives to Join Break Even Point Method of Issuance