Consistent Binary Classification with Generalized
Performance Metrics
Nagarajan Natarajan
Joint work with
Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon
Problem and Motivation (1/3)
• State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited.
• It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholdingP(Y = 1|x) at 0.5.
• Such a characterization is lacking for many utility measures used in practice.
Problem and Motivation (2/3)
• Most performance measures are based on the four fundamental population quantities:
• Examples includeFβ, Jaccard coefficient, and other cost-sensitive
Problem and Motivation (3/3)
• Goals:
1. Develop a general framework for analyzing performance measures
2. Characterize optimal decision functions for a large family of utility measures
3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice
A Family of Generalized Performance Metrics (1/3)
• Letθ:X → {0,1}denote a classifier, and Pbe a fixed unknown
distribution over labeled dataX × {0,1}.
• We define the following ratio family of performance metrics:
L(θ,P) = a0+a11TP +a10FP +a01FN +a00TN
b0+b11TP +b10FP +b01FN +b00TN
wherea0,b0,aij,bij,i,j ∈ {0,1} are non-negative constants and:
TP := TP(θ,P) =P(Y = 1, θ= 1),
FP := FP(θ,P) =P(Y = 0, θ= 1),
FN := FN(θ,P) =P(Y = 1, θ= 0),
A Family of Generalized Performance Metrics (2/3)
• Example metrics in this family:
AM = (1−π)TP +πTN 2π(1−π) Fβ = (1 +β2)TP (1 +β2)TP +β2FN + FP Jaccard Coefficient = TP TP + FN + FP Weighted Accuracy = w1TP +w2TN w1TP +w2TN +w3FP +w4FN
A Family of Generalized Performance Metrics (3/3)
• Letγ(θ) :=P(θ= 1) andπ :=P(Y = 1).
• Observing that:
FP =γ(θ)−TP, FN =π−TP(θ), TN = 1−γ(θ)−π+ TP
we get the following equivalent, simpler representation of the family:
L(θ,P) = c0+c1TP +c2γ(θ)
d0+d1TP +d2γ(θ)
,
Optimal Classifier (1/2)
• Optimal (Bayes) decision function for a given metricL is:
θ∗= arg max
θ∈Θ
L(θ,P).
Main Result 1. Given a performance metricL, or equivalently, the constants{c0,c1,c2}and{d0,d1,d2}, letL∗:=L(θ∗) and let:
δ∗ = d2L ∗−c
2
c1−d1L∗
The Bayes classifierθ∗ takes the form
Optimal Classifier (2/2)
• Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers:
θ∗∈arg max
δ∈(0,1)L(I(P(Y = 1|x)≥δ),P),
whereI(P(Y = 1|x)≥δ) is the classifier that thresholds the conditional atδ.
Recovered and New Results (2/2)
• Simulated results showingη(x) :=P(Y = 1|x), optimal thresholdδ∗
and Bayes classifierθ∗
F1 Weighted Accuracy 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP 2TP+FP+FN
η
(
x
)
δ
∗=0
.
34
θ
∗ 1 2 3 4 5 6 7 8 9 10 x 0.0 0.2 0.4 0.6 0.8 1.0 2TP+2TN 2TP+FP+FN+2TNη
(
x
)
δ
∗=0
.
50
θ
∗Maximizing
L
in Practice (1/3)
• Given iid sample (Xi,Yi),i = 1,2, . . . ,n, we would want to maximize
the empirical measure:
Ln(θ) = c1TPn(θ) +c2γn(θ) +c0
d1TPn(θ) +d2γn(θ) +d0 ,
where TPn(θ) = n1Pni=1θ(Xi)Yi andγn(θ) = 1nPni=1θ(Xi).
• However, maximizingLn(θ) is often NP-hard.
• Main Result 1 suggests two simple procedures for estimatingθfrom training data
Maximizing
L
in Practice (2/3)
Algorithm 1: Two-Step EUM
Input: Training examples S ={Xi,Yi}ni=1 and utility measureL.
1. Split the training dataS into two sets S1 andS2.
2. Estimateη(x) :=P(Y = 1|x) usingS1, define θbδ = sign(ˆη(x)−δ)
3. Computebδ= arg maxδ∈(0,1) Ln(θδb) on S2.
4. Return: θb
b
δ
• 1-d optimization in Step 3 can be done efficiently —Ln changes only onO(n) discrete thresholds
Maximizing
L
in Practice (3/3)
• The second method is based on minimizing a surrogate` of the weighted 0-1 loss:
`δ(t,y) = (1−δ)1{y=1}`(t,1) +δ1{y=0}`(t,0).
Algorithm 2: Weighted ERM
Input: Training examples S ={Xi,Yi}ni=1, prediction function class
Φ⊂ {φ:X →R}and utility measureL.
1. Split the training dataS into two sets S1 andS2. 2. Computebδ= arg maxδ∈(0,1) Ln(θbδ) on S2.
Sub-algorithm: θbδ(x) := sign(φbδ(x)) where b φδ(x) = arg minφ∈Φ |S1 1| P|S1| i=1`δ(φ(Xi),Yi). 3. Return: θb b δ
Consistency of Empirical Estimation (1/2)
• For consistency w.r.t. L metric, we need estimated bθto satisfy
L∗− L(θb)
p
→0.
Theorem (Uniform convergence of Ln). Consider the function class of
all thresholded decisions Θ ={I(φ(x)≥δ) ∀δ ∈(0,1)} for a [0,1]-valued functionφ:X →[0,1]. For sufficiently largen that is a function of constants associated withL, andρ, with prob. at least 1−ρ,
sup
θ∈Θ
Consistency of Empirical Estimation (2/2)
Main Result 2. If the estimateηb(x) satisfies bη(x)→p η(x), Algorithm 1 is
L-consistent.
Main Result 3. Let`:R: [0,∞) be a classification-calibratedconvex
(margin) loss and let `δ be the corresponding weighted loss for a givenδ
Experimental Results
• Evaluate Algorithms 1 and 2 on two metrics,F1 and Weighted Accuracy 2(TP2(+TPTN+)+TNFP)+FN.
• Compare the two algorithms with standard ERM (regularized logistic regression).
• On datasets listed below:
1. Letters: 26 classes (English alphabet), 20000 instances
2. Scene: 6 classes (scene types), 2230 images
3. Web Page: 2 classes (spam/non-spam), 34780 pages
4. Image: 2 classes, 2068 images
Open Problems & Future Directions
• There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster ??!)
• Raises the question — Identify/characterize the entire family of utility metrics with simple optimal decision functions
• Develop surrogate theory forL
• Obtain convergence rates forL(ˆθ)→ Lp (θ∗) as ˆθ→p θ∗
• Multi-label classification setting:
• Can extend the definitionL in more than one way! • Do similar results hold in this setting?