• No results found

7.3 Analysis and generalization of DCF and ROC

7.3.1 DCF vs Bayes error-rate

The DCF, or detection cost function,6 has been used by NIST as their primary

speaker detection evaluation criterion for more than a decade, from 1997 to the present.7 In this section, we define and analyse DCF and compare it with

our proposed evaluation recipe.

DCF can be specified concisely (as NIST does):

DCF = πCmissPmiss+ (1 − π)CfaPfa (7.11)

or laboriously as we do below to relate to our analysis. (For the uninitiated, the symbols in (7.11) will be defined in due course.)

In our terminology, DCF is an evaluation criterion parametrized by a specified application. Evaluation by DCF requires the detector to make hard binary decisions. The decision set is A = {accept, reject}, where ac- cept means the target hypothesis has been recognized and reject means the non-target hypothesis has been recognized. A miss is defined as the erro- neous outcome (reject, target) and a false-accept as the erroneous outcome (accept, non-target). As explained in section 3.4.1, these respective errors are weighted with the specified parameters Cmiss and Cfa. The remaining two

correct outcomes have zero cost. Let us denote this cost function as Cdcf. Ad-

ditionally, there is specified a prior parameter, in our notation: π = (π, 1 − π), where π = P (target|π).

The triplet (Cmiss, Cfa, π), parametrizes a family of applications of the form

(Cdcf, π). As explained in section 3.4 on the equivalence of cost functions and

section 6.3 on the equivalence of applications, a three-parameter specification is redundant—for evaluation we need only a single parameter. We give two

6This terminology may be confusing. Our cost function, C

, evaluates a single detection

trial. DCF is the expected cost over all the trials in an evaluation.

7See www.itl.nist.gov/iad/mig//tests/sre/, where the “Speaker Recognition Eval-

such single-parameter application families that are equivalent to the three- parameter family, (Cdcf, π): (Cdcf, π) ≡ (Cerr, ˜π) ≡ (Cη, 0.5) (7.12) where ˜ π = logit−1 

logit(π) + log Cmiss Cfa



(7.13)

η = 1 − ˜π (7.14)

where8 C

erris zero-one cost as defined in section 2.1.3 and Cη is the normalized

cost defined in section 3.4.1. A good name for ˜π is the effective prior. The Bayes decision threshold for all three families is η.

As previously explained, these three families are equivalent for evaluation purposes because they would rank recognizers in the same order. All three share the same Bayes decision threshold, all three would give the same Bayes decision for a given posterior and all three would give the same miss and false- alarm rates for a given recognizer. This equivalence can be expressed in terms of our evaluation criterion as Edcf(W|π) = kEerr(W|˜π) = k0Eη(W|0.5), where k

and k0 are unimportant positive scale factors.

Our evaluation criterion, Edcf is a generalization of DCF in the following

sense. For evaluation by DCF, the detector is required to submit a hard accept/reject decision for every trial, while for Edcf, the detector is required to

submit a log-likelihood-ratio for every trial. DCF evaluates the actual decisions made by a detector, while Edcf measures the ability of the detector to make

Bayes decisions. In the DCF recipe, the evaluee (the recognizer) makes the decisions. In our recipe, the evaluator makes the decisions. DCF evaluates by cost function, while Edcf evaluates by proper scoring rule. In the DCF recipe,

for a given submitted detector, there are fixed miss and false-alarm rates:

DCF = 2 X i=1 πi |Ti| X t∈Ti Cdcf(at|θi) = πCmissPmiss+ (1 − π)CfaPfa (7.15) where Pmiss = 1 |T1| X t∈T1 Cerr(at|target) (7.16) Pfa = 1 |T2| X t∈T2 Cerr(at|non-target) (7.17)

8Recall logit(p) = log p

1−p and logit

−1(x) = 1

and where at ∈ {accept, reject} is the detector output for trial t. In our

recipe, for a given detector, the error-rates vary as a function of the application parameter, as shown by the equivalent parametrization (Cerr, ˜π):

Eerr(W|˜π) = 2 X i=1 πi |Ti| X t∈Ti Cerr∗ B(wt, ˜π) θi  = ˜πPmiss(˜π) + (1 − ˜π)Pfa(˜π) (7.18) where Pmiss(˜π) = X t∈T1 Cerr∗ B(wt, ˜π) target Pfa(˜π) = X t∈T2 Cerr∗ B(wt, ˜π) non-target (7.19)

and where wt is the detector’s submitted log-likelihood-ratio for trial t.

DCF does not allow the application to be varied. The prior and cost function are effectively hard-coded into the detector. Up to 2008, NIST had always parametrized the DCF at (Cmiss, Cfa, π) = (10, 1, 0.01), or equivalently

at an effective prior of ˜π ≈ 0.091. In 2010 they specified Cmiss = Cfa = 1

and ˜π = 0.001. Now ˜π = 0.091 is referred to as the old operating point and ˜π = 0.001 is referred to as the new operating point. The limitation of DCF evaluation is that only a single operating point is evaluated in a given evaluation.

Our innovation with Edcf is that the application can be varied for a given

detector, in order to evaluate it over a range of different operating points. By using the equivalence Edcf(W|π) = kEerr(W|˜π) = k0Eη(W|0.5), we can

conveniently parametrize the range of applications in terms of ˜π, or in terms of η. We demonstrate with an example below, where we parametrize with ˜π. (The η-parametrization will come in handy later.)

In what follows, we shall refer to Eerr as Bayes error-rate, since it is the

error-rate obtained when making Bayes decisions. Example: Normalized Bayes error-rate plot

In this example we do an evaluation of a single speaker detector, W. We use a fixed evaluation database of about a hundred thousand trials, but the Bayes error-rate evaluation criterion, Eerr(W|˜π), is varied by adjusting ˜π. That is,

we are evaluating the recognizer over a range of different applications, where the effective prior is varied, but the cost function, Cerr, remains fixed.

To illustrate our point, we use a typical uncalibrated speaker detector. Its output scores are evaluated as is, as if they were calibrated log-likelihood ratios.

The result of the evaluation is given in figure 7.1, where the horizontal axis represents the prior on a log odds scale and the vertical axis is a normalized version of Eerr(W|˜π).

The horizontal axis of the plot represents the prior and at the same time the Bayes decision threshold. The horizontal axis is the prior log odds: h = logit(˜π) = log1−˜π˜π. The logit is a monotonic rising invertible transformation that sends the interval [0, 1] to the extended real line, [−∞, ∞]. The h-axis is infinite, so we have to be content with plotting a limited interval, which we centre at logit(0.5) = 0.

The Bayes decision threshold is just λ = −h, so that scores wt ≥ −h give

accept decisions, while those below give reject decisions.

The vertical axis, v, of the plot is our evaluation criterion normalized thus: v = Eerr(W|˜π)

Eerr(W0|˜π)

= πP˜ miss(˜π) + (1 − ˜π)Pfa(˜π)

min(˜π, 1 − ˜π) (7.20)

where W0 is the default recognizer that always outputs W0(xt) = 0. The

denominator changes abruptly at ˜π = 0.5, hence the cusp at h = 0. The numerator is the Bayes error-rate, where everything varies as functions of h: ˜

π = logit−1(h), Pmiss(˜π) is the proportion of target trials with scores below the

threshold and Pfa(˜π) is the proportion of non-target trials with scores above

the threshold, as given by (7.19).

The normalization places the reference value at 1 for default performance. This is indicated by the dashed red line. This plot shows the uncalibrated detector, W, is useful for applications with prior log odds greater than about −2, while for applications with smaller prior, this recognizer is badly calibrated and would not be useful, since performance is worse than the default detector. The important message here is that if we had evaluated the detector at just one specific application (Cerr, ˜π), as represented by just one point on the

horizontal axis, then the measurement at that point would not necessarily give a good indication of usefulness at some other point (application) along the axis.

The above graphical evaluation procedure forms the essence of the author’s current favourite tool for judging the calibration of speaker recognition scores. We made extensive use of this tool during the 2010 NIST Speaker Recognition Evaluation. Examples will be given in section 7.6 below.

This graphical solution is applicable only to two-class problems, because for multi-class problems, applications cannot be parametrized by a single pa- rameter. This makes graphical representation difficult, especially when there are several classes. We defer discussion of multi-class evaluation solutions to chapter 8.

Moreover, the graphical solution does not provide an application-spanning, scalar summary criterion. In practice, such scalar summary criteria are invalu- able tools for discriminative training of fusion and calibration parameters of pattern recognizers. We discuss summary criteria in section 7.4.

−5 −4 −3 −2 −1 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 logit prior normalized cost

Figure 7.1: Normalized Bayes error-rate for a speaker detector, as a function of the prior.