Normalized Bayes error-rate plots - Measuring, refining and calibrating speaker and language in

Here we generalize the traditional DCF/minDCF calibration decomposition, which evaluates decisions and scores at a fixed application, to the Eerr/Eerrmin de-

composition, which evaluates log-likelihood-ratios over a range of applications. Given our analysis in this chapter, the recipe is straight-forward:

The recognizer, W, provides output, wt, in calibrated log-likelihood-ratio

format for every trial t of the supervised evaluation database. This is evaluated by (7.18) as Eerr(W|˜π). As already demonstrated in figure 7.1, this criterion

can be normalized and plotted as a function of the application parameter (the effective prior), ˜π.

Now treat the same submitted log-likelihood-ratios, wt, as uncalibrated

scores and let the evaluator do the calibration. This takes the form of comput- ing the traditional minDCF (which evaluates scores), but for multiple closely

−100 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 0.2 0.4 0.6 0.8 1 logit P tar normalized DCF

BUT PLDA i−vector condition 2

new DCF point dev misses dev false−alarms dev act DCF eval misses eval false−alarms eval min DCF eval act DCF eval DR30

Figure 7.8: Normalized Bayes error-rate plot for an SRE 2010 speaker detector with good calibration. Here eval denotes the evaluation database and dev the devel- opment database. DCF and minDCF refer to Eerr and Eerrmin. Pmiss and normalized

Pfa are also shown separately. DR30 refers to the point to the left of which there

are fewer than 30 false-alarms. The vertical magenta dashed line represents the new operating point at ˜π = 0.001. −100 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 0.2 0.4 0.6 0.8 1 logit P tar normalized DCF

BUT i−vector full−cov condition 2

new DCF point dev misses dev false−alarms dev act DCF eval misses eval false−alarms eval min DCF eval act DCF eval DR30

Figure 7.9: Normalized Bayes error-rate plot for an SRE 2010 speaker detector with bad calibration. See caption of figure 7.8 for details.

spaced values of ˜π over the plotting range range of interest. We define: Emin

where as explained above, the γi are all of the score threshold values that give

different (Pfa, Pmiss) points in the empirical ROC. Now Eerrmin can be normalized

and plotted alongside Eerr.

The minimization ensures: Emin

err (W|˜π) ≤ Eerr(W|˜π). If these two are close,

the calibration is good, if they are very different, the calibration is bad. For a good detector, it is desirable to have good calibration everywhere.

As before, we assume all the scores are finite, so that there is always a threshold that makes either one of the error-rates zero, so that Emin

err (W|˜π) ≤

Eerr(W0|˜π) = min(˜π, 1 − ˜π). If we normalize by Eerr(W0|˜π), then:

Eerr(W|˜π) Eerr(W0|˜π) ≥ E min err (W|˜π) Eerr(W0|˜π) ≤ 1 . (7.43)

This agrees with our previous conclusion: any performance worse than the default is due to bad calibration.

This plot of normalized Eerr and Eerrmin vs h = logit ˜π is the author’s current

favourite tool for judging the calibration of a given speaker detector and it is the tool we used both in preparation for the NIST 2010 Speaker Recognition Evaluation (SRE2010) and for our analysis of the results afterwards. In figures 7.8 and 7.9, we show examples of two subsystems submitted to SRE2010, with respectively good and bad calibration.

Notice that when the horizontal axis is h = logit ˜π, then for the region h < 0, which we plot in these figures, the vertical axis (normalized error-rate) is: v = Eerr(W|˜π) min(˜π, 1 − ˜π) = πP˜ miss(˜π) + (1 − ˜π)Pfa(˜π) ˜ π

= Pmiss(˜π) + exp(− logit ˜π)Pfa(˜π)

= Pmiss(logit−1h) + exp(−h)Pfa(logit−1h)

(7.44)

where we used (7.18). The exponential amplification of false-alarms induced by this normalization explains the shape of the Eerr curves for regions of bad

calibration. Some form of amplifying normalization is needed to make the effects of calibration visible in regions of low error-rate. This normalization is the main difference between these curves and APE-curves (see section 7.7.4 below). The normalized Bayes error-rate plot is able to display a wider range of operating points than the APE-curve.

Finally, recall the discussion of section 7.3.2, which effectively means one cannot extend the horizontal axis indefinitely in either direction, because the errors will run out somewhere along the way. To make this effect explicit, we plot what we call the DR30 point, to the left of which the absolute number false-alarms drops below 30. This point is on the E_errmin curve, because we use the false-alarm count which results from the evaluator’s optimized threshold. DR30 refers to Doddington’s Rule of 30, see appendix B.

7.6.1 Computation

In SRE2010, large score sets—up to a few million trials—were needed to sup- port the new operating point. With inefficient algorithms, evaluation of a single detector by Eerr and Eerrmin may take several minutes. We propose the fol-

lowing efficient algorithms, which in our implementation takes a few seconds to execute:16

To efficiently compute Eerr, pool all the scores, wt, with all the different

thresholds, − logit ˜πi, at which Eerr(W|˜πi) is to be evaluated. Sort them all

together, keeping track of where the thresholds end up. A simple calculation involving the index of each threshold gives the desired miss and false-alarm rate at each threshold.

To efficiently compute Emin

err , compute the vertices of the ROCCH, using the

PAV algorithm (to be discussed below). There are typically very few of these vertices and as shown in section 7.3.6, the original large ROC can be replaced with these vertices, without changing the value of minDCF.

In document Measuring, refining and calibrating speaker and language information extracted from speech (Page 115-118)