Idealized scores - Measuring, refining and calibrating speaker and language information extract

Here we analyse idealized detection scores, the properties of which will inspire some of the choices we have to make below when choosing practical evaluation strategies.

7.2.1 Assumptions

Let f (s) and g(s) be probability densities, with support R, so that they are non-zero for every s ∈ R, but f(∞) = f(−∞) = g(∞) = g(−∞) = 0. We also assume both are differentiable for any s ∈ R. We interpret these functions as the score likelihoods, f (s) = P (s|target, O) and g(s) = P (s|non-target, O). As before, O plays the role of conditioning these probability distributions. We further assume that f and g are related such that the likelihood-ratio,

`(s) = f (s)

is a strictly monotonic rising bijection2 _{from [−∞, ∞] to [0, ∞], with `(−∞) =}

0 and `(∞) = ∞, so that the inverse function `−1(y) is defined for every y ∈ [0, ∞] and the derivative, `0(s) > 0, for every s ∈R.

In the notation of section 4.3, w = log `(s) is a calibration transformation between score and log-likelihood-ratio. The above assumptions are equivalent to requiring that there exists a strictly monotonic rising, continuous, differentiable bijection between scores and log-likelihood-ratios.

We give three examples that satisfy these criteria: Example 1: Gaussian scores

For normal distributions, f (s) = N (s|µ1, σ2) and g(s) = N (s|µ2, σ2), where

µ1 > µ2, we find: `(s) = exp(µ1_σ−µ2 2s +

µ2 2−µ21

2σ2 ). Note this doesn’t work when the variances are different, because then `(s) is not monotonic.

Example 2: Calibrated scores

If the score s is a perfectly calibrated log-likelihood-ratio, so that s = log f (s)_g(s) = log `(s), then `(s) = exp(s).

Example 3: Calibratable scores

If s = α + β log `(s), for β > 0, then `(s) = exps−α_β . This also works for more general bijections between s and log `(s).

7.2.2 Properties

Under the above assumptions, the following properties hold: Scores can make optimal decisions

We already know thresholding the likelihood-ratio, `(s), can be used to make optimal detection decisions. However, thresholding the score can do so too, because for any likelihood-ratio threshold, y ∈ [0, ∞], the score threshold, γ = `−1(y), makes equivalent decisions, since: s ≥ γ if and only if `(s) ≥ y.

For some score threshold, γ ∈ [−∞, ∞], define the miss rate, F (γ), and false-alarm rate, G(γ), as:

F (γ) = P (s < γ|target) = Z γ −∞ f (s) ds (7.2) G(γ) = P (s >= γ|non-target) = Z ∞ γ g(s) ds (7.3)

with derivatives F0(γ) = f (γ) and G0(γ) = −g(γ). This shows F and G are continuous, strictly monotonically increasing and decreasing respectively, and both invertible.

For a given prior, (π, 1 − π), let the expected error-rate be:

E(π, γ) = πF (γ) + (1 − π)G(γ) . (7.4)

For any 0 < π < 1, we can use first and second derivatives w.r.t. γ to verify that E has a unique minimum at

γ∗ = `−1 1 − π π

(7.5) where the first derivative is zero and the second derivative is π`0(γ∗)g(γ∗) > 0. At the boundaries, π = 0 or π = 1, we have the minima respectively at E(0, ∞) = E(1, −∞) = 0. Here 1−π_π is the likelihood-ratio threshold and γ∗ is the score threshold and both give the same minimum-expected-error-rate Bayes decisions.

We use the notation E∗, for the minimum: E∗(π) = E(π, γ∗) = min

γ E(π, γ) (7.6)

As mentioned above, E∗(0) = E∗(1) = 0. It is easy to show3 _{that E}∗_{(π) is}

concave and therefore has a unique maximum. E∗(π) can be interpreted as the error-rate of the detector at the prior, π, provided the optimal score threshold is used. For well-calibrated detectors, the error-rate vanishes when there is no prior uncertainty, but is non-zero in between. (Our experiments on real speaker detection scores show that the maximum invariably occurs somewhere near π = 0.5.)

ROC properties

If we sweep the score threshold, γ, from −∞ to ∞ (with F increasing and G decreasing) and plot G against F , this gives4 what is known as the receiver operating characteric, or ROC [62]. It gives a trade-off between false-alarm-rate and miss-rate as the threshold is varied. This curve is a strictly decreasing and strictly convex function, which maps false-alarm-rate, in [0, 1], to miss-rate, in [0, 1]:

Pmiss= F G−1(Pfa) . (7.7)

The curve end-points are at (Pfa, Pmiss) = (0, 1) and (1, 0). The slope of this

function, at a given threshold γ is −`(γ). The convexity follows from the strictly increasing property of `.

3_{This can be done as in section 2.1.6, or by noting that together G and F satisfy the}

contract of a cost function, with A ∈ [−∞, ∞], so that E∗ has the form of a generalized entropy function.

4_{ROC is often plotted as G vs 1 − F , but for convenience we use G vs F here to agree}

EER interpretation

The equal-error-rate, or EER, is a special point on the ROC, which acts as a scalar summary of the whole curve. It can be defined as the point on the ROC for which (Pfa, Pmiss) = (EER, EER). Below we show how it acts as a

summary for the whole curve [6].

Under the assumptions above, the expected error-rate, E(π, γ), satisfies the conditions for Sion’s minimax theorem [63], which allows interchanging of nested minimum and maximum over the two variables. In particular, E is quasi-convex in γ, since it has a unique minimum. E is linear in π and therefore quasi-concave in π. We can now interpret the EER as the maximum optimal error-rate: max π E ∗ (π) = max π minγ E(π, γ) = min γ maxπ E(π, γ) = min

γ max E(0, γ), E(1, γ) , by linearity in π

= min γ max G(γ), F (γ) = G(γeer) = F (γeer) = EER (7.8)

where the last step can be understood by noting that F increases from 0 to 1, while G decreases from 1 to 0, forming a rough X, of which the top v is the inner discrete maximum, which in turn has its minimum w.r.t. γ at the cusp—where the two functions are equal.

The max min form of the first line of (7.8) shows that the EER summa- rizes the curve as a tight upper bound on error-rate: Provided you use the optimal threshold (found by the inner minimization), the error-rate at any prior (scanned by the outer maximization) will not exceed the EER.5

If a detector is optimized by using EER as an objective to be minimized, error-rates at all operating points will be driven down. E∗ is concave, so that by ‘pushing down’ on its maximum at EER, you cannot make a local dent that violates concavity, instead (if anything budges) the whole curve has to go down.

5_{Although the min max form seems to show that maximum is always found at π = 0,}

Generalization of EER

In fact, we can interpret any point on the ROC via a generalization of (7.8). Letting α, β > 0, a similar derivation shows:

Cα,β = max π minγ παF (γ) + (1 − π)βG(γ) = αF (γα,β) = βG(γα,β) (7.9) so that Cα,β β , Cα,β

α is a point on the ROC. By varying the error-rate ratio, α β,

from 0 to ∞, we can plot out the whole ROC. In this sense, any point on the ROC can be seen as an upper bound on the weighted cost of errors, provided the optimal threshold is used.

AUC

The area under the ROC, often called AUC, is the probability that a randomly picked non-target score will exceed a randomly picked target score [62]. This can be seen by integrating the joint probability, f (˜s)g(s) of a target score ˜s and a non-target score s over the region where s > ˜s:

P (s > ˜s) = Z ∞ −∞ g(s) Z s −∞ f (˜s) d˜s ds = Z ∞ −∞ g(s)F (s) ds = Z 1 0 F G−1(p) dp = AUC (7.10)

where we used the change of variables p = G(s), dp = −g(s) ds. Although this is a useful figure of merit for applications where scores are sorted to prior- itize targets, rather than comparing each score to a fixed threshold, AUC has apparently not had much exposure in the speaker recognition literature.

For practical computation of AUC, we propose using the ROCCH variant of the empirical ROC, which we shall discuss in section 7.3.6 below. But so far, we have not used this metric in practice.

In document Measuring, refining and calibrating speaker and language information extracted from speech (Page 86-90)