Results for the I4U EVAL set - Experiment Setup

7.3 Experiment Setup

7.4.1 Results for the I4U EVAL set

In this subsection we pool all of the scores from different duration and SNR conditions and present the recognition results for all trials in the EVAL set in Table 7.2. The results are provided for mismatched, matched, pooled scores and four proposed QMFs, the Qd, Qn, Q1 and Q2. We use measured SNRs in the QMFs, even though

for this data we would happen to know the applied SNR. Comparing the first three approaches, matched calibration gives the best performance. This is expected because the matched calibration uses 30 pairs of w0 and w1 calibration parameters.

On the other hand, the mismatched approach results in Cllr > 1, showing the im-

plication of duration and noise to the speaker recognition calibration performance: even though the system can partially discriminate target from non-target scores (E= < 50 %), Bayes’s decision based on the log-likelihood-ratio would on average

lead to worse error rates than based on the prior alone (Cllr> 1).

All four proposed QMFs in Table 7.2 show positive improvements in E=, Cllrmin,

Cmc and Cllr values compared to the pooled scores and mismatched approach.

The biggest performance improvement can be seen in the Cmc measures where

there is over 25% relative improvement on Q1 and Q2 approaches compared to

the pooled scores calibration. The close performance obtained using Q1 and Q2

is suggesting that the simple linear model of Qd+ Qn is sufficient for modeling

the duration and SNR dependencies in terms of our experiments. Even though only 1–3 extra calibration parameters need to be trained, the proposed QMFs give better discrimination and calibration performance. Moreover, unlike the matched calibration approach that has far more parameters which are needed to be trained, QMF approaches can be applied to the conventional linear calibration with the possibilities of interpolation and extrapolation. This is shown in our previous study of QMF calibration [105] where this calibration approach offer superior performance to other approaches when dealing with unseen data conditions, e.g., in the case of interpolated and extrapolated conditions in calibration.

As mentioned in Section 7.2, calibration parameters were optimized at two different operating points for the effective prior, πdetand πprimary. When the calibration

is optimized at πdet, the system performance in terms of Cprimary gets worse. As

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Cnorm logit π

Matched calibrated EVAL scores at π_det and π_primary

det operating point

minimum C_norm optimized at π_det C_norm optimized at π_det π

primary operating point

minimum C_norm optimized at π_primary C

norm optimized at πprimary

1 operating point

2 operating point

Figure 7.3: Normalized Bayes error rate plot for EVAL set when matched calibration applied and optimized at both πdetand πprimary.

values when the system is optimized at the πprimary compared to when it is opti-

mized at πdet operating point. This happens because Cprimary is associated with

much lower effective prior compared to Cdet, and therefore the recognition needs to

operate at a very low false alarm region. Therefore, it is reasonable to train the calibration parameters at πprimary when we want to evaluate the system performance

using Cprimary measure.

This effect can be studied in a normalized Bayes error rate (NBE) plot in Fig- ure 7.3 [16]. The normalized Bayes error-rate plot is a convenient way of visualizing calibration performance of a speaker verification system over a representative range of operating points [14]. It generalizes the familiar NIST decision cost function to cover all of the values of Cmiss, CFA and Ptar in a single plot. In Figure 7.3, along

the vertical axis is plotted Cnorm form (7.12), as a function of the logit5 of the 5_{logit p = log} p

7.4. RESULTS 107 effective prior π. Vertical lines denote the various operating points at which Cnorm

is evaluated in Cdet or Cprimary. One can observe how, for different configuration

of the calibration, the total normalized detection costs vary with the prior. In Fig- ure 7.3, the curves differs only because of difference in calibration. At the π1 and

π2 operating points, Cnorm is lower when calibration is optimized at πprimary than

when it is optimized at πdet. Because Cprimary is just the average of Cnorm at the

two operating points associated with Cprimary, training the calibration parameters

using πprimary as prior results in lower Cprimary.

It is interesting to compare the results at specific operating points, Cdet and

Cprimary, with results that integrate over all points (Cllr and derivatives). The

QMFs perform better than pooled scores for overall metrics, but show a small degradation for specific operating points at low false alarm rate.

In document Speaker Recognition System in Forensic Conditions: The Calibration and Evaluation of the Likelihood Ratio (Page 121-123)