7.3 Experiment Setup
7.4.1 Results for the I4U EVAL set
In this subsection we pool all of the scores from different duration and SNR condi- tions and present the recognition results for all trials in the EVAL set in Table 7.2. The results are provided for mismatched, matched, pooled scores and four proposed QMFs, the Qd, Qn, Q1 and Q2. We use measured SNRs in the QMFs, even though
for this data we would happen to know the applied SNR. Comparing the first three approaches, matched calibration gives the best performance. This is expected be- cause the matched calibration uses 30 pairs of w0 and w1 calibration parameters.
On the other hand, the mismatched approach results in Cllr > 1, showing the im-
plication of duration and noise to the speaker recognition calibration performance: even though the system can partially discriminate target from non-target scores (E= < 50 %), Bayes’s decision based on the log-likelihood-ratio would on average
lead to worse error rates than based on the prior alone (Cllr> 1).
All four proposed QMFs in Table 7.2 show positive improvements in E=, Cllrmin,
Cmc and Cllr values compared to the pooled scores and mismatched approach.
The biggest performance improvement can be seen in the Cmc measures where
there is over 25% relative improvement on Q1 and Q2 approaches compared to
the pooled scores calibration. The close performance obtained using Q1 and Q2
is suggesting that the simple linear model of Qd+ Qn is sufficient for modeling
the duration and SNR dependencies in terms of our experiments. Even though only 1–3 extra calibration parameters need to be trained, the proposed QMFs give better discrimination and calibration performance. Moreover, unlike the matched calibration approach that has far more parameters which are needed to be trained, QMF approaches can be applied to the conventional linear calibration with the possibilities of interpolation and extrapolation. This is shown in our previous study of QMF calibration [105] where this calibration approach offer superior performance to other approaches when dealing with unseen data conditions, e.g., in the case of interpolated and extrapolated conditions in calibration.
As mentioned in Section 7.2, calibration parameters were optimized at two differ- ent operating points for the effective prior, πdetand πprimary. When the calibration
is optimized at πdet, the system performance in terms of Cprimary gets worse. As
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Cnorm logit π
Matched calibrated EVAL scores at πdet and πprimary
π
det operating point
minimum Cnorm optimized at πdet Cnorm optimized at πdet π
primary operating point
minimum Cnorm optimized at πprimary C
norm optimized at πprimary
π
1 operating point
π
2 operating point
Figure 7.3: Normalized Bayes error rate plot for EVAL set when matched calibration applied and optimized at both πdetand πprimary.
values when the system is optimized at the πprimary compared to when it is opti-
mized at πdet operating point. This happens because Cprimary is associated with
much lower effective prior compared to Cdet, and therefore the recognition needs to
operate at a very low false alarm region. Therefore, it is reasonable to train the cal- ibration parameters at πprimary when we want to evaluate the system performance
using Cprimary measure.
This effect can be studied in a normalized Bayes error rate (NBE) plot in Fig- ure 7.3 [16]. The normalized Bayes error-rate plot is a convenient way of visualizing calibration performance of a speaker verification system over a representative range of operating points [14]. It generalizes the familiar NIST decision cost function to cover all of the values of Cmiss, CFA and Ptar in a single plot. In Figure 7.3, along
the vertical axis is plotted Cnorm form (7.12), as a function of the logit5 of the 5logit p = log p
7.4. RESULTS 107 effective prior π. Vertical lines denote the various operating points at which Cnorm
is evaluated in Cdet or Cprimary. One can observe how, for different configuration
of the calibration, the total normalized detection costs vary with the prior. In Fig- ure 7.3, the curves differs only because of difference in calibration. At the π1 and
π2 operating points, Cnorm is lower when calibration is optimized at πprimary than
when it is optimized at πdet. Because Cprimary is just the average of Cnorm at the
two operating points associated with Cprimary, training the calibration parameters
using πprimary as prior results in lower Cprimary.
It is interesting to compare the results at specific operating points, Cdet and
Cprimary, with results that integrate over all points (Cllr and derivatives). The
QMFs perform better than pooled scores for overall metrics, but show a small degradation for specific operating points at low false alarm rate.