Speaker Recognition Configuration

7.3 Experiment Setup

7.3.1 Speaker Recognition Configuration

For the experiments in this study we used the Radboud University Nijmegen speaker recognition system developed for NIST SRE 2012 [155]. In the front-end of our system, the speech signals are first enhanced using maximum likelihood estimate of short-time spectral amplitude (ML-STSA) [111] in which the noise is estimated on-line using the improved minima controlled recursive averaging (IMCRA) proposed in [30]. Then, 19 base MFCCs are extracted from frames of 20 ms windowed speech every 10 ms, appended with log energy and augmented with delta and delta- delta resulting in 60-dimensional feature vectors [105]. We used stabilized weighted linear prediction as noise robust method in spectrum estimation [156]. Speech ac- tivity detection (SAD) [114] is employed to discard non-speech frames and feature warping [136] is applied on final features. In the experiments including duration truncation, the feature warping is applied after truncation.

A gender-dependent UBM with 2048 components is trained using a subset of NIST SRE 2004–2006, Switchboard cellular phase 1 and 2, and Fisher English cor- pora. We train our i-vector extractor [38] with the same data as for UBM with

400-dimensions. In post-processing of utterance-level i-vectors, we used linear discriminant analysis (LDA) projection to enhance separability of classes (speakers) and reduce the i-vectors dimension to 200. Prior to probabilistic linear discriminant analysis (PLDA) modeling, we remove the mean, perform whitening using within- class covariance normalization (WCCN) and normalize the length of i-vectors. The enrollment speech are neither truncated nor noise contaminated, and simply av- eraged over multiple i-vectors per speaker. Truncated and noisy versions of all i-vectors available for enrollment of speaker in our I4U-DEV set are used for training LDA and PLDA.

7.3.2 Performance Measures

Four performance measures are used to evaluate the performance of speaker recognition system in this paper. The first measure is equal error rate (E=) the point on

the Receiver Operating Characteristic (ROC) where the probabilities of miss and false alarm, Pmissand PFA, are equal. It is a measure of discrimination performance

of recognition system. The second measure is called cost of LLR (Cllr) [17], and

can be formulated as:

Cllr = 1 2Ntar Ntar X i=1 log₂(1 + exp(_−xi)) + 1 2Nnon Nnon X j=1 log2(1 + exp(xj)) , (7.9)

with Ntar and Nnon as the number of target and non-target trials, and xi and xj

as target and non-target scores, respectively. Cllr measures both calibration and

discrimination at all operating points along the ROC. It is a proper scoring rule [34], and is similar to a cross-entropy measure. The normalization makes the quantity interpretable as the average amount of information, expressed in bits, that the system was incapable to extract from a speaker comparison trial. The metric Cllr

can further be separated into two terms, minimum Cllr (Cllrmin) and mis-calibration

cost (Cmc):

Cllr= Cllrmin+ Cmc. (7.10)

The value C_llrminis obtained after re-calibration of the test data by shifting each score to minimize Cllr while maintaining the original order of the scores. It is equivalent

to determining the minimal detection cost for all possible cost functions [17, 179].

Hence Cmin

llr is another measure for discrimination capability, while Cmc = Cllr−

Cmin

llr , shows calibration ability of a speaker recognition system integrated over all

7.3. EXPERIMENT SETUP 103

Another performance measure used in this paper is primary cost (Cprimary),

proposed in the NIST SRE’12 evaluation plan [131] and defined as

Cprimary =

2 Cnorm(π1) + Cnorm(π2) , (7.11)

parameterized by π1 and π2, and normalized detection cost Cnorm is

Cnorm(π) = Cdet(π)/Cdefault. (7.12)

The traditional detection cost [40] Cdet has the cost parameters Ptar, the a-priori

probability of a target speaker, and Cmiss and CFA the costs for miss and false

alarm, respectively. It is a weighted sum of Pmissand PFA, the miss and false alarm

probabilities obtained after making a Bayes’ minimum expected cost decision3

Cdet= PtarCmissPmiss+ (1− Ptar)CFAPFA, (7.13)

The normalizing default cost Cdefaultis associated with a recognizer that bases the

decision on the prior only,

Cdefault= minCmissPtar, CFA(1− Ptar) . (7.14)

The cost parameters can conveniently be combined to the effective prior odds β−1 β−1= Cmiss CFA Ptar 1_{− P}tar , (7.15)

that relate to the effective prior π introduced in (7.12)

β−1= π

1_{− π}, (7.16)

which reduces (7.12) for β > 1 to [182]

Cnorm(π) = Pmiss+ βPFA. (7.17)

In the SRE’12 evaluation, two operating points π1 = .01 and π2 = .001 are used,

resulting from Ptar = .01 and .001, respectively, and constant Cmiss = CFA = 1.

We define the average πprimary= 1₂(π1+ π2) = .0055 which is the average effective

prior of the two NIST SRE 2012 primary operating points used in Cprimary.

Finally, we use the traditional Cdetwith πdet= .0917 using Cmiss= 10, CFA= 1

and Ptar = 0.01 from pre-2010 NIST SREs. The calibration parameters in all

approaches, w0, w1, wd, wn and wdn, are trained using Bosaris toolkit [16] on

the DEV set, which uses a cross entropy optimization criterion. In training the calibration parameters, we use the effective prior of πdet= .0917 corresponding to

the NIST SRE 2010 cost function. In evaluating the performance using Cprimary

the respective πprimary = .0055 is used in training the calibration parameters. 3_{The threshold of the log likelihood ratio in a Bayes’ minimum expected cost decision lies at}

7.3.3 Variation in duration and signal-to-noise-ratio for NIST

In document Speaker Recognition System in Forensic Conditions: The Calibration and Evaluation of the Likelihood Ratio (Page 117-120)