Amount of development, test and reference data

A final issue for the application of the LR to FVC is the amount of development, test

and reference data needed for robust system testing. The limited amount of previous research in this area has focused on the effects of the number of reference speakers on

samples of spontaneous speech produced by 241 male speakers of Japanese. The samples were extracted from the larger non-forensic Corpus of Spontaneous Japanese

(CSJ) (Maekawa et al. 2000). LTf0 was parameterised using the long-term mean and standard deviation (SD) as well as skew, kurtosis, mode and modal density. The

speakers were divided into two groups and within each group 12 differently sized population samples were created containing between ten to 120 speakers. This allowed

for the computation of two LRs for each comparison for the same population size. Cross- validated (§3.2.2.3) MVKD (§3.2.2.1) LRs were computed using all 241 speakers as

test data and typicality assessed against the differently sized reference sets.

Ishihara and Kinoshita (2008) found the median SS log10LR (LLR; §3.2.2.4) to be up

to three orders of magnitude greater when using ten reference speakers compared with using all 120. The overall range of LR scores also decreased as the amount of reference

data increased. DS pairs were found to be more sensitive to the size of the reference sample. With ten speakers the median DS LLR value was around -30, although for

certain pairs values extend far beyond -30. In the 120 speakers condition, the DS median was located between -2 and -3. As with the SS results, the overall range of

scores decreased as the size of the sample increased. Ishihara and Kinoshita (2008) also found that equal error rate (EER; §3.2.3.1) generally improved as the number of

speakers increased, although “improvement seems more rapid up to the population size 30” (p. 1943). As a crude form of calibration, the study also included an analysis

of the EER threshold relative to the LLR zero threshold (i.e. neutral evidence; see further §3.2.2.4). When using ten reference speakers the EER threshold was found to

be furthest away from zero, with increasing convergence as the number of speakers increased.

Ishihara and Kinoshita (2008) conclude that “we do need a large population data in order to produce reliable (LRs)” and that “(LRs) produced using anything smaller

than 30 (reference speakers) (are) highly unreliable” (p. 1944). Although their results provide evidence against the use of small amounts of reference data, there is no explicit

discussion as to why small samples should produce such imprecise LRs. Further, given the intrinsic properties of how MVKD LRs are computed (particularly for variables with

should affect different variables from different regional and social groups in different ways. Further, although the issue of calibration was considered with regard to accept-

reject thresholds, Ishihara and Kinoshita (2008) did not assess the effect of sample size on calibrated LRs (§3.2.4) or log LR cost function (Cllr; validity metric which is

logically consistent with the Bayesian approach, see §3.2.3.1).

Whilst Ishihara and Kinoshita (2008) focus on the effects of small numbers of reference

speakers, Rose (2012) investigated an upper limit for reference sample size at which point LR performance becomes asymptotic. Rose (2012) used Monte Carlo simulations

(MCS; see Chapters 9 and 10 in this thesis) to synthesise F1, F2 and F3 midpoint values for AusEng /a:/ for up to 10,000 speakers based on values in Bernard (1970). Using both

the multivariate normal and KD approaches, LRs were computed for real suspect and offender data which were known to have been produced by the same speaker. Typicality

was assessed as a function of the number of reference speakers between five and 60. Output was compared against thetrue LR, which was defined as the LR computed

using the maximum amount of reference data (in this case 10,000 speakers).

The results of Rose (2012) are comparable with those of Ishihara and Kinoshita (2008). Based on univariate LR analyses of F1, F2 and F3, SS scores were generally higher in

magnitude than thetruevalue when using small amounts of reference data (fewer than ten speakers). The overall range of LRs was also considerably greater when using small

numbers of reference speakers. However, relatively stable scores were achieved (within

two SDs of thetruescores) by the inclusion of 30+reference speakers. This was the case even for F2, which displayed the greatest sensitivity to sample size. A similar

pattern was found in the multivariate analysis, with the distributions of values skewed towards stronger scores when using small samples. Compared with the univariate

analysis, however, the range of scores was far more sensitive to sample size using MVKD.

However, Rose’s (2012) preliminary study has a number of limitations. The test data

are based on a single suspect and offender comparison. It would be preferable to assess the performance of a large set of test data, where it is knowna prioriwhether samples

came from SS or DS pairs, as a function of the number of reference speakers. In the absence of such data, Rose (2012) was unable to assess how system validity metrics

such as EER andC_llrare affected by sample size. Further, the scores were not calibrated based on coefficients generated from an appropriate set of development data. Therefore,

it was not possible to assess the role of calibration in determining the overall sensitivity of LR output to reference sample size.

A limited amount of work has also considered the issue of sample size for ASR. Van der Vloedet al. (2011) investigated the inbuilt reference population optimisation algorithm

in Batvox6which identifies theNclosestspeakers, based on Kullback-Leibler distances calculated from the MFCC vectors, to the suspect (note that Batvox bases population

selection on the suspect rather than the offender) from a larger database of speakers. LRs were computed for a test set (i.e. mock suspects and offenders) of 16 male

speakers of Swiss-French in Batvox using three population data conditions. The first contained 35 speakers extracted from a 45-speaker subset of the 1995 speakers in

the Swiss-French PolyPhone database (Cholletet al. 1996). The second condition contained 35 reference speakers extracted from the whole database of 1995 speakers,

and the third condition contained 1400 speakers extracted from the 1995 speakers. Tests were conducted using samples of speech transmitted via the Global System for

Mobile Communications (GSM) and the Public Switched Phone Network (PSTN) (see Bigelow 1997; Kondoz 2004).

For both transmission types, condition two (35 speakers out of 1995) produced the

weakest LRs but, for the GSM condition, the lowest C_llr. Conditions one (35/45

speakers) and three (1400/1995 speakers) performed equally well in terms ofCllr. Van

der Vloedet al.(2011) explain this result in terms of the ratio of the size of the subset

to the total size of the database rather than the absolute size of the reference data (i.e. the two systems generate similarC_llr values because the ratio of speakers used

as population data extracted from the larger database is roughly the same). This is because in condition two the 35 reference speakers will be more like the suspect and

more homogeneous, since they were identified from a much larger sample. However, the choice of absolute sample sizes appears arbitrary and the results do not provide

useful information in addressing how LR output is affected by monotonic increases in sample size. Further, given that these results were computed using Batvox based on CC

6_{http://www.agnitio-corp.com/products/government/batvox} _(accessed: ₉th

input and inbuilt algorithms, their transferability to other variables and LR formulae is not clear.

The only study to have investigated sample size beyond the number of reference speakers is Ishihara and Kinoshita (2012). They assessed the effect of the number of

tokens per test speaker on the two components ofCllr(Cllr_minand Cllr_cal). Cllr_min is

the lowestC_llrvalue achievable when the system is optimally calibrated, whileC_{llr_cal}

is system calibration loss (i.e. the difference between theCllrand theCllr_min). Input

data consisted of ten tokens of the Japanese filler expression e- (/e:/) produced by

118 male speakers of Japanese from the CSJ. 16 MFCCs were extracted from a 20ms hamming window at the temporal midpoint of each token. MVKD LRs based on

non-contemporaneous samples were computed using two, four, six, eight and ten tokens per test speaker. To assess how the inclusion of different tokens affected LR output

the experiment was conducted using consecutive tokens from each sample, and by reversing the order of the tokens.

Ishihara and Kinoshita (2012) found different patterns for the two elements of Cllr.

C_{llr_cal}increased considerably as the number of tokens per speaker increased. This had

the overall effect of worseningCllras sample size increased (from around 0.5 with two

tokens to 2.5 with ten tokens). This pattern was found in both forms of the experiment. However,Cllr_mindecreased as the number of tokens increased. The magnitude of this

decrease was around 0.2 (from 0.4 to 0.2). The different patterns found forC_{llr_min}and

Cllr_callead Ishihara and Kinoshita to conclude that “additional data can improve the

quality of LRs, as long as we calibrate the obtained LRs” and that “uncalibrated LRs

can be extremely misleading” (2012: 3). Unfortunately, however, the distributions of calibrated LLRs as a function of the amount of data per test speaker were not provided.

It seems that no empirical work has yet analysed how much data per reference speaker

is required to generate stable estimations of within-speaker variation for LR testing. Furthermore, no empirical work has considered the effects of the size of the development

and test sets in LR-based testing. Therefore, the experiments in this thesis address these issues.

In document The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison (Page 74-79)