Experiment (3): Speaker-specific patterns

5.3 Results

5.3.3 Experiment (3): Speaker-specific patterns

Figure 5.9 displays LLRs based on input from the first three formants of /aI/ analysed individually and in combination using DyViS speakers only (20 development/ 20 test/

57 reference). The SS median LLRs based on F1-only and F2-only were within the same order of magnitude (limitedsupport), although numerically strength of evidence

was generally better using F2-only. The ranges of SS LLRs for F1-only and F2-only were also broadly equivalent, with values spread from marginally less than zero to

around +1. Although the median SS LLR for F3-only was also located within the zero to +1 range, the absolute numerical value was much closer to +1. Further, the maximum

strength of SS evidence for F3-only was +2.72 (moderately strongsupport) indicating that F3 in some cases outperformed F1 and F2 by up to two orders of magnitude. The

strength of SS evidence was, however, greatest when using a combination of all three formants, with LLRs generally one order of magnitude higher compared with any

formant individually (moderatesupport). The proportion of misses also decreased from maximally 15% using F1-only to 5% using all three formants.

Figure 5.9: Tippett plot of SS and DS LLRs using F1-only (blue), F2-only (red), F3-only

(green) and a combination of the three formants (orange) of /aI/ from DyViS

Similar results are revealed in the distributions of DS LLRs. Numerically, the weakest

DS LLRs were achieved using F1-only, followed by F2-only. The difference in median values was equivalent to one order of magnitude fromlimited(F1-only) tomoderate

(F2-only) support for the defence. However, unlike the SS comparisons, F3-only input generated generally stronger LLRs than the combination of the three formants. The

median DS LLR based on F3-only was -4.11 (very strongsupport), compared with -3.66 (strongsupport) using F1∼F3. Further, the range of DS LLRs for F3-only extended to -35.4, compared with -19.5 for F1∼F3. However, F3-only input also generated a higher false hit rate, as well as higher magnitude contrary-to-fact DS LLRs compared with the

combination of formants.

Figure 5.10 displays EER and C_llr values for each of the four sets of formant data. Despite achieving somewhat weaker DS LLRs compared with F3-only, the combination

of formants produced the best performing system in terms of both EER and C_llr.

F1∼F3 outperformed F3-only by 5% in terms of EER and 0.2 in terms ofCllr. The

achieving EER values of around 20% andC_llrvalues of around 0.6. Consistent with patterns in Experiments (1) and (2), the improved performance of the combination of

formants over F3-only in terms of the strength of SS LLRs and system validity provides evidence that F1 and F2 do carry speaker-specific information. However, given that

F1 and F2 encode so muchspeechinformation (i.e. they are carriers of contrast), their value as individual discriminants is relatively minimal. Clearly in terms of individual

formants, F3 dominates with regard to speaker discrimination.

F1, F2 and F3 F1-only F2-only F3-only 0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Log LR Cost (Cllr ) EER (% )

Figure 5.10: Log LR Cost (Cllr) plotted against EER (%) for different DyViS formant

input for /aI/

5.4 Discussion

The results of Experiment (1) revealed a number of effects of using regionally Matched and Mixed BrEng data at both the feature-to-score and score-to-LR stages of system

testing using /aI/. Consistent with predictions in §5.1, the effects of using regionally

Mixed system data were considerably more severe for /aI/ than for /u:/, owing primarily to the regional variation encoded in /aI/ in BrEng. The distributions of SS LLRs were

Matched and Mixed system. However, DS LLRs were weaker by up to four orders of magnitude in the Mixed condition (using F1∼F3). Further, consistent with the results in §4.3.2, validity was consistently worse (by up to 7% EER and 0.15C_llr) when using the Mixed system compared with the Matched system.

The removal of F1 and then F2 in Experiment (1) generated lower magnitude LLRs and generally worse system validity across both systems. This, along with the results

of Experiment (3), suggests that F1 and F2, which are primarily thought to encode phonetic contrast and systematic regional and social variation, are capable of carrying

considerable speaker discriminatory information. Further, the removal of F1 and F2 in Experiment (1) reduced the divergence between the Matched and Mixed systems

in terms the distributions of LLRs, such that LLRs were most similar across systems when using F3-only input. These results suggest that there may be a trade-off between

the speaker discriminatory potential that lower formants (F1 and F2) provide and the regional sensitivity they introduce into the LR-based analysis. That is, with the removal

of F1 and F2, the strength of evidence and overall system performance may be lower, but the effects of regional variation, at least in terms of the magnitudes of the LLRs

themselves, may be minimised.

Somewhat different patterns were revealed in terms of the Matched and Mixed validity across the three sets of /aI/ input. The EER for the Mixed system was only marginally

higher than that of the Matched system when using all three formants and with the

removal of F1. However, the largest difference between the systems in terms of EER was found when using F3-only (c. 7%). Similarly, the smallest difference between the

systems in terms ofCllrwas found using F1∼F3, followed by F2 and F3. As with EER,

the largestC_llrdifference between systems was found using F3-only (c. 0.15). This

finding runs contrary to the earlier prediction that LR output based on F3 may be most robust to different definitions of the relevant population based on the hypothesis that

it encodes more information relating to theindividualrather than regional and social information relating to thegroup(Garvin and Ladefoged 1963).

In Experiment (2), the cubic coefficients of F1 and F2 were both able to correctly

assign around 64% of the 320 tokens to the regional group (four regional groups) of the speaker, and both outperformed F3. This suggests, predictably, that F1 and F2 (and

in particular the intercept (absolute frequency) and slope elements of the trajectory) are primarily responsible for the differences between the four sets (as shown in Figure

5.2). F3 generated a classification rate of 40.6% which, although worse than F1 and F2, was better than chance (25%). Further, when analysing the individual elements of the

trajectory using DA, the intercept generated the highest classification rate compared with coefficients relating to the dynamics of the trajectory. This suggests that F3 does

encode some region-specific information primarily in the absolute frequency element of the trajectory. This may be due to intrinsic factors (i.e. an inherent property of F3

itself) such as VQ and vocal setting (see Stevens and French 2012), as well as extrinsic factors (i.e. extraneous) such as correlation with F2 (although no consistent correlations

between elements of F2 and F3 were found when this was tested using these data). Formal analysis of these factors was not possible, however, due to the small number of

speakers and regional sets available.

Despite evidence of region-specific patterns of F3 variation, consistent with previous

studies, in Experiment (3) F3 outperformed F1 and F2 in terms of the magnitude of LLRs and system validity. There was also evidence of speaker-specificity in the lower

formants, with F1∼F3 generating higher magnitude SS LLRs and better overall system performance than any individual formants. However, the addition of F1 and F2 to F3

did generate lower magnitude DS LLRs. The combined results of Experiments (2) and (3) suggest that for F3, Garvin and Ladefoged’s (1963)group-individualdistinction is a

continuum rather than a dichotomy, since F3 was found to encode at least some regional information along with considerable speaker discriminatory power. More importantly

when considered in terms of the results of Experiment (1), it is clear that the inevitable regional and social information to which linguistic-phonetic variables respond may

affect different elements of LR output (e.g. magnitude of LLRs, validity) in potentially unpredictable ways and to unpredictable extents. Potential explanations for the results

5.5 Chapter summary

In document The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison (Page 156-161)