Methods used for spoken data analysis: authentic and simulated data

The phonetic data analysis in this thesis combines auditory and acoustic approaches to examine the differences and similarities between either:

• what is assumed to be threatening and non-threatening speech samples taken from the same speaker (authentic data) - or,

• the same speaker in experimental conditions producing threatening and non-threatening speech (simulated data)

This auditory-acoustic approach reflects the manner of forensic phonetic analysis currently practised in the United Kingdom (R v Flynn and St John [2008] Crim LR 799). By com- paring the vocal productions of the same speaker across multiple experimental conditions, in effect, this thesis presents the findings of 51 speaker comparisons. However, a key

difference is that in a typical forensic speaker comparison, the intention is to ascertain the likelihood that the speaker in the criminal recording is also the speaker in a suspect recording. In this study, the analysis of the authentic threat data most closely resembles this practice. However, the intention here is to investigate whether these speakers adopt what could be described as a ‘threatening tone of voice’ while making what is likely to be a threat.

The numerous difficulties presented when analysing authentic data have been previously described in this thesis. By collecting data in experimental conditions, it is hoped that there will be minimal interference from confounding factors (such as those noted for authentic materials). For the simulated materials, the identity of the speakers is already known, as are the conditions that these recordings were created in. The speaker comparison for these data intends to provide a more robust comparison of the same speakers producing threatening and non-threatening speech.

The following sections details how each phonetic and linguistic feature of interest to this study was analysed. These parameters were chosen due to their connection to threats in previous research or theories centring on threatening language. In addition, this study seeks to describe the broad phonetic and linguistic properties of threats with a view to identifying possible areas of interest for future relevant research.

3.4.1 Fundamental frequency: authentic data

In this study, measurements of fundamental frequency (f0) were made using Praat software (Boersma and Weenink, 2014). Each sound file was edited before the extraction of these measurements to remove instances of background noise or speech which was irrelevant to the current study. As the authentic data consisted of male voices, Praat pitch settings were initially set at a minimum level of 60Hz and a maximum level of 150Hz. These settings were adjusted for voices which exhibited f0 values which fell outside of this range.

Each recording was then converted to a pitch object in Praat. This process allows for the manipulation of the available component harmonics (or ‘candidates’). These pitch objects produce synthetic tones which correspond to whatever candidates are selected at the time. By manipulating these candidates, the pitch object can be modified to produce tones which more closely resemble the perceived pitch of the corresponding speech. From

these ‘corrected’ pitch objects, the following measures were extracted from Praat manually: • minimum pitch (Hz)

• mean pitch (Hz) • median pitch (Hz)

• upper (75) and lower (25) quartiles (Hz) • maximum pitch (Hz)

• standard deviation (Hz)

3.4.2 Articulation rate: authentic data

Articulation rate (AR) provides a measure of the tempo or pace of fluent speech. The numeric output represents the number of syllables produced per second. Typically, for speakers of English, AR is between 4-6 syllables per second (Laver, 1994). Figures lower than this range would indicate very slow speech, while higher figures would indicate very fast speech.

In this research, articulation rate was calculated using the methods discussed by K¨unzel (1997). In keeping with this method, before AR was calculated, each speech recording was edited to remove pauses of >0.1 second. All pauses in each experiment task which were longer than this were removed, with the exception of pauses which occurred within words. After each speaker’s task recording was edited, the articulation rate was calculated. This involved manually counting the number of syllables which the speaker actually produced. During fluent speech, speakers regularly omit or reduce syllables (e.g. pronouncing the word ‘library’ with two syllables as in /laI.bri/, as opposed to three as in /[email protected]/).

After the number of syllables for each task recording was counted, this figure was divided by the duration of the edited recording. These calculations produced an articulation rate output for each task recording for each of the speakers involved in this research.

3.4.3 Fundamental frequency: simulated data

The simulated data were edited using the same procedure as described previously in rela- tion to the authentic data. For female data, Praat pitch settings were set at a minimum level of 170Hz and a maximum level of 250Hz. Due to the larger number of simulated data (than authentic data) the following measurements were extracted from Praat using a script, rather than manually using Praat’s interface:

• minimum pitch (Hz) • mean pitch (Hz) • median pitch (Hz)

• Upper (75) and lower (25) quartiles (Hz) • maximum pitch (Hz)

• standard deviation (Hz)

This pitch measurement extraction script is shown in the Appendix.

3.4.4 Intensity: simulated data

Intensity measurements were extracted from the edited task recordings using a Praat script, rather than selecting each measurement separately using Praat’s interface. The previously described Praat pitch extraction script was modified by the researcher to extract the following intensity measurements automatically:

• minimum intensity (dB) • mean intensity (dB) • median intensity (dB)

• Upper (75) and lower (25) quartiles (dB) • maximum intensity (dB)

• standard deviation (dB)

The extracted data were exported as a .csv file and later as a Microsoft Excel file. The intensity script is shown in the Appendix. As reliable intensity measurements could not be guaranteed from the authentic speech materials (see Chapter 3.2), this thesis will only present the intensity analysis of simulated speech materials.

3.4.5 Articulation rate: simulated data

For the simulated data, the method of calculating articulation rate was identical to that of the authentic data. As explained earlier in this subsection, only syllables which were actually produced by the speaker were counted for this analysis. Even though much of the simulated data is read speech, there are a number of false starts, repetitions and impro- visations present in these spoken data. For this analysis, these disfluencies or additional linguistic content were not removed from the data. This was in order to maximise the quantity of data available for each AR analysis. Therefore, even though multiple speakers

read aloud the same text (in similar experimental conditions), there are differences in the number of syllables produced and the duration of each speaker’s reading.

3.4.6 Vocal profile analysis: simulated data

In forensic phonetic casework performed in the United Kingdom, a modified version of the Edinburgh Vocal Profile Analysis Scheme, the VPA protocol (Beck, 2007), is used to catalogue the presence of various features relating to vocal setting and vocal tract features. In this study, vocal setting refers to the configuration of the vocal folds during speech. For example, holding the vocal folds tightly (but not completely) together results in creaky voice. ‘Vocal tract features’ refers here to changes made to the dimensions of the vocal tract during speech production. For example, describing the lips as spread apart as opposed to rounded, or describing the perceived height of the larynx during speech.

The author of this research has been trained to perform vocal profile analysis of both forensic and non-forensic speech data as part of a Masters level programme in Forensic Speech Science. This programme included extensive ear-training on recordings of speakers who have or are adopting different vocal tract or voice quality settings. For example, infer- ences of vocal properties (such as a raised larynx or tongue-fronting) present in the speech samples collected for this research, could be likened to recordings which were created or selected to exemplify these properties. The author was solely responsible for the analysis presented in §5.2.8 and §5.3.4.

Using the VPA protocol involves carefully listening to each recording and performing an impressionistic analysis. As such, compared to the other forms of phonetic analysis discussed so far, this analysis is inherently more subjective. A copy of this modified VPA protocol was completed by the author for all Task 1-4 recordings collected for this research under experimental conditions. This scheme allowed for non-neutral (or non-modal) vocal tract features or vocal settings to be recorded. In addition, the extent or degree of these non-neutral vocal tract features or vocal setting was noted on a scale of 1-3. For example, describing a voice sample as ‘Creaky voice (3)’ would indicate that there is an extreme level of creaky voice. A copy of a blank modified VPA protocol can be found in the Ap- pendix.

common in forensic phonetic casework for multiple practitioners to calibrate their individ- ual analyses. Owing to the large quantity of speech data collected for this research, it was not feasible for this calibration to take place across the data. As such, this analysis should be taken as an initial impression of the vocal tract features and vocal settings used during threatening speech. If any apparent voice quality or vocal setting features of interest is noted in this research, these can be scrutinised in further detail in any subsequent acoustic phonetic research.

Voice quality was not analysed for the authentic recordings sourced for this research, as it was thought that this property of speech would be particularly susceptible to changes that could be accounted for by contextual or environmental changes between the non- threatening and threatening recordings. As previously described in §3.2, the authentic recordings varied in terms of the situation the speech was produced in or the recording device(s) used. For example, some of the non-threatening speech samples (A8-A10) were made during a police interview setting, where it might be expected for speakers to speak differently to their more typical, modal speech, as well as (possibly) their threatening speech. These issues were encountered initially during the examination of fundamental frequency and speech tempo presented in §5.1. The inherent variability of the authentic recordings led to the decision that the results of any further phonetic analyses would also be influenced by contextual or environmental changes, as opposed to changes from non-threatening speech when the speakers make a threat. As such, the collection of simulated data allows for a more robust comparison to be made between threatening and non-threatening speech, and minimises the interference of other factors relating to voice quality features.

In document Investigating the phonetic and linguistic features used by speakers to communicate an intent to harm (Page 96-101)