Datasets - Functional data analysis in phonetics

The current work employs two functional datasets. Both of them were made available to the author by his respective collaborators.

2.5.1 Sinica Mandarin Chinese Continuous Speech Prosody Cor- pora (COSPRO)

The Sinica Continuous Speech Prosody Corpora (COSPRO) [311] was collected at the Phonetics Lab of the Institute of Linguistics in Academia Sinica and consists of 9 sets of speech corpora. We focus our attention on the COSPRO-1 corpus;

0 0.5 1 50 100 150 200 250 300 350 Hz t Tone1 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone2 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone 3 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone4 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone5 M01 M02 F01 F02 F03

Figure 2.5: Tone realization in 5 speakers from the COSPRO-1 dataset.

the phonetically balanced speech database consists of recordings of Taiwanese Man- darin read speech. The COSPRO-1 recordings themselves were collected in 1994. COSPRO-1 was designed to specifically include all possible syllable combinations in Mandarin based on the most frequently used 2- to 4-syllable lexical words. Addi- tionally it incorporates all the possible tonal combinations and concatenations. It therefore offers a high quality speech corpus that, in theory at least, encapsulates all the prosodic effects that might be of acoustic interest.

After pre-processing and annotation, the recorded utterances, having a me- dian length of 20 syllables, resulted in a total of 54707 fundamental frequency curves. Each F0 curve corresponds to the rhyme portion of one syllable. The three female

and two male participants were native Taiwanese Mandarin speakers. Using the in-house developed speech processing software package COSPRO toolkit [311; 312], the fundamental frequency (F0) of each rhyme utterance was extracted at 10ms in-

tervals, a duration under which the speech waveform can be regarded as a stationary signal [131]. Associated with the recordings were characterizations of tone, rhyme, adjacent consonants as well as speech break or pause. Importantly the presented corpus is a real language corpus and not just a series of nonsensical phonation pat- terns and thus while designed to include all tonal combinations, it still has semantic meaning.

More specifically the syllables are labeled with one of the four lexically speci- fied tones or a sign that are phonologically toneless (tone 5). In addition contextual information is also associated with each curve (see Table 2.2 for a list of covariates included). Fig. 2.5 shows time-normalized example realizations of all 5 tones for all

Effects Values Meaning Notation- mark

Fixed effects

previous tone 0:5 Tone of previous syllable, 0 no previous tone present

tnprevious

current tone 1:5 Tone of syllable tncurrent

following tone 0:5 Tone of following syllable, 0 no following tone present

tnnext

previous consonant

0:3 0 is voiceless, 1 is voiced, 2 not present, 3 sil/short pause

cnprevious

next consonant 0:3 0 is voiceless, 1 is voiced, 2 not present, 3 sil/short pause

cnnext

B2 linear Position of the B2 index break in sentence

B2 B3 linear Position of the B3 index break

in sentence

B3 B4 linear Position of the B4 index break

in sentence

B4 B5 linear Position of the B5 index break

in sentence

B5 Sex 0:1 1 for male, 0 for female Sex

Duration linear 10s of ms Duration

rhyme type 1:37 Rhyme of syllable rhymet

Random Effects

Speaker N(0,σ_speaker2 ) Speaker Effect SpkrID Sentence N(0,σ_sentence2 ) Sentence Effect Sentence Table 2.2: Covariates examined in relation toF0production in Taiwanese Mandarin.

Tone variables in a 5-point scale representing tonal characterization, 5 indicating a toneless syllable, with 0 representing the fact that no rhyme precedes the current one (such as at the sentence start). Reference tone trajectories are shown in Fig. 2.4.

5 speakers.

2.5.2 Oxford Romance Language Dataset

The Oxford Romance Language Dataset was collected by Prof. John Coleman in the Phonetics Laboratory of University of Oxford between 2012-13. It consists of natural speech recordings of four languages; French, Italian, Portuguese and Spanish. Spanish recordings where classified as American or Iberian Spanish. For the purpose of this study American and Iberian Spanish are treated as distinct languages. The speakers utter the numbers one to ten in their native language and dialect. The dataset is inherently unbalanced; we have seven (7) French speakers, five (5) Italian speakers, five (5) American Spanish speakers, five (5) Iberian Spanish speakers and

three (3) Portuguese speakers. We were unable to have records for all 10 digits from all speakers, this finally resulting in a sample of 219 recordings. The sources of the recordings were either collected from freely available recordings from language training websites or standardized recording made by university students.

Language Number of Speakers (F/M) French 7 (4/3)

Italian 5 (3/2) American Spanish 5 (3/2) Iberian Spanish 5 (4/1) Portuguese 3 (2/1)

Table 2.3: Speaker-related information in the Romance languages sample. Numbers in parentheses show how many female and male speakers are available.

An important caveat regarding this dataset is it is “real world”. This con- trasts with the COSPRO dataset that was recorded under phonetic laboratory con- ditions. The Romance language dataset consisted of recordings people made under non-laboratory settings (eg. classes, offices). It is also heterogeneous in terms of bit- rate sampling, duration and even format. As such before any phonetic or statistical analysis took place, all data were converted in *.wav files of 16Khz. This clearly un- dermines the quality of the recordings compared to the ones acquired by COSPRO but these conversions were deemed essential to ensure sample homogeneity. Fig. 2.1 shows a typical waveform reading.

The Romance language dataset is exclusively used for the phylogenetic appli- cations showcased in Chapt. 6 as it provides an obvious “well-examined” [105; 230; 106] sub-sample of the greater Romance languages linguistic family; some “standard members” of the Romance family like Catalan and Romanian were not included. Fig. 2.6 shows an unrooted linguistic phylogenetic tree T of nominal phylogenetic distances for the languages at hand based on Grey et al on [106].

Italian

American Spanish

Iberian Spanish Portuguese

French

Romance Language Unrooted Phylogeny

Figure 2.6: Unrooted Romance Language Phylogeny based on [106]. Branch lengths donot correspond to lexical clock time.

In document Functional data analysis in phonetics (Page 35-40)