The current work employs two functional datasets. Both of them were made avail- able to the author by his respective collaborators.
2.5.1 Sinica Mandarin Chinese Continuous Speech Prosody Cor- pora (COSPRO)
The Sinica Continuous Speech Prosody Corpora (COSPRO) [311] was collected at the Phonetics Lab of the Institute of Linguistics in Academia Sinica and consists of 9 sets of speech corpora. We focus our attention on the COSPRO-1 corpus;
0 0.5 1 50 100 150 200 250 300 350 Hz t Tone1 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone2 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone 3 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone4 0 0.5 1 50 100 150 200 250 300 350 Hz t Tone5 M01 M02 F01 F02 F03
Figure 2.5: Tone realization in 5 speakers from the COSPRO-1 dataset.
the phonetically balanced speech database consists of recordings of Taiwanese Man- darin read speech. The COSPRO-1 recordings themselves were collected in 1994. COSPRO-1 was designed to specifically include all possible syllable combinations in Mandarin based on the most frequently used 2- to 4-syllable lexical words. Addi- tionally it incorporates all the possible tonal combinations and concatenations. It therefore offers a high quality speech corpus that, in theory at least, encapsulates all the prosodic effects that might be of acoustic interest.
After pre-processing and annotation, the recorded utterances, having a me- dian length of 20 syllables, resulted in a total of 54707 fundamental frequency curves. Each F0 curve corresponds to the rhyme portion of one syllable. The three female
and two male participants were native Taiwanese Mandarin speakers. Using the in-house developed speech processing software package COSPRO toolkit [311; 312], the fundamental frequency (F0) of each rhyme utterance was extracted at 10ms in-
tervals, a duration under which the speech waveform can be regarded as a stationary signal [131]. Associated with the recordings were characterizations of tone, rhyme, adjacent consonants as well as speech break or pause. Importantly the presented corpus is a real language corpus and not just a series of nonsensical phonation pat- terns and thus while designed to include all tonal combinations, it still has semantic meaning.
More specifically the syllables are labeled with one of the four lexically speci- fied tones or a sign that are phonologically toneless (tone 5). In addition contextual information is also associated with each curve (see Table 2.2 for a list of covariates included). Fig. 2.5 shows time-normalized example realizations of all 5 tones for all
Effects Values Meaning Notation- mark
Fixed effects
previous tone 0:5 Tone of previous syllable, 0 no previous tone present
tnprevious
current tone 1:5 Tone of syllable tncurrent
following tone 0:5 Tone of following syllable, 0 no following tone present
tnnext
previous conso- nant
0:3 0 is voiceless, 1 is voiced, 2 not present, 3 sil/short pause
cnprevious
next consonant 0:3 0 is voiceless, 1 is voiced, 2 not present, 3 sil/short pause
cnnext
B2 linear Position of the B2 index break in sentence
B2 B3 linear Position of the B3 index break
in sentence
B3 B4 linear Position of the B4 index break
in sentence
B4 B5 linear Position of the B5 index break
in sentence
B5 Sex 0:1 1 for male, 0 for female Sex
Duration linear 10s of ms Duration
rhyme type 1:37 Rhyme of syllable rhymet
Random Effects
Speaker N(0,σspeaker2 ) Speaker Effect SpkrID Sentence N(0,σsentence2 ) Sentence Effect Sentence Table 2.2: Covariates examined in relation toF0production in Taiwanese Mandarin.
Tone variables in a 5-point scale representing tonal characterization, 5 indicating a toneless syllable, with 0 representing the fact that no rhyme precedes the current one (such as at the sentence start). Reference tone trajectories are shown in Fig. 2.4.
5 speakers.
2.5.2 Oxford Romance Language Dataset
The Oxford Romance Language Dataset was collected by Prof. John Coleman in the Phonetics Laboratory of University of Oxford between 2012-13. It consists of natural speech recordings of four languages; French, Italian, Portuguese and Spanish. Spanish recordings where classified as American or Iberian Spanish. For the purpose of this study American and Iberian Spanish are treated as distinct languages. The speakers utter the numbers one to ten in their native language and dialect. The dataset is inherently unbalanced; we have seven (7) French speakers, five (5) Italian speakers, five (5) American Spanish speakers, five (5) Iberian Spanish speakers and
three (3) Portuguese speakers. We were unable to have records for all 10 digits from all speakers, this finally resulting in a sample of 219 recordings. The sources of the recordings were either collected from freely available recordings from language training websites or standardized recording made by university students.
Language Number of Speakers (F/M) French 7 (4/3)
Italian 5 (3/2) American Spanish 5 (3/2) Iberian Spanish 5 (4/1) Portuguese 3 (2/1)
Table 2.3: Speaker-related information in the Romance languages sample. Numbers in parentheses show how many female and male speakers are available.
An important caveat regarding this dataset is it is “real world”. This con- trasts with the COSPRO dataset that was recorded under phonetic laboratory con- ditions. The Romance language dataset consisted of recordings people made under non-laboratory settings (eg. classes, offices). It is also heterogeneous in terms of bit- rate sampling, duration and even format. As such before any phonetic or statistical analysis took place, all data were converted in *.wav files of 16Khz. This clearly un- dermines the quality of the recordings compared to the ones acquired by COSPRO but these conversions were deemed essential to ensure sample homogeneity. Fig. 2.1 shows a typical waveform reading.
The Romance language dataset is exclusively used for the phylogenetic appli- cations showcased in Chapt. 6 as it provides an obvious “well-examined” [105; 230; 106] sub-sample of the greater Romance languages linguistic family; some “standard members” of the Romance family like Catalan and Romanian were not included. Fig. 2.6 shows an unrooted linguistic phylogenetic tree T of nominal phylogenetic distances for the languages at hand based on Grey et al on [106].
Italian
American Spanish
Iberian Spanish Portuguese
French
Romance Language Unrooted Phylogeny
Figure 2.6: Unrooted Romance Language Phylogeny based on [106]. Branch lengths donot correspond to lexical clock time.