• No results found

Pitch & Prosody Models

The rhythmic and intonational patterns of a language [159] are known as prosody. Prosody andF0modelling are intertwined asF0is a key component of prosody6and

this connection is even stronger in tonal languages like Mandarin Chinese. Accurate modelling of the voicing structures enables the accurate modelling of voiced speech segments thus assisting all aspects of speech related studies: synthesis, recognition, and coding [69].

5

Pinyin is the official form of the Latin alphabet transliteration of Mandarin Chinese used by the People’s Republic of China [328].

6

Figure 2.4: Reference tone shapes for Tones 1- 4 as presented in the work of Yuen Ren Chao; Tone 5 is not represented as it lacks a general estimate, always being significantly affected by non-standardized down-drift effects. Vertical axis represents impressionistic pitch height. Somewhat generally, the “ref-

erence framework” for the analysis and synthesis of intonation patterns is ToBi [158]. As defined by its creators:“ToBi is a framework for developing community-wide conven- tions for transcribing the intona- tion and prosodic structure of spo- ken utterances in a language vari- ety”. ToBi defines five pitch accents and four boundary tones based on which it categorizes each respec- tive utterance. ToBi is effectively a complete prosodic system; it has two important caveats. First ToBi is not a universal system. There are language-specific ToBi systems that are non-communicative to one another; this is a major shortcom- ing in its generality. In that sense its rigidity has rendered it too re- strictive to model even English va- rieties (ToBi was specifically devel- oped for the English language orig- inally) leading to the development

of IVie [103]. Secondly though it also defines a series of different break types, acting as phrasing boundaries. Break counts are very significant as physiologically a break has a resetting effect on the vocal folds’ vibrations; a qualitative description of break counts is provided in Table 2.1. This recognition of the importance of breaks high- lights an important physiological characteristic ofF0; while a continuous trajectory

is meaningful for temporal modelling, anF0trajectory is not a continuously varying

parameter along an utterance but rather a series of correlated discrete events that are realized as continuous curves.

Complementary to ToBi is the work of Taylor with the TILT model [305].

“TILT is a phonetic model of intonation that represents intonation as a sequence of continuously parametrized events.” The interesting thing about TILT is that it effectively places the intonational event not only as its modelling target, but also as its fundamental unit. As such it does not use predetermined labels as ToBi. Instead, each event is characterized by its amplitude, duration and tilt. Tilt (not TILT) is

Break Type Meaning

Break 1 Normal syllable boundary. In languages like written Chinese where there is “no alphabet” but the written system corresponds directly to morphemes, this corresponds to a single character. (As syllable segments will often act as our experimental data units, B1 is equivalent to the mean value of the statistical estimates and thus not examined separately as a “dependent variable”).

Break 2 Prosodic word boundary. Syllables group together into a word, which may or may not correspond to a lexical word.

Break 3 Prosodic phrase boundary. This break is marked by an audible pause.

Break 4 Breath group boundary. The speaker inhales.

Break 5 Prosodic group boundary. A complete speech paragraph. Table 2.1: ToBi Break Annotation

effectively a continuous description of theF0 curve that is a function of the duration

Dand the amplitudeA of the intonation pattern examined. In particular: Tilt = |Arise| − |Af all|

2(|Arise|+|Af all|)

+ Drise−Df all 2(Drise+Df all)

. (2.14) The TILT model was in a way influential because it really provided an empirical and continuous representation of F0. Nevertheless in the end TILT uses three,

undoubtedly important numbers to characterize a single curve. This is not “wrong” (the popularity of TILT hinting that these three numbers are highly effective), but ultimately fails to provide a framework that can be directly expanded to account of increasing sample complexity. Additionally it does not account for speaker related information affecting an utterance nor for explicit interaction between successiveF0

curves.

This is one of the main intuitions behind the third and final “reference model” for intonation patterns: the Fujisaki model [90]. The Fujisaki model was introduced by Fujisaki and Ohno in 1997 and was extended mostly by the cooperation of Fujisaki with Mixdorff7. Similarly to the TILT model, the Fujisaki model is a quantitative model that does not use explicit labels. The basic modelling assumption behind the Fujisaki model is that the F0 contour along a sentence is the superposition of

both a slowly- and a rapidly-varying component [90]. The slowly-varying component commands the overall curvature of theF0contour along the duration of the sentence,

the rapidly-varying relates to the lexical tone. This major idea came from the way theF0production mechanism is treated: the laryngeal structure being approximated

7

The author feels that given the amount of work that Hansj¨org Mixdorff has published in relation to the Fujisaki model, the Fujisaki-Mixdorff naming scheme would probably be more accurate; eg. see [210; 211; 215; 212; 216; 213; 214] among others.

by Fujisaki’s earlier work as effectively the step response function of a second-order linear system [133]. Another important theoretical break-through of the Fujisaki model was that it explicitly incorporated speaker related information or better yet uncertainty; for example it assumes that the lowerF0 attainable is a speaker related

rather than universal characteristic and that it should be treated as an unobserved random variable. The major shortcoming of the Fujisaki model actually comes from within its design: the idea of a slow-varying down-drift deterministic component is rather restrictive, despite being a reasonable norm. Especially in its original format this assumption fails to account for intonation patterns in Western languages [305]. Also in its original form the Fujisaki model advocated the use of a rigid gradient for each of the rapidly-varying components; a position where the TITL model was definitely more flexible.

A number of other prosodic frameworks have been based on these three basic ones (eg. MOMEL [135], INTSINT [196], qTA [244], etc.) but few have presented a prosodic framework that offers a universal “language-agnostic” approach. The presented work in later chapters of this thesis strives to deliver exactly that.