Acoustic normalisation - Acoustic Data Processing

Chapter 2: Methodologies

2.4. Acoustic Data Processing

2.4.2. Acoustic normalisation

The acoustic signals are highly variable because they carry not only the linguistic structures of the utterances but also a wealth of extralinguistic information, including a speaker’s sociocultural background, pragmatic intent, attitude and emotional status, and vocal tract anatomy and physiology of individual speakers (Anderson, 1978; Ladefoged, 1999; Harrington, 2010; Rose, 1986, 2000, 2016; Foulkes et al., 2013; Huang et al., 2016). For example, the anatomical and physiological differences in individual vocal tract structures can generate dynamic acoustic

outputs for phonologically identical utterances, and it is common to see female speakers having higher F0 values than males because of their shorter and less massive vocal cords.

The acoustic variability resulting from the entwined linguistic and extralinguistic information makes it necessary to abstract the variable indexical content from invariable linguistic content in speech signals (Rose, 1986, 1996, Huang et al., 2016). The process of normalisation is designed to accomplish this goal and to derive a linguistic phonetic representation of the variety in question, making this variety comparable to other languages/varieties with respect to language-internal properties (Anderson, 1978; Rose, 1986, 1996, 2000, 2016; Ladefoged, 1999; Huang et al., 2016). 2.4.2.1. F0 normalisation

Several normalisation strategies have been proposed and compared for achieving an effective reduction in the between-speaker variation in tonal F0 (e.g., Earle, 1975; Shi, 1986; Shi et al., 2010; Rose, 1986, 1987, 1993, 2016; Zhu, 2004). For example, Earle (1975) proposed a z-score normalisation of the tones of standard Vietnamese. Shi (1986) proposed a t-value transform approach for a corpus with single speaker of Chinese languages, and Shi et al. (2010) suggested a revised t-value normalisation for a large corpus with multiple speakers. Zhu (2004) compared six different normalisation transforms while Rose (2016) conducted a comparative-quantitative study to judge the efficiency of seven normalisation strategies using the citation F0 data in four Chinese dialects.

Among different approaches, the z-score normalisation has been demonstrated superior to other strategies (Rose 1987, 1993, 2016), and it has been widely adopted in tonal studies (e.g., Steed, 2011; Shen & Rose, 2016; Huang et al, 2016). This study also used the z-score normalisation approach to reduce the individual-dependent variance of 21 speakers. The z-score normalised F0 value Ziis calculated using the following formula (Huang et al., 2016):

Zi = (Xi-m)/s.

In this formula, normalisation parameters m and s, separately, stand for raw mean F0 value and the standard deviation estimated from all sampling F0 values over all tokens of all tones from a given speaker under consideration. Xiis an observed F0 value at a given sampling point while Zi

is its corresponding normalised value derived as the distance from the mean F0 value, corresponding to speakers’ neutral pitch. Therefore, the normalised F0 contour is expressed in the unit of standard deviation. Assuming the variables are distributed normally, nearly 95% of the normalised F0 values are supposed to be distributed between -2 and +2 standard deviations away from the mean value. In this study, each raw F0 value of individual speakers extracted from Praat was transformed into its corresponding normalised value, forming a new dataset to be further plotted and statistical tested with respect to specific research tasks, including

● What are individual citation tones realised in terms of z-score normalised F0 from 21 speakers? (Chapter 5)

● What are individual phrase-initial tones realised across their following tones in terms of z-score normalised F0 from 21 speakers? (Chapter 7)

● What are individual phrase-final tones realised across their preceding tones in terms of z-score normalised F0 from 21 speakers? (Chapter 8)

2.4.2.2 Duration normalisation

To retain the linguistic information of the absolute tonal duration, raw duration values are normalised using the following formula (Huang et al., 2016):

Dnorm = (D

Dmean) * 100.

In this formula, the normalisation parameter Dmean represents the mean raw duration estimated

from the average duration of all tokens over all tones from a given speaker being investigated. D is the duration observed for a given tone while its corresponding normalised value Dnorm is

expressed as a percentage of a speaker’s average duration values over all their tonal production. For example, if a duration is calculated over 100%, it means the tone under consideration has a longer duration than the average. Similarly, each extracted raw duration value of individual speakers was also transformed into its corresponding normalised value, forming a new dataset to be further plotted and statistical tested with respect to specific research issues, including

● What are individual citation tones realised in terms of normalised duration from 21 speakers? (Chapter 5)

● What are individual phrase-initial tones realised across their following tones in terms of normalised duration from 21 speakers? (Chapter 7)

● What are individual phrase-final tones realised across their preceding tones in terms of normalised duration from 21 speakers (Chapter 8)

2.4.2.3. Normalisation evaluation

The efficiency of a normalisation process, as proposed by Earle (1975), can be quantified by the normalisation index (NI), which represents the ratio of the dispension coefficients of the normalised and unnormalised acoustic data (Rose, 1986, 1987). This index reflects how much proportion of variance, intermingled with unnormalised acoustic data and caused by physical differences in the mass and length of individual vocal folds, is reduced. The higher the NI value, the greater degree of speaker-dependent variation is abstracted, and the clearer linguistic-phonetic content of the signals is obtained (Huang et al., 2016).

The derived normalised data were further plotted into numbers to present the patterns in a visible, precise, and generalisable way. Further, the data were used for statistical testing concerning the assumptions based on both auditory and acoustic observation. The data plotting and statistical testing were conducted using a variety of R codes, originally provided by Prof. Phil Rose but improved by my colleague Siva Kalyan for their use with Zhangzhou data.

In document Tones in Zhangzhou: Pitch and Beyond (Page 69-72)