Corpus annotation and feature extraction - A Study of Accomodation of Prosodic and Temporal Fea

This section describes the annotation and feature extraction procedure followed in the analysis. There are three distinct steps in this procedure: (a) segmentation of the continuous recording into speech/silence, (b) annotation of non-silent segments with suitable labels and (c) feature extraction from the annotated segments. These three separate procedures are described in sections 6.5.1, 6.5.2, and 6.5.3 respectively.

6.5.1 Silence/ non-silence segmentation

The process of segmentation of a continuous audio stream into speech/silence segments is termed

chronography (Lennes and Anttila 2002). The result of this process is typically a representation of

the form shown in Figure 6.5, in which black and white areas denote speech activity and silence respectively (Lennes and Anttila 2002; Campbell 2009).

Segmentation can be performed either manually or automatically. In manual segmentation, a human annotator listens to the audio stream and demarcates the speech/silence areas one by one. This method produces adequately precise segmentation (±10ms) and, in addition, can be combined with

annotation of non-silent, non-speech areas. The latter step is explained in section 6.5.2. The disadvantage of manual segmentation is that is a repetitive and tedious process, which makes it costly and inefficient, especially for large corpora.

Alternatively, an automatic segmentation can be achieved by means of a voice activation detection (VAD) algorithm. A simple implementation of such an algorithm is that of segmentation based on intensity and duration thresholds. As a first step, the audio stream is divided into frames, the length of which defines the “resolution” of the algorithm (e.g. ~10ms). The intensity is calculated for each frame based on Equation 2.1. Depending on the intensity relative to the intensity threshold, a frame is characterized as silent/non-silent. Adjacent silent/non-silent frames are joined together in silent/non-silent segments, respectively. This yields numerous segments that are shorter in duration than the minimum duration thresholds (which can be different for silent/non-silent intervals). As a last step, these segments are “erased” and neighbouring segments are joined. This has to be performed both for silent and non-silent intervals (in either order).

The above algorithm was implemented in the speech analysis software Praat (Boersma and Weenink 2009), originally using a Praat script17_{which is available on-line}18_{, and subsequently using a built-in}

command that was included in later versions of Praat (see appendix C).

The resulting segmentation using the automatic method typically contains errors. Areas that are non-silent may be annotated as speech and vice versa. This occurs because a “flat” intensity threshold cannot capture the possible variations in voice intensity throughout an entire dialogue. A high threshold “misses” utterances spoken much less loudly then average, while a low threshold captures too much extraneous noise, such as air stream from the mouth and nostrils when a subject is not speaking. A reasonable trade-off value can be found by manually adjusting the threshold value, but this cannot overcome all the problems. For example, stop-consonant (/p/ /k/ /t/) closures are typically cut-off from the speech segment and annotated as silence. Thus, manual corrections are again required for an adequately precise segmentation to be obtained. The resulting method, which was used for segmentation of all the dialogues in the corpora used in this thesis, is a semi-

17 Praat software operates as a shell where objects such as sounds can be queried or modified by means of commands. A series of commands can be executed as a shell script, also known as Praat script.

18 http://www.helsinki.fi/~lennes/praat-scripts/public/mark_pauses.praat (01/04/2010)

Figure 6.5: Chronographic represenation of dialogue (two speakers A, B)

Speech Silence time (sec)

A B

autonomous process: Automatic segmentation using the built-in Praat command, followed by

manual correction of the output segments. An example segmentation using Praat and Mietta Lennes's script is shown in Figure 6.6 (silences marked by “xxx”).

6.5.2 Annotation

This section describes the corpus annotation procedure followed in the work described in this dissertation. The output of the automatic segmentation process is a “textgrid” Praat object. This type of object is a time-line with marked boundaries, which define “intervals” (or segments). The timeline is shared between the sound object and textgrid object, in a way that boundaries mark silent and non-silent intervals, as shown in Figure 6.6. During the manual correction step that was described in section 6.5.1, the intervals are labeled for content according to the simple annotation schema shown in Table 6.3 below.

Label Description

s

Speech interval

p

Silent interval

l

laughter

b

Breathing noise

n

Other non-speech noise

Table 6.3: Labels for annotation of textgrid intervals

The speech intervals, marked “s”, denote any type of vocal activity by the speaker. This means that nonsense words, such as “uhm”, “err”, and filled pauses are considered as speech. This is justified from the point of view of further analysis. These utterances were observed to be prosodicaly similar to actual words (in the linguistic sense) and are thus further analyzed for prosodic features. Nonsense words, for example, frequently appear as back-channeling expressions in the corpus (both task-based and unconstrained). By comparison to “proper” lexical elements used as backchannels,

such as “yes”, it was found that these nonsense words serve the same purpose (acknowledgment of understanding/continuing attention) and exhibit similar prosodic structure. Nonsense words are not dictionary words, but the former are in all other ways equivalent to the latter: function, vocalization, prosodic structure. Filled pauses, which are typically of the form of elongated vowels, are also classified as speech, on the same premise as before: they represent vocal activity by the speaker and are prosodicaly similar to well-formed utterances, in terms of average pitch, intensity and pitch range. Therefore, it was decided that these should be treated as speech for the purpose of prosodic analysis. By extension, any prosodicaly “speech-like” interval uttered by the speakers was classified as speech, regardless of timing or function in the dialogue.

In contrast to the above rule, occurrences of laughter, marked “l”, were not classified as speech and were not prosodicaly analyzed. Laughter was common in all recorded dialogues. From a prosodic point of view, laughter is characterized by short repetitive bursts of high pitch and intensity, a pattern largely different from that of speech, which exhibits smoother pitch and intensity contours. In addition, pitch and intensity peaks fall outside their normal range during laughter. As these values introduce bias to the acoustic/prosodic analysis, it was decided to exclude them. Importantly, this did not apply to instances of “laughing” speech, which is audible speech uttered by a speaker who is laughing at the same time, but only to instances of pure laughter. The purpose of the distinct label is that laughs are still considered as “contributions” of the speaker, for the purposes of temporal analysis.

Similarly, the “b” and “n” labels denote breathing and other non-speech noises respectively. Breaths are quite common at the beginning of utterances and are often loud enough to be captured by the intensity-threshold algorithm. Due to their high intensity and non-voiced nature, breaths introduce bias to prosodic analysis and thus had to be located and labeled appropriately. As in the case of laughter, breaths were considered important for the purpose of temporal analysis. A long inhaling sound before an utterance may be signaling the intention to speak, and is therefore considered as a contribution by the speaker. The 'n' label groups together all other unvoiced, non-speech sounds (coughing, nasal inhalation, lip-smacking etc).

Silent intervals were annotated as pauses, marked “p”, and contain silence but also certain types of extraneous noise. This noise includes accidental knocks on the microphone stand or other surfaces that are “picked-up” by the intensity threshold algorithm. Such noises are not considered part of the interaction, and are thus not labeled. Instead, any interval that is automatically marked as non-silent because of extraneous noise was manually annotated as silent instead. This is significant mainly for the purposes of temporal analysis, as these noises are relatively infrequent and thus do not introduce bias in the prosodic analysis.

6.5.3 Feature extraction

Following segmentation and annotation of the audio files, feature extraction was carried out using the Praat software. The various steps described in this section were implemented as a collection of Praat scripts which can be found in appendix C.

As described in the previous section, the audio files were semi-automatically segmented and annotated with the labels shown in Table 6.3. Prosodic features were extracted using built-in Praat algorithms (Boersma and Weenink 2009) from intervals marked with the “s” label, henceforth termed speech intervals. The features measured on each speech interval were as follows:

(a) Fundamental frequency (F0), or pitch19_{, was measured (in Hz) using the built-in Praat}

function that is based on the autocorrelation method (Boersma 1993). For each speech segment, the built-in function computes a pitch contour. Querying the pitch contour in the Praat environment yields a minimum, a maximum and an average value (arithmetic mean). The minimum and maximum were used to calculate pitch range. However, this method of pitch range calculation was too error-prone due to erroneous pitch values introduced by the algorithm, such as octave jumps or mistakenly calculating pitch values for non-voiced regions. Thus pitch range was consequently calculated as 2*std, the standard deviation of pitch, which can also be found by querying the pitch contour.

(b) Intensity, was measured (in dB) using the built-in Praat function that is based on Equation 2.1. For each speech segment, the built-in function computes an intensity contour. Querying the intensity contour yields a minimum, a maximum and an average value (arithmetic mean). However, the minimum and maximum were not used in further analysis. The built-in Praat function was used with the option “subtract mean” enabled. The purpose of this option is to subtract the “DC offset” introduced by audio recording equipment. Since the audio equipment used was of very high quality, with a signal-to-noise ratio greater than 90 dB, disabling the option yields negligible difference in the computed intensity values.

(c) Speech rate was measured (in vowels/minute) by counting the number of detected vowels and dividing by the length of the speech segment. This method yields only an approximation of speech rate (Pellegrino et al. 2004). However, since the purpose was to compare the speech rate of two speakers, the approximation was deemed sufficient in order to assess inter-speaker accommodation of speech rate. The vowel detection method used is based on calculating the derivative of the intensity contour (Press et al. 1992) and detecting vowel onsets and offsets based on steep rises, falls and peaks (Cummins and Port 1998) in the intensity contour

(located as maxima, minima and zero crossings on the derivative contour). This method was chosen for its computational robustness and low computational cost over other automatic vowel/syllable detection methods (see appendix B).

Other features measured (using built-in Praat functions) were jitter, shimmer, harmonics-to-noise ratio and degree of voice breaks. These four features are measures of voice quality (see section 2.4). All of the aforementioned features were also measured on each vowel, in addition to the entire speech segment. The entire process was implemented as a collection of Praat scripts, which can be found in appendix C. Parts of these scripts were included in the development of LinguaTag20

(Cullen 2008b), a multipurpose speech corpus annotation tool that allows for linguistic transcription, prosodic and emotional annotation of speech and stores the annotation data in XML format for portability.

The extracted feature data was stored in tab-delimited text files that replicate the table-like memory structure used in the scripts. These “table files” can be imported into other programs such as Microsoft Excel® , OpenOffice Calc and MATLAB®. The first two were used for visualization of the data (plots), and the latter was used for the subsequent analysis which is described in the next two chapters.

In document A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications (Page 116-121)