Time-aligned moving average - A Study of Accomodation of Prosodic and Temporal Features in Spok

The TAMA method utilizes a sequence of contemporaneous fixed-duration frames in which an average value of each a/p feature is calculated. The frames may overlap, making the process similar to a moving average filter, hence the name of the method. The sequence is initiated at the start of the dialogue (time instant zero), and there are two main variables: the frame length, and the time

step. The frame length is the duration of each frame, while the time step defines the degree of

overlap and the total number of frames. The degree of overlap, as a percentage of the frame length is given by the following formula:

The overlap expresses the proportion of a frame that is overlapped by an adjacent frame. Thus a frame length of 20 seconds combined with a time step of 10 seconds yields 50% overlap: the second half of each frame is the first half of the next frame. The total number of frames is given by Equation 7.2 (“\” denotes an integer division):

7.3.1 Frame average calculation

The average a/p feature value of a frame is calculated over the speech intervals found in that frame as shown in Figure 7.1 below. The speech intervals have previously been annotated and a/p features for each interval have been extracted using Praat software (see sections 6.5.2 and 6.5.3).

Equation 7.1: Proportion of frame overlap Overlap=100×FrameLength−TimeStep

FrameLength

NumberOfFrames = (DialogueDuration \ TimeStep) + 1

Equation 7.2: Calculation of total number of frames

Figure 7.1: Schematic of calculation of TAMA frame average of an a/p feature

Speech interval Clipped-off part

Pause Frame boundary

Speaker A di

f_i Speaker B

Let fi denote the feature value for speech interval i. The overall mean value of the feature for the entire frame, μframe, is given as a weighted mean, where the interval durations, di are the weights and

N is the total number of speech intervals in the frame:

Equation 7.3: Frame average calculation

The weights, di can be normalized, if divided by their total, i.e. wi = di / Σdd, with Σwi = 1, in which case the standard error is given by:

Equation 7.4: standard deviation for weighted mean with normalized weights

where σi is the standard deviation of feature fi in interval i.

The weighting ensures that longer speech intervals have a proportionally higher contribution to the frame average than shorter intervals. The latter are characterized by large variations in their prosodic characteristics: back-channeling expressions often have very low pitch/intensity, while short exclamations have very high pitch/intensity. Since these short intervals are very frequent in spontaneous speech, the averaging would be biased in frames with such intervals. Alternatively, one could concatenate all speech intervals in a given frame and calculate the average feature for the concatenated sound, which leads to the same result: the grand mean of two populations is equal to the mean of the individual means weighted by the population sizes. In this case, the “populations” are the speech intervals, and the “sizes” are the interval durations.

As shown in Figure 7.1, speech intervals may cross frame boundaries. In this case, the duration of the part of the speech interval that lies inside the frame is used as the weight in the calculation. This can be thought of as trimming the intervals: the “clipped-off” parts of the speech intervals do not contribute to the frame average. This does not involve a re-calculation of the a/p feature value for the remaining part: the a/p value for the whole interval is used in the calculation and only the duration is affected.



_frame

=

∑

i =1 N

f

⋅d

∑

i=1 N

d

S.E.=∑

i=1 N

w

_i2



_i2

7.3.2 TAMA plots

The result of the process described in the previous section is two series (one for each speaker) of contemporaneous frame averages of a/p features, which can directly undergo bi-variate time series analysis. In order to fully satisfy the specification of section 7.2, the frame averages are normalized by dividing over the overall dialogue mean value, μ, of each speaker. This is again calculated using Equation 7.3, considering the entire dialogue as a single frame. An example TAMA plot is shown in Figure 7.2 below.

The TAMA method can be thought of as an expansion of the “half-split” idea (see sections 4.6 and 7.2). Instead of split in two, the dialogue is divided into several shorter frames. The disadvantage in this case, as was mentioned in section 4.6, is that due to the smaller amount of utterances the frame averages tend to be biased by local phenomena, as different utterance types have different prosodic properties. Interrogative statements, for example, have rising intonation, as opposed to declarative statements, which have falling intonation. Thus, there is a trade-off between robustness (longer frames) and resolution (shorter frames). The introduction of overlap, similarly to a moving average filter, has a smoothing effect, highlighting slower-moving (or low-frequency) patterns of prosodic variation over abrupt changes (high-frequency) in prosody that often occur in spontaneous speech.

Figure 7.2: Normalized average pitch of two male speakers measured over 30 second frames with 33% overlap (part of dialogue shown)

In addition, the usage of frames, rather than utterances or turns, as units, resolves the issue of synchronous analysis without the need for assumptions over turn allocation to a speaker or marking turn-exchange instants, which is difficult to do in spontaneous speech (Campbell 2009). Instead, a/p feature values are collected by accumulation over an arbitrarily defined frame, regardless of the

specific linguistic detail during that time. Some information is lost, such as the time instants that vocalization is initiated or terminated by either speaker. Thus, it is possible that each speaker dominates a different portion of the frame, so that the frame average similarity shown in Figure 7.2 is not indicative of a strictly synchronous similarity in a/p features.

However, speakers in general do not speak contemporaneously most of the time (despite significant occurrences of overlapping speech). In addition, the temporal order of vocalization among speakers is significant when accommodation is considered as a result of dialogue structure, rather than an

underlying behaviour. In naturally occurring human speech, vocalizations can be anticipated before

they actually occur, thus accommodation does not necessarily depend on the immediately preceding

utterance or turn. A TAMA frame captures a local portion of the dialogue, and both speakers'

contributions during that time are considered as equal in terms of causality. This alleviates the need to define “speaker turns”.

Information on each speaker's contribution during a frame is given by Σdi which, if divided by the frame length, yields a relative duration:

Equation 7.5: Calculation of relative duration

The relative duration has a value between 0 (no contribution) and 1 (entire frame covered by one speech interval of that speaker), and can be used as a confidence score for the a/p value obtained for that frame and speaker: if a speaker's relative duration is low, as a result of minimal contribution, such as a single one syllable back-channeling utterance, it is possible to obtain extremely high or low values for some features. The thresholds depend on the frame length, as longer frame lengths reduce the variance more than shorter frame length. In such cases, points can be removed and replaced by either the overall mean or a linearly interpolated value. Interpolation is justified in this case as each point represents an entire frame rather than a single utterance and thus a linear model

can be fitted locally for frame averages (if the a/p feature can be assumed to have a normal

distribution, see section 7.4.1).

In a preliminary study based on three 30-minute long unconstrained dialogues (Kousidis et al. 2008), accommodation was evaluated by visual inspection of the plots for all four a/p features studied (pitch, pitch range, intensity, speech rate). The overall picture was that the two speakers were consistently following each other's prosodic variations over progressively longer time frames (20, 30 and 60 seconds), in all three dialogues. Some dialogue portions, such as the approximately 8

RelativeDuration=

∑

minute-long extract shown in Figure 7.2, showed accurate “tracking” among the two speakers. Several instances of deviation from this behaviour were also found. A careful inspection of these frames showed that the deviation could be attributed to specific causes such as (a) non-standard speech style, such as laughing speech or extreme expressions of enthusiasm (e.g. “wow”), or (b) inaccurate measurements due to low relative duration. While (b) can be dealt with by increasing the frame length, with the consequences discussed in the previous paragraph, (a) is a natural occurrence in human dialogues and it should not be considered as an error. This means that speakers are not

obliged to converge (accommodate) in their a/p features, rather they do so spontaneously most of

the time.

The results in (Kousidis et al. 2008) showed that the TAMA method can capture accommodation of a/p features in spoken dialogues, in a continuous representation. In order to formally evaluate this, a statistical validation was sought, as described in the next section.

In document A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications (Page 124-129)