• No results found

2.5 Speaker segmentation and diarisation

3.1.2 AMI vocalisation unit representation

The AMI corpus contains annotations on the word-level, with timings. In order to resolve overlaps between two speakers and generate proper V E, there are two possible approaches. The first is to terminate a V E when an overlapping vocalisation begins, regardless of whether the current speaker has stopped. A vocalisations generated through this approach is represented asV Et. The second

approach is to continue a vocalisation until the current speaker finishes his/her talk (V Ec). In the AMI corpus, the average duration of V Et is 1.8s, while the

average duration of a V Ec is 4.0s. V Ec is used in this study as a basic unit for

classification.

Based on V Ec, I generate eight types of feature sets, and aim to test the

effect of feature combination, especially filled pauses (Section 3.1.2.2). If a vocal sound (i.e., “Um”, “Uh”) is regarded as a filled pause, and it is longer than 0.5s, the current vocalisation which contains this vocal sound is split into two new vocalisations, and this vocal sound becomes a filled pause feature pf for its

preceding vocalisation. In this setting, there are three possible observations after aV Ec: empty pause, filled pause and overlap. For simplicity, empty pause, filled

F P = (s, t, d, pf, pe, o) (3.1)

F PV OC = (s, t, d, d−n, ..., d−1, d1, ..., dn) (3.2)

F PV OCP = (s, t, d, pf, pe, o, d−n, ..., d−1, d1, ..., dn) (3.3)

F PGAP = (s, t, d, pf, pe, o, g−n, ..., g−1, g1, ..., gn) (3.4)

In case Filled Pause is recognised as a proper separator of Vocalisation Event, four Filled Pause (FPs) based features are generated as Equations (3.1) to (3.4). Equation (3.1) is a simple feature set containing V Ec speaker s, V Ec start time

t,V Ec duration d, V Ec following filled pause durationpf, empty pause duration

pe and overlap durationo. For oneV Ec, there should be only one non-zero value

from filled pause or empty pause or overlap following it. If the nextV Ec is closely

connected, all GAP features are zero.

In Equation (3.3),F PV OCP contains the same features asF P plus Vocalisation

Horizon ofV Ec duration d. But in Equation (3.2),F PV OC does not containGAP

features pf, pe and o. F PV OC is used to signify features of vocalisation itself. In

Equation (3.4), F PGAP is analogous to F PV OCP where Vocalisation Horizon is

replaced withGAP Horizon. Using these four feature sets with the same classifier, I can easily compare the effect ofGAP, GAP Horizon and Vocalization Horizon.

EP = (s, t, de, pe, o) (3.5)

EPV OC = (s, t, de, d−n, ..., d−1, d1, ..., dn) (3.6)

EPV OCP = (s, t, de, pe, o, d−n, ..., d−1, d1, ..., dn) (3.7)

EPGAP = (s, t, de, pe, o, u−n, ..., u−1, u1, ..., un) (3.8)

On the contrary, if a vocal sound is treated as a part of continuous vocalisation, instead of a filled pause, one vocalisation will only stop at an empty pause, and will not be split by vocal sounds.

Equations (3.5) to (3.8) show vocalisation features without filled pauses, where

equations, s is a unique identifier for a speaker, t is the start time of current

V Ec,d is its duration (de refers toV Ec duration without filled pause), pf andpe

are durations of filled pause and empty pause, o is the negative value of overlap duration of adjacent V Ec (in order to distinguish from pause duration),di is the

duration of the ith V E

c preceding (i<0) or following (i>0) V Ec, gi is duration

of filled pause, empty pause and overlap separately preceding or following V Ec

(for V Ec with filled pause), ui only refers to empty pause and overlap preceding

or following V Ec (for V Ec without filled pause) and n = 3 is the length of the

context (or “horizon”) spanned by the V E, as explained in the following section.

3.1.2.1 Empty pauses and overlaps

Pauses can be characterised as empty pauses or filled pauses. An empty pause corresponds to a period of silence in the conversation. It signals the end of vocal- isation or a period of thinking. Beyond these, research shows that pauses have communicative functions, such as drawing attention from listeners. Esposito et al. [2007] indicated that pauses are used as a linguistic means for discourse segmen- tation. Pauses are used by children and adults to mark the clause and paragraph boundaries. Empty and filled pauses are more likely to coincide with boundaries, realized as a silent interval of varying length, at sentence and paragraph level.

3.1.2.2 Filled pauses

Traditionally, filled pauses are treated as a sign of hesitation and delay. We would like to know how much such hesitations and delays relate to discourse structure. Swerts et al. [1996] analysed acoustic features and shows that filled pauses are more typical in the vicinity of major discourse boundaries. Furthermore, filled pauses at major discourse boundaries are both segmentally and prosodically dis- tinct. Smith and Clark [1993] indicated that dialogue participants have many choices to signal their low confidence on answering questions, and a filled pause is a major option. Speakers use filled pauses to signal that they want to ‘hold the floor’ [Stenstrom, 1990]. Filled pauses therefore deserve attention, and should be evaluated upon topic boundary detection.

“Um” and “Mm” which have a nasal component, and “Uh” which does not. Clark [1994] showed that in the London-Lund corpus, “Um” and “Mm” are mostly used to signal short interruptions, but “Uh” are used on more serious ones. The two types of filled pauses are analysed separately in this study.

In the AMI corpus, filled pauses are extracted from annotations. “Um”, “Mm- hmm”, “Uh” and “Uh-huh” are treated as filled pauses exclusively. To be con- sistent with empty pause extraction, filled pause is identified and extracted only when it is longer than 0.5 second. If a filled pause happens in the middle of a vocalisation event, this vocalisation event is recorded as two vocalisations with a non-switching filled pause in between.

3.1.2.3 Acoustic features

Acoustic features, includingvocalisation speed,intensity,pitch,formantand MFCC (Mel-frequency cepstral coefficients), are widely used in dialogue and speech anal- ysis. Gaussian mixture models (GMM) achieve reliable speaker segmentation results with MFCC [Reynolds and Rose, 1995], and even LSP (Line Spectrum Pairs), pitch [Lu and Zhang, 2005]. I would like to incorporate acoustic features in content-free topic segmentation. Levow [2004] identified pitch and intensity features that signal segment boundaries in human-computer dialogue, and max- imum pitch gathers in segment-initial utterances. I use pitch as an example of acoustic features in topic segmentation research. Praat [Boersma and Weenink, 2009] extracts pitch in the range 75Hz - 600Hz with 10ms sampling rate. I further adapt the pitch recordings withV Ec duration, that is to use the mean pitch value

during one V Ec as its pitch feature.

3.1.2.4 Vocalisation horizon

In most classification methods, a vocalisation event is treated as an independent sample. Since the instances are sequentially observed, it is desirable to include time series information into the feature set. I postulate that the features from previous and following vocalisation events can influence the current vocalisation event. I attempt to capture this influence in two ways. The first is by using the duration of previous and following vocalisation events, as a feature of the present

Voc

Vz3 Vz2 Vz1 Vy1 Vy2 Vy3

Py1

Pz3 Pz2 Pz1 Py2 Py3

Oy1

Oz3 Oz2 Oz1 Oy2 Oy3

Vocalization Events Pause Layer Overlap Layer

Figure 3.1: Schematic Diagram for Vocalisation Horizon, Pause Horizon and Overlap Horizon (Horizon = 3). Voc is the current vocalisation,Vy1toVy3are

vocalisations afterVoc,Vz1toVz3are vocalisations beforeVoc. All 6 instances

of vocalisations form the Vocalisation Horizon. In the Pause Layer and Overlap Layer, each instance labels the position of a possible pause or overlap. Between two consecutive vocalisations, there is either a pause or an overlap, or neither. All 6 instances of pauses form the Pause Horizon, and all 6 instances of overlaps form the Overlap Horizon.

event. We call these features vocalisation horizon. The level of vocalisation hori- zon is the number of vocalisations represented as features on either side of the current vocalisation. For example, level 1 means that only the nearest vocalisa- tion before and after the current vocalisation is used as a vocalisation horizon feature. The second strategy is to use the duration of adjacent pauses2 and over-

laps3 as features. I call these features pause horizonand overlap horizon (Figure

3.1). I assume that there is either pause or overlap between any two consecu- tive vocalisation events. When there is no pause or overlap, the corresponding duration is labeled zero.