• No results found

8.3 Flexible dialogue representations

8.3.2 A practical example

In order to test the usefulness of the proposed representation, (Kousidis et al. 2009b) carried out a preliminary analysis, based on 5 dialogues recorded using the “shipwrecked” scenario experimental setup. In this study, the speakers' average pause length (APL) and overlap rate (OR) were investigated in relation to turn share (TS) and joint active time (JAT) distributions. Since the proposed representation does not define turns for the speakers, the pauses and overlaps were attributed to the speakers in an unambiguous manner (see Figure 8.2): pauses belong to the speaker who initiates a vocalization immediately after the pause interval, regardless of who is speaking before the pause; for overlaps, the interval immediately before the overlap segment, is considered, and the overlap is attributed to the speaker who is not speaking in that segment (thus initiating the overlap segment). There is no distinction between switch and non-switch pauses, or interrupting and non-interrupting overlaps.

The results of the study in (Kousidis et al. 2009b) are shown in Table 8.2. A frame length of 5 seconds with no overlap was used. APL was found to be strongly correlated to JAT. The correlation is negative, which indicates that high JAT results in shorter pauses. This is intuitive to an extent, as JAT is defined as the proportion of vocalization (total length minus the total duration of silence). Thus, it is possible that there are fewer – and longer – pauses. This correlation validates the hypothesis that there are in fact shorter pauses. OR is positively correlated to JAT, which indicates that high JAT results in more frequent overlaps. Again, OR is positively correlated to TO, but expresses the frequency of overlapping segments, rather than their relative length (TO). This finding validates that overlaps are more frequent when JAT is high. OR is also (negatively) correlated with TS, which indicates that speakers overlap their interlocutors more often when they have a smaller turn share (e.g. due to back-channeling).

Dialogue TDD

(sec) APLJAT APLER JATOR ORTS

1 F 428 -0.6 - 0.3 - M -0.5 -0.3 0.5 -0.3 2 M 490 -0.6 -0.3 0.4 -0.4 M -0.7 - 0.5 -0.2 3 M 409 -0.6 -0.3 0.5 - F -0.4 - 0.6 - 4 F 516 -0.5 -0.5 0.4 -0.4 F -0.7 -0.4 0.4 -0.3 5 M 363 -0.7 - 0.4 -0.4 M -0.4 -0.3 - -

Table 8.2: Correlation coefficients between APL, OR and JAT, TS and ER (Significant at 95%, t-test with n-2 degrees of freedom, where n equals the length of the data).TDD: Total Dialogue Duration

A correlation between APL and turn share, TS was not found. However, APL is correlated with a related measure, hereafter exchange rate, ER, defined as ER = 2·MIN(TS1, TS2). ER takes values between 0 and 1 and expresses the degree to which a frame is dominated by either speaker (zero) or shared (1). The correlation between APL and ER is negative, which suggests that speakers shorten their pause length when exchange rate is high, i.e. when the floor is shared more equally.

The results of (Kousidis et al. 2009b), support the argument that the proposed representation of spontaneous dialogues can be useful in verifying the effect of factors such as JAT and ER on temporal features. One advantage of this representation is that it moves away from turn attribution and, consequently, the shortcomings of defining turns solely from the chronograph of the dialogue. Clearly, meaningful turn segmentation can only be achieved by discourse analysis which, in the context of SDS, pre-requires automatic speech recognition (ASR) and spoken language understanding (SLU) output. However, it is desirable for the interaction management component (which manages when the system can speak to the user or when the user’s turn has ended) to operate independently of these components, due to their higher computational load and significant error rates in practice. For this reason, spoken dialogue systems have to rely on low-level information from the signal to manage turn-taking behaviour, namely the duration of turn-switch pauses and prosodic features such as final vowel lengthening. The approach presented here provides an alternative solution: the interaction management component can adapt to the ongoing session and adjust its thresholds and latencies according to JAT and ER. It would be naive to consider that the methodology outlined here could replace the current methods of SDS design; rather, the proposed representation should at best be seen as a starting point towards more flexible representations of the dynamics of human (and human-computer) interaction, which in turn may push naturalness of SDS forward.

One argument against the representation presented here is that there is loss of information due to the averaging “sliding window” process. Indeed, the length of the applied frame determines the time resolution of the representation. But, as indicated by the example analysis presented in (Kousidis et

al. 2009b), there is nothing preventing the use of the original chronograph in order to extract

features and analyze them in combination with the proposed representation. The purpose of the averaging is only to extract information about the turn-share distribution properties in the

neighborhood of a segment (in this case a pause or an overlap). This information can also be

combined with other inputs, such as low-level acoustic and prosodic features.

The size of that neighborhood, or frame-length, is another feature that needs to be considered. As discussed earlier, there is a trade-off between time resolution and frame length. It is desirable to keep the frames (and consequently the time resolution) small, because ER (or TS) is very sensitive

to frame length: the worst-case scenario is that a frame with ER=0.5 is actually two adjacent “half- frames” with ER = 1 (each speaker dominating one of the adjacent half-frames, yielding an equally shared frame). This can be allowed for short frames, because even when this is the case, responses are often anticipated before they occur, therefore the speakers know that there is going to be an exchange. Indeed, the correlations in Table 8.2 remain significant for frames with length approximately up to 8 seconds. JAT and TO, on the other hand, are less sensitive to frame length, and can be used to monitor lower frequency variations in activation, or engagement in the dialogue. Another important point is that APL is correlated to JAT and ER, which apply to both speakers equally at any time in the dialogue, although each speaker's APL is influenced differently by JAT. Therefore, the proposed representation did not reveal a source of variation in APL that would imply non-contemporaneous inter-speaker accommodation (see section 8.2.3). This was the case however for OR, as it was found to be correlated to TS, therefore a lag zero correlation of the two speakers' OR should not be expected unless the dialogue (or part of) is characterized by high ER, in which case turn shares tend to be equal most of the time.

Finally, considering turn shares rather than turns is more consistent with dialogue representations which consider both speakers active at any time during the dialogue (Campbell 2009; Heylen 2009). Thus, the dialogue schema of Figure 6.1 can be updated in order to represent this view. In a full- duplex model, properties of speech are not necessarily causally related to the immediately preceding time interval in the interaction, but subject to the ongoing interaction in which both speakers participate equally. The process of instantaneous feedback that was discussed in section 7.5 is one aspect of this: a/p and temporal (and possibly other) features of speech are subject to variations at the instant the feedback is perceived, i.e. during vocalization and not after. The simplest possible way to depict this process is to superimpose Figure 8.9 on Figure 6.1 resulting in the following schema, which is equivalent to the schema proposed in (Heylen 2009)

Figure 8.11: Representation schema of dialogue including instantaneous feedback

SPEAKER A