A review of theoretical perspectives of inter-speaker accommodation phenomena was given in chapter 3. As highlighted in section 3.6, the majority of these theoretical models are based on positive empirical evidence acquired in laboratory conditions. The presence or absence of accommodating behaviour in some cases has been assessed by perceptual “expert” judgements, while little emphasis has been put on measuring the magnitude of these phenomena. In contrast, theoretical models have focused on the cause and function of accommodation. Such functions are cognitive alignment, communication efficiency, satisfaction of emotional needs, social approval or balancing a dyadic relationship. Several of these functions are relevant in the context of SDS, but without a quantification of the observations, it is impossible to develop systems that can replicate the behaviour observed in human dialogues. This problem was identified in (Oviatt et al. 2004):
“One weakness of past research on interpersonal linguistic adaptation has been its lack of follow-through on quantitative research and user modeling. Instead, this literature has focused on qualitative descriptions of the social dynamics and context involved in linguistic accommodation. It has also relied on global correlational measures to demonstrate linguistic accommodation between two interlocutors. In future research, more quantitative predictive modeling will be needed on the process of linguistic convergence, including the magnitude and rate of adaptation of different linguistic features, the factors that drive dynamic adaptation and re-adaptation during human-computer conversation, and other key issues. Such models will be valuable in guiding the design of future conversational interfaces and their adaptive processing capabilities.”12, (Oviatt et al. 2004)
As noted in chapter 4, the mechanisms currently available for monitoring and quantifying accommodation are unsuitable for SDS that aim to mimic human-like interaction. Existing approaches to measuring accommodation are almost exclusively – with few exceptions – statistical. The typical process comprises (a) acquiring speech recordings, (b) extracting features and (c) performing statistical analysis or – in some cases – signal processing techniques in order to validate
12 (Oviatt et al. 2004) uses the term “linguistic” to signify any property of spoken language. The features studied in the same text are amplitude, speech rate and response latency.
the hypothesis of accommodation, or to compare the results of two or more experimental conditions. In assessing the limitations of existing studies of measuring accommodation, these three stages are discussed in the remainder of this section.
As highlighted in section 2.5, proponents of human-like SDS (Carlson et al. 2006; Edlund et al. 2008) have emphasized the need for investigating human behaviour in dialogues of spontaneous speech. The reason for this requirement is that spontaneous speech is human speech in its most natural form. Therefore, knowledge derived from investigating such corpora is more likely to perceived as “natural” when applied to SDS. Wizard-of-Oz SDS environments simulating application tasks can also be used, but care has to be taken that properties of natural human speech are not masked by the experimental constraints. Accommodation, in particular, has been found to be affected by task complexity (Pardo 2006) and talker role (Fais 1996) among other factors.
However, few of the studies reviewed in chapter 4 have used spontaneous speech in their investigation of accommodation phenomena (see Table 4.1). Some of the studies have used scripted dialogues, which were designed so that features could be extracted from identical lexical elements (Kakita 1996), or utterance types (Suzuki and Katagiri 2005). Despite the advantages of this approach in relation to robust feature extraction, the “dialogue” is artificial and the results of these studies cannot be generalized. A second group of studies used simulated human-machine interaction scenarios, in which subjects had the role of the “user” (Bell et al. 2003; Suzuki and Katagiri 2004; Oviatt et al. 2004). While these studies provided evidence of user accommodation towards the “system”, it is doubtful whether they can be helpful in comparing human-human and human- machine interaction in this regard and informing improvements on the human-likeness of SDS. A third group of studies reported using spontaneous speech recordings (Brennan 1996; Bosch et al. 2004b; Reitter et al. 2006; Nishimura et al. 2008; Campbell 2009; Edlund et al. 2009; Benus 2009). However, as discussed in section 2.5, acquiring recordings of genuine spontaneous speech is not trivial, and careful consideration is required in order to record such dialogues.
The stage of feature extraction is also typically accompanied by a number of assumptions. Turns, in particular, are typically defined using an arbitrary turn attribution scheme (see section 2.3.2) which assumes speakers are holding and releasing the floor at specific points. However, such schemes are not adequate in describing spontaneous speech and thus introduce bias in the subsequent analysis. Another assumption commonly found is to extract features from entire utterances and “tie” them to a specific time point, such as the beginning (Kakita 1996) or the middle (Nishimura et al. 2008) of the utterance. While such conventions are convenient, they are not necessarily consistent with the process of speech production and perception in human speech: the prosodic realization of an utterance is not pre-determined before vocalization, but comes as a result of articulation effort (Xu
2005) and simultaneous feedback from the interlocutor (Heylen 2009).
Finally, statistical validation of inter-speaker accommodation has been accomplished in a variety of ways, but most of these methods are not helpful for quantifying/modeling this behaviour for SDS. A characteristic example is across-dialogue comparisons (Coulston et al. 2002; Bosch et al. 2004a; Suzuki and Katagiri 2004; Ward and Nakagawa 2004), in which subjects' speech features are compared across two or more different conditions; in some cases, the dialogue is arbitrarily split into two halves (Darves and Oviatt 2002; Suzuki and Katagiri 2005), resulting in a comparison between the first and second half; and yet it is clear, from any of the theoretical descriptions, that accommodation phenomena are dynamic: they (are thought to) evolve through the interaction and characterize it in terms of “coordination” or “synchrony”. This can only be indicated by using a continuous measurement methodology, sampling at regular intervals or identified instances (depending on the features studied), in order to arrive at a model which describes the variations of these features that occur as a result of inter-speaker accommodation. Such a model can then be used in SDS in order to continuously monitor the user's speech (or other modalities) and adapt the system voice accordingly.
A promising approach in this direction is time-series analysis, which has been used in a number of studies reviewed in chapter 4. However, time series analysis is characterized by complexity, which discourages wide adoption of this technique (Edlund et al. 2009). Thus, several studies are limited to inferring conclusions by simply inspecting the time series plots (McRoberts and Best 1997), while a few take the next step and employ an analytical approach (Buder and Eriksson 1999; Jaffe
et al. 2001; Nishimura et al. 2008; Richardson et al. 2008). However, only one of these proposed a
model for monitoring user accommodation and adapting the system voice to accommodate to that of the user (Nishimura et al. 2008).
The problem of quantification is perhaps most evident in studying accommodation of temporal features, such as the duration of silences before/after utterances. The phenomenon is studied from two distinct viewpoints: Communication Accommodation Theory (Giles et al. 1992) proposes that this is another form of socially-driven behaviour, while studies on rhythmic entrainment (Jaffe and Feldstein 1970; Wilson and Wilson 2005) suggest that interlocutors are rhythmically “coupled” when engaged in dialogue. Evidence is weak for both: across-dialogue comparison of silence duration convergence among speakers (Bosch et al. 2004b) does not constitute solid evidence, as it can be attributed to other causes, such as dialogue or topic liveliness (Bosch et al. 2005; Benus 2009); turn-based time series approaches show partial evidence: only a portion of the dialogues exhibit simultaneous variation of silence duration among speakers (Edlund et al. 2009); and there is little empirical support for “coupling” theories (Benus 2009). Therefore, temporal accommodation
is a subjectively observed phenomenon, but there is weak evidence for it, especially in the case of spontaneous speech.
It is evident from the review that inter-speaker accommodation phenomena have not been described adequately in respect to their manifestation; and this is a significant obstacle towards their implementation in SDS. Therefore, further investigation of the form of accommodation is required, in order to extract information that can be useful for SDS.