Bi-variate analysis - Statistical evaluation

7.4 Statistical evaluation

7.4.3 Bi-variate analysis

Bi-variate analysis considers two series together and is the simplest case of multivariate time series analysis. Each observation, rather than being a real number, is a vector, the elements of which are the values from each individual series. If x1 , x2 are two time series, then the vector X = [x1 x2] is the bi-variate time series. In general, the observations of an n-variate time series are n x 1 vectors. The individual univariate time series are called component series (Chatfield 1996).

The relationship between the component series can be explored by means of the sample cross-

correlation function (cc.f), an estimate of which is the cross-correlogram. In order to to obtain a

cross-correlogram, one needs to distinguish the component series into an input, x (independent variable), and output, y (dependent variable). As mentioned in section 7.4, this is done arbitrarily in this case, as the two subjects have an equal role in the dialogue experiment of the shipwrecked scenario (see section 6.4.3). In this manner, the series for speaker A is considered as the “input”, and the series of speaker B is considered as the “output”.

Careful consideration needs to be given to cross-correlation, as spuriously large coefficients may appear in the cross-correlogram if the component series are themselves autocorrelated (Chatfield 1996). A technique commonly used in such cases is that of pre-whitening the component series. This means that their correlograms should resemble white noise, which is a random process in

which subsequent values are uncorrelated21_{. Therefore, each component series has to be transformed}

so that its respective correlogram shows no significant coefficient. In the case of the two pitch series (Figure 7.3a), this can be achieved by fitting an autoregressive (AR) model of order 1 to each series. This is indicated by the respective correlograms of the series (Figure 7.4), which show a significant coefficient at lag 1 for both series. According to (Chatfield 1996), the value of that coefficient is the best estimate for an alpha (α) value in an AR(1) model of the form (xi – μ) = α(xi-1 -μ) +εi, where εi denotes random noise. Using the value of α = 0.4 found on the correlogram, the above equation is solved for εi, which yields a residual series for each speaker. The success of the pre-whitening method can be validated by plotting the correlograms of the residual series, in order to determine whether any coefficients remain significant (not shown).

Cross-correlation coefficients are then calculated for this pair of residual series. The sample correlation coefficient rk at lag k is given by:

Equation 7.7: Sample cross-correlation coefficient

where μx, μy are the means of the component (residual) series x,y respectively, xt, yt are the values of the residual series at time t, and N is the total number of points in the residual series. The cross- correlogram for the two pitch series (Figure 7.3a) is shown in Figure 7.6 below.

One major difference between the cross-correlogram and correlogram plots is that the former contains both positive and negative lags. According to (Chatfield 1996), a linear system with input

x and output y demonstrates feedback if significant coefficients are found at zero or positive lags.

However, if the roles of the two speakers' series – as input and output – are reversed, then the coefficient at lag 1 which can be seen in Figure 7.6 will appear at lag -1. Therefore, a coefficient at lag 1 or -1 is an indication of uni-directional convergence, in this case A→B: as the roles are

21 For a formal definition of white noise, see Chatfield (1996)

r

_xy

k =

{

∑

t=1 N −k

x

−

y

_{t k}

−



∑

t =1 N

x

−



y

−



, k 0

∑

t=1−k N

x

−

y

tk

−



∑

t =1 N

x

−



y

−



, k 0

}

reversed, B is now the input and A the output, and a significant coefficient at lag -1 means that A converges to B. This can be seen on several occasions in Figure 7.7, where A (blue) is lagging behind B (orange) by one point, particularly in the right part of the plot.

Figure 7.6: Sample cross-correlogram of the two series of Figure 7.3a, pre-whitened by fitting an AR model with α = 0.4

It is noted that this interpretation of the cross-correlogram is not very reliable, due to the presence of a (borderline) significant coefficient at lag zero. This indicates the presence of feedback in the system, unless a common underlying process is affecting both series. This point is emphasized because correlation by itself does not imply causality: unless the possibility of a common external factor can be safely excluded, there is no basis to assume a causal relationship. Since the only input in the dialogue is provided by the speakers themselves, the coefficient at lag zero has to be attributed to feedback (see section 7.5). When feedback is present, the interpretation of the correlogram can be misleading (Chatfield 1996), especially in terms of using the cross-correlogram in order to estimate model parameters, e.g. as in the univariate case, where it was possible to estimate the alpha value for an AR(1) model directly from the correlogram.

In Figure 7.7, the residual (pre-whitened) series are plotted. These residuals represent the amount of variation in the a/p features not accounted for by autocorrelation (a deterministic component). The existence of one or more significant cross-correlation coefficients implies the existence of an additional deterministic component, whether an external factor that affects both series, or a causal

relationship between the two series (the latter in this case). However, estimation of the power of this

component is not possible using the correlogram because of feedback: as shown in Figure 7.7, there are points at which the two series are “in-phase”, as well as points at which blue is lagging behind

orange. These two coefficients are competitive: the instances of zero lag reduce the value of the coefficient at lag 1 and vice versa. Positive and negative lag coefficients are also competitive. In fact, in an extreme case where two pure open-loop processes with opposite lags (at -1 and 1) are combined (concatenated), there is only one significant coefficient at lag 0. Therefore, the values of the cross-correlation coefficients can only be used for model parameter estimation only if it is certain that there is no feedback.

In addition, each point in the time series represents an entire frame, rather than a single time instant; therefore, the coefficients at lag 0 and and lag 1 are competitive with respect to the frame length. In other words, some of the autoregressive structure is “masked” due to the averaging process. Intuitively, accommodation in human dialogues is always deterministic, as speakers accommodate to each other's speech based on past utterances. However, it has been suggested (Heylen 2009) that feedback in human interaction can be instantaneous, due to visual or other cues. In the absence of visual feedback in the recordings analyzed here, it can be argued that instantaneous feedback occurs by means of overlapping speech segments. As pointed out in section 7.4, feedback implies bi- directional accommodation (A↔B). However, due to the issues discussed here, i.e. the competitiveness between coefficients and the loss of some temporal information due to the frame length, the cross-correlogram cannot show the degree of convergence separately for each speaker. Despite the fact the cross-correlogram is not useful for model estimation, it can be used for model identification (see section 7.4.4).

In a paper presenting this statistical evaluation method (Kousidis et al. 2009a), five dialogues from the “shipwrecked” scenario corpus were analyzed for accommodation of four a/p features: pitch,

Figure 7.7: Residual series plot for the two series in Figure 7.3a after fitting an AR(1) model with α = 0.4 to both series 0 50 100 150 200 250 300 350 400 450 -0,15 -0,1 -0,05 0 0,05 0,1 0,15 Residual (sp. A) Residual (sp. B)

intensity, pitch range and speech rate (see Table 7.1). Significant positive correlations were found for all four features, albeit mostly for pitch and intensity. Most of these coefficients were found at lag zero, which implies bi-directional accommodation. Whether uni-directional or bi-directional, the presence of a significant positive correlation coefficient constitutes a statistical validation of accommodation, as there is a deterministic component for at least one of the speakers that is caused by inter-speaker influence.

Table 7.1: Lags at which significant positive cross-correlation coefficients are found among two speakers in 5 “shipwrecked” dialogue recordings

Importantly, the positive sign of the cross-correlation signifies convergence, in other words adaptation of one's a/p features to the respective features of the other. This occurs simultaneously along different dimensions (or modalities), if each a/p feature is though of as a distinct channel of accommodation. A negative cross-correlation coefficient would signify divergence, or non- accommodation (see section 3.4.3), but no negative coefficients were found in (Kousidis et al. 2009a). As positive and negative coefficients are also competitive at the same lag, non- accommodation will not be statistically significant unless it occurs in a relatively large portion of the dialogue. The results of (Kousidis et al. 2009a) were confirmed from the analysis of the rest of the corpus (see appendix A).

In document A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications (Page 134-138)