Further implications of task-based modelling decisions

Modelling perceptual compensation for the effects of reverberation 1

4.7 General discussion

4.7.3 Further implications of task-based modelling decisions

As described above, it is largely possible to base modelling decisions for the com-putational model on pre-existing data from a range of physiological or psychophys-ical experiments. However, as soon as the model is applied to the task of modelling a particular set of data, it is necessary to begin making such decisions since many things are left unknown about how the model should be integrated with the spe-cific listener task that it is attempting to simulate. Here, since physiological data is not available, psychoacoustic data has instead been used to select values for model parameters: attempts were made to bring the attenuation values into reasonable ranges by tuning the parameters governing the linear mapping on the actual human response data in naturalistic conditions (cf. § 4.4.5 and § 4.4.6). The resulting values in unknown data conditions are then computed automatically, and arise in a similar range to those expected from the literature (where a maximum attenu-ation of around 20 dB is suggested by Cooper and Guinan, 2003; Murugasu and Russell, 1996). Moreover, allowing the efferent attenuation to vary in proportion

to the reverberation level of the signal is similar in principle to the suggestions by other researchers that efferent signal should vary in proportion with the noise level (Brown et al., 2010; Clark et al., 2012; Lee et al., 2011; Messing et al., 2009).

The time course of contextual awareness represents two difficulties in the current experiments. Firstly, there is not yet a consensus in the data about the timescales on which efferent effects are manifest. Secondly, at the time of modelling there was no data available on the specific time course of perceptual compensation for reverberation¹. Nonetheless, the time-scale of the binaural effect could be seen to be at the lower end of the analysis window timescales (cf. Brandewie and Zahorik, 2010; Longworth-Reed et al., 2009), and was known to occur monaurally within the span of a single utterance (e.g. in Watkins, 2005a, b; Watkins and Makin, 2007b, c). The window duration was therefore set at a somewhat arbitrary 1 sec-ond time-frame. The model was then run in different configurations as shown in Figure 4.1 at the start of this chapter. First, an ‘open-loop’ tuning stage is under-taken in order that the efferent attenuation value providing the best match to human data could be found by an exhaustive search. In subsequent simulations the model is then able to assesses stimuli independently to derive AT T automatically in a

‘closed-loop’ setting. However, the model presented in this chapter is simplified in that it only assesses the context area prior to the text word in order to determine the subsequent attenuation value to apply (more like a ‘semi’ closed loop, perhaps).

The assessment of the preceding speech would be better updated online in a con-tinual fashion by means of a gradually shifting time window, following methods presented elsewhere (for example, in Beeston and Brown, 2010; Clark et al., 2012;

Messing et al., 2009). Additionally such a model might include a ‘forgetting’ func-tion (e.g., an exponential decay with a time-constant that varies inversely with the centre frequency of the channel so that low frequency channels contain longer his-tories) so that the immediate prior context of the signal contributes more strongly to the quantification of reverberation. Recent work by Watkins et al. (2011) also suggests that a frequency weighting to rate context areas signalling the [t] more strongly would benefit the high-frequency consonant distinction around which the

‘sir-stir’ identification task revolves, as is discussed below².

Additionally, the time window of test-word awareness is not straight-forward ei-ther. A keyword spotting technique was considered for the current study, and might have been closer to the manner in which a human listener approaches the ‘sir-stir’

task. However, this method was eventually abandoned in favour of the simpler

1Two studies in particular have recently provided data related to this question: Brandewie and Zahorik (2013), which investigates binaural compensation effects, and Experiment H4 (below), which investigates monaural effects.

2Additional support for this is found and discussed in § 5.3.4 in the following chapter as well.

template-based MSE approach (also favoured by Messing et al., 2009). Influenced by the technical considerations behind ASR segmentation techniques, the test area was at first assumed to be defined by the time frames corresponding to the un-reverberated test-word position, and later, to be the frames corresponding to just the initial consonant cluster (cf. § 4.4.4). With hindsight, this appears particularly restrictive: the reverberation tails corresponding to the test item itself protrude out-side of this region, yet are not assessed by the template mechanism. Moreover, recent work by Watkins and Raimond (2013)¹suggests that these areas beyond the bounds of the test-word do indeed influence human listeners in this task. Future modellers would thus be advised to increase the duration of the word identification portion considerably to include areas outwith the test-word location itself. More-over, alternative templates themselves would also be worthy of investigation. Here, unreverberated speech items were used in order that the exact stimuli presented in the near or far distance cases did not recur in testing. Alternatives to this could in-clude, for instance, using acoustically averaged templates (combining near and far distances together), or using perceptually averaged templates (combining all steps of the continuum that were reported by listeners to be ‘sir’ or vice versa).

A further aspect of the template-matching process might yet be improved in addi-tion, that of the relative contribution of the different spectral regions to the distance metric. Since it is not specialised to the high frequency consonant distinction inher-ent in the ‘sir-stir’ listening task, the currinher-ent model retains an eleminher-ent of generalis-ability as it stands. The MSE metric described above weights all channels equally;

however, the human listeners’ perception of whether a word is ‘sir’ or ‘stir’ is dom-inated by energy in the region around 4 kHz (Allen and Li, 2009; Watkins et al., 2011). Moreover, Watkins et al. (2011) suggest that the frequency weighting of both context and test portions are such that the high frequency channels should count for more when a ‘sir-stir’ consonant identification task is under way (i.e., the metric should be (or becomes) specialised for the task at hand). Earlier work therefore trialled a frequency-weighted version of the MSE which favours high fre-quencies over low. However, the low-frequency channels in the templates (corre-sponding to ‘s’ and ‘st’ in the current model configuration, as seen in Figures 4.15a and 4.15b) essentially contain only the spontaneous firing rate response, and there-fore do not contribute much to the wideband distance calculation in any case. In practise then, this weighted metric made little difference to the overall results: nu-merical values for sir-scores and stir-scores differ slightly, but not appreciably; the simulated category boundary positions do not move as a result.

1Findings of Watkins and Raimond (2013) are indeed supported in a connected-speech task in Experiment H3 below.

This exposes a further possibility in regard to the manner in which stimuli are pre-sented to the model. The present work follows that of countless other modelling studies: the stimulus is presented; the response is recorded; the process repeats for each item in the test suite. Whereas the human listeners in Watkins’ trials received these items one after another (in a randomised order), the current model shared the simulations out between many different cores in a high performance computing grid. This was deemed a sensible given that the computer model is completely de-terministic and outputs a single response for a given sound file, irrespective of what other jobs it has recently been working on. That is, since each trial is a completely separate event, there is no need to randomise the stimuli presentation order, and no knock-on effects can persist from one trial to the next. In the online closed-loop configuration of the model (where the attenuation estimate is continually updated), however, the time course is such that the previous trial may indeed have an ef-fect on the current word identification likelihood if the inter-stimulus interval is short¹. Therefore, it would be possible to present a randomised sequence of trials to a single instance of the auditory model (much as is done for a human listener) and record category boundaries along the same manner as that used in Watkins (2005a). Further randomisations could be investigated much as multiple people are tested; and since the model is no longer fully deterministic, category bound-aries might now vary slightly in the different runs of the experiment. If so, this would allow the current limit on the quantisation to be overcome (Equation 4.12 would now be averaged across several model listeners), perhaps allowing a better match to human data since positions in between the labelled continuum steps could now be identified as the point at which responses flip from ‘sir’ to ‘stir’. Since one of the pair of data points used in tuning was itself unattainable in the current model configuration (discussed in § 4.4.6), it can be inferred that this smoothing process might improve results additionally through re-tuning of the attenuation mapping equations themselves.

In document Perceptual compensation for reverberation in human listeners and machines (Page 152-155)