Stimulus processing - Acoustic models of consonant recognition in cochlear implant users

Chapter 2. Background

3.3 Methodology

3.3.2 Stimulus processing

All stimuli used as input to the processing were recorded nonsense syllables using a female speaker, kept as digitised Microsoft sound (.wav) files with a sampling rate of 22,050 and a resolution of 16 bits. An additional stimulus was “speech-shaped noise”- this was white noise filtered to have the same long-term average spectrum as the BKB sentences (Bench et al., 1979) spoken by an adult female speaker. A randomly extracted sample (of the appropriate length) was mixed with the speech stimuli at the appropriate SNR for noise-contaminated listening conditions. Two possibilities exist with respect to how to achieve a defined SNR for VCV nonsense syllables: either the signal RMS level could be averaged across the entire signal duration or, alternatively, the signal RMS could be computed across the duration of the nominal consonant portion of the stimulus. There are disadvantages of each method: with the first option, the effective SNR with respect to the consonant itself will vary according the consonant to vowel amplitude ratio while with the second option the overall level of the signal will vary and lack of a clear definition of start and end times of the consonant portion makes the task more subjective than is ideal. For this study the first approach was used (across all experiments). In order to this, the software package Adobe Audition was used to determine the RMS level of each stimulus. For each stimulus conditions, the average RMS level of all 20 stimuli was first determined. A randomly chosen portion was copied from the sound file containing the speech- shaped noise was adjusted so that its mean RMS was at the appropriate level for whichever SNR was to be used. This was then mixed with the target stimuli at the appropriate SNR. It should also be noted that all sound files containing the target stimuli had 1 second of silence before and after the stimulus and for noise-

contaminated stimuli noise also began 1 second before stimulus onset and one second after stimulus end. A final processing stage prior to AMling was down-sampling of the sound files to 16,000 samples per second as the recordings had been made using a 22,050 sampling rate whereas the input to the NIC-STREAM/AMO processing needed to be 16,000 Hz to mimic the Nucleus 24 audio sampling rate. Stimuli were also decimated to an 8-bit rate as this is the quantization used by the Nucleus 24 processor.

The remainder of this section describes the signal processing principles used to produce the AMs (e.g. simulated stimuli for presentation to normal hearing listeners), although some further details are given to specific to each experiment. Stimuli were processed using NIC-STREAM, a MATLAB software toolbox created for processing of cochlear implant signals with the Nucleus 24 cochlear implant system, designed by Brett Swanson of Cochlear Corporation to mimic the processing of the Nucleus 24 device. The platform is much more flexible than the standard clinical programming software and is designed for research use. Its advantage for this work was the fact that it implements the same filterbank, envelope extraction and channel mapping processes as are implemented in the Nucleus 24 device and therefore allowed a valid comparison between AM and CI user data. NIC-STREAM comprises a MATLAB toolbox for generation of pulse sequences in addition to a set of functions for direct stimulation of a CI (the latter were not used in this study).

Figure 3.1 shows the conceptual stages of processing, both for the Nucleus 24 device and for the NIC-STREAM stimulus processing. For the purposes of this work, only those MATLAB functions necessary to generate a channel magnitude sequence were used. At the time of initial experimental work, the MATLAB toolbox did not implement front end processing. Consequently, this aspect of processing was dealt with separately (see below) and the input to NIC-STREAM was at the filterbank stage. Consequently, the Nucleus MATLAB toolbox was used for filterbank and sampling and selection stages of stimulus processing. Audio input to the filterbank stage generates a 2-dimensional matrix known as a “frequency-time matrix” which represents variations in output for each filter (in the case of experiments 2, the filterbank was configured as having 8 filter outputs). The subsequent stage of sampling and selection was used to generate a channel magnitude sequence for ACE

processing as used in experiment 3, but for CIS the frequency-time matrix and channel-magnitude sequence were effectively identical as with the CIS strategy all filter outputs are chosen. The channel magnitude sequence was used to generate acoustic stimuli for the AM experiments and also to generate visual representations of nominal electrode output (“electrodograms”) used in chapters 2 and 6.

Front End Filterbank Channel

Mapping Sampling &

Selection Microphone

Audio Audio Frequency-

Time Matrix Channel- Magnitude Sequence Pulse Sequence

Figure 3.1. Signal flow in the ACE and CIS speech processing strategies. Reproduced with permission of Brett Swanson, Cochlear Corporation.

Additional MATLAB M-files were developed by Johan Laneau and colleagues (Laneau et al., 2006) for generation of AMs and were used for experiments 2 and 4. These additional functions allowed the inclusion of a channel interaction model that was implemented by altering the filter characteristics used to generate the noise bands used as carrier stimuli. The AM was developed and validated in a study of pitch perception (Laneau et al., 2006) and was based on the mathematical model of current spread of Black and Clark (1980), described in 2.4.1. As the unique aspects of this model were only used for generation of stimuli in experiment 4, further details are given in section 5.2. The remaining details of processing given here apply across all three AM experiments.

At the beginning of the experimental work, front end processing was not included in NIC-STREAM. Therefore, the first stage of stimulus processing was the implementation of a pre-emphasis filter to mimic the normal high frequency boost used by the Sprint and Esprit speech processors. The frequency response of the Sprint microphone was determined empirically and the measurements use to determine this are described in Appendix A. This was defined as having the following characteristics: up to 1800 Hz, 6 dB per octave was added; between 1800 and 5000 Hz there was a flat frequency response; from 5000 to 10,000 Hz a 24 dB per octave decrease was implemented. The pre-emphasis was implemented in Adobe Audition using an FFT filter with a Hamming window and an FFT size of 8192. In some cases

implementation of the pre-emphasis led to clipping and therefore the filter was implemented with an overall gain reduction as necessary to reduce clipping. However, prior to subsequent processing, all stimuli were re-scaled to the same relative levels (to one another) as obtained prior to the addition of pre-emphasis. With the Nucleus 24 device, the pre-emphasis is inbuilt in the microphone and therefore the subsequent stage of processing would be ADC. However, the stimuli here had already been down-sampled to 16,000 Hz with an 8-bit resolution (e.g. the characteristics of the ADC stage within the Nucleus device) so no further processing was necessary to mimic the Nucleus device in this respect.

The next stage of processing was to band-pass filter the signal using the NIC- STREAM/Nucleus FFT filter bank. It should be noted that the same filterbank is used for both ACE and CIS processing strategies therefore this is identical across AM experiments. The input waveform was analysed at the same rate as the nominal “stimulation rate”, e.g. 500 Hz for experiments 1 and 2 and 900 or 250 Hz for experiment 3. As with the Nucleus device itself, a 128-point FFT was performed. This yielded bin centre frequencies that were linearly spaced at multiples of 125 Hz and which had a 6dB bandwidth of 250 Hz. These bins were combined by summing powers to provide eight frequency bands as per figure 3.2. For experiments 1 and 2, an 8-channel CIS implementation was used: the upper and lower frequency boundaries of the 8 analysis filters are shown in figure 3.2. For experiment 4, an ACE implementation was used (in order to match the clinical parameters actually used by the CI users) and details of the corresponding analysis filters are given in chapter 5.

0 1000 2000 3000 4000 5000 6000 7000 8000 22 19 16 13 10 7 4 1 Electrode number Fr e q ue n c y Lower frequency Higher frequency

Figure 3.2. Frequency allocation for the 8-channel CIS implementation used in experiments 1 and 2.

The envelope of each filter was calculated as a weighted sum of the corresponding FFT bin powers where the weights determined the frequency boundaries of the bands. Carrier stimuli were modulated according to the fluctuations in the envelopes of the corresponding band-pass filters. The nature of the carrier stimuli varied across

experiments in terms of: carrier stimulus type, choice of (centre) frequency and (in the case of noise bands for experiment 4 only) overlap between carriers.. For experiment 1, sine waves were used, whose frequencies corresponded to the centre frequencies of the 8 FFT filter outputs shown in figure 3.2. For experiment 2, noise bands and sine waves were used in different models for comparison purposes. For half of the models used in experiment 2, centre frequencies of the carriers corresponded to the centre frequencies of the FFT filter outputs as shown in figure 3.2, as in experiment 1. However, for half of the acoustic models in experiments 2, and all of the models in experiment 4, the centre frequencies of the carrier stimuli were shifted upwards in frequency so that they so that they corresponded to the assumed place of excitation along the basilar membrane (F in equation 3.1) for the corresponding intracochlear electrode (assuming the standard Nucleus 24 electrode array inserted 25 mm into the cochlea). This frequency transformation was determined according to Greenwood (1990).Consequently, the centre frequencies of the channels used in the CI processing

were shifted upwards in frequency based upon the assumed frequency along the basilar membrane for an electrode array with 22 electrodes placed 25 mm into a cochlear with a length of 33 mm. To determine the appropriate frequencies, Greenwood’s formula, given here as equation 3.1, was used.

(

)

A F = 10ax − where F=centre frequency in Hz A=165.4 a= 0.06

x=distance along basilar membrane in mm. k=1

Equation 3.1. Determination of centre frequency corresponding to place along the basilar membrane according to Greenwood, 1990.

To take an example, the filter output for (virtual) electrode 13 in the 8-channel CIS model shown in figure 3.2 yielded a centre frequency of 1313Hz. The corresponding electrode along a 22 electrode array of 25 mm length along a 33mm basilar membrane was assumed to be 17.8 mm from the apex. This resulted in an assumed characteristic frequency of 1768 Hz according to Greenwood’s formula. Consequently, the

frequency of the carrier (sine wave frequency, or noise band centre frequency), was shifted upwards by 455 Hz. The formula, combined with information about electrode array characteristics and typical insertion depth, yielded upwards shifts in frequency which ranged from 1.2 for apical/low-frequency channels to 1.45 at basal/high frequency channels. The same shift was used to determine the frequency of the sine wave carriers (for experiment 2) and the centre frequency of the noise band carriers (in experiments 2 and 4). Because of the finding from experiment 2 that this degree of “pitch shift” had only a very modest effect on performance, the transform was applied to all models used in experiment 4. It should be noted that the filter bank frequency bands reported in figure 3.2, 5.1 and 5.2 reflect analysis filter bank characteristics (common to AM and CI processing), not necessarily AM output carrier frequencies, given that these were transformed systematically as described above.

For experiment 4, noise band carriers with centre frequencies chosen to reflect corresponding cochlear locations according to Greenwood (1990) were used as in experiment 2. Additionally, in order to model spectral channel interaction the

frequency response of the filters used to generate the noise-band carriers was altered, according to the Laneau et al. (2006) model. The frequency response of the filter was designed to simulate the exponential decay of current density along the basilar membrane (Black and Clark, 1980) and is defined by:

⎟ ⎠ ⎞ ⎜ ⎝ ⎛− − = λ )) ( ( exp )) ( (x f abs xelectrode x f F where

λ = distance along cochlear in mm (the conversion of distance on a cochlear into the frequency domain assumed the Greenwood formula)

xelectrode = the position of the simulated electrode

x(f) implements the conversion to distance along the cochlea from frequency according to Greenwood, 1990

Equation 3.2. Filter transfer function used to model spectral channel intertaction from Laneau et al., 2006

The desired frequency response was obtained by implementing a linear phase FIR filter in MATLAB. The model assumed a 35 mm cochlear length and 25 mm electrode array insertion. Laneau et al. (2006), applying the same model, found equivalent performance between Nucleus 24 users and AM listeners when a channel overlap term equivalent to 1 mm spectral spread of excitation was used. However, those papers evaluated pitch perception rather than segmental perception, e.g. consonant identification. It was therefore chosen to take three channel overlap conditions: first, no overlap between noise band carriers, second, overlap equivalent to 1 mm spectral spread, and, finally, overlap equivalent to 3.3 mm spectral spread, similar to the value suggested by Black and Clark (1980). Therefore, the three models were identical except for the definition of λ which varied across three values. Figure 4.3 shows the effect of varying λ. The figure shows wide-band spectrograms of a 2000 Hz pure tone which was sinusoidally amplitude modulated at 50 Hz with a modulation depth of 100% and processed through an AM of the ACE speech

processing strategy (12 maxima out of 20 channels) and a 900 pps/ch stimulation rate. It can be seen that the spectral spread associated with the 3.3 mm channel interaction

condition is very marked. It should also be noted that the effect of any given degree of channel interaction in a peak-picking strategy will be stimulus dependent, as with a wider band stimulus it is possible that the peaks will be wider apart, whereas for a narrow band stimulus the peaks will be closer together. Therefore, for a given degree of spectral spread, the consequences will differ according to the location and spacing of the peaks chosen in a particular frame. For a stimulus where peaks are selected in the same frequency region, a small amount of channel interaction (e.g. 1mm, which represents a filter bandwidth just over 1 electrode wide either side of the stimulation electrode) will cause a larger amount of channel overlap than for a stimulus which produces widely spaced peaks.

Figure 3.3. Wide-band spectrograms of AMs of a 2000 Hz pure tone modulated at 50 Hz with no channel interaction (top) with λ= 1 mm (middle) and 3.3 mm (below)

In all three AM experiments, carrier stimuli, either sine waves or noise bands, were added together and the RMS level of the resulting signal was adjusted to be equal to the original signal. Presentation level for the AM experiments was at a nominal level of 65 dB(A) as measured in a 2cc acoustic coupler, equivalent to approximately 60 dB(A) at the tympanic membrane. For the CI user experiment, stimuli were presented

in the sound field at a level of 70 dB(A) as measured at the location of the subjects’ speech processor microphone.

In document Acoustic models of consonant recognition in cochlear implant users (Page 101-109)