Filter bank temporal characteristics - Effect of CI signal processing characteristics on speech

Chapter 2. Background

2.3. Effect of CI signal processing characteristics on speech perception

2.3.3 Filter bank temporal characteristics

A number of CI processing factors come under the broad heading of “temporal”, but what they have in common is the notion of information being carried within a single channel and the associated ability of the CI to represent these changes accurately within the signals carried by individual electrodes. Until very recent innovations in CI processing, the majority of CIs have used envelope extraction. Envelope extraction strategies use a fixed rate of pulsatile stimulation in which within-channel energy changes are not coded as changes in pulse timing, but in variations in pulse level (corresponding to envelope fluctuations from the filter outputs, as described above). These strategies do not code the fine temporal structure of the band-specific signals. Recent work has attempted to utilise variations in pulse timing to represent fine temporal information (Nie at al., 2005), although one of the problems intrinsic to using variable pulse stimulation rate is the (avoidance of) simultaneous pulse presentation across channels, which is known to be associated with greater channel interaction (Boex et al., 2003). In this study only envelope extraction strategies (specifically ACE and SPEAK as implemented in the Nucleus 24) are considered.

It is important to determine whether the temporal information that is available via CI processing is adequate for speech perception and also whether temporal parameter changes, particularly stimulation rate, have an impact on speech perception in CI users. The first question is therefore, how much temporal detail is required in the signal to lead to good speech perception? Steeneken and Houtgast (1980) suggested that low modulation frequencies carry the highest information load in speech. However, Rosen (1992) argued that higher-frequency temporal information is important for various critical aspects of speech perception. According to Rosen, temporal information in speech can be divided into three separate information sources varying by modulation frequency. First, low-rate temporal information (below about 50 Hz), termed envelope information, conveys basic amplitude variation in speech,

and is important in signalling manner of articulation, voicing, vowel identity and suprasegmental information. Second, temporal information between 50 and 500 Hz conveys periodicity information, e.g. information within this modulation range

conveys whether the signal is aperiodic (normally unvoiced) or periodic (voiced), contributing to voicing, manner and suprasegmental information. Third, higher-

frequency information (600-10,000 Hz) is termed fine structure by Rosen, and the

main contribution to speech intelligibility is to perception of place of articulation and also vowel quality. A proviso to this account is that, in practice, NH listeners cannot code temporal information beyond about 5 kHz and therefore it is likely that

information higher than this frequency must be coded as spectral rather than temporal information (e.g. must be coded via the place rather than the volley mechanism).

The question of how much temporal information CI users have access to has been addressed in some studies of temporal modulation transfer functions (TMTF) by CI users. Steeneken and Houtgast (1980) introduced the concept of the TMTF as a way of determining the temporal response of an acoustic system. The concept can be applied in both the physical and psychophysical domains. The original work by Steeneken and Houtgast (1980) defined the TMTF as a physical measure of

modulation depth as a function of modulation rate, but the term is also applied to the measurement of modulation detection thresholds as a function of modulation rate as in Galvin and Fu (2005). Shannon (1992) measured TMTFs in CI users in three ways: detection of amplitude modulation, detection of low-frequency sine waves and detection of beats in two-tone complexes. For each of the three tasks the TMTF was derived. The response pattern of the TMTF was similar irrespective of which of the three tasks was used. The CI users showed TMTFs with a mean cut-off frequency of 140 Hz with a very sharper fall-off above the cut-off frequency. The TMTF varied as a function of stimulus level. With NH listeners modulation detection is independent of stimulus level across the majority of the dynamic range (Moore and Glasberg, 2001). By contrast, the subjects in Shannon's study had worse temporal modulation detection thresholds the lower the stimulus level.

The problem that should be noted in the context of the present study is that it cannot be inferred from a psychophysically measured TMTF (as with any other perceptual measure) whether restrictions in temporal information are due to CI processing

information loss or electrical/neural interface information loss. The fact that there was such variability in TMTFs across CI users suggest that the electrical/neural interface may play a part in accounting for temporal information loss. A crucial question for

this study is the amount of temporal information available to the CI user as a

consequence of CI processing (as opposed to the subsequent information loss possibly associated with the electrical/neural interface- see 2.4.3). A particular focus of the literature has been the perceptual effect of changing stimulation rate and therefore it is important to determine extent to which temporal information changes with stimulation rate, e.g. total number of pulses provided by the CI per second. In the present study the question is addressed with specific reference to the Nucleus 24 device. Therefore, a more detailed consideration of the temporal processing of the Nucleus 24 device is needed.

Because the Nucleus 24 implements an audio sampling rate of 16 kHz and a fixed FFT length of 128 points, it undertakes 125 (=16,000/128) FFT analyses per second. The temporal response of the filter can therefore be approximated by a low-pass smoothing filter with a cut off at 125 Hz, with little information in the envelope available above this frequency (David Simpson, personal communication). However, the Nucleus 24M is able to implement channel stimulation rates ranging from 250 pulses per second per channel (pps/ch) to 1200 pps/ch (although note that the more recent device, the Nucleus Freedom, can implement channel stimulation rates up to 3,500 pps/ch). However, the extent to which increases in stimulation rate within the available range genuinely increase the temporal envelope information available is unclear, as temporal information can only be increased by increasing the degree of overlap between subsequent FFT analyses (of the same sampled signal). Stimulation rate increases are achieved by increasing the overlap between subsequent FFT analyses such that the number of (overlapping) analyses is equal to the stimulation rate (Cochlear, 2002). Let us consider the example of changing from 250 pps/ch to 500 pps/ch. For 250 pps/ch, the first stimulation frame analyses the first 128 samples, the second frame analyses points 65 to 194, and so on (e.g. there is an overlap of half the data points with each analysis). For 500 pps/ch, the second analysis uses points 33 to 160, and so on (an overlap of 3/4 the data points from each analysis). Increases in analysis rate above 125 Hz without increases in auditory sampling rate (i.e. shorter analysis windows) or a decrease in FFT length means that there is little benefit in temporal detail for the envelope. This suggests that the envelope bandwidth is effectively limited to 125 Hz, irrespective of analysis/stimulation rate, although a

small amount of increased temporal information may be consequent to higher degrees of overlap between FFT analyses. In order to determine this empirically, a series of objective temporal modulation transfer functions (TMTFs) were undertaken. Sinusoidally amplitude modulated (SAM) sinusoids of 250 Hz and 2000 Hz were used as input stimuli for signal processing using the NIC-STREAM Nucleus MATLAB toolbox simulation of Nucleus 24 processing. The choice of these two frequencies was motivated by the importance of the two frequency regions for different aspects of consonant recognition. Information for voicing, nasality and fundamental frequency for higher-pitch female or children’s voices occur is around 250 Hz or lower while the important second formant for most vowels occurs (and associated second formant transitions for adjacent consonants) occurs near to 2000 Hz.

The two sine waves were sinusoidally modulated at 100% modulation depth at

modulation rates from 25 to 250 Hz, in 25 Hz steps. Modulation depth was measured for processed stimuli for three different stimulation rates (250 pps/ch, 900 pps/ch and 2000 pps/ch). Stimuli were processed through a single-channel CIS strategy as implemented in the Nucleus 24 CI (described in detail in 3.3.2). Figures 2.6 and 2.7 show two examples of visual representations of electrode output. The difference between the two figures is the modulation rate- in both cases, the output of a single electrode channel is given for a SAM 250Hz tone with a modulation depth of 100%. It can be clearly seen that, while for the SAM tone modulated at a rate of 25 Hz, the modulation depth approaches 100%, for the same stimulus modulated at a rate of 250 Hz, the modulation depth is markedly affected at only 9% (modulation depth for a SAM pure tone can be simply defined as the ratio of maximum to minimum signal values, expressed as a percentage).

0 100 200 300 400 500 600 700 800 900 1000 11 El e c tr o d e Time (ms)

Figure 2.6. Electrode output for a pure tone modulated at 25 Hz through single channel CIS processing with a stimulation rate of 2000 pps. The input stimulus was a SAM tone with a carrier frequency of 2000 Hz, a modulation rate of 25 Hz and a modulation depth of 100%.

0 100 200 300 400 500 600 700 800 900 1000 11 El e c tr o d e Time (ms)

Figure 2.7. Electrode output for a pure tone modulated at 250 Hz through a single channel CIS processing with a stimulation rate of 2000 pps The input stimulus was a SAM tone with a carrier frequency of 2000 Hz, a modulation rate of 250 Hz and a modulation depth of 100%.

Figures 2.8 to 2.10 show the full range of TMTFs measured for the three stimulation rates.

0 10 20 30 40 50 60 70 80 90 100 25 50 75 100 125 150 175 200 225 250 Modulation rate M odul a ti on de pt h 250 Hz carrier 2000 Hz carrier

Figure 2.8. Temporal modulation transfer functions for two different carriers with single-channel CIS processing at a stimulation rate of 250 pps/ch with the Nucleus 24 processor. The original unprocessed signal was modulated at 100% modulation depth.

0 10 20 30 40 50 60 70 80 90 100 25 50 75 100 125 150 175 200 225 250 Modulation rate M odul a ti on de pt h 250 Hz carrier 2000 Hz carrier

Figure 2.9. Temporal modulation transfer functions for two different carriers with single-channel CIS processing at a stimulation rate of 900 pps/ch with the Nucleus 24 processor. The original unprocessed signal was modulated at 100% modulation depth.

0 10 20 30 40 50 60 70 80 90 100 25 50 75 100 125 150 175 200 225 250 Modulation rate M odul a ti on de pt h 250 Hz carrier 2000 Hz carrier

Figure 2.10. Temporal modulation transfer functions for two different carriers with single- channel CIS processing at a stimulation rate of 2000 pps/ch with the Nucleus 24 processor. The original unprocessed signal was modulated at 100% modulation depth.

It can be seen that modulation depth drops off markedly as a function of modulation rate, and that the pattern is very similar across stimulation rates and carrier

frequencies. The pattern of TMTF data, showing a gradual decrease in modulation depth and a modulation depth around 70% at 125 Hz, is consistent with the hypothesis that, for a processor with a fixed FFT length and number of samples, the envelope bandwidth does not vary significantly with increased FFT overlap. For modulation rates less than 200 Hz, there appears to be a modest advantage for 900 pps/ch and 2000 pps/ch over 250 pps/ch. However, for higher modulation rates even this small advantage disappears, at least up until the modulation rate is equal to the stimulation rate as in figure 2.10.

The data, provided in figures 2.8 to 2.10 suggest that benefits to changing from lower to higher stimulation rates should be modest if present at all for the Nucleus 24 processing system. It is therefore of interest to relate this finding to empirical

evidence regarding the effect of stimulation rate, particularly in users of the Nucleus 24 device. Vandali et al. (2000) evaluated sentence recognition in in a group of

Nucleus 24 CI users. In this study, six users of the Nucleus 24M CI were tested in three different stimulation rate conditions: 250, 807 and 1615 pps/ch. Users had take- home experience with the different rate conditions within a cross-over design, with order of presentation of the three rate conditions randomised across subjects. Other parameters used by the subjects were those normally used and outcome measures were tests of word and sentence recognition. The study failed to show a significant effect of stimulation rate and for some listeners even found deterioration in sentence recognition at higher rates. However, Holden et al. (2002) found that some Nucleus 24 users obtained better performance with 1800 pps/ch compared to 720 pps/ch, albeit only at 50 dB SPL but not at 60 or 70 dB SPL, and only for two of the six subjects. Interestingly, Galvin and Fu (2005) found an improvement to modulation detection at low stimulus levels when using a lower stimulation rate (250 pps/ch compared to 2000 pps/ch) in Nucleus 24 and Nucleus 22 users, although it should be noted that these differences were obtained via direct stimulation using a modulated pulse train, rather than for stimuli processed via the CI processor itself. Taken together, these findings suggest that there is very little evidence of performance benefit with higher rates in the Nucleus 22 and 24 devices and even some evidence of performance reductions. The measurements reported above suggest that the reason for this is the absence of appreciable changes to temporal envelope sampling with increases in stimulation rate in the Nucleus device, due to the inherent limitations of combining a fixed FFT length with a fixed sampling rate.

Systems other than the Nucleus CI implement IIR filterbanks followed by

rectification and smoothing as with the CIS strategy in the MED-EL COMBI 40+ and CIS-PRO body-worn processor. In this case, it is possible to alter stimulation rate and envelope cut-off frequency (e.g. the low-pass cut-off of the smoothing filter)

independently. It may be that the ability to increase the cut-off of the smoothing filter could lead to comparatively greater changes in temporal information transmission than is the case with devices such as the Nucleus 24 which use a fixed-size FFT approach. Recent literature suggests that both rate of pulsatile stimulation and envelope cut-off frequency may have an impact on consonant recognition, although these effects are highly variable between studies. Verschuur (2005) showed that there was little benefit to changing stimulation rate without changing envelope cut-off

frequency. In that study three different stimulation rates were used (400, 800 and >1500 pps/ch) but envelope cut-off was maintained at 400 Hz. There were no

differences in performance with consonant recognition measures, although there were improvements at the higher rates for sentence recognition, albeit only for 2 out of 6 subjects.

Fu and Shannon (2000) evaluated the effect of both stimulation rate and envelope cut- off frequency on consonant and vowel recognition in users of a 4-channel CIS

strategy with the Nucleus 22 device. The authors used an experimental processor which implemented an IIR filterbank approach and was therefore able to separately manipulate envelope cut-off frequency and stimulation rate. The authors found improvements in performance as stimulation rate was increased from 50 to 150 pps/ch. However, they found no further significant improvement with increases in rate from 150 to 500 pps/ch, the highest rate used. They also found no improvement in consonant recognition with envelope cut-off frequencies above 20 Hz, although performance deteriorated below this frequency down to the lowest cut-off frequency used (2 Hz). This is an interesting finding, because it suggests that only very low frequency modulation rates contributed to speech perception, or at least that increasing the envelope cut-off filter above this rate did not provide more temporal information.

A final point to note is the concept of “trade-off” between stimulation rate and

channel number. Brill et al. (1997) showed that different individuals performed better at higher rates and lower channel numbers while for others performance was optimal for relatively lower rates and higher channel numbers. Nie et al. (2006) found that changes in stimulation rate and channel number could be “traded off” against one another to produce similar outcomes in consonant recognition in quiet, again in a group of users of the MED-EL device. Clearly, the degree to which these two

parameters can be traded off against each other must depend on the relative change in information. For the Nucleus 24 device, as indicated in 2.4.3, a doubling of

stimulation rate means considerably less than doubling of temporal information. Theoretically, an increase in channel number (or number of peaks coded in a peak- picking strategy) should mean a corresponding increase in spectral detail, although

this of course depends on electrical/neural interface limitations. Moreover, the trade- off would presumably be different for different consonant features, depending on the relative importance of spectral and temporal resolution for coding of the feature. The possibility of “trading off” channel number and stimulation rate was included in the design of the experimental work reported in chapter 5, although it was not anticipated that this phenomenon would be observed for users of the Nucleus 24 device given the absence of changes in temporal sampling with increased stimulation rates.

In document Acoustic models of consonant recognition in cochlear implant users (Page 48-58)