• No results found

THE BENEFIT OF LISTENING WITH TWO EARS

The Perception of Multiple Sounds

THE BENEFIT OF LISTENING WITH TWO EARS

Up to now we have considered only cases of monaural perception. For the previ-ous chapter, this omission can be defended by pointing to the fact that the contri-bution of binaural hearing of single sounds consists mainly in adding directional information about the sound source. Of course this is a remarkable achievement of the auditory system, but it seems to be of secondary importance with respect to the scope of this book. However, binaural perception is rather important for con-sidering how simultaneous sounds undergo perceptual separation.

Binaural hearing has been studied primarily in terms of interaural time differ-ences. We know that the auditory system can distinguish minimal horizontal differences in the direction of two frontal sound sources on the order of a few de-grees, corresponding with a temporal difference of only about 20m sec (Blauert, 1983). This sensitivity to interaural time differences contributes to the detectability of a sound. For example, the detection threshold for a 1,000-Hz tone presented at the same time to both ears (frontal direction) can be lowered

FIG. 3.16. Demonstration that a single anomalous dash is much more easily distinguished in the context of a repeating regular dash pattern.

about 10 dB if the interaural time difference of a masking noise (lateral direc-tion) is increased from zero to 0.6 msec.

“Head shadow” quantified in terms of interaural intensity differences can also lead to improved audibility of sounds arriving from different directions.

The relative contributions of interaural time and level differences to the intelli-gibility of speech in the presence of a masking noise were studied by Bronkhorst and Plomp (1988). The speech material, consisting of sentences read by a fe-male speaker, was presented without interaural differences, and the noise, spec-trally equal to the long-term average spectrum of the sentences, was varied as a function of azimuth. Thus the testing condition mimicked the common situa-tion of a speaker facing the listener, and an interfering noise coming from a dif-ferent direction. In Fig. 3.17, the speech-reception threshold, defined as the speech-to-noise ratio at which 50% of the sentences were correctly repeated by the listeners, is plotted as a function of the azimuth. In these results, we see that time differences alone can lower thresholds up to about 5 dB, level differences alone up to about 8 dB, and the combined effects can lower thresholds by as much as 10 dB.

Of course, these testing conditions represent a very favorable listening situa-tion, where there is only a single disturbing noise source, without the complicat-ing factor of sound reflections, which is the more typical case in everyday

FIG. 3.17. Mean speech-reception threshold for sentences presented in front of the listener as a function of the direction of the noise source. The three curves represent the conditions where only interaural time differences, only level differences due to head shadow, or both factors are taken into account (redrawn from Bronkhorst & Plomp, 1988).

listening environments. The results demonstrate that, in addition to the effects discussed earlier in this chapter, binaural hearing can make a substantial contri-bution to the audibility and separation of simultaneous sounds. In chapter 5, some further comments on the relevance of this result are made within the framework of a general exposition regarding the intelligibility of fluent speech in the presence of disturbing noise.

DISCUSSION

The message of this chapter is that the auditory system is continuously testing whether simultaneous sounds originate from the same source or from different sources, and our surprising conclusion has been that this testing as well as the perceptual process of segregation into two or more sound streams is so success-ful that we are not consciously aware of this tremendous achievement. When an auditory signal is interrupted by louder sound bursts, the obliterated signal segments are heard as continuing straight through the interfering bursts. Suc-cessive tone pulses having a small frequency difference are perceived as parts of the same sound stream, whereas tone pulses with larger frequency differences are heard as multiple streams. The system seems to use all information avail-able, that is, differences in timbre, pitch, and loudness, to decide which frag-ments belong together and should be perceived as such. Even absent speech fragments can be restored on the basis of contextual information.

The experimental evidence indicates that this process is controlled by a number of clear as well as flexible principles. The most significant one is the time scale of 75 to 200 msec for which the continuity and segregation effects are most prominent, and which corresponds to the duration of acoustic speech ele-ments such as phonemes and syllables. Apparently, the time scale of the audi-tory–perceptual process on the one hand and the time scale of the acoustic units of our communication system on the other are very well matched.

Directly related to the previous point, it is remarkable that sequences of equal tones as visualized in Fig. 3.9 are not perceived as crossing each other. The auditory system’s preference for grouping sounds of similar timbre on the basis of overall differences in pitch contributes to the perceptual separation of voices.

However, this preference can be overruled by the criterion of similarity in tim-bre. Even a modest difference in timbre is sufficient for us to hear two sound streams as crossing each other, equally effective for separating simultaneous tone sequences or simultaneous voices.

This example illustrates that both pitch and timbre differences are

“weighed” in the decision as to whether sound elements belong to the same

stream or not, shown explicitly in Fig. 3.10. Timbre should be taken here in its widest sense, including spectrotemporal variations. For example, two complex tones where one is modulated in amplitude or frequency or begins slightly ear-lier than the other are much more easily separated than two tones that are more similar. As the comodulation effect illustrated in Figs. 3.14 and 3.15 showed, large timbre differences between signal and background can even improve detectibility.

It is also interesting to note that longer sound sequences are more easily segre-gated than short ones. It is as if the auditory process allows the “benefit of the doubt” for a single pitch deviation but considers that repeated alternations reflect the presence of more than one sound stream. The fact that the temporal relations of (nonsimultaneous) segregated sound streams are difficult to perceive demon-strates that the auditory system directs all its efforts to separating the individual streams. Even the binaural hearing system is provided with sophisticated pro-cesses for improving the separation of sounds from different sources.

Without doubt, the most striking aspect of this unraveling process is the au-ditory system’s capacity to restore inaudible sound fragments as if they were never masked. As illustrated in Fig. 3.5, even rhythmic properties are recon-structed with striking accuracy. These restorations in temporal structure are, however, much less stable than those observed for pitch and timbre. As we saw, training listeners on a particular time pattern can result in their hearing a corre-spondingly restored sound stream.

This restoration capability becomes particularly manifest in cases in which the system has only a small fragment of a sound to work with. Take, for example, the case of a single sinusoidal tone presented against a background of wide-band noise. If the tone is relatively weak, the auditory system does not exclude that the tone is the only audible harmonic of a low-pitched complex tone (illustrated in Fig. 3.18). In this case, context is used to decide upon the most likely recon-struction. This can explain why Houtgast’s (1976) listeners were able to per-ceive a single weak tone as the harmonic of a lower fundamental and thus decide on its corresponding pitch. In his experiment, as well as in the earlier ones utilizing more harmonics, a necessary condition for hearing the tones as harmonics was a low sensation level as well as the presence of comparison tones having a similar timbre, which served to direct the listener’s attention to the pitch range where the (inaudible) fundamental could be expected.

A related experiment was reported by Shriberg (1992). She presented her subjects with isolated vowels excised from natural speech. Low-pass filtering of these vowels resulted in frequent identification errors, which could be re-duced significantly by adding high-pass filtered noise. Apparently, the noise

improved the auditory system’s ability to restore the high-frequency part of the vowel spectra.

Perceptual uncertainty regarding the original signal is maximal for fluent speech that is partially masked by other sounds. In this case the contribution of the context for retrieving any mutilated speech fragment is essential. As dis-cussed in chapter 5, listeners report hearing quite different phonemes depend-ing on context. This restoration can be so perfectly misleaddepend-ing that quite sophisticated processes must be involved.

Such results demonstrate even more convincingly than in the previous chap-ter that hearing represents an active process in which the incoming information is reorganized so as to derive the most probable reconstruction of the undis-turbed sounds radiated from the various (predicted) sources. This reconstruc-tion of auditory “objects” is so complete that moving our head does not destroy the stability of the acoustic world, no more than our eye movements perturb the world we see.

Our ability to segregate sounds may seem so obvious that we take it for granted. However, a simple demonstration of the effect can show how striking it can be. For example, if a fluent speech or music signal is alternated about three times per second with more intense wide-band noise, we hear the speech or mu-sic as disturbed by the noise bursts but still as a continuous signal. If the noise bursts are replaced with silent intervals, a completely different impression re-sults: Now the speech or music is heard as a mutilated signal, an impression that immediately disappears if the noise is reintroduced. Apparently, silent intervals

FIG. 3.18. Illustration of the condition in which the ear cannot decide whether a single weak tone audible against a background of noise represents a single sinusoidal tone or the strongest harmonic of complex tone with a lower pitch.

are perceived as belonging to the stimulus itself, whereas noise bursts are per-ceived as foreign sounds (i.e., produced by a different source). As we have seen, even intelligibility improves when the silent intervals are filled with noise.

With the phenomena discussed in the previous chapter, but still more with the discoveries reviewed in this chapter, we are far removed from the primitive picture of listening that was current half a century ago. The tacit supposition that a complete formulation of auditory psychophysics could be obtained by studying the perception of single sinusoidal tones has been replaced by the view that hearing is primarily typified by organizational characteristics. Single sounds

“make sense” insofar as they are parts of a meaningful structure, and the system focuses all its efforts on finding this structure. This means that sounds are not accepted on their acoustic face value, as a mechanical sound analyzer would do, but are assumed to be a probable mixture of different messages from different sources, to be unraveled as effectively as possible. Audition is controlled by highly sophisticated principles, where the context of a sound element appears to be at least as important as the sound itself.

We have described this conclusion as the result of experimental evidence ob-tained with tones and noise bursts, which are still rather abstract stimuli. These insights are useful as a basis for exploring how speech is perceived, the topic of the next two chapters.

CONCLUSIONS

The previous chapter focused on spectral factors that contribute to the unique way in which tone complexes are analyzed, and this chapter considered tempo-ral factors involved in sound analysis. We saw that the auditory system can pro-cess a mixture of multiple sound streams that are partially masking each other, as in a concert or a cocktail party, as if they were never superimposed. Not only are the mutually interfering sound fragments sorted according to their sources, but the inaudible parts are restored as convincingly as if they had never been masked. This remarkable achievement reveals that auditory processing is effec-tively designed for its everyday task of segregating, and identifying, the multiple sounds in our environment. The listener requires a reliable picture of the acous-tic surround in order to react appropriately, and active perceptual processes are optimally adapted to deliver this information.

4