• No results found

4.1 Human Audition

4.1.2 Perception

Once a sound has been detected and encoded as a neural signal, the auditory pathways in the brain analyze the sensory data to produce the experience of sound. The following sections describe these parameters as well as how the brain tends to create discrete sound objects from the neural spectrum encoded by the ears.

Parameters of psychoacoustics

Loudness is the perceptual equivalent of intensity, but is not linearly related to it. In fact, loudness is both frequency and duration dependent. Sounds in the 1 kHz to 5 kHz range are louder than those outside this frequency band when played in isolation in a free field. Moreover, a tenfold increase in sound intensity within an auditor filter is perceived as only a doubling in sound loudness. In other words, ten sources producing sound at approximately the same frequency or with similar spectra will only sound twice as loud as one source alone. Loudness is also duration dependent in that sounds lasting less than 1 second are experienced as being louder than sounds lasting more than 1 second (Gaver, 1997).

The perception of pitch varies logarithmically with frequency. This relationship can be expressed linearly again using the units of cents. One hundred cents is the equivalent of a semitone, the interval between two musical notes such as C and C# or E and F. Twelve hundred cents is the equivalent of a musical octave, the interval between two harmonics of a fundamental. Overall, the human auditory system can distinguish approximately 1500 separate pitches thanks to the sharpening mechanism of the cochlea (Nave, 2006). Pitch is also dependent on sound loudness. Sounds above 2 kHz appear to rise in pitch when their intensity is increased while those below 2 kHz seem to drop with an increase in intensity (Nave, 2006).

Theposition of a sound relates it to the location of its source in space. As mentioned earlier, the structures of the outer ear play a fundamental role in shaping the sound

Figure 4.5: Resolution and axis for sound localization in azimuth (left) and elevation (right).

spectrum such that later processing by the brain can determine source azimuth on the medial plane, its elevation above that plane, and its distance from the head (see Figure 4.5). The resolution for localization in azimuth is about 1-2°of arc in the direction of the nose (0°azimuth) and 5-6°on each side of the head (±90°azimuth) for pure tones (Walker and Kramer, 2004). Discrimination of sounds in front of the listener from those behind is poor without head movement as the phase and intensity differences are symmetric about the dorsal plane (Gorny, 2000). An estimate of elevation resolution is not as readily available as that for azimuth, but the results from at least one recent study suggest 3° of arc over the range -40° to 90° (slightly below to directly above the head) for sounds having some frequency content between 6 kHz and 10 kHz (Susnik et al., 2005). The distance to a sound source is determined by its intensity, the attenuation of its high frequencies due to damping caused by air, and reverberations in the listening environment. In all three dimensions, localization error is reduced when using broadband sounds instead of single frequency tones (Neuhoff, 2004).

Timbre is perception of the spectrum of a complex sound that allows a listener to distinguish two sounds having the same pitch and loudness (Gaver, 1997). The timbre of a sound is largely determined by its harmonic content and its dynamic properties over time such as periodic fluctuations in pitch (vibrato), fluctuations in loudness (tremolo), and the suddenness of its onset and offset (attack anddecay). A sound must persist for at least 60 ms for its timbre to be fully determined, in contrast to the shorter 10-30 ms requirement for discrimination of its pitch. This extended duration requirement can be reconciled with the theory that timbre is processed in higher levels of the brain than other properties. These areas need more samples over time in order to identify long-term spectral and dynamic properties (Nave, 2006).

The rhythm of a sound describes its pattern of onsets and terminations over time while itstempo describes the rate at which the pattern repeats. Like timbre, tempo and rhythm are concepts requiring integration at higher levels of the brain since they are inherently measures of changes over long periods of time. Based on the perceptual pro- cessor cycle time of 70 ms given by the Model Human Processor (Card, 1983), detecting individual beats in tempos up to 840 beats per minute should be possible.

Harmony is the vertical dimension of music characterized by the use of simultaneous pitches called chords which are perceived as single entities. Melody is the horizontal dimension of music defined by series of sounds varying in pitch, loudness, timbre, and duration grouped as a succession of conceptual entities. Consonance is produced by a chord that is stable, harmonized, and often described as pleasing to hear whiledissonance

is formed by a chord that is unstable, anharmonic, and usually said to be unpleasing to the ear. The feeling produced by consonant and dissonant sounds are dependent on the culture as well as the style of the music. In most music, however, harmonic dissonance is introduced to create tension in the piece and later resolved to a consonance as a means of release. A dissonance left unresolved has a tendency to leave a listener expecting the

consonant release while a piece consisting only of consonant harmonies is often heard as boring (Burns, 1999).

Auditory Scene Analysis

The neural signal produced by the cochlea represents the entire frequency spectrum of sound detected by the ear. This spectrum might, for instance, include components from many sound sources such as a cell phone ringing, a radio playing, a car horn honking, and the wind rushing by as a listener drives his car. Yet, quite amazingly, a listener is still able to distinguish the voice of his passenger from all of these other sounds. How the auditory system accomplishes this feat is described by the theory of auditory scene analysis developed by Alfred Bregman (Bregman, 1990). Understanding the predictions made by this theory is a key aspect in designing an auditory display that can present more than one channel of information simultaneously.

Auditory scene analysis is the process by which perceptual streams are formed. A stream is a perceptual grouping of the parts of the encoded spectrum that “go together” (Bregman, 1990). In streaming, portions of the spectrum are integrated into discrete perceptual objects based on their likeness in terms of psychoacoustic parameters. Con- versely, portions of the spectrum dissimilar in terms of the psychoacoustic parameters are segregated into separate streams. The streams that result from this analysis are an approximate representation of the discrete sound sources and events that produced the spectrum encoded by the cochlea.

At least three basic tenets underly this theory. First, Gestalt principles govern the parsing process. Similarity, proximity, continuity, and common trajectory determine what constitutes a stream (Ueda et al., 2005; Bregman, 1990). Second, portions of the available spectrum tend to be allocated exclusively to streams. Information from one sound source is rarely a contributor to more than one perceptual object (Bregman, 1990). Third, multiple streams may result from the segregation process, but only one

stream, part of a stream, or grouping of streams can be attended to consciously at a time. Attention may select only one object as figure while the rest become background (Valkenburg and Kubovy, 2004; Barber et al., 2003; Duncan et al., 1997).

Primitive processing

Conceptually, the scene analysis process takes place in two stages. The first stage, prim- itive analysis, is an innate, bottom-up process whereby streams are produced according to correlations of spectral and temporal cues (Huron, 1991). Primitive processing occurs pre-attentively and its resulting streams are held in a temporary auditory image store (Nager et al., 2003). The streams in this store are those that can be later selected for attentive processing (Valkenburg and Kubovy, 2004). The diagram shown in Figure 4.6 depicts auditory scene analysis up through this phase.

Figure 4.6: Sounds from the environment are detected by the ears and encoded as a neu- ral spectrum. Primitive auditory scene analysis forms perceptual streams by integrating similar spectral cues in frequency and over time. Based on a diagram from (Valkenburg and Kubovy, 2004).

Primitive analysis occurs along two dimensions: frequency and time. At any given point in time, the primitive process segregates the spectrum encoded by the cochlea into

streams representing simultaneous, yet distinct sound sources. It is this operation that allows a listener to identify the sound of the ringing cell phone and the car radio as two, distinct sources. Over time, portions of the spectrum are integrated into streams representing coherent, persistent sources. This second action is what allows a listener to perceive, for instance, speech in terms of words and phrases instead of a collection of disjoint guttural, throaty, and clicking sounds.

The factors known to influence the segregation of sound into streams are listed in in Table 4.1 (Arons, 1992; Bregman, 1990). Varying the parameters in the first column over time increases the chances of sound from a single source splitting into more than one perceptual stream. Making concurrent sound sources dissimilar in terms of the parameters in the second column decreases the likelihood that the independent sources will be perceived as one. The occurrence of sudden changes in these parameters or extended periods of silence (> 1 second) cause the immediate recalculation of streams (Cusack et al., 2004).

Over time In frequency

Frequency Frequency

Pitch Onset/offset synchrony

Timbre Regularity of spectral spacing

Center frequency (noise) Binaural frequency matches

Amplitude Harmonic relations

Location Parallel amplitude modulation

Shortening of gaps Parallel gliding of components

Table 4.1: Factors affecting primitive auditory scene analysis. The ordering in the table is arbitrary as the relative ordering and interactions of the parameters is not well established at present. (Bregman, 1990)

Presenting multiple channels of information simultaneously in an auditory display requires a design that accounts for these innate rules. A multi-channel display should separate channels in space, pitch, timbre, loudness, starting time, and trajectory to assist their segregation into disparate, non-conflicting streams. Moreover, the same

high tempo to ensure the channel stream remains fused. To force a reset of stream calculation, a display need only make a sudden change to its parameters.