• No results found

requirement is to provide a good match to average listener recognition rates or matching listener responses at the level of individual tokens (Cooke, 2009).

The remainder of this chapter is outlined as follows: Section2.2 provides an overview of human perceptual strategies to compensate for speech in additive noise as well as the effects of additive noise on speech intelligibility. Then, we survey the literature in terms of models of speech intelligibility. In particular, Section 2.3 and 2.4 describe the macroscopic intelligibility models and microscopic intelligibility models, respectively. The chapter ends with a summary in Section2.5.

2.2

The Perception of Speech in Noise

The main purpose of this section is to first identify perceptual strategies that human listeners adopt in the presence of additive noise in order for us to better understand the effects of masking on speech intelligibility, and second to describe these effects. Section2.2.1describe the human listening strategies and Section2.2.2gives an overview of the differential effects of energetic and informational masking on speech intelligibility. Definitions of energetic and informational masking will be given then.

2.2.1

Human Listening Strategies in Noise

In typical listening conditions, sounds may reach the listener’s ear as a mixture of differ- ent acoustic sources. Now, to identify individual sound patterns the incoming auditory information need to be organised first, and then the right subset assigned to individ- ual sounds, in order to form an accurate representation for each. This mechanism is named ‘auditory scene analysis’ (ASA) byBregman (1990) in which the input acoustics grouped and segregated into separate mental representations, called auditory streams.

Bregman(1990) makes a distinction between two types of auditory streaming which are simultaneous grouping and sequential grouping. The simultaneous grouping refers to the process of grouping units occurring all at a certain time but in different spectral bands. In contrast, the sequential grouping is the process of grouping sound units that occur sequentially in time but possibly in the same spectral band.

In addition to identifying the target source, human listeners adopt a strategy that plays a role in source separation named glimpsing. Cooke (2003) defines the glimpsing phenomenon as the ability to extract spectro-temporal elements in which the degraded speech signal is less masked and as a result less distorted. In fact, human listeners utilise

2.2 The Perception of Speech in Noise 10

Table 2.1: Summary of potential masking effects for listeners including native and non- native listeners as reported in Cooke et al. (2008).

Energetic masking (EM) Informational Masking (IM)

i) partial information i) misallocation of audible masker components to target ii) competing attention of masker

iii) higher cognitive load

iv) interference from ‘known language’ masker

the local high signal-to-noise ratio (SNR) elements of the noisy signal and obtain useful information, i.e., glimpse, accordingly. For this reason, stationary noises are stronger

maskers compared to competing speakers, since the later present more glimpses to the listener (Festen and Plomp, 1990).

2.2.2

Energetic masking and informational masking

When noise interferes with a speech signal (i.e., target) it can provide two types of

masking - ‘energetic’ and ‘informational’ - either of which can lead to a reduction in intelligibility. Energetic masking (EM) refers to masking which occurs in the periphery of the auditory system when the speech energy in some spectro-temporal region is rendered inaudible owing to the high noise energy. Informational masking (IM) refers to target and masker competition that occurs in more central portions of the auditory system (Durlach et al., 2003). In fact, it is a ‘catch-all’ term that covers any reduction in intelligibility once energetic masking in the auditory periphery has been accounted for (Cooke et al.,2008; Durlach, 2006).

As the energetic masking increases the speech intelligibility decreases. Energetic masking results in loss of speech observation in spectro-temporal regions which may or may not include important speech features that helps in the discriminability between speech classes.

Informational masking has multiple potential aspects which are summarised in Ta- ble 2.1as reported in Cooke et al. (2008). The first is misallocation of the target source (i.e., which audible components belong to the target source) referring to two situations

when: (i) the human listener uses audible elements from the masker leading to mis- identification of the target, or (ii) the human listener assigns target elements to the masker leading to erroneous identification, too. Studying the effect of informational masking has been often conducted using speech-like maskers (i.e.,contained speech ma-

2.2 The Perception of Speech in Noise 11

terial) (Brungart, 2001; Freyman et al., 2004). For example, Brungart (2001) stated that in the presence of a single competing talker all words belonging to the masker are usually reported as part of the target. Simpson and Cooke (2005) also found a sub- stantial effect of informational masking on speech intelligibility when speech presented in N-talker babble over a wide range of values for N. Therefore, speech-like maskers are more likely to yield this type of masking through misallocation. Cooke et al. (2008) indicated that misallocation could apply to speech sub-units of any size and also speech sub-units smaller than words or phonemes. Misallocation may also lead to report a sound or word which is not parts of either the target or the speech-like masker (e.g., the

aspiration comes after a plosive could be perceived as the voiceless glottal fricative/h/)

(Cooke et al.,2008).

A further aspect associated with informational masking is the higher cognitive load often occurs when processing a signal that contains multiple components (Cooke et al.,

2008). Assuming that both target and masker might have important components, it makes more sense that processing resources are equally assigned to both target and masker. This often occurs in the presence of a competing speech masker and results in the failure to attend to the target. Darwin and Hukin(2000) investigated how differences in properties such as fundamental frequency f0, vocal tract length, and spatial cues

affect detecting which of two competing sentences is more likely attended to. Cooke et al. (2008) also stated that a higher cognitive load is more likely to cause difficulties in tracking the target source specially if attention is resulted from limited resources (e.g., Kahneman(1973)).

An additional effect of informational masking according to Cooke et al.(2008) may arise from the language of the masking talker and whether it is known to listeners. Several recent studies investigating the effect of the language of the masker on the intel- ligibility of the target sentence (e.g., (García Lecumberri and Cooke, 2006; Rhebergen

et al., 2005)). Rhebergen et al. (2005) reported a significant reduction in speech recep- tion thresholds for Dutch sentences presented in competing Dutch speech compared to when presented in competing Swedish speech. InGarcía Lecumberri and Cooke (2006), a consonant in a vowel context identification task were used and they demonstrated that monolingual English listeners were better at identifying the consonant when the language of a competing talker was Spanish. However, Spanish listeners with English as their second language performed equally in the presence of maskers in both languages (i.e., English or Spanish).