Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs) generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual, and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e., a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisualspeechperception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisualspeech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Background: During speechperception, the ability to integrate auditory and visual information causes speech to sound louder and be more intelligible, and leads to quicker processing. This integration is important in early language development, and also continues to affect speech comprehension throughout the lifespan. Previous research shows that individuals with autism have difficulty integrat- ing information, especially across multiple sensory domains. Methods: In the present study, audio- visual speech integration was investigated in 18 adolescents with high-functioning autism and 19 well-matched adolescents with typical development using a speech in noise paradigm. Speech reception thresholds were calculated for auditory only and audiovisual matched speech, and lipreading ability was measured. Results: Compared to individuals with typical development, individuals with autism showed less benefit from the addition of visual information in audiovisualspeechperception. We also found that individuals with autism were significantly worse than those in the comparison group at lipreading. Hierarchical regression demonstrated that group differences in the audiovisual condition, while influenced by auditory perception and especially by lipreading, were also attributable to a unique factor, which may reflect a specific deficit in audiovisual integration. Conclusions: Combined deficits in audiovisualspeech integration and lipreading in individuals with autism are likely to contribute to ongoing difficulties in speech comprehension, and may also be related to delays in early language development. Keywords: Speech reception threshold, speech in noise, audiovisualspeech integration, autism. Abbreviations: SNR: speech to noise ratio; SRT: speech reception threshold.
To keep the experimental design simple, we only examined two auditory dimensions – pitch changes conveyed through F0, where we suspected our groups would show a difference, and duration, where we believed they would not. Outside the laboratory there are other cues that individuals could take advantage of, such as vowel quality, which is also associated with phrase boundaries, and pitch accents (Sluijter & van Heuven, 1996; Streeter, 1978). Accents also carry visual correlates, such as head movements, beat gestures, and eyebrow raises (e.g. Beskow, Granström, Conference, 2006, 2006; Flecha-García, 2010; Krahmer & Swerts, 2007), which individuals may also be able to use to compensate for their pitch impairment in audiovisualspeechperception. Moreover, top-down processes such as the use of lexical knowledge can also help disambiguate unclear speech (Connine & Clifton, 1987; Ganong, 1980) and talker identity cues from the visual modality help listeners to disambiguate acoustic- phonetic cues (Zhang & Holt, 2018). Individuals may be able to modify the extent to which they make use of any or all of these different sources of information in response to their idiosyncratic set of strengths and weaknesses. For example, individuals with widespread auditory processing problems may rely more heavily on top-down lexical information, or visual cues.
variability, especially among young infants. In the present study we tested the hypothesis that this variability results from individual differences in the maturation of audiovisualspeech processing during infancy. A developmental shift in selective attention to audiovisualspeech has been demonstrated between 6 and 9 months with an increase in the time spent looking to articulating mouths as compared to eyes (Lewkowicz & Hansen-Tift, 2012; Tomalski et al., 2012). In the present study we tested whether these changes in behavioural maturational level are associated with differences in brain responses to audiovisualspeech across this age range. We measured high-density event- related potentials (ERPs) in response to videos of audio-visually matching and
sharing some common features will pop up. As Maddieson claimed they are constructed using only a handful of phonetic categories (on average around 30), although vocabularies contain tens of thousands of words (cited in Cutler & Broersma, 2005). Cutler & Broersma further gave an example to show how the given word is finally selected and recognized by the listener. As is pointed out above that words with certain common features may be activated at the same time and shorter words will be embedded with longer words. When listeners begin to determine the intended word, they often feel confused. For example, the intended word star may be heard as start or stark or starve or startling. The input star activates all the words with formal similarity. Then all the candidates come to a process of competing. The incoming speech information plays a vital role in settling this competition. Ellis’ spreading activation networks is similar to the activation/competition model. When listeners hear a word, at first they have no cues about this word in their mind. Ellis (1995) declared that the more information listeners can associate with the missing term, the more knowledge they could activate to determine the given word and that is how the network of association spreads. At last the target word will be searched. However, this is only one aspect of spoken word recognition. Some other researchers hold that listeners do not process speech sounds linearly. Liverman (cited in Miller & Eimas, 1994) reported that a single segment of the acoustic signal does not contain limited information for its own, instead, it gives useful hints for more than one phonetic segment, and conversely, the information for a given phonetic segment is often distributed across more than one acoustic segment. In this way, speechperception is largely context-dependent. If the incoming sound is /k/, the word stark is preferred more than all the other candidates, but listeners are prone to mishear or misinterpret phonetic symbols in continuous speech. Like a circle, the later speech information is responsible for revising the previous decision. Actually the above two views about speechperception reflect the discrepancy between the two chief models of speech recognition--- the Cohort Model of word recognition and the TRACE Model. Experimental data from the Cohort Model suggests that context only plays a minor and limited role in recognition and speech signal itself carries enough information to decide one lexical entity, while TRACE lays much stress on claiming that word processing is directly influenced by the top-down process.
Visual attention, speciﬁcally during AVSI, has recently been investigated in detail in 6- to 9-month-old infants using the paradigm developed by Kushnerenko, Tomalski, and colleagues (Tomalski et al., 2012; Kushnerenko et al., 2013). In this eye- tracking (ET) paradigm, faces articulating either /ba/ or /ga/ syllables were displayed along with the original auditory syllable (congruent VbaAba and VgaAga), or a mismatched one (incongru- ent VbaAga and VgaAba). By measuring the amount of looking to the eyes and mouth of articulating faces, it was found that younger infants (6–7 months) may not perceive mismatching auditory /ga/ and visual /ba/ (VbaAga) cues in the same way as adults, that is, as the combination /bga/ (McGurk and MacDonald, 1976) but pro- cess these stimuli as a mismatch between separate cues and “reject” them as a source of unreliable information, and therefore allocate less attention to them. Using the same stimuli, Kushnerenko et al. (2013) also found that the AVMMR brain response to these stim- uli showed large individual differences between 6 and 9 months of age, and that these differences were strongly associated with differ- ences in patterns of looking to the speaker’s mouth. Interestingly, the amplitude of the AVMMR was inversely correlated with look- ing time to the mouth, which is consistent with the results found by Wagner et al. (2013). These results suggest that at this age suf- ﬁcient looking toward the eyes may play a pivotal role for later communicative and language development. Given these results, and the fact that infants as young as 2–5 months of age are able to match auditory and visual speech cues (Kuhl and Meltzoff, 1982; Patterson and Werker, 2003; Kushnerenko et al., 2008; Bristow et al., 2009), we hypothesized that individual differences in visual attention and brain processing of AV speech sounds should predict language development at a later age.
GUBT was developed in 14 different languages, where, in all the languages, there is a creation of sentence lists balanced regarding phonetics and difficulty, an estimation of the performance-intensity function, development of rules and reliability. Taking into consideration that, currently most of the available tests to evaluate speechperception in hearing loss individuals were standardized in a language other than Brazilian Portuguese, HINT development in Brazilian Portuguese is an evolution in the evaluation of speechperception, providing parameters of both clinical and research analysis (9).
Tomalski, P., Ribeiro, H., Ballieux, H., Axelsson, E.L., Murphy, E., Moore, D.G., & Kushnerenko, E. (2012) Exploring early developmental changes in face scanning patterns during the perception of audiovisual mismatch of speech cues. Eur J Dev Psychol, 1–14.
Individual differences in the ability to establish and direct attention to auditory objects could reflect variability in both bottom-up and top-down processes. On the one hand, listeners whose encoding of a particular auditory dimension is imprecise or blurred may struggle to separate perceptual objects which differ along this dimension (Shinn-Cunningham and Best, 2008). The theory that impaired perceptual precision can worsen informational masking is supported by studies showing that listeners with hearing loss are less able to separate streams based on their perceptual characteristics (Grose and Hall, 1996; Mackersie et al., 2001) and show less attentional modulation of cortical responses to sound (Dai et al. 2018). This notion is also supported by links between the ability to perceive speech in competing speech and the robustness of subcortical encoding of sound (Ruggles et al., 2012) as well as between temporal coding fidelity and spatial selective listening (Bharadwaj et al., 2015). On the other hand, the ability to direct and maintain attention to a particular auditory dimension—and, specifically, to a particular range of values along an auditory dimension—may also be a foundational skill for speechperception in complex environments. This theory is supported by findings that reported listening difficulties in complex environments are linked to impaired attention switching (Dhamani et al., 2013; Sharma et al., 2014) and that the ability to understand speech in multi-talker babble or noise correlates with performance on tests of attentional control (Fullgrabe et al., 2015; Heinrich et al., 2015, Neher et al., 2009; Neher et al., 2011; Oberfeld and Klockner-Nowotny, 2016; Yeend et al., 2017). However, other studies have reported a lack of relationship between attentional control and speech-in-speechperception (Gatehouse and Akeroyd, 2008; Heinrich et al., 2016; Schoof and Rosen, 2014).
Catalan and Spanish are both Romance languages and have many similarities, but their pronunciations differ sig- nificantly. In particular, Catalan has two mid vowels with different heights, one high [e] and one low [ε], while Span- ish has only one [e] phoneme (which is more open than the Catalan [e]). The [e]-[ε] contrast is used to distinguish be- tween common words in Catalan, e.g., [te] (take) and [tε] (tea), [pera] (Peter) and [pεera] (pear). Our study focused on this [e]–[ε] contrast and assessed its perception by forty bilinguals with different backgrounds: half of our subjects had Spanish-speaking parents, and the other half Catalan- speaking parents. Thus, the latter were exposed to Cata- lan since birth, whereas the former were exposed to Span- ish first, with exposure to Catalan starting in kindergarten or in primary school (at the latest at 6 years of age). To take part in the experiment, the subjects, who were all students at
Methods in (1) establish the direct correspondence between audio descrip- tors, e.g. Mel Frequency Cepstral Coecients (MFCC) or Linear Predictive Coding (LPC) and parameters controlling the shape of the mouth (aperture, protrusion, etc.). This correspondence can be learned by neural networks, Gaus- sian mixture models or vector quantization (e.g. [9, 10]). Then, methods in (2), are based on the mouth shape recognition from the speech signal. The mouth movements are then monitored with a discrete set of positions: the visemes. A viseme is a shape of the mouth associated with particular sounds. Observation vectors provided by a spectral and/or a temporal analysis are used to estimate the parameters of the mouth shape recognition system. Well-known techniques in automatic speech recognition, such as Hidden Markov models (HMM) ,
In other cases, the semantics of embedding en- tities is truly ambiguous between two categories, irrespective of the complement construction. Per- ception verbs like see, for instance, can easily mean understand or know. For verbs like these, we have made default classification rules such as the following: ’a perception entity is annotated as such by default; only if an interpretation of direct physical perception is not possible in the given dis- course context it is annotated as an attitude knowl- edge entity.’ Moreover, a list was made of all em- bedding entities and their (default) classification. personal passive report constructions If the embedding entity is a passive verb of speech or thought, as in Xerxes is said to build a bridge, its subject is coreferential with the subject of the complement clause. (This is the so-called Nom- inative plus Infinitive construction (Rijksbaron, 2002)). What is reported here, of course, is the fact that Xerxes builds a bridge. However, we have de- cided not to include the subject constituent within the annotated complement in these cases, mainly to warrant consistency with other constructions with coreferential subjects for which it is more natural to exclude the subject from the comple- ment (as in Xerxes promised to build a bridge/that he would build a bridge). There is a similar rule for constructions like δοκεῖ μοι ’X seems to me’ and φαίνομαι’to appear’.
Two types of hybrid models are differentiated: a) models with combined representations, where abstraction occurs over detailed memories of speech episodes; versus b) models with separate representations, where different processing paths exist from the speech signal to word and speaker recognition. To investigate these models, this thesis reports multiple experiments investigating the time-course of the decay patterns of voice effects in repetition priming. Results from auditory lexical decision indicate that voice information only affects the speed of future perceptual processes within a short time window: until around three items intervene between prime and target. This finding clarifies previous results, which found no long-lasting effects, by providing an exact time-course of voice information’s impact. Nevertheless, the results reported here differ from the predictions of studies investigating recognition accuracy, where long-lasting effects are commonly found. To address these differences, additional experiments using continuous and blocked word recognition paradigms were conducted. Again, talker-specific effects only persist within the same short time window, while abstract repetition priming effects persist much longer. By de-emphasizing the contribution of voice information, these findings assert the importance of abstract linguistic representations in hybrid models with separate representations.
Resonant frequencies (F1, F2…) are not only relevant for the identification of critical phoneme information important for speech comprehension, but also for the storage of the characteristic voice quality or timbre of a speaker (Jacobsen et al., 2004). The number of vocal fold vibrations determines the perceived pitch of a voice (Lattner et al., 2005). Listeners often associate lower frequencies with stereotypically male traits (Pisanski et al., 2016). Interestingly, lower pitch encodes not only the male gender; its perception is also important for emotional recognition, as described in the previous section of vocal affect information. For example, sad emotions tend to be communicated with a lower tone of voice (Lattner et al., 2005). Neuroimaging studies show that male formant conﬁgurations activate the pars triangularis of the left inferior frontal gyrus, whereas female voices seem to recruit more the supra-temporal plane in the right hemisphere, as well as the insula (Lattner et al., 2005). The supra-temporal plane, localized within the auditory cortex, plays a key role for the processing of complex acoustic properties (Tremblay, Baroni, & Hasson, 2013), e.g., speech, which functioning is crucial for the perception of formant frequencies (Formisano et al., 2008). The infant’s preference over female voices could be explained by the higher perceptual salience of female timbre configuration for infant’s auditory system, aroused by high- pitched voices (Lattner et al., 2005).
Michel Chion’s concept of synchresis, an acronym created by combining the words synchronism and synthesis, is useful here. In his book Audio-Vision: Sound on Screen, Chion (1994: 63) defines synchresis as “the spontaneous and irresistible weld produced between a particular auditory phenomenon and the visual phenomenon when they occur at the same time.” He stresses that the possibility of recombination of sound and image is essential to the making of film sound. “For a single face on the screen there are dozens of allowable voices – just as, for a shot of a hammer, anyone of hundreds of sounds will do. The sound of an axe chopping wood, played exactly in sync with a bat hitting a baseball, will ‘read’ as a particularly forceful hit rather than a mistake by the filmmakers.” (1994: XVIII) Synchresis fuses sound and image into an audiovisual unit whose logic is accepted as truth and experienced on a visceral level by the viewer. While Chion talks about sound for live action film, this also applies to other kinds of moving image such as animation or live visuals. Especially with visual abstraction, synchresis helps to make it believable and bring it to life. Tight synchronisation of sound to abstract image, even more so when using non- or semi-abstract sounds, also gives hints of meaning and guides the reading of the work. An example for this is my short film RE:AX (2011) where a visually pared down language of shapes is juxtaposed with a comparatively complex soundtrack. Precise synchronisation between the recognizable sounds of rockets and explosions with the abstract images links the two media and anchors the reading of specific events, while the overall orchestral score guides the emotional response of the viewer. In my film Shift (2012), the synchresis between small objects and ‘gigantic’ sounds exponentially increases the perceived size of the objects on screen.
In Chapters 4 and 6, it was demonstrated that an estimate of the auditory speech envelope could be reconstructed from EEG recorded during silent lipreading with accuracy above chance level. These findings could have implications for the design of future BCI technologies that aim to decode internal speech from the user’s neural recordings. As discussed in the previous chapter, decoding extended passages of covert speech has successfully been demonstrated using intracranial recordings such as ECoG (Martin et al., 2014). Other intracranial BCI approaches have sought to decode imagined speech at the level of phonemes, vowels and words (Guenther et al., 2009, Leuthardt et al., 2011, Kellis et al., 2010, Martin et al., 2016). The findings presented here indicate that EEG could provide a non-invasive, cost-effective solution to decoding imagined thoughts and could be further optimised by utilizing the natural statistics of visual speech input. Indeed, such technology would also have major implications for clinical research in populations that are unable to effectively communicate due to suffering from what’s known as a ‘locked-in syndrome’, e.g., amyotrophic lateral sclerosis (Lou Gehrig's disease), traumatic brain injuries and spinal cord injuries. Moreover, the approach of speech decoding would provide a more naturalistic and user- friendly way for patients to communicate their thoughts compared to traditional EEG- based BCI methods that have primarily relied on discrete brain components that can only be elicited to specific target stimuli (Oken et al., 2014, Lesenfants et al., 2014, Combaz et al., 2013) – such methods can be tedious, time-consuming and ineffective (for a review, see Machado et al., 2010).
The key contribution of this study is to provide novel evidence that, although PMC makes a necessary contribu- tion to speechperception in some circumstances, these effects do not extend to situations where spoken words must be perceived to allow comprehension; rather, PMC appears to play a critical role only in tasks requiring ex- plicit access to phoneme categories, such as deciding if a /k/ or /p/ was presented. In contrast, some theories advocate a necessary and automatic role for motor speech representations in speechperception more generally, an idea which has received support from the dis- covery of mirror neurons (Rizzolatti & Craighero, 2004; but see, Gallese, Gernsbacher, Heyes, Hickok, & Iacoboni, 2011) and neuroimaging studies showing PMC activation during speechperception (Pulvermuller et al., 2006; Uppenkamp et al., 2006; Wilson et al., 2004). As functional neuroimaging methods cannot confirm that this activity plays a necessary role in speechperception, TMS has been used in several studies to show that stimulation of PMC does disrupt speechperception tasks (D ʼ Ausilio et al., 2009; Mottonen & Watkins, 2009; Sato et al., 2009; Meister et al., 2007; Watkins & Paus, 2004; Watkins et al., 2003; Fadiga et al., 2002). However, all of these TMS studies, as well as the majority of fMRI studies, have used tasks that require explicit access to and/or manipulation of phonemes (e.g., Pulvermuller et al., 2006; Uppenkamp et al., 2006; Wilson et al., 2004). This research cannot dem- onstrate, therefore, that PMC plays a vital role in speechperception for comprehension. Additionally, evidence from patient studies suggests that motor areas may only be crucial for tasks that require overt segmentation or explicit phoneme awareness and not for speech com- prehension (e.g., Rogalsky et al., 2011; Bishop et al., 1990; Basso et al., 1977). However, patients typically have large and variable lesions, and consequently, these studies lack spatial resolution. Neither functional neuroimaging nor neuropsychological methods are ideally placed to confirm an essential role for a specific region such as PMC in aspects of speech recognition. In the current study, we overcame these limitations through the use of TMS to produce relatively focal disruption of processing within PMC in healthy participants.
advanced services for communication disorders among the Arab countries. With a population of 25 million, there are 14 registered audiologists at the Saudi Speech Pathology and Audiology Association and five facilities that provide audiological services (SSPAA, 2004). The second possible explanation for the limited distribution of speech materials is the difference in dialects between Arab countries. Although Arab countries share the standard written Arabic language, there is a wide range of dialects (Fatihi, 2001). Published speech recognition tests are in Moroccan, Baghdadi, Egyptian or Saudi standard dialect. The possibility of using one test across the Arab countries has not been investigated. Alusi et al. (1974) has suggested the possibility of using the word lists he developed in all Arab dialects since the words were taken from standard Arabic. However, Alusi’s speech materials were recorded in a Baghdad standard dialect. In developing the speech test, Alusi had a limited number of participants (17) representing “several” Arab countries (the author did not specify which countries), who were young educated adults. The sample did not necessarily represent the large Arabic speaking population). However, Alusi et al. did attempt to meet the criterion of word familiarity by choosing words from children’s books and newspapers in order to include educated and un-educated populations. He did not describe a specific comparison between participants from different countries to support his argument.
This study showed that individuals with DD benefited substantially less than TD subjects from lip- read information that disambiguates noise- masked speech, regardless of age and SNR. The current results are in line with previous findings published by de Gelder and Vroomen (1998) and Ramirez and Mann (2005), who reported that the processing of synthetic and natural audio- visual consonant- vowel stimuli is atypical in children and adults with DD. The current data extend these findings by showing that the processing of natural audio- visual speech may be atypical in DD individuals as well. Another crucial aspect of the present study is the inclusion of two age cohorts of DD and TD individuals. This enabled us to investigate whether or not the impact of the deficits in audio- visual speech processing is ameliorated during adulthood. We found that the relative impairment of the ability to gain benefit from lip- reading in DD is present in children and adults, despite the fact that the current cohort of DD adults consisted of students with a highly educated background. Taken together, these findings indicate that in DD, despite adequate education and unlike in other neurodevelop- mental disorders such as autism spectrum disorder (Brandwein et al., 2013; Foxe et al., 2015; Stevenson, Segers, Ferber, Barense, & Wallace, 2014) and developmental language disorders (Meronen, Tiippana, Westerholm, & Ahonen, 2013), deficits in the processing of audio- visual speech- in- noise do not resolve during adolescence, but persist into adulthood. The current findings are therefore consistent with pre- vious longitudinal studies indicating that dyslexia is a persistent disor- der and not merely a condition of transient ‘developmental lag’ (Francis, Shaywitz, Stuebing, Shaywitz, & Fletcher, 1996; Scarborough, 1984).