Top PDF Audio-visual speech in noise perception in dyslexia

Audio-visual speech in noise perception in dyslexia

Audio-visual speech in noise perception in dyslexia

The current findings replicate previous work from Ross et al. (2011), and show that adults experienced more multisensory enhancement by lip- reading than children. Although vocabulary size was slightly lower in children compared to adults, there were no significant differences in unisensory recognition performance between both age groups. Therefore, it is unlikely that the observed difference in audio- visual gain between children and adults can be accounted for by develop- mental differences in lexicon. Given the evidence from previous re- search (Foxe et al., 2015; Ross et al., 2011), it seems likely that the ability to benefit from lip- reading continues to develop into late child- hood both as a function of exposure to audio- visual speech and as a function of the self- production of speech. In addition, there is ample evidence that substantial developmental changes in cognitive func- tioning occur throughout childhood due to maturational changes in the brain (Fair et al., 2008; Liston, Matalon, Hare, Davidson, & Casey, 2006; Shaw et al., 2008; Somerville & Casey, 2010). These develop- mental changes are, to some degree, reflected in the lower test per- formance scores of children compared to adults on both the regular and pseudoword- reading test. The current findings therefore provide further evidence for the assumption that a combination of develop- mental, behavioral and environmental influences leads to an increased ability to integrate audio- visual speech signals into unified percepts.
Show more

10 Read more

Audio Visual Speech Recognition for People with Speech Disorders

Audio Visual Speech Recognition for People with Speech Disorders

Dysarthria is a type of motor speech disorders where normal speech is disrupted due to loss of control of the articulators that produce speech [3]. Limited research studies for automatic recognition of speech disorders have been done previously. In [4], Authors described the results of experiments in building recognizers for talkers with spastic disorder using HMM. Authors in [5] presented an automatic speech recognition system of disorders in continuous speech using Artificial Neural Network (ANN) approach. Researchers in [6] reported that the speaker independent (SI) systems have low recognition accuracy for dysarthria speakers; so, speaker dependent (SD) systems have been investigated in some papers reporting that this system type is more suitable for disorder people than SI systems especially for severe cases of dysarthria speakers [6]. An SD system for disorder speakers based on HMM, Dynamic Bayesian networks (DBN) and Neural Networks was presented in [7]. As described, researches on automatic recognition of disordered speech have been focused on recognition of acoustic speech only whose performance degrades in the presence of ambient noise. However, visual features generated from the speaker's lip region have recently been proposed as an active modality to enhance normal speech recognition [1, 2]. The advantage of the visual-based speech recognition is that it is immune to background acoustic noise [8]. Also, human make use of visual signals to recognize speech [9, 10] and the visual modality has information that is related to audio modality [11].
Show more

6 Read more

The Effect of Combined Sensory and Semantic Components on Audio Visual Speech Perception in Older Adults

The Effect of Combined Sensory and Semantic Components on Audio Visual Speech Perception in Older Adults

Our results have implications for speech comprehension in older adults in the real world. On the one hand, our results show that the perception of meaningful and non-meaningful AV sen- tences is efficient in older adults, irrespective of the reliability of the information in the visual component. However, the ability of older adults to accurately recollect unpredictable (i.e., non-meaningful) sentences when the visual component of the AV input was unreli- able is relatively inefficient. For older adults, unpredictable speech patterns may include novel sentences, sentences with unfamiliar content (such as medical instructions), complex sentences, or sen- tences with ambiguous meaning. Thus when such information is being presented to an older person, our findings suggest that this information will be better remembered if presented in an audiovisual format, where information from both sensory com- ponents is reliable, than when the visual component is blurred or otherwise altered (such as when glasses are removed). For exam- ple, although speculative, it may be the case that asynchronous AV inputs (as often occur in AV communications technology) may also be specifically detrimental to speech recall in older adults. Moreover, unreliable AV speech components may lead to relatively good speech detection in older adults but may affect sub- sequent recall possibly leading older adults to fail to act to verbal instructions which were previously presented. Further research is required to elucidate the type of sentences which benefit from reliable AV inputs during speech perception and recall in older adults.
Show more

9 Read more

Download

Download

Automatic speech recognition (ASR) technology is available now-a-days in all handsets where keyword spotting plays a vital role. Keyword spotting performance significantly degrades when applied to real-world environment due to background noise. As visual features are not affected much by noise this provides better solution. In this paper, audio-visual integration is proposed which combines audio features with the visual features where decision fusion used to adapt for various noise conditions. Visual features are extracted by a set of both geometry based features and appearance based features for facial landmark localization. To avoid similarities among the textons spatiotemporal lip feature (SPTLF) is used which map the features into intra class subspace. The dimensionality of the lip features are reduced using WPCA. A hybrid HMM-ANN method is proposed for integrating audio and visual features. Adaptive weights are generated using neural network for integration of audio and visual features. A parallel two step keyword spotting strategy is provided to avoid overlap between audio and visual keywords. Experiments results on dataset demonstrate that the proposed HMM-ANN method shows improved performance compared to the state of the art network.
Show more

7 Read more

An Analysis of Visual Speech Features for Recognition of Non articulatory Sounds using Machine Learning

An Analysis of Visual Speech Features for Recognition of Non articulatory Sounds using Machine Learning

orientation, and background create a third layer of complexity. For the classification step, audio and video information can be integrated by feature fusion or by decision fusion [14]. The feature fusion technique combines information at feature level and submits a single combined feature vector to a single classifier. This is generally simple to implement and allows correlation between audio and video to be modeled. The simplest feature fusion method corresponds to concatenation of the audio and video feature vectors. Unfortunately, this technique cannot explicitly model the relative reliability of each feature stream. Feature stream may vary significantly even within the duration of an utterance due to constant or instantaneous background noise or channel degradations. In contrast, decision fusion systems assume independence between the two streams and combine the results of separate classifiers for audio and video, offering a mechanism that can model the reliabilities of each feature stream. These systems usually combine parallel classifier architecture. Capturing the reliabilities of the audio and video feature streams is possible through application of weights during the fusion process. Weights may be globally set to fixed values calculated by testing the system to find which weights produce optimal speech recognition [15, 16].
Show more

9 Read more

Audio Visual Speech Recognition Using MPEG 4 Compliant Visual Features

Audio Visual Speech Recognition Using MPEG 4 Compliant Visual Features

In this paper, we first describe an automatic and ro- bust method for extracting FAPs, by combining active con- tour and templates algorithms. We use the Gradient Vec- tor Field (GVF) snake, since it has a large capture area, and parabolas as templates. We next describe the audio- visual ASR systems we developed, utilizing FAPs for the vi- sual representation of speech. The single-stream and mul- tistream HMMs were used to model the audio-visual in- formation integration, in order to improve speech recogni- tion performance over a wide range of audio noise levels. In our experiments, we utilize a relatively large vocabulary (approximately 1000 words) audio-visual database, the Bern- stein Lipreading Corpus [28]. The results of the performance improvement in noisy and noise-free speech conditions are reported.
Show more

15 Read more

Sparseness and speech perception in noise

Sparseness and speech perception in noise

There is also physiological evidence suggesting that sparse- ness is a key principle for neurons to encode environmental images and sounds. Everyday we receive large quantities of information, and our sensory system must have evolved efficient coding strate- gies to maximize the information conveyed to the brain without taking too many neural resources. Field [4] has shown that recep- tive field properties of simple cells in primary visual cortex pro- duce a sparse representation. When this sparse representation is used as a constraint to encode images, a set of localized and ori- ented filters could be derived [5]. If applied to sound signals, a set of time and frequency localized filters can be derived [6, 7, 8]. These studies confirm sparse coding principles [9, 10] and the im- portance of statistics in neuroscience.
Show more

5 Read more

Multi-pose lipreading and audio-visual speech recognition

Multi-pose lipreading and audio-visual speech recognition

deployment the systems still lack robustness against non-ideal working conditions. Research has particularly neglected the variability of the visual modality subject to real scenarios, i.e., non-uniform lighting and non-frontal poses caused by natural movements of the speaker. The first studies on AV-ASR with realistic conditions [4,5] applied directly the systems developed for ideal visual conditions, obtaining poor lipreading performance and failing to exploit the visual modality in the multi-modal systems. These studies pointed out the necessity of new visual feature extraction methods robust to illumination and pose changes. In particular, the topic of pose-invar- iant AV-ASR is central for the future deployment of this technology in genuine AV-ASR applications, e.g., smart- rooms or in-car vehicle systems. In these scenarios the audio modality is degraded by noise and the inclusion of visual cues can improve recognition. However, in nat- ural situations the speaker moves freely, a frontal view to the camera is rarely kept and pose-invariant AV-ASR is necessary. It can be considered, then, as the first step in the adaptation of laboratory AV-ASR systems to the conditions expected in real applications.
Show more

23 Read more

Visual perception in dyslexia is limited by sub optimal scale selection

Visual perception in dyslexia is limited by sub optimal scale selection

To explore if increased levels of internal noise could provide an alternative explanation for our results Gaussian noise was added to the model’s encoded direction of each dot (Fig. 6d). Integration field size was set to match the segment size (100%), whilst the standard deviation (SD) of the Gaussian noise distribution was varied between 0 and 90°. Adding a relatively small amount of internal noise (SD 30–60°) did not change the overall pattern of results, but the acuity limit and coherence threshold at asymptote increased dramatically when the standard deviation of the Gaussian noise was 90°. Crucially, there was no change in the slope of the descending limb or the knee-point of the curve. These findings are the opposite of those observed in the poorest readers and demonstrate that elevated levels of internal noise cannot readily explain their performance on the motion-based segmentation task.
Show more

11 Read more

Dynamic Bayesian Networks for Audio Visual Speech Recognition

Dynamic Bayesian Networks for Audio Visual Speech Recognition

We tested the speaker dependent isolated word audio vi- sual recognition system on the CMU database [18]. Each word in the database is repeated ten times by each of the ten speakers in the database. For each speaker, nine exam- ples of each word were used for training and the remaining example was used for testing. In our experiments we com- pared the accuracy of the audio-only, video-only and audio- visual speech recognition systems using the AV MSHMM, AV CHMM, AV FHMM, AV PHMM, and AV IHMM de- scribed in Section 4. For each of the audio-only and video- only recognition tasks, we model the observation sequences using a left-to-right HMM with five states, three Gaussian mixtures per state and diagonal covariance matrices. In the audio-only and all audio-visual speech recognition experi- ments, the audio sequences used in training are captured in clean acoustic conditions and the audio track of the testing sequences was altered by white noise at various SNR levels from 30 dB (clean) to 12 dB. The audio observation vectors consist of 13 MFC coefficients [6], extracted from overlap- ping frames of 20 ms. The visual observations are obtained using the cascade algorithm described in Section 3.
Show more

15 Read more

Audio visual speech perception in infants and toddlers with Down syndrome, fragile X syndrome, and Williams syndrome

Audio visual speech perception in infants and toddlers with Down syndrome, fragile X syndrome, and Williams syndrome

Typically-developing (TD) infants can construct unified cross-modal percepts, such as a speaking face, by integrating auditory-visual (AV) information. This skill is a key building block upon which higher-level skills, such as word learning, are built. Because word learning is seriously delayed in most children with neurodevelopmental disorders, we assessed the hypothesis that this delay partly results from a deficit in integrating AV speech cues. AV speech integration has rarely been investigated in neurodevelopmental disorders, and never previously in infants. We probed for the McGurk effect, which occurs when the auditory component of one sound (/ba/) is paired with the visual component of another sound (/ga/), leading to the perception of an illusory third sound (/da/ or /tha/). We measured AV integration in 95 infants/toddlers with Down, fragile X, or Williams
Show more

41 Read more

The Effect of Reliability Measure on Integration Weight Estimation in Audio- Visual Speech Recognition R. RAJAVEL

The Effect of Reliability Measure on Integration Weight Estimation in Audio- Visual Speech Recognition R. RAJAVEL

Potamianos et al. has demonstrated that using mouth videos captured from cameras attached to wearable headsets produced better results as compared to full face videos [29]. With reference to the above, as well as to make the system more practical in real mobile application, around 70 commonly used mobile functions (isolated words) were recorded 30 times each by a microphone and web camera located approximately 5-10 cm away from the speaker’s right cheek mouth region. Samples of the recorded side-face videos are shown in figure 1. Advantage of this kind of arrangement is that face detection, mouth location estimation and identification of the region of interest etc. are no longer required and thereby reducing the computational complexity [9]. Most of the audio-visual speech databases available are recorded in ideal studio environment with controlled lighting or kept some of the factors like background, illumination, distance between camera and speaker’s mouth, view angle of the camera etc. as constant. But in this work, the recording was done in the office environment on different days with different values for the above factors to make the database suitable for real life applications. Also, the database included natural environment noises such as fan noise, birds sounds, sometimes other people speaking and shouting sounds.
Show more

10 Read more

Exploring early developmental changes in face scanning patterns during the perception of audiovisual mismatch of speech cues

Exploring early developmental changes in face scanning patterns during the perception of audiovisual mismatch of speech cues

To summarise, using eye-tracking measures we have found evidence for infants’ ability to discriminate between possible (fusible) and impossible audio-visual speech combinations in terms of total looking duration (6- to 9-month-olds). We have also found evidence for age-related changes in 6- to 9-month-old infants’ attention to the mouth during the perception of incongruent, impossible and non-redundant audio- visual speech cues. The age-related shift in attention to non-fusible, mismatched speech cues found here, suggests that an important transition in perceptual learning of speech may occur between 6 and 9 months of age.. Importantly, our data add to the research on intersensory redundancy hypothesis, demonstrating that it is applicable to the early stages of language acquisition, but not to later development (from
Show more

24 Read more

A Survey on Techniques for Enhancing Speech

A Survey on Techniques for Enhancing Speech

Almajai in [42] made use of visual speech information within a Wiener filter to improve the noisy speech signal. The approach reports to achieve improvement in the noisy speech, especially by reducing noise intrusiveness at the cost of signal distortion. The basic idea behind is the existence of correlation between the audio and video speech signals, facilitating the estimation of filterbank features from the visual features. The initial investigation of this approach reported various findings. Primarily, the correlation metric is higher when estimated within phonemes than globally across all speeches. Then investigation on measurements of filterbank estimation errors, subjective and objective tests reveals that the proposed method is relatively insensitive to phoneme decoding errors. In other words, only a very little difference was observed in the filterbank estimation errors when decoding accuracy decreased from 100% to 30%. Results also bring to notice that the estimation of spectral features from visual features is limited by the audio-visual correlation and also by the amount of speech information conveyed in lip movement. To illustrate an example, spectral details such as harmonic structure cannot be determined from the visual features. Hence, this places a limit on the level of spectral detail that can be extracted from the visual features. However, analysis has shown that coarse filterbank estimates are sufficient to enable speech enhancement to an extent. In 2013, a two-stage multimodal speech enhancement framework, utilizing audio and visual information was proposed [84]. The input noise contaminated speech signals, obtained from microphone array is initially pre-processed through visually derived Gaussian Mixture Regression based Weiner filter, using visual speech information elicited by means of Semi Adaptive Appearance Models (SAAM) based lip tracking approach. Subsequently, the pre-processed speech signals are improved further though Transfer Function Generalized Sidelobe Canceller (TFGSC) approaches. The two-stage system is a promising solution in challenging noisy scenarios. Results provide a favorable outlook on the framework to be used in difficult noisy environments. The system is then extended to incorporate fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system[85]. The drawbacks in the system are that Weiner filters used are very basic and it uses less complex GMM for speech estimation.
Show more

14 Read more

Noise Adaptive Stream Weighting in Audio Visual Speech Recognition

Noise Adaptive Stream Weighting in Audio Visual Speech Recognition

So far word error rates were calculated for a setting of the fu- sion parameter c being constant in one noise condition. The average value of the reliability measure was calculated in this noise condition and a global value of c for this noise condi- tion was selected accordingly. This assumes that the whole test set is known at recognition time, which of course is un- realistic in a real life recognition system. Rather it is neces- sary to calculate the correct setting of the fusion parameter instantaneously for each frame. This also opens the possi- bility to cope with nonstationary noise and variations of the SNR of the speech signal. We therefore repeated the tests in the previous section with audio stream weights adapted on a frame by frame basis. To reduce the influence of estimation errors, the values of the fusion parameter were smoothed over time with a first order recursive filter with a cut off fre- quency of 0.6 Hz. Table 3 compares the results of the opti- mization for the different criteria, when the value of the fu- sion parameter is fixed over the whole test set (Global) and when it is varied (Frame Dependent). As for the previous recognition results, the average RWER is based on the results obtained with SWP λ and hence evaluated according to (15)
Show more

14 Read more

Audio-visual speech perception: a developmental ERP investigation

Audio-visual speech perception: a developmental ERP investigation

Work with infants indicates a very early sensitivity to multisensory speech cues. By two months of age infants can match auditory and visual vowels behaviourally (Kuhl & Meltzoff, 1982; Patterson & Werker, 1999). Bristow and colleagues (Bristow, Dehaene-Lambertz, Mattout, Soares, Gilga, Baillet & Mangin, 2008) used an electrophysiological mismatch negativity paradigm to show that visual speech cues habituated 10-week-old infants to auditory tokens of the same phoneme, but not auditory tokens of a different phoneme. Such evidence suggests that infants have a multisensory representation of the phonemes tested, or at least are able to match across senses in the speech domain. By 5 months of age, infants are sensitive to the McGurk illusion, as shown both behaviourally (Burnham & Dodd, 2004; Rosenblum, Schmuckler & Johnson 1997; Patterson & Werker, 1999), and electrophysiologically (Kushnerenko, Teinonen, Volein & Csibra, 2008). Notably though, audio-visual speech perception may not be robust or consistent at this age due to a relative lack of experience (Desjardins & Werker, 2004). Nevertheless, infants pay attention to the mouths of speakers at critical times for language development over the first year (Lewkowicz & Hansen-Tift, 2012), during which time they may even use visual cues to help develop phonemic categories (Teinonen, Aslin, Alku & Csibra, 2008).
Show more

16 Read more

Audio visual speech perception: a developmental ERP investigation

Audio visual speech perception: a developmental ERP investigation

Work with infants indicates a very early sensitivity to multisensory speech cues. By two months of age infants can match auditory and visual vowels behaviourally (Kuhl & Meltzoff, 1982; Patterson & Werker, 1999). Bristow and colleagues (Bristow, Dehaene-Lambertz, Mattout, Soares, Gilga, Baillet & Mangin, 2008) used an electrophysiological mismatch negativity paradigm to show that visual speech cues habituated 10-week-old infants to auditory tokens of the same phoneme, but not auditory tokens of a different phoneme. Such evidence suggests that infants have a multisensory representation of the phonemes tested, or at least are able to match across senses in the speech domain. By 5 months of age, infants are sensitive to the McGurk illusion, as shown both behaviourally (Burnham & Dodd, 2004; Rosenblum, Schmuckler & Johnson 1997; Patterson & Werker, 1999), and electrophysiologically (Kushnerenko, Teinonen, Volein & Csibra, 2008). Notably though, audio-visual speech perception may not be robust or consistent at this age due to a relative lack of experience (Desjardins & Werker, 2004). Nevertheless, infants pay attention to the mouths of speakers at critical times for language development over the first year (Lewkowicz & Hansen-Tift, 2012), during which time they may even use visual cues to help develop phonemic categories (Teinonen, Aslin, Alku & Csibra, 2008).
Show more

16 Read more

Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audiovisual and auditory speech perception

Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audiovisual and auditory speech perception

At least where auditory signals lack spectral structure, fundamental frequency information is well-established as a source of useful information in sentence-level audio-visual speech perception. The results of experiment 1 suggest that prostheses which ensure a salient percept of F 0 may also aid in the identification of consonants. That the explicit temporal representation of voiced and voiceless excitation can contrib- ute to consonant identification may also be of practical sig- nificance for the design of hearing prostheses, since it would be likely to contribute to the audio-visual perception of low- redundancy messages. Voiceless excited speech is of course distinct from voice excited speech not only by aperiodicity, but also typically by the presence of predominantly higher frequency energy. In listeners for whom higher frequencies are not audible, a temporal coding of periodicity and aperi- odicity is likely to be the only possible means of preserving this excitation contrast. Where sufficient frequency range and useful spectral resolution is retained, this contrast may also be accessible from spectral structure, and the signifi- cance of temporal cues to aperiodicity when spectral cues are also available merits further investigation.
Show more

11 Read more

Kannada Speech Recognition Using MFCC and KNN Classifier for Banking Applications

Kannada Speech Recognition Using MFCC and KNN Classifier for Banking Applications

on continuous phoneme and digit recognition were performed on an unrestricted-speaker telephone database. Walker K et al.(1989) [12] presented a speaker independent automatic speech recognition system for a small vocabulary, employing phonetically based methods. The system uses formant tracking and relative energy values to characterize each word in the vocabulary (the digits, 0 to 9) The system was tested on a number of speakers of both sexes, with encouraging results. Brognaux S et al. (2016) [13] implemented a HMM-Based Speech Segmentation. The obvious advantage of this technique is that it is applicable to any language or speaking style and does not require manually aligned data. Receveur S et al.(2016) [14] successfully applied the turbo principle to the domain of ASR and thereby provide solutions to the above mentioned information fusion problem. On a small vocabulary task, their proposed turbo ASR approach outperforms even the best reference system on average over all SNR conditions and investigated noise types by a relative word error rate (WER) reduction of 22.4% (audio-visual task) and 18.2% (audio-only task), respectively
Show more

11 Read more

On the Soft Fusion of Probability Mass Functions for Multimodal Speech Processing

On the Soft Fusion of Probability Mass Functions for Multimodal Speech Processing

4.4. Audio-Visual Speech Recognition As a Multiple Hypothesis Testing Problem. Audio-visual speech recognition (AVSR) is a technique that uses image processing capabilities like lip reading to aid audio-based speech recognition in recognizing indeterministic phones or giving preponderance among very close probability decisions. In general, lip reading and audio- based speech recognition works separately and then the information gathered from them is fused together to make a better decision. The aim of AVSR is to exploit the human perceptual principle of sensory integration (joint use of audio and visual information) to improve the recognition of human activity (e.g., speech recognition, speech activity, speaker change, etc.), intent (e.g., speech intent) and identity (e.g., speaker recognition), particularly in the presence of acoustic degradation due to noise and channel, and the analysis and mining of multimedia content. AVSR can be viewed as a multiple hypothesis Testing-Like problem in speech processing since there are multiple words to be recognized in a typical word-based audio-visual speech recognition system. The application of the aforementioned MHT-SB function to such a problem is discussed in the ensuing section on performance evaluation.
Show more

14 Read more

Show all 10000 documents...