Audio-Visual Speech Recognition

Top PDF Audio-Visual Speech Recognition:

Noise Adaptive Stream Weighting in Audio Visual Speech Recognition

Noise Adaptive Stream Weighting in Audio Visual Speech Recognition

Our objective was to compare a number of schemes for an adaptive combination of audio and video a posteriori proba- bilities estimated by an ANN for an audio-visual recognition task under different noise conditions. In a first test we looked at the effectiveness of different weight combination schemes for audio and video data. The results demonstrated that a multiplicative combination respecting class conditional in- dependence of the streams gives the best results. Next, we compared different criteria for an adaptive estimation of the audio stream reliability using the Geometric Weighting method. The performance of both, the criterion based on the entropy of the a posteriori probabilities and the one based on the ratio of the harmonic to the nonharmonic components in the speech signal, was very close to the best achievable perfor- mance determined by a manual adjustment. We showed that an adaptive weighting scheme based on the entropy and the voicing index can be built yielding consistent performance in various noise conditions. Finally, we investigated if a con- stant weight on the audio and video stream in all noise con- ditions would give comparable performance to the adaptive weighting. The test we made showed that when the SNR is higher than 0 dB, the Unweighted Bayesian Product performs as well as Geometric Weighting, so weighting, fixed or adap- tive, is unnecessary. Whereas for SNR values below − 3 dB performance losses are tremendous if no weighting is per- formed. An analysis of the confusion matrices showed that the confusion of all phonemes with the silence state is the main cause of the failure of the Unweighted Bayesian Prod- uct for SNR < 0 dB. We remark that this is related to the con- tinuous speech recognition task and the problem of speech detection in noise. Therefore an algorithm (namely FCA and GW) incorporating Bayes’ rule, which performs well for SNR ≥ 0 dB, and a weighting principle, being dominant for SNR < 0 dB, seems to be optimal. The weighting globally performs as a switch between the two modalities, favoring the one having less confusions with the silence state. This complements Bayes’ rule, when this type of confusion oc- curs.

14 Read more

Feature fusion based audio visual speech recognition using lip geometry 
		features in noisy environment

Feature fusion based audio visual speech recognition using lip geometry features in noisy environment

Due to wide variability in the lip movement involved in articulation, not all English digit recognition can be substantially improved by audio-visual integration. For example, when pronouncing the word ‘six’, only small movements of the lips are involved, while in the production of the word ‘seven’ considerable lip movements are required. In general, the greater the lip movements required to generate the word, the better an AVSR system is likely to perform. Figure-8 and Figure-9 show the recognition performance of the new system in identifying the digit ‘seven’ when simulated under ‘white’ and ‘babble’ noise. The graphs show that the performance when using only visual information is 75% and the combination of the audio-visual information improved the performance by more than 40% relative the audio-only results at SNRs below 0dB.

7 Read more

Dynamic Bayesian Networks for Audio Visual Speech Recognition

Dynamic Bayesian Networks for Audio Visual Speech Recognition

independently of each other, but each of the audio and vi- sual states are conditioned jointly by the previous set of au- dio and visual states. The performance of the FHMM and the CHMM for speaker dependent isolated word AVSR was com- pared with existing models such as the multistream HMM, the independent HMM and the product HMM. The coupled HMM-based system outperforms all the other models at all SNR levels from 12 dB to 30 dB. The lower performance of the FHMM can be an e ff ect of the large number of parame- ters required by this model, and the relatively limited amount of data in our experiments. In contrast, the efficient struc- ture of the CHMM requires a small number of parameters, comparable to the independent HMM, without reducing the flexibility of the model. The best recognition accuracy in our experiments, the low parameter space, and the ability to ex- ploit parallel computation make the CHMM a very attractive choice for audio visual integration. Our preliminary experi- mental results [30] show that the CHMM is a viable tool for speaker independent audio-visual continuous speech recog- nition.

15 Read more

Audio Visual Speech Recognition Using MPEG 4 Compliant Visual Features

Audio Visual Speech Recognition Using MPEG 4 Compliant Visual Features

several years. Improving ASR performance by exploiting the visual information of the speaker’s mouth region is the main objective of AVSR. Visual features, usually extracted from the mouth area, thought to be the most useful for ASR are the outer lip contour, the inner lip contour, the teeth and tongue location, and the pixel intensities (texture) of an im- age of the mouth area. Choosing the visual features that con- tain the most useful information about the speech is of great importance. The improvement of the ASR performance de- pends strongly on the accuracy of the visual feature extrac- tion algorithms. There are three main approaches for vi- sual feature extraction from image sequences; image-based, model-based, and combination approaches. In the image- based approach, the transformed mouth image (by, for ex- ample PCA, Discrete Wavelet Transform, Discrete Cosine Transform) is used as a visual feature vector. In the model- based approach, the face features important for visual speech perception (mainly lip contours, tongue, and teeth positions) are modeled, and controlled by a small set of parameters which are used as visual features [12]. In the combination approach, features obtained from the previous two methods are combined and used as a new visual feature vector. A num- ber of researchers have developed audio-visual speechread- ing systems, using image-based [13, 14, 15, 16], model-based [16, 17, 18, 19, 20], or combination approaches [13, 21] to obtain visual features. The reported results show improve- ment over audio-only speech recognition systems. Most of the systems performed tests using a small vocabulary, while recently results of the audio-visual ASR performance im- provement over audio-only ASR performance on a large vo- cabulary were shown [13, 14].

15 Read more

Audio Visual Speech Recognition for People with Speech Disorders

Audio Visual Speech Recognition for People with Speech Disorders

Automatic speech recognition (ASR) is used in many assistive fields such as human computer interaction, and robotics. In spite of their effectiveness, speech recognition technologies still need more work for people having speech disorder. Because speech is not spoken in isolation, such that there are some visual movements of the lips, so, making use of visual features from the lip region can improve the accuracy compared to audio only ASR. Visual features are studied in many recent audio-visual ASR systems for normal speakers [1, 2].

6 Read more

Lip-Reading Techniques: A Review

Lip-Reading Techniques: A Review

In recent trends, pattern-recognition has proved to be an important topic of discussion which emphasizes on the use of computers to mimic people’s ideas regarding different items to convey some valuable information. When matched with other recognition systems such as fingerprint, gesture or facial recognition, audio visual speech recognition is more beneficial and robust which makes it important building block of Human Computer interface [22,23]. The other important areas of research in lip-reading are pattern recognition [24,25], image processing and computer-vision [26]. Nowadays, lip reading is becoming very important technique implemented in recognition systems where several lip-reading techniques may be used to improve performance of recognition models. Lip reading finds great applications in the field of information security [27,28], speech recognition[29,30,31] and driver assistance systems[32]. Looking at history of lip reading, we will have to go back to 1954 when Sumby[33] proposed his first work associated with lip reading. Later Petajan[34] introduced a different lip contour reading system which was popular in 1980s. After that there has been a number of researches in the field of lip- reading. Since audio signal is susceptible to noise in the environment, a pixel based method combined with artificial neural network (ANN) was proposed in a recognition model[35] developed in 1989. In 1993, Goldschen and others used Hidden Markov Models (HMMs) in their lip reading systems to achieve sentence recognition rate of 25%[36]. Chiou[37] gave a lip-reading system which used colour motion-video combining snake model, HMM and principal component analysis (PCA) to achieve accuracy of about 94% for 10 words.

6 Read more

Audio-visual speech perception: a developmental ERP investigation

Audio-visual speech perception: a developmental ERP investigation

not mature until the teenage years (Gotgay et al., 2004; see Lenroot & Giedd, 2006). Recent functional imaging data mirror this late development and support the role of STS in children’s audio-visual speech perception (Nath, Fava & Beauchamp, 2011). Dick and colleagues (Dick, Solod- kin & Small, 2010) measured brain activity in response to auditory and audio-visual speech in adults and 8- to 11-year-old children, and found that while the same areas were involved in perception for both adults and children, the relationships between those areas differed. For exam- ple, the functional connectivity between pSTS and frontal pre-motor regions was stronger for adults given audio- visual over auditory-only speech, but weaker for children. With regard to latency, a different pattern emerged for the children, as a group, compared to the adult sample in Experiment 1. For the children, only the P2 component exhibited latency modulation in response to visual speech cues, and latency shortening was observed regardless of congruency between auditory and visual cues. Interpre- tations of previous adult data (Pilling, 2009; Van Wassenhove et al., 2005) have rested on the effect of congruence-dependency, with congruent visual cues sug- gested to allow a prediction of the upcoming auditory signal, such that the degree of latency shortening reflects the difference between expected and perceived events. The current developmental data are not sensitive to congruency, and therefore cannot be interpreted entirely with recourse to the prediction of signal content. The present and previous adult data may therefore not tell the whole story regarding latency modulation. One possibil- ity is that visual cues are involved in predicting not just what is about to be presented, but also when it is to be presented. Certainly, using non-speech stimuli, the audi- tory N1 and P2 components have been shown to be sensitive to both the content and timing of stimulus presentation (Viswanathan & Jansen, 2010). In this case,

16 Read more

Visual speech recognition: aligning terminologies for better understanding

Visual speech recognition: aligning terminologies for better understanding

To conclude our mini review of machine lipreading, we summarise that there is a clear differentiation between machine lipreading and speech reading. It is important that future researchers use the correct terminology in future publications to help the community under- stand what data conclusions have been drawn upon. We have also discussed the challenge of speaker independence in lipreading, showing that it can stem from the feature extraction method and the training parameters of individual speakers. Furthermore, we have reviewed many of the influences on accuracy scoring in publications from different fields and recom- mended a new notation to help compare results in the future.

11 Read more

Two and three-dimensional visual articulatory models for pronunciation training and for treatment of speech disorders

Two and three-dimensional visual articulatory models for pronunciation training and for treatment of speech disorders

recognition rate was not significant (p=0.994, two-sided Wilcoxon sign rank test for matched samples). In contrast to the phoneme evaluation, the children’s feature recognition rate did not increase significantly with age for either model. Furthermore the recognition scores of any individual visual articulatory feature did not differ significantly across the two models (Fig. 5). However, the various articulatory features showed significantly different recognition scores for both models (Table 2). In the case of the both models the visual articulatory features nasality, articulator and place do not exhibit significant different recognition scores (two-tailed Wilcoxon sign rank test) while the feature rounding exhibits the highest and the feature narrowness the lowest recognition scores. An ordering of articulatory visual features with respect to the recognition rates is given in Tab. 2 for both models.

5 Read more

Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

about points located on a lip-contour [6, 7], have already been used for bimodal speech recognition based on frontal- face images. However, since these features were extracted based on “oval” mouth shape models, they are not suitable for side-face images. To e ff ectively extract geometric infor- mation from side-face images, this paper proposes using lip- contour geometric features (LCGFs) based on a time series of estimated angles between upper and lower lips [12]. In our previous work on audio-visual speech recognition us- ing frontal-face images [9, 10], we used lip-motion veloc- ity features (LMVFs) derived by optical-flow analysis. In this paper, LCGFs and LMVFs are used individually and jointly [12, 13]. (Preliminary versions of this paper have been pre- sented at workshops [12, 13].) Since LCGFs use lip-shape in- formation, they are expected to be effective in discriminat- ing phonemes. On the other hand, since LMVFs are based on lip-movement information, they are expected to be ef- fective in detecting voice activity. In order to integrate the audio and visual features, a multistream HMM technique is used.

9 Read more

An Analysis of Visual Speech Features for Recognition of Non articulatory Sounds using Machine Learning

An Analysis of Visual Speech Features for Recognition of Non articulatory Sounds using Machine Learning

Treatment of speech disorders requires speech therapy and substantial effort. People with such disorders need rigorous training to plan and to execute motor acts of speech. Speech training starts with facial motor praxia activities and oral myofunctional exercises that involve production of non- articulatory sounds, such as production of blow, tongue snap, and kisses (lip protrusion), which can be considered precursors in the production of phonemes and words (articulatory sounds). Essentially, a Speech Therapist supervises therapy exercises performed in therapeutic clinics. Depending on patient's condition, the use of multimedia devices and mobile technology can make it easier for individuals to achieve their goals through speech training exercises that they can carry out in a clinic or at home. For several years, great effort in the field of multimedia processing has been devoted to addressing recognition of different types of sounds (speech and others) and to filtering noises [3-5]. In SR systems, therapy exercises for speech disorders start with non-articulatory sounds, which can be misclassified as noise. Besides this recognition issue, speech exercises conducted in noisy locations (clinics or homes) are recorded together with various environment sounds like music, bird song, rain noise, street traffic noise, TV sounds as well as in the presence of people speaking, baby cries, dog barking, etc. SR systems consider environment sounds as background noises, which can lead to false recognition or low performance. Some methods have been proposed to filter noises [6,7].

9 Read more

Audio visual speech perception: a developmental ERP investigation

Audio visual speech perception: a developmental ERP investigation

not mature until the teenage years (Gotgay et al., 2004; see Lenroot & Giedd, 2006). Recent functional imaging data mirror this late development and support the role of STS in children’s audio-visual speech perception (Nath, Fava & Beauchamp, 2011). Dick and colleagues (Dick, Solod- kin & Small, 2010) measured brain activity in response to auditory and audio-visual speech in adults and 8- to 11-year-old children, and found that while the same areas were involved in perception for both adults and children, the relationships between those areas differed. For exam- ple, the functional connectivity between pSTS and frontal pre-motor regions was stronger for adults given audio- visual over auditory-only speech, but weaker for children. With regard to latency, a different pattern emerged for the children, as a group, compared to the adult sample in Experiment 1. For the children, only the P2 component exhibited latency modulation in response to visual speech cues, and latency shortening was observed regardless of congruency between auditory and visual cues. Interpre- tations of previous adult data (Pilling, 2009; Van Wassenhove et al., 2005) have rested on the effect of congruence-dependency, with congruent visual cues sug- gested to allow a prediction of the upcoming auditory signal, such that the degree of latency shortening reflects the difference between expected and perceived events. The current developmental data are not sensitive to congruency, and therefore cannot be interpreted entirely with recourse to the prediction of signal content. The present and previous adult data may therefore not tell the whole story regarding latency modulation. One possibil- ity is that visual cues are involved in predicting not just what is about to be presented, but also when it is to be presented. Certainly, using non-speech stimuli, the audi- tory N1 and P2 components have been shown to be sensitive to both the content and timing of stimulus presentation (Viswanathan & Jansen, 2010). In this case,

16 Read more

Survey and Comparative Analysis on Video Retrieval Techniques

Survey and Comparative Analysis on Video Retrieval Techniques

Vendrig &Worring [20] propose a system that allows character identification in movies. In order to achieve this, they associate visual content to names extracted from movie scripts. Denman et al [13] present the tools in a system for creating semantically significant summaries of broadcast Snooker footage. Their system parses the video sequence, identifies proper camera views, &tracks ball movements. A same approach presented by Kim et al [14] extracts semantic information from basketball videos based on audio-visual features. A semantic video retrieval approach using audio analysis is presented by Bakker and Lew [15] in which the audio can be automatically categorized into semantic categories such as explosions, music, speech, etc. A learning method using the AdaBoost algorithm and a k-nearest neighbor approach is proposed by Pickering et al [16] for video retrieval.

5 Read more

Deep word embeddings for visual speech recognition

Deep word embeddings for visual speech recognition

We compare our architecture with two approaches which ac- cording to our knowledge are the two best performing ap- proaches in LRW. The first in proposed in [7], and it deploys an encoder-decoder with temporal attention mechanism. Al- though the architecture is designed to address the problem of sentences-level recognition, it has also been tested on the LRW database, after fine-tuning on the LRW training set. The whole set of experimental results can be found in [7] and the results on LRW are repeated here in Table 1 (denoted by Watch-Attend-Spell). The second architecture is introduced by our team in [15] and its differences with the proposed one have been discussed above. The experimental results on LRW are given in Table 1 (denoted by ResNet-LSTM). Both exper- iment use the full set of words during training and evaluation (i.e. 500 words).

5 Read more

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment

The stages involved in using RNNLM adaptation for ASR are as follows. Voice Activity Detection (VAD) is first applied to the audio in order to identify speech segment boundaries. The input text is then converted to a mono- phone/triphone/senome representation and aligned to the seg- mented audio using a baseline ASR system. The segmented audio and aligned text are fed to a DNN-HMM system which can be either a Hybrid or Bottleneck system [38]. In the Hybrid system, a DNN is used to predict monophone/triphone/senome states from audio features, which in most cases are log Filterbank features. This results in posteriors over these states, which are integrated as observation probabilities in a hidden Markov model (HMM), and used to predict the optimal path by also taking into account dynamical constraints arising from an underlying language model. In a Bottleneck system, the log Filterbank features are fed to the DNN as input and the monophone/triphone/senome states as output. A Bottleneck layer is introduced between the final layer of the DNN and the output layer which generally has a lower dimension than the final layer. The activation values of that layer are then extracted as Bottleneck features. These Bottleneck features are used as input to a standard GMM-HMM system and have been found to outperform GMM-HMM systems with MFCC or PLP features [38], due to the discriminative nature of the input features.

12 Read more

Speech Endpoint Detection Based on High Order Statistics

Speech Endpoint Detection Based on High Order Statistics

Despite these various methods, there is no universal detection algorithm yet working reliably in all possible noises and settings. Difficulties in endpoint detection arise not only from the different types of noise present in the recording, but also from the vocabulary words themselves. Some phonemes or sounds have very low energy when compared to the vowel portion of the speech, and as a result, they are interpreted as background noise [3].

5 Read more

A corpus of audio-visual Lombard speech with frontal and profile views

A corpus of audio-visual Lombard speech with frontal and profile views

Table 1. The mean and standard deviation (M±SD) of acoustic, phonetic and visual features of all talkers, female (F) talkers and male (M) talkers. P: plain, L: Lombard. Columns t summarize the results of statistical analyses (t-tests) between plain and Lombard conditions. Symbols: increase: ↑ , decrease: ↓; All tests were significant (p < 0.001) except those marked with ⋆ (p > 0.5)

20 Read more

Resolution limits on visual speech recognition

Resolution limits on visual speech recognition

We have shown that the performance of simple visual speech recognizers has a threshold effect with resolution. For suc- cessful lip-reading one needs a minimum four pixels across the closed lips. However the surprising result is the remark- able resilience that computer lip-reading shows to resolution. Given that modern experiments in lip-reading usually take place with high-resolution video ([16] and [1] for example) the disparity between measured performance (shown here) and assumed performance is very striking.

5 Read more

Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition

Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition

able illumination during the speech. The aim of recording database with impaired conditions is to test existing visual parameterization (C´ısaˇr et al., 2007). This parameterization consist of lip shape information and pixel based informa- tion about inner part of the mouth. To stay concentrated on the parameterization it is crucial to preprocess the database. The preprocessing means that database contains informa- tion about position of the region of interest (ROI). In this case it is a part of the face around lips. ROI information is used as an input of the visual parameterization algorithm. This paper is organized as follows. Next section list some of existing databases that were already released. Section three describe database specification and recording. Section four deals with database preprocessing. The last section sum- marizes the facts about our database.

5 Read more

Voice Recognition System Through Machine Learning

Voice Recognition System Through Machine Learning

Abstract: Human voice recognition by computers has been ever developing area since 1952. It is challenging task for a computer to understand and act according to human voice rather than to commands or programs. The reason is that no two human’s voice or style or pitch will be similar and every word is not pronounced by everyone in a similar fashion. Background noises and disturbances may confuse the system. The voice or accent of the same person may change according to the user’s mood, situation, time etc. despite of all these challenges, voice recognition and speech to text conversion has reached a successful stage. Voice processing technology deserves still more research. As a tip of iceberg of this research we contribute our work on this are and we propose a new method i.e., VRSML (Voice Recognition System through Machine Learning) mainly focuses on Speech to text conversion, then analyzing the text extracted from speech in the form of tokens through Machine Learning. After analyzing the derived text, reports are created in textual as well graphical format to represent the vocabulary levels used in that speech. As Supervised learning algorithm from Machine Learning is employed to classify the tokens derived from text, the reports will be more accurate and will be generated faster.

6 Read more

Show all 10000 documents...