This Master Thesis was developed at the Department of Electronics and Telecommunications (Faculty of Information Technology, Mathematics and Electrical Engineering) at NTNU University (Trondheim, Norway), from February 2009 to July 2009. The Master Thesis was called Speech Analysis for AutomaticSpeechRecognition. This Master Thesis is connected to the research project SIRKUS 1 . The aims of SIRKUS project is to investigate structures and strategies for automaticspeechrecognition; both in terms of what type of linguistic units it uses as the basic unit (today, phonemes, which are perceptually defined, are used), which acoustic properties to look for in the speech waveform, or which classifier to use (Hidden Markov Models (HMM) are predominantly used today).
phonetics, lexical access, syntax, semantics and pragmatics.
4 PHASES OF ASR
Automaticspeechrecognition system involves two phases:
Training phase and recognition phase. A rigorous training procedure is followed to map the basic speech unit such as phone, syllable to the acoustic observation. In training phase, known speech is recorded, pre-processed and then enters the first stage i.e. Feature extraction. The next three stages are HMM creation, HMM training and HMM storage. The recognition phase starts with the acoustic analysis of unknown speech signal. The signal captured is converted to a series of acoustic feature vectors. Using appropriate algorithm, the input observations are processed. The speech is compared against the HMM‟s networks and the word which is pronounced is displayed. An ASR system can only recognize what it has learned during the training process. But, the system is able to recognize even those words, which are not present in the training corpus and for which sub-word units of the new word are known to the system and the new word exists in the system dictionary.
With the advancement of speechrecognition technologies, there is an increase in the adoption of voice interfaces on mobile-based platforms. While, developing a general purpose AutomaticSpeechRecognition (ASR) which can understand voice commands is important, the contexts of how people interact with their mobile device change very rapidly. Due to the high processing complexity of the ASR engine, much of the processing of trending data is being carried out on cloud platforms. Changed content regarding news, music, movies and TV series change the focus of interaction with voice based interfaces. Hence ASR engines trained on a static vocabulary may not be able to adapt to the changing contexts. The focus of this paper is to first describe the problems faced in incorporating dynamically changing vocabulary and contexts into an ASR engine. We then propose a novel solution which shows a relative improvement of 38 percent utterance accuracy on newly added content without compromising on the overall accuracy and stability of the system.
AutomaticSpeechRecognition (ASR) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program (Sanjivani S. Bhabad and Gajanan K. Kharate, 2013). Due to technological curiosity to build machines that mimic humans or desire to automate work with machines, research in speechrecognition, as a first step toward natural human-machine communication, has attracted much enthusiasm over the past five decades. Therefore several research efforts have been oriented to this area where computer scientists have been researching ways and means to make computers able to record, interpret and understand human speech. It has been an intensive research area for decades. ASR system includes two phases. Training phase and Recognition phase. In training phase, known speech is recorded, and then the features (parametric representation of the speech) are extracted and stored in the speech database. In the recognition phase, the features of the given input speech signal are extracted and compared with the reference templates (stored in the speech database) to recognize the utterance.
The MITRE Corporation, 7525 Colshire Dr, McLean, VA 22102
Abstract
We address and evaluate the challenges of utilizing AutomaticSpeechRecognition (ASR) to support the human translator. Audio transcription and translation are known to be far more time-consuming than text translation; at least 2 to 3 times longer. Furthermore, time to trans- late or transcribe audio is vastly dependent on audio quality, which can be impaired by back- ground noise, overlapping voices, and other acoustic conditions. The purpose of this paper is to explore the integration of ASR in the translation workflow and evaluate the challenges of utilizing ASR to support the human translator. We present several case studies in different settings in order to evaluate the benefits of ASR. Time is the primary factor in this evaluation.
Automaticspeechrecognition, which was considered to be a concept of science fiction and which has been hit by number of performance degrading factors, is now an important part of information and communication technology. Improvements in the fundamental approaches and development of new approaches by researchers have lead to the advancement of ASRs which were just responding to a set of sounds to sophisticated ASRs which responds to fluently spoken natural language. Using artificial neural networks (ANNs), mathematical models of the low-level circuits in the human brain, to improve speech-recognition performance, through a model known as the ANN-Hidden Markov Model (ANN- HMM) have shown promise for large-vocabulary speechrecognition systems. Achieving higher Recognition accuracy, low Word error rate, developing speech corpus depending upon the nature of language and addressing the issues of sources of variability through approaches like Missing Data Techniques & Convolutive Non-Negative Matrix Factorization, are the major considerations for developing an efficient ASR. In this paper, an effort has been made to highlight the progress made so far for ASRs of different languages and the technological perspective of automaticspeechrecognition in countries like China, Russian, Portuguese, Spain, Saudi Arab, Vietnam, Japan, UK, Sri- Lanka, Philippines, Algeria and India.
{negri,turchi,desouza,falavi}@fbk.eu
Abstract
We address the problem of estimating the quality of AutomaticSpeechRecognition (ASR) out- put at utterance level, without recourse to manual reference transcriptions and when information about system’s confidence is not accessible. Given a source signal and its automatic transcription, we approach this problem as a regression task where the word error rate of the transcribed utter- ance has to be predicted. To this aim, we explore the contribution of different feature sets and the potential of different algorithms in testing conditions of increasing complexity. Results show that our automatic quality estimates closely approximate the word error rate scores calculated over reference transcripts, outperforming a strong baseline in all the testing conditions.
Keywords: AutomaticSpeechRecognition, Free Software, WaveSurfer
1. Introduction
AutomaticSpeechRecognition (ASR) is becoming an im- portant part of our lives, both as a viable alternative for humans-computer interaction, but also as a tool for linguis- tics and speech research. In many cases, however, it is trou- blesome, even in the language and speech communities, to have easy access to ASR resources. On the one hand, com- mercial systems are often too expensive and not flexible enough for researchers. On the other hand, free ASR soft- ware often lacks high quality resources such as acoustic and language models for the specific languages and requires ex- pertise that linguists and speech researchers cannot afford.
{shuet, ggravier, sebillot}@irisa.fr
Abstract
Texts generated by automaticspeechrecognition (ASR) systems have some specificities, related to the idiosyncrasies of oral productions or the principles of ASR systems, that make them more difficult to exploit than more conventional natural language written texts. This paper aims at studying the interest of morphosyntactic information as a useful resource for ASR. We show the ability of automatic methods to tag outputs of ASR systems, by obtaining a tag accuracy similar for automatic transcriptions to the 95-98 % usually reported for written texts, such as newspapers. We also demonstrate experimentally that tagging is useful to improve the quality of transcriptions by using morphosyntactic information in a post-processing stage of speech decoding. Indeed, we obtain a significant decrease of the word error rate with experiments done on French broadcast news from the ESTER corpus; we also notice an improvement of the sentence error rate and observe that a significant number of agreement errors are corrected.
Speech is a primary mode of communication among human beings. It is natural for people to expect to be able to carry out spoken dialogue with computers. In this paper we discussed the fundamental approach and development of speechrecognition in the last several year of research in AutomaticSpeechRecognition (ASR). The design of SpeechRecognition system requires careful attentions to the following issues: Various type of speech class, Feature Extraction, Acoustic model, Pronunciation Dictionary and language model. We presented the various techniques to solve this problem existing in ASR. This paper is helpful for to review the problem in ASR research in various Speechrecognition models.
There are two scenarios in medical dictation where ASR can remove or alleviate the problems mentioned above: Real-time ASR and ASR+post-editing.
Real-time automaticspeechrecognition
Speaking is faster than typing (Basapur et al., 2007). If the physician uses digital dictation augmented with real-time ASR, the secretary is not a part of the documentation workflow and a resource is free for other purposes. As a side-effect, the physician is the last eyes on the transcription and can approve or correct a transcription immediately while the consultation is still fresh in memory. If integrated with an electronic medical records system, the physician can even dictate directly into the patient record and the clinical documentation will always be up-to-date with the most recent information.
KEYWORDS- AutomaticSpeechRecognition (ASR) , Feature Extraction, MFCCs, LPC, RASTA, PLDA and PLP.
I. INTRODUCTION
Speech is the important way of communication. Speech processing is one of the most rousing research areas under signal processing. The signals are generally processed in digital domain; hence speech processing can also be distinctively called as digital signal processing appertained to speech signal. AutomaticSpeechRecognition (ASR) is a computer speechrecognition system. It is a course of action of converting speech signal into series of words and other lingual units with help of algorithms which could be implemented as computer programs. The predominant objective of ASR is to develop different techniques and also system to enable the computers to identify speech signals which are fed as input. Speechrecognition and its applications have evolved from past few decades. In any of the speechrecognition system the speech signal is converted into text form out of which, the text form will be the output from ASR and this text will be almost equivalent to the speech fed as input. This recognition has its procesecution in voice search, voice dialling, robotics etc.
{fhamidi,mb}@cse.yokru.ca
Abstract
Although automaticspeechrecognition (ASR) has been used in several systems that support speech training for children, this particular design domain poses on-going challenges: an input domain of non-standard speech and a user population for which meaningful, consistent, and well designed automatically-derived feedback is imperative. In this design analysis, we focus on and analyze the differences between the tasks of speechrecognition and speech assessment, and identify the latter as a central issue for work in the speech-training domain. Our analysis is based on empirical results from fieldwork with Speech-Language Pathologists concerning the design requirements analysis for tangible toys intended for speech intervention with primary- school aged children. This analysis leads us to advocate for the use of only rudimentary ASR feedback.
2 Universit¨ at des Saarlandes, Saarbr¨ ucken, Germany cristinae@cs.upc.edu, jose.fonollosa@upc.edu
Abstract. AutomaticSpeechRecognition has reached almost human performance in some controlled scenarios. However, recognition of im- paired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models out- perform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for sub- jects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speechrecognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available.
Bangla (can also be termed as Bengali), which is largely spoken by the people all over the world, has been performed a very little research where many literatures in automaticspeechrecognition (ASR) systems are available for almost all the major spoken languages in the world. Although Bangla speakers’ number is about 250 million today, which makes Bangla the seventh language ( banglapedia, 2013 ), a systematic and scientific effort for the computerization of this language has not been started yet. The Bengali alphabet is a syllabic alphabet in which consonants all have an inherent vowel which has two different pronunciations, the choice of which is not always easy to determine and which is sometimes not pronounced at all. Some efforts are made to develop Bangla speech corpus to build a Bangla text to speech system (Hossain et al., 2007). However, this effort is a part of developing speech databases for Indian Languages, where Bangla is one of the parts and is spoken in the eastern area of India (West Bengal). But most of the natives of Bangla (more than two thirds) reside in Bangladesh, where it is the official language. Although the written characters of standard Bangla in both the countries are same, there are some sounds
Bangla (can also be termed as Bengali), which is largely spoken by the people all over the world, has been performed a very little research where many literatures in automaticspeechrecognition (ASR) systems are available for almost all the major spoken languages in the world. Although Bangla speakers’ number is about 250 million today, which makes Bangla the seventh language ( banglapedia, 2013 ), a systematic and scientific effort for the computerization of this language has not been started yet. The Bengali alphabet is a syllabic alphabet in which consonants all have an inherent vowel which has two different pronunciations, the choice of which is not always easy to determine and which is sometimes not pronounced at all. Some efforts are made to develop Bangla speech corpus to build a Bangla text to speech system (Hossain et al., 2007). However, this effort is a part of developing speech databases for Indian Languages, where Bangla is one of the parts and is spoken in the eastern area of India (West Bengal). But most of the natives of Bangla (more than two thirds) reside in Bangladesh, where it is the official language. Although the written characters of standard Bangla in both the countries are same, there are some sounds
The goal of this thesis is to develop and design new feature representations that can improve the automaticspeechrecognition (ASR) performance in clean as well noisy conditions. One of the main shortcomings of the fixed scale (typically 20-30 ms long analysis windows) envelope based feature such as MFCC, is their poor handling of the non-stationarity of the underlying signal. In this thesis, a novel stationarity-synchronous speech spectral analysis technique has been proposed that sequentially detects the largest quasi-stationary segments in the speech signal (typically of variable lengths varying from 20-60 ms), followed by their spectral analysis. In contrast to a fixed scale anal- ysis technique, the proposed technique provides better time and frequency resolution, thus leading to improved ASR performance. Moving a step forward, this thesis then outlines the development of theoretically consistent amplitude modulation and frequency modulation (AM-FM) techniques for a broad band signal such as speech. AM-FM signals have been well defined and studied in the context of communications systems. Borrowing upon these ideas, several researchers have applied AM-FM modeling for speech signals with mixed results. These techniques have varied in their definition and consequently the demodulation methods used therein. In this thesis, we carefully define AM and FM signals in the context of ASR. We show that for a theoretically meaningful esti- mation of the AM signals, it is important to constrain the companion FM signal to be narrow-band. Due to the Hilbert relationships, the AM signal induces a component in the FM signal which is fully determinable from the AM signal and hence forms the redundant information. We present a novel homomorphic filtering technique to extract the leftover FM signal after suppressing the redundant part of the FM signal. The estimated AM message signals are then down-sampled and their lower DCT coefficients are retained as speech features. We show that this representation is, in fact, the exact dual of the real cepstrum and hence, is referred to as fepstrum. While Fepstrum provides amplitude modulations (AM) occurring within a single frame size of 100ms, the MFCC fea- ture provides static energy in the Mel-bands of each frame and its variation across several frames (the deltas). Together these two features complement each other and the ASR experiments (hidden Markov model and Gaussian mixture model (HMM-GMM) based) indicate that Fepstrum feature in conjunction with MFCC feature achieve significant ASR improvement when evaluated over several speech databases.
This paper introduces a speech corpus which is developed for Myanmar Au- tomatic SpeechRecognition (ASR) research. AutomaticSpeechRecognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in de- veloping the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low- resourced language because of lack of pre-created resources for speech pro- cessing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily con- versations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data. The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on dif- ferent data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The kaldi speechrecognition toolkit. In IEEE 2011 Workshop on AutomaticSpeechRecognition and Understanding. IEEE Sig- nal Processing Society, December. IEEE Catalog No.: CFP11SRW-USB.
Contextual Error Correction in AutomaticSpeechRecognition
ABSTRACT
This disclosure describes techniques that leverage the context of a conversation between a user and a virtual assistant to correct errors in automaticspeechrecognition (ASR). Once confirmed by the user, the correction event is used to augment the training data for ASR.