Design and Implementation of
Three-tier Distributed VoiceXML-based Speech System
BY
Thomas C.K. Cheng
Supervised by
Michael R. Lyu
A report submitted in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
AUTHORIZATION
I hereby declare that I am the sole author of the thesis.
I authorize the Chinese University of Hong Kong to lend this thesis to other institutions or individuals for the purpose of scholarly research.
I further authorize the Chinese University of Hong Kong to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.
SIGNATURE
APPROVED:
Prof. Michael R. Lyu, SUPERVISOR
TABLE OF CONTENTS
AUTHORIZATION...ii
SIGNATURE ...iii
TABLE OF CONTENTS...iv
LIST OF FIGURES...vi
ABSTRACT ...viii
Chapter 1 INTRODUCTION... 1
Chapter 2 SPEECH TECHNOLOGY... 5
2.1 Speech
Recognition...5
2.1.1 Hidden Markov Models... 6
2.1.2 Feature Extraction... 9
2.1.3 Distributed Speech Recognition ... 10
2.2 Text-to-Speech
Synthesis ... 11
2.3 Voice eXtensible Markup Language... 12
Chapter 3 COMMON OBJECT REQUEST BROKER
ARCHITECTURE ... 15
3.1 Load
Balancing ... 16
3.1.2 Replication... 17
3.2 Fault
Tolerance... 18
3.2.1 Location Transparency ... 19
3.2.2 Recoverability of the Activation Agent... 20
Chapter 4 SOFTWARE DESIGN SPECIFICATION...22
4.1 System
Architecture ... 22
4.1.1 Platform ... 22
4.1.2 Terminal ... 22
4.1.3 Speech Browser ... 26
4.1.4 HMM training by HTK ... 28
APPENDIX : PROJECT SCHEDULE...32
LIST OF FIGURES
Figure 1 Components of a typical speech recognition system ... 1
Figure 2 Description of components in various speech recognition
system... 2
Figure 3 A client-server model of speech recognition system ... 3
Figure 4 Components of a typical TTS synthesis system... 3
Figure 5 A client-server model of TTS synthesis system... 3
Figure 6 Scheme of ASR system functions ... 6
Figure 7 The Markov generation model M... 8
Figure 8 Block diagram of DSR system ... 11
Figure 9 VoiceXML architecture model... 13
Figure 10 A sample VoiceXML document and one possible
corresponding dialog... 14
Figure 11 OMG’s Object Management Architecture ...15
Figure 12 A request passing from client to object implementation
...16
Figure 14 System architecture (Terminal) ...23
Figure 15 Examples of terminals ... 23
Figure 16 A Java applet running as a terminal ... 24
Figure 17 Block diagram of the DSR front-end ...25
Figure 18 System architecture (Speech browser) ... 26
ABSTRACT
There are plenty of speech recognition researches and applications targeting on embedded systems, desktops and servers. Each of them has their own limitations. We may have a command and control application on mobile devices, a speaker-dependent dictation application on home PCs and a speaker-independent interactive voice response (IVR) application on call-center servers. Advances in speech recognition technology, wireless technology and growth of the Internet leads us to establish a universal solution to break all their limitations, i.e. a new three-tier speech system and makes us to access information more simply and more naturally.
In this project, we are designing and implementing a three-tier distributed VoiceXML-based speech system to demonstrate our model. The system is comprised of three components, namely terminal, speech browser and document server, forming a three-tier model. The terminal usually has small sizes, limited computing powers, small memories and limited bandwidth networks; for example, mobile phone, PDA with bluetooth, analog phone plus PC with telephone interface. Owing to its capabilities, it acts as a front end that uses data channel to send/receive the parameterized representation of speech to/from the speech browser. The speech browser is a cluster of powerful servers acting as a back-end that performs the core recognition, synthesis processes and the interpretation of VoiceXML files retrieved from the document server. The document server is a web site containing the VoiceXML files.
This model adds mobile devices the dialog-based speech capability to access the enormous multilingual information in the Internet without breaking the size, memory and
computation limitations. By adding the features of load balancing and fault tolerance to the speech browser, a reliable back-end service can be guaranteed. Our goal is to find a complete solution to fit into this model by applying various technologies.
Chapter 1 INTRODUCTION
The Automatic Speech Recognition (ASR) has a history of around 50 years, which can be traced back to 1952, when a first word recognizer, Audrey was built by Daveis, Biddulph and Balashek to recognize digits by approximating the formants. The progress made in ASR was very slow until in 1974, Hidden Markov Models (HMMs) was applied to speech recognition by Baker in the Dragon project. It was later developed by IBM (Baker, Jalinek, Bahl, Mercer) in 1976-1993. Now, it is the dominant technology for both isolated word and continuous speech recognition. The extraction of speech feature plays a critical role in the robustness of the speech recognizer. John Bridle proposed Mel Cepstra, which was used in the second major (D)ARPA ASR project in 1980’s. Its variant, Mel Frequency Cepstral Coefficients (MFCC) is still one of the best feature sets used in speech recognition today [1][2][3].
Figure 1 Components of a typical speech recognition system We can represent all speech recognition systems ranging from command and control, dictation to IVR applications as in Figure 1. Each of them is comprised of a
Speech Wave Recognition Preprocessing Feature Extraction Search and Match Acoustic Model Lexical Model Language Model Speech Word
preprocessor, a feature extractor and a recognizer with different algorithm and complexity (Figure 2). Applications Components Command and control Dictation IVR Preprocessing simple noise
filtering
low-complexity noise cancellation
high-complexity noise cancellation
Feature Extraction formant MFCCs MFCCs
Recognition dynamic time
warping (DTW) without models
HMM with acoustic, lexical and language
model
HMM with acoustic, and lexical model
and grammar Figure 2 Description of components in various speech recognition system We are proposing a new three-tier distributed VoiceXML-based speech system, which is modified from the typical one as above. To summarize all the requirements in such a system, we have,
1. Low computational power and small memory in terminal
2. Low communication bandwidth between terminal and speech browser
3. Easy incorporation of various recognition engines and text-to-speech (TTS) synthesizers onto speech browser
4. Provision of reliable service by speech browser
Several modifications shown in Figure 3 are done to meet the above requirements [4]. 1. Make the terminal to perform the low-complexity preprocessing and feature
extraction parts excluding the high-complexity recognition part, which is now assigned to speech browser
2. Add the compressor and decompressor in between terminal and speech browser to achieve a low-bit-rate data transmission between them.
Figure 3 A client-server model of speech recognition system Similarly, we can represent all TTS synthesis system as in Figure 4. To meet the requirements, similar modifications shown in Figure 5 are done [5].
Figure 4 Components of a typical TTS synthesis system
Figure 5 A client-server model of TTS synthesis system Speech Word Speech Wave
Terminal (Client) Speech Browser (Server) Preprocessing
and Feature Extraction
Compression Decompression Recognizer
Speech Text
Text and Linguistic Analysis
Speech Synthesis Speech Wave Text Preprocessing Accent Assignment Word Pronunciation Intonational Phrasing Segmental Durations F0 Contour Computation Phoneme to Unit Mapping Unit Concatenation Waveform Synthesis Speech Wave Speech Text
Speech Browser (Server) Terminal (Client) Text and Linguistic
Analysis
Speech Synthesis
Furthermore, load balancing and replication strategy will also be considered in speech browser
Our objectives are to:
1. Design the preprocessing and feature extraction algorithm specially for mobile devices
2. Design the compression algorithm to achieve a low-bit-rate data transmission between terminals and speech browser
3. Implement the speech browser as CORBA-based such that recognition engines and TTS synthesizers are server objects that may locate at different hosts.
4. Break down and replicate the high-complexity above-mentioned server objects so as to achieve the advantages of load balancing, fault tolerance, scalability and parallelism.
Chapter 2 SPEECH TECHNOLOGY
2.1 Speech Recognition
The thirty-year old research activity in the speech recognition area has produced a well-consolidated technology based on Hidden Markov Models (HMMs) that is also firmly supported theoretically. The HMM technology, that has the powerful capability of modeling complex (non-stationary phenomena and the availability of efficient algorithms managing very large amounts of data, is now available for Automatic Speech Recognition (ASR) tasks owing to low-cost high-performance commercial products [6][7].
In this project, we implement a speech recognition engine using a open source HMM toolkit (HTK), that is primarily designed for building HMM-based speech processing tools, in particular recognizers. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Secondly, unknown utterances are transcribed using the HTK recognition tools [8].
Our target is to build a large-vocabulary dialog-based speaker-independent speech recognizer (LVCSR). Large-vocabulary generally means that the systems have a vocabulary of roughly 5,000 to 60,000 words. The term dialog-based means that the words are run together in a restricted way defined in grammars. Speaker-independent means that the systems are able to recognize speech from people whose speech the system has never been exposed to before.
Figure 6 Scheme of ASR system functions
2.1.1 Hidden Markov Models
Let each spoken word be represented by a sequence of speech vectors or observations O, defined as T o o o O? 1, 2,..., Equation 1
where oT is the speech vector observed at time t. The recognition problem can then be regarded as that of computing
)} | ( { max arg P wi O i Equation 2
where wi is the i’th vocabulary word. This probability is not computable directly but using Bayes’ Rule gives
) ( ) ( ) | ( ) | ( O P w P w O P O w P i i i ? Equation 3
Thus, for a given set of prior probabilities P(wi), the most probable spoken word depends
Speech Data Unknown Speech Transcription Training Data Models Transcription Training Tools Recognizer
spoken words is not practicable. However, if a parametric model of word production such as a Markov model is assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities P(O|wi) is replaced by
the much simpler problem of estimating the Markov model parameters.
In HMM-based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Figure 7. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector ot is generated from the probability density bj(ot). Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij. Figure 7 show an example of this process where the six state model moves through the state sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1 to o6. Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.
The joint probability that O is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities. So for the state sequence X in Figure 7
)... ( ) ( ) ( ) | , (O X M a12b2 o1 a22b2 o2 a23b3 o3 P ? Equation 4
However, in practice, only the observation sequence O is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model.
Figure 7 The Markov generation model M Given that X is unknown, the required likelihood is computed by summing over all possible state sequences X = x(1), x(2), x(3),…,x(T), that is
?
?
? ? ? X T t t x t x t t x x x b o a a M O P 1 ) 1 ( ) ( ) ( ) 1 ( ) 0 ( ( ) ) | ( Equation 5where x(0) is constrained to be the model entry state and x(T+1) is constrained to be the model exit state.
Given a set of models MI corresponding to words wi, equation 2 (i.e. recognition problem) is solved by using equation 3 and assuming that
) | ( ) | (O wi P O Mi P ? Equation 6
Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each
o6 o5 o4 o3 o2 o1 b5(o6) b4(o5) b4(o4) b3(o3) b2(o2) a22 b2(o1) a33 a44 a55 a35 a24 a23 a34 a45 a56 a12 1 2 3 4 5 6
2.1.2 Feature Extraction
The human ear resolves frequencies non-linearly across the audio spectrum and empirical evidence suggests that designing a front-end to operate in a similar non-linear manner improves recognition performance. A popular alternative to linear prediction based analysis is therefore filterbank analysis since this provides a much more straightforward route to obtaining the desired non-linear frequency resolution. However, filterbank amplitudes are highly correlated and hence, the use of a cepstral transformation in this case is virtually mandatory if the data is to be used in a HMM based recognizer with diagonal covariances.
The filters used are triangular and they are equally spaced along the mel-scale which is defined by ) 700 1 ( log 2595 ) ( 10 f f Mel ? ? Equation 7
To implement this filterbank, the window of speech data is transformed using a Fourier transform and the magnitude is taken. The magnitude coefficients are then binned by correlating them with each triangular filter. Here binning means that each FFT magnitude coefficient is multiplied by the corresponding filter gain and the results accumulated. Thus, each bin holds a weighted sum representing the spectral magnitude in that filterbank channel.
Mel-Frequency Cepstral Coefficients (MFCCs) are calculated from the log filterbank amplitudes {mj} using the Discrete Cosine Transform
?
? ? ? ? N j j i j N i m N c 1 )) 5 . 0 ( cos( 2 ? Equation 8where N is the number of filterbank channels.
MFCCs are the parameterization of choice for many speech recognition applications. They give good discrimination and lend themselves to a number of manipulations. In particular, the effect of inserting a transmission channel on the input speech is to multiply the speech spectrum by the channel transfer function. In the log cepstral domain, this multiplication becomes a simple addition which can be removed by subtracting the cepstral mean from all input vectors. In practice, of course, the mean has to be estimated over a limited amount of speech data so the subtraction will not be perfect. Nevertheless, this simple technique is very effective in practice where it compensates for long-term spectral effects such as those caused by different microphones and audio channels.
2.1.3 Distributed Speech Recognition
In a distributed speech recognition (DSR) architecture the recognizer front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. DSR provides particular benefits for applications for mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. To enable all these benefits in a wide market containing a variety of players including terminal manufacturers, operators server providers and recognition vendors, a standard for front-end is needed to ensure compatibility between the terminal and the remote recognizer. The STQ-Aurora DSR Working Group within ETSI has been actively
Figure 8 Block diagram of DSR system
2.2 Text-to-Speech Synthesis
It is natural to divide the speech synthesis problem into two broad subproblems. The first of these is the conversion of text – an imperfect representation of language, as we have seen – into some form of linguistic representation, that includes information on the phonemes (sounds) to be produced, their duration, the locations of any pauses, and the F0 contour to be used. The second – the actual synthesis of speech – takes this information and converts it into a speech waveform [11]. Each of these main tasks naturally breaks down into further subtasks, some of which have been alluded to below. The first part, text and linguistic analysis, breaks down as follows:
?? Text preprocessing: including end-of sentence detection, “text normalization” (expansion of numerals and abbreviations), and limited grammatical analysis, such as grammatical part-of-speech assignment.
wireless data channel – 4.8kbit/s Terminal DSR Front-end
Parameterization Mel-Cepstrum
Compression Split VQ
Frame structure & error protection Server DSR Back-end Error detection & mitigation Decompression Recognition decoder
?? Accent assignment: the assignment of levels of prominence to various words in the sentence.
?? Word pronunciation: including the pronunciation of names and the disambiguation of homegraphs.
?? Intonational phrasing: the breaking of (usually long) stretches of text into one or more intonational units.
?? Segmental duration: the determination, on the basis of linguistic information computed thus far, of appropriate duration for phonemes in the input.
?? F0 contour computation.
Speech synthesis breaks down into three parts:
?? The selection of appropriate concatenative units given the phoneme string to be synthesized.
?? The concatenation of those units.
?? The synthesis of a speech waveform given the units, plus a model of the glottal source.
2.3 Voice eXtensible Markup Language
The VoiceXML Forum is an industry organization founded by AT&T, IBM, Lucent and Motorola. It was established to develop and promote the Voice eXtensible Markup Language (VoiceXML), a new computer language designed to make Internet content and information accessible via voice and phone [12].
Figure 9 VoiceXML architecture model A document server (e.g. a web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.
The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call,
Document Request Document Server VoiceXML Interpreter Context VoiceXML Interpreter Implementation Platform
acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.
In this project, we build a VoiceXML Interpreter which is based on a open source project ‘Open VXI VoiceXML Interpreter’ [13].
<?xml version=”1.0”?> <vxml version=”1.0”>
<form id=”tapered”> <block>
<prompt bargein=”false”>Welcome to the ice cream survey.</prompt> </block>
<field name=”flavor”>
<grammar>vanilla | chocolate | strawberry</grammar>
<prompt count=”1”>What is your favorite flavor?</prompt> <prompt count=”2”>Say chocolate, vanilla, or strawberry.</prompt> </field>
</form> </vxml>
C: Welcome to the ice cream survey. C: What is your favorite flavor? H: Pecan praline.
C: I do not understand. (the “flavor” field’s prompt counter is 1) C: What is your favorite flavor?
H: Pecan praline.
C: Say chocolate, vanilla, or strawberry. (the prompt counter is now 2) H: What if I hate those?
C: I do not understand.
Chapter 3 COMMON OBJECT REQUEST BROKER
ARCHITECTURE
Common Object Request Broker Architecture (COBRA) combines two important trends in the computer industry: object-oriented software development and client/server computing. But CORBA is more than just an object-oriented Remote Procedure Call mechanism. The Object Management Group’s (OMG) Object Management Architecture is a framework that defines different levels of abstraction. The core Object Request Brokers (ORB) provides an abstraction from the complexity of network programming. The CORBA services offer classic system-level functionality, in an object-oriented fashion. The CORBA facilities offer standardized approaches to solving domain-specific problems [14].
Figure 11 OMG’s Object Management Architecture Figure 12 shows a request passing from a client to an object implementation in the CORBA architecture. Two aspects of this architecture stand out:
Object Request Brokers Application Objects CORBAfacilities Vertical CORBAfacilites Horizontal CORBAfacilites CORBAservices
?? Both client and object implementation are isolated from the Object Request Broker (ORB) by an IDL interface. CORBA requires that every object’s interface by expressed in OMG IDL. Clients see only the object’s interface, never any implementation detail. This guarantees substitutability of the implementation behind the interface – our plug-and-play software environment. And, on the server side, it allows ORBs to be engineered for scalability.
?? The request does not pass directly from client to object implementation; instead, every request is passed to the client’s local ORB, which manages it. The standard form of the CORBA object invocation is location transparent; that is, its form is the same whether the target object is local (in the same process, or at least in the same memory space) or remote [15][16][17].
Figure 12 A request passing from client to object implementation
3.1 Load Balancing
In the CORBA world, load balancing refers mainly to the distribution of client requests over a group of servant objects that typically live in different server processes and run on
Client IDL Stub Object Implementation IDL Skeleton
Object Request Broker Request
uncharacteristically long time to complete will be more likely to benefit from a load-balanced solution.
3.1.1 Application Partitioning
Application partitioning splits an application into a number of independent service components that provide a specific subset of the overall applications functionality. The sum functionality of all the partitions will equal that of the entire application. Application functionality may be allocated to a partition either horizontally or vertically. A horizontal approach splits a system functionally. Each server only provides a subset of the system’s functionality. A vertical approach splits a system based on data. Each server provides the full functionality of the system, but only has access to a subset of the data.
3.1.2 Replication
Replication of a service means that there are multiple instances of the same server, each of which provides the same functionality and access to the same set of data objects. Since the replicas all provide the same service, a client request can be directed towards any of the servers currently available. As the name replica suggests, a single object might exist simultaneously in more than one server. Synchronization can be required to ensure that the contents of one server mirror those of all the others in the replicated group. In general, synchronization is expensive, so there is a tradeoff between the cost of synchronizing the contents of replicated servers and the performance gain expected from having multiple servers handling an application’s clients.
The server that a client makes its invocations against is chosen in accordance with a migration policy which determines how often this selection is made. For example, each new remote invocation from a client could be directed towards a different (but not necessarily newly launched) server. Alternatively, the client might address all remote invocations to the same server instance for the duration of their client session.
Replication allows an application to support a certain quality-of-service in terms of performance and throughput. As more clients start using the application, additional replicants can be added to share the load.
3.2 Fault Tolerance
In CORBA-based applications, which naturally include components distributed across a network, developers are more acutely aware of potential failure points. In a monolithic system, the whole application was either working or had failed, and corrective action could be taken at that level. In contrast, the topology and structure of a distributed application means that partial failures will be common. To be more precise, CORBA systems are made up of a number of elements, typically distributed across processes and hosts. Any one of these elements can fail at any time.
In formal terms, fault tolerance is a measure of an application’s resistance to failures caused by factors as diverse as:
?? Loss of processing resources – hardware and software or OS failures
?? Both transient and permanent failures in the application itself – faulty logic, incomplete exception handling, mismatched client and server versions, and deadlocks, to name a few.
One primary measure of fault tolerance is based upon the amount of time an application is available to clients. Availability is calculated using the mean-time-between-failures (MTBF) and the mean-time-to-repair (MTTR). Formally,
MTTR MTBF MTBF ty Availabili ? ? Equation 9
where the MTBF is the average time between failures occurring and the MTTR is defined as the total time taken to (a) identify the fault, (b) recover the servers, and (c) complete any synchronization required to restore consistent state. To improve overall availability, we strive to increase the MTBF and reduce the MTTR.
Commercially available ORBs provide their degree of fault tolerance through information stored in object references, and also provide some level of recoverability of the activation agent.
3.2.1 Location Transparency
Object references obtained by clients typically direct them not to the application server, but to the ORB activation agent on the server’s host. This has two advantages. First, the server does not have to be running all the time – the activation agent can launch the server process only when a client needs it. The client first contacts the activation agents, which launches the server and redirects the client to the application server. Second, in the case of server process termination, a client ORB runtime can fall back on the original
information in the object reference, and contact the activation agent (which can relaunch the server). This provides some level of fault tolerance, and is generally implemented so that it is transparent to the client application. That is, some level of fault tolerant behavior is achieved with no additional coding effort on behalf of the client.
However, this approach does have its limits. First, it is inherently limited to a single host. If that host were to fail, the client would not be able to recover. Also, it does not address the recoverability of any server state. Simply relaunching the server will not solve the problem if the client’s reference is to a transient object – the client application will still receive a CORBA system exception. Also, it may not be appropriate for the activation agent to launch servers, particularly if specialized load balancing solutions with concentrators or selectors are used.
3.2.2 Recoverability of the Activation Agent
Some commercial ORBs have built-in support for recoverability of the activation agent itself. As the activation agent generally represents a single point of failure and as there is usually only one per host, it is important for the agent to be able to recover from a crash or other serious failure. Some activation agents can be configured to persistently log their state, so that in the event of a failure, the state can be recovered. But what detects that the activation agent has crashed? Furthermore, what is responsible for restarting the agent? On some operating systems, it is possible to embed the agent’s launch command within a “fault tolerant” script – should the agent fail and the process exit for any reason,
write a monitor that was responsible for checking the health of the agent. In the event of a failure or slow access, the monitor can take corrective action.
Chapter 4 SOFTWARE DESIGN SPECIFICATION
4.1 System Architecture
The three-tier speech system model is formed by three components, namely terminal, speech browser and document server. To realize this model, we will make a demonstration.
Figure 13 Three-tier speech system model
4.1.1 Platform
The language, platform and tools we will use in this project are as follows. Our Java platform is Java SDK 1.3.1.
Components Language Platform Tools
Terminal Java Forte for Java CE 3.0 Java Sound API [18] VoiceXML interpreter C Visibroker ORB for C++
4.5.1
Open VXI Recognition engine Java Visibroker ORB for Java
4.5.1
HTK TTS synthesizer N/A Visibroker ORB for Java
4.5.1 IBM ViaVoice TTS SDK for Windows [19]
4.1.2 Terminal
Document Server Speech Browser http request http reply Terminal parameters for recognition parameters for synthesisfront-end distributed text-to-speech synthesizer. A Java applet, running on the top of JVM inside a web browser, makes use of Java Sound API and Java Network API.
Figure 14 System architecture (Terminal)
Figure 15 Examples of terminals In the recognition path, it performs voice capturing, feature extraction, compression and then data sending to speech browser. In the synthesis path, it performs data retrieval from speech browser, speech decoder and then voice playback.
Our configuration of speech input will be 16-bit linear quantization and 16kHz sampling rate while that of speech output will be 8-bit linear quantization and 8kHz sampling rate, which are adequate to maintain the recognition accuracy and speech intelligibility.
The feature vectors after feature extraction are 13-dimensional vectors consisting of mel-cepstral features and they are computed every 10ms, which is the typical frame rate in
G.723.1 Coefficients Speech Output Compressed MFCCs Speech Input Terminal Feature Extraction Compression Speech Decoder
PDA with wireless capability
Access Point
Mobile phone Gateway
Analog phone Bluetooth WiFi GSM 2.5G/3G PSTN PC with Telephone Interface Speech Browser MFCCs G.723.1
most speech recognition systems. Owing to the high correlation (in time) between subsequent feature vector, we take the difference between adjacent feature vectors, which is a 1-step (scalar) linear prediction such that it can be compressed very much before sending to speech browser. Further compression can be achieved by multi-stage vector quantizing these error vectors.
We cannot use the text and linguistic processing part in IBM ViaVoice TTS SDK for Windows as a separate module, which originally will be our back-end TTS. To solve it, we add a pair of 5.3kbps G.723.1 speech encoder and decoder into the whole system. In speech browser, ViaVoice TTS accompanying with the encoder, which is responsible for compressing the TTS output, will be our back-end TTS. In terminal, a corresponding decoder is responsible for recovering the TTS output.
Figure 16 A Java applet running as a terminal Figure 17 shows a block diagram of the DSR front-end
Figure 17 Block diagram of the DSR front-end 1. Preemphasis
The digitized speech signal, s(n), was put through a preemphasizer (typically a first-order FIR filter), to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing. In this project, we chose 0.97 as the preemphasis coefficient.
0 . 1 9 . 0 , 1 ) (z ? ?az?1 ? a? H Equation 10 2. Frame Blocking
In this step the preemphasized speech signal, s’(n), was blocked into frames of N samples, with adjacent frames being separated by M samples. We chose N = 320 and M = 160.
3. Windowing
The next step in the preprocessing was to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. Hamming window is used, which has the form
1 0 ), 1 2 cos( 46 . 0 54 . 0 ) ( ? ? ? ? ? ? n N N n n w ? Equation 11 Windowing Frame Blocking Preemphasis s’(n) xt(n) N M s(n) xt’(n) FFT Mel-scale Filterbank DCT mj Xt(n) ci
4. Finally, 39 coefficients were generated for each frame. This number, 39, was computed from the length of parameterised static vectors (MFCC_0 = 13) plus the delta coefficients (+13) plus the acceleration coefficients (+13) after the standard FFT, Mel-scale filterbank and DCT, described in 2.1.2 above.
4.1.3 Speech Browser
Speech browser is a cluster of powerful servers acting as a back-end that performs the core recognition, synthesis and translation processes as well as the interpretation of VoiceXML files retrieved from the document server.
The implementation platform was based on the Sourceforge Project: Open VXI VoiceXML Interpreter V4.0, HMM recognizer and IBM ViaVoice TTS synthesizer. The speech browser was implemented in the client/server architecture. Different objects had different requirement on the platform at which they are located. Speech recognition is a computation-intensive process while TTS synthesis is memory-intensive. Borland’s Visibroker ORB serve as a middleware in our case [20][21].
XML Interface
Recognition Prompt Telephony Translation Open VXI R eco gn it ion En gine TT S Sy nt hes izer A udio F ile Call Control S emantics and Tr an sla ti o n Server Objects Resource Bus VoiceXML interpreter ORB
The first step in building a dictionary was to create a sorted list of the required words. In this project, the British English BEEP pronouncing dictionary was used. Its phone set was adopted without modification except that the stress marks was removed and a short-pause (sp) was added to the end of every pronunciation. For convenience, the training and test data was taken from the TIMIT acoustic-phonetic database. They were then parameterize into sequences of feature vectors. The HMM models for acoustic models and lexical models was then created and trained well using the above prepared data.
?? Recognition Engine
The recognition process can be described as a searching and matching process from the acoustic and lexical models. Load balancing and parallelism can be achieved by applying vertical application partitioning to those models. Each recognition server perform the same recognition process but on different sets of phoneme in parallel.
Figure 19 Load balancing strategies for recognition engine The client can connect to the correct server using a well-known name (often resolved through the CORBA naming service) as a key. Once the partition key has been resolved
Vertically partitioned application data Vertically partitioned application servers Clients Client Recognition Engine Object Recognition Engine Object Models for phoneme set 1 Models for phoneme set 2 Client Client
to the object in the server, all further remote invocations are made directly to that server, neither the CORBA activation agent nor the naming service are directly involved in further requests. Since entries in the naming service are usually sufficient as a means of locating services, the amount of context-switching between various servers is reduced. To provide the fault tolerance for recognition engine objects, instances of those objects on multiple hosts are started. If an implementation becomes unavailable, the ORB will detect the loss of the connection between the client program and the object implementation and will automatically contact the Smart Agent to establish a connection with another instance of the object implementation, depending on the effective rebind policy established by the client.
4.1.4 HMM training by HTK
Here were the steps to train the HMM models by HTK.
1. In this project, a voice dialling application was demonstrated and a suitable task grammar was
$digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO; $name = [ THOMAS ] CHENG |
[ WENDY ] HOW;
( SENT-START ( DIAL <$digit> | ( PHONE | CALL ) $name) SENT-END )
2. Build the dictionary containing a sorted list of the required words
CALL k ao l sp CHENG ch ae ng sp DIAL d ay ax l sp EIGHT ey t sp
ONE w ah n sp PHONE f ow n sp SENT-END sil sp SENT-START sil sp SEVEN s eh v n sp SIX s ih k s sp THOMAS t oh m ax s sp THREE th r iy sp TWO t uw sp WENDY w eh n d iy sp ZERO z ia r ow sp
3. Record and label the 50 training data
4. Create monophone HMMs ~o <VecSize> 39 <MFCC_0_D_A> ~h "proto" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 4 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM>
5. A series of training was then carried out including creating flat start monophones, fixing the silence models, realigning the training data, making tied-state triphones from monophones. The recognizer was complete and its performance can be evaluated.
The line starting with SENT: indicates that of the 50 test utterances, 42 (84%) were correctly recognized. The following line starting with WORD: gives the word level statistics and indicates that of the 281 words in total, 281 (100%) were recognized correctly. There was no deletion error (D), no substitution error (S) and 16 insertion error (I). The accuracy figure (Acc) of 94.31% is lower than the percentage correct (Cor) because it takes amount of the insertion errors which the latter ignores.
Rec : recout.mlf
--- Overall Results --- SENT: %Correct=84.00 [H=42, S=8, N=50]
WORD: %Corr=100.00, Acc=94.31 [H=281, D=0, S=0, I=16, N=281]
=================================================================== Read 35 physical / 58 logical HMMs
Read lattice with 25 nodes / 50 arcs Created network with 105 nodes / 130 links READY[1]>
Please speak sentence - measuring levels Level measurement completed
SENT-START CALL CHENG SENT-END == [147 frames] -72.7333 [Ac=-10691.8 LM=0.0] ( Act=92.0)
READY[2]>
SENT-START DIAL ONE TWO THREE SENT-END == [216 frames] -54.1143 [Ac=-11688.7 L M=0.0] (Act=95.5)
READY[3]>
SENT-START PHONE WENDY HOW SENT-END == [182 frames] -51.6337 [Ac=-9397.3 LM=0. 0] (Act=94.2)
READY[4]>
SENT-START DIAL SIX FIVE OH EIGHT SENT-END == [286 frames] -56.8492 [Ac=-16258 .9 LM=0.0] (Act=97.4)
READY[5]>
SENT-START PHONE THOMAS CHENG SENT-END == [195 frames] -58.5229 [Ac=-11412.0 L M=0.0] (Act=94.7)
APPENDIX : PROJECT SCHEDULE
Period Coding Thesis
Oct, 2001 Front-end DSR – feature extraction & compression
Introduction Nov, 2001 Back-end DSR – decompression,
training & recognition
Speech recognition technology Dec, 2001 Front-end & back-end DTTS Text-to-speech synthesis technology
Jan, 2001 Protocol between terminal & speech browser and integration of back-end
DSR & DTTS to VoiceXML interpreter
VoiceXML
Feb, 2001 Move the system to CORBA-based and distributed
Design & implementation of DVXMLSS
Mar, 2001 Further training on DSR & DTTS Result & performance of DVXMLSS Apr, 2001 Extend the system to load-balancing
& fault-tolerant
Load-balancing & fault-tolerant of DVXMLSS
May, 2001 Final testing and further enhancement
Future research
Key:
DSR - Distributed Speech Recognition DTTS - Distributed Text-to-speech Synthesis
REFERENCES
1 Lawrence R. Rabiner, Ronald W. Schafer, Digital processing of speech signals, c1978
2 John R. Deller, Jr., John G. Proakis, John H.L. Hansen, Discrete-time processing of speech signals
3 Lawrence Rabiner, Biing-Hwang Juang, Fundamentals of speech recognition, Prentice Hall, c1997
4 Ganesh N. Ramaswamy, Ponani S. Gopalakrishnan, Compression of acoustic features for speech recognition in network environments, IBM Thomas J. Watson Research Center
5 Bell Labs Voice and Audio Technologies For Portable MP3 Players,
http://www.lucentssg.com/speech/MP3.pdf
6 Caludio Becchetti and Lucio Prina Ricotti, Speech recognition – theory and C++ implementation, John Wiley & Sons, c1999
7 L. Rabiner, Fundamental of speech recognition, Prentice Hall, c1993
8 Steve Young, Dan Kershaw, Julian Odell, Dave Ollason, Valtcho Valtchev, Phil Woodland, The HTK book (for HTK Version 3.0), Microsoft Corporation, c2000 9 X.D. Huang, Y. Ariki and M.A. Jack, Hidden Markov models for speech
recognition, Edinburgh University Press, c1990
10 Aurora Project: Distributed speech recognition,
http://www.etsi.org/technicalactiv/dsr.htm
12 Voice eXtensible Markup Language Version 1.00, Voice Forum, c2000
13 Sourceforge Project: Open VXI VoiceXML Interpreter V4.0,
http://sourceforge.net/projects/openvxi
14 Dirk Slama, Jason Garbis and Perry Russell, Enterprise CORBA, Prentice Hall, c1999
15 Jon Siegel, CORBA 3: fundamentals and programming, Second Edition, John Wiley & Sons, Inc
16 G. Coulouris, J. Dollimore and T. Kindberg, Distributed Systems: Concepts and Design, 2nd edition, Addison-Wesley, c1994
17 Wolfgang Emmerich, Engineering distributed objects, John Wiley & Sons, Ltd., c2000
18 Java? sound API programmer’s guide, Sun Microsystems, Inc., c2000
19 IBM ViaVoice TTS SDK for Windows, IBM Developer’s Corner, http://www-4.ibm.com/software/speech/dev/ttssdk_windows.html
20 Visibroker 4.5.1 for Java programmer’s guide, Borland Corporation
21 Michael McCaffery and Bill Scott, The offical Visibroker for Java handbook – The authoritative solution, Sams Publishing, c1999