1.1 Brief History. 1.2 Project Overview

(1)

1

1.1 Brief History

Throughout history, there have been many researches, projects and innovations, all with the goal of making human life easier and more luxurious. History has seen the invention of the wheel, the steam-engine, radio & communication systems, aero planes and other great inventions. One of these great innovations is speech

recognition. Speech recognition is a luxury beyond any luxury. It allows people to interact with technology in the same way they interact with each other. Speech recognition has seen its share in science fiction in modern day cinema; however, this dream is slowly becoming a reality. Work on speech recognition systems has been going on since the 1950s. Speech recognitions systems were developed in laboratories and universities worldwide. Speech recognition hardware was developed in Japan, and the USA. These systems have made milestone achievements and worldwide success. For example, in the 1960s and 1970s, Speech recognition systems have been in AT&T, Carnegie Mellon, and IBM. In the 1980s, a new statistical method of modeling speech appeared which employed Hidden Markov Models or HMMs. This method became standard in almost any speech recognition system.

1.2 Project Overview

Speech Recognition, or Automatic Speech Recognition, is the process by which a machine can convert speech in the form of audible sounds into meaningful output. Speech, on the other hand, is information in the form of audible sounds produced by a speaker to be transmitted to a listener.

(2)

2 This project examines the known problem of statistical speech recognition through the use of Hidden Markov Models. An attempt is to be made to design and implement a speech recognition system that is fast and efficient to be used in real-time applications. The system has to be efficient and extensible so as to be used in future studies and research on continuous speech recognition systems using the Arabic language. The two techniques used in this project are Vector Quantization and Hidden Markov Models. For the speech recognition to be performed, the LPC model is used. The parameters of the LPC model, and HMM must be chosen accurately so as to enable the system to recognize Arabic speech with relatively good recognition accuracy given that the Arabic language is morphologically rich and have a large vocabulary growth rate.

The system must been able to operate in two modes: recognition mode and training mode. It must contain a database so as to allow the user to easily increase the words available in the vocabulary. Thus, the aim of this project is to develop a speaker-independent, isolated word, limited vocabulary speech recognition that is fast, extensible and flexible.

1.3Problem Definition

The problem of speech recognition is that the number of input signals, signal

variations, word combinations, noise, and other perturbations is incredibly large. This requires a complex process that can, with certainty, determine the matching

word/sentence that corresponds to the input signal. There are many problems that arise in this project, some of which are listed below:

1. Arabic language is a very huge language and require deep studying of its features

2. Arabic language is Avery difficult language because it is morphologically rich which causes a high vocabulary growth rate. This high growth rate is problematic for language models by causing a large number of out-of-vocabulary words.

3.

In this language there are some phonemes which have the same articulation region which generate problems in the recognition process .like

(3)

3

1.4 Objectives:

Our objective is to design and implement a speaker-independent speech recognition system that can accurately recognize Arabic speech in the form of isolated word utterances using HMMs as the recognizer. To analyze the performance of this speech recognition system so as to determine and evaluate its performance under various conditions and with different speakers, i.e. to compute word accuracy and speed of the system under various operating conditions. And this will lead us to studying of basics of the Arabic language and the problems of this language.

Project block diagram:

The following figure 1.1 shows a preliminary design of a speech recognition system and its various components:

Figure 1.1: Block Diagram of Speech Recognition System

1.5 Tools and Methodologies:

(4)

4

1.6 Thesis Layout

Chapter Two (Theoretical Background)

In this chapter we discuss all possible methods in each stage that can be used to implement this project

Chapter Three (Signal Processing)

Here we go through the theory and analysis of methods which was actually used in signal processing and classification procedure in the project

Chapter Four (Theory of Hidden Markov Model)

In this chapter we discuss a basic set of statistical modeling techniques for recognizing the speech and analysis of this theory and how we can use some concepts to increase its performance.

Chapter Five (Design and Implementation)

This chapter gives you summary of project design by flowchart of each algorithm which was obviously discussed in chapter 3&4.and also have the sequence of the implementation steps used to implement the project

Chapter Six (Results)

This chapter has all the experiments done on the system after developed to evaluate its performance and their results and explanations of these results.

Chapter Seven (Conclusion & Future Work)

(5)

5

2.1 Introduction

The more widely used paradigm in ASR systems has been the phonetic content of the speech signal, which varies from language to language, but there are no more than 30 different phonemes without some variations, such as accentuation, duration, and the concatenation. The last one includes the co-articulation such as demisyllables and triphones. Considering all variations, the number of phonetic units will be increased considerably [1].

2.1.1Arabic Language

Arabic is a Semitic language, and it is one of the oldest languages in the world. Currently it is the second language in terms of number of speakers. Arabic is the first language in the Arab world, i.e., Saudi Arabia, Jordan, Oman, Yemen, Egypt, Syria, Lebanon, etc. Arabic alphabets are used in several languages, such as Persian and Urdu. Standard Arabic has basically 34 phonemes, of which six are vowels, and 28 are consonants. A phoneme is the smallest element of speech units that indicates a difference in meaning, word, or sentence. Arabic language has fewer vowels than English language. It has three long and three short vowels, while American English has twelve vowels [2]. There is a detail of this manner in the preceding chapter which contains classification of the Arabic letters according pronouncing procedure.

2.1.2 Speech Recognition

Automatic Speech Recognition (ASR) is a technology that allows a computer to identify the words that a person speaks into a microphone or telephones. It has a wide area of applications: Command recognition (Voice user interface with the computer), Dictation, Interactive Voice Response, it can be used to learn a foreign language. ASR can help also, handicapped people to interact with society. It is a technology which makes life easier and very promising. Recognition is different from understanding, recognition means identifying the words that make up the input utterance, but not necessarily their meaning [3]. Some of the difficulties related to speech recognition are [4]:

(6)

6 2- Natural speech is variable over: global rate, local rate, pronunciation within speaker, pronunciation across speakers, and phonemes in different contexts.

3- Recorded speech is variable over room acoustics, channel characteristics, and background noise.

4- Natural speech is continuous.

The disciplines that have been applied to one or more speech recognition problems are to full understanding of the following points:

►Signal processing:-the process of extracting relevant information from the speech signal in an efficient, robust manner .included in signal processing is the form of spectral analysis used to characterize the time varying properties of the speech signal. ►Physics (acoustics):- the science of understanding the relationship between the physical speech signal and physiological mechanism

►Pattern recognition: - the set of algorithms used to cluster data to create one or more prototypical patterns of data ensemble, and to match (compare) a pair of patterns on the basis of feature measurements of patterns.

►Communication and information theory:-the procedures for estimating parameters of statistical model

►Linguistics:-the relationship between sounds, words in a language (syntax), meaning of spoken words (semantics), and sense derived from meaning (pragmatic). ►Physiology:-understanding of the higher order mechanism within the human central nervous system that account for speech production and perception in human beings

►Computer science: - the study of efficient algorithms for implementing, in software or hardware

►Psychology:-the science of understanding the factors that enable a technology to be used by human beings in practical tasks [5].

By well known of the above points and well implementing them the people can overcome the problem they face in designing and implementing ASR .and successful speech recognition systems require knowledge and expertise from a wide range of disciplines ,a range far larger than any single person can possess. Therefore, it is important for a researcher to have a good understanding of fundamentals of speech recognition (so that a range of techniques can be applied to a variety of problem)

(7)

7 1- Great difficulties occur when several speakers with different dialects are to be recognized.

2- Homophone is a word that is pronounced the same as another word but differs in meaning. For example: The word لَّك that means exhausted and the word لَّك that means no or never. The word _{لَّ جَ ' that means to drag, the word ى لَّ جَ that means to make} something to stream.

3- Arabic language is morphologically rich which causes a high vocabulary growth rate. This high growth rate is problematic for language models by causing a large number of out-of-vocabulary words [4].

These problems can be minimized by restricting the number of speakers, words and working with good acoustic condition. Also, by avoiding the complexities of fluent speech and working on modern standard Arabic to overcome different dialects. Different approaches can be used in speech recognition (recognizers) such as:- 1. The Dynamic warping (DTW)

2. Stochastic Models (Hidden Markov Model)

3. The Connexionist Models (founded on a modeling of the Neuronal Networks).

2.2 Signal Processing

In the following subsection we will describe all processing which must implemented on the input signal so as to be suitable for recognition process

2.2.1 Introduction

By means of modern digital signal processing, we can interact, not only with others, but also with machines. The importance of speech/audio signal processing lies in preserving and improving the quality of speech/audio signals. These signals are treated in a digital representation where various advanced digital-signal-processing schemes can be carried out adaptively to enhance the quality [1].

2.2.2 Sequence of Processing In the Speech Signal

(8)

8 in time-frequency analysis of an audio signal concerns signal segmentation .Segmentation is needed because of the non-stationary behavior of audio signals. Roughly speaking, this means that frequency information is changing over time in audio signals. Therefore, the choice of frame lengths in the segmentation stage should reflect an average duration over which the frequency information can be considered unaltered. Typically, for audio and speech signals, this duration ranges around 10 to 30 ms. The contents of each frame are summarized with a vector of parameters (or observations features). Sophisticated spectral analysis techniques are required in automatic speech recognition systems to obtain accurate and reliable estimates of recognition parameters [1]. There are three basic analysis techniques used in speech recognition systems Digital filter banks, Discrete Fourier Transform technique, and Linear Predictive Coding (LPC) techniques. Before extracting speech feature from raw data, the signal goes speech end point detector which will be described into the following sub section.

2.2.2.1 End-Point Detection

Specifying the start and the end point for each recorded word and isolating it from the background noise. And there are three sources of background noise:

1. Environmental condition in which the speech is produced

2. The distortion introduced by the transmission system over which the speech is sent and the delay which occur from the speaker

3. Noise due to pronouncing procedure [5]

2.2.2.2 Feature Extraction Methods 1. Digital Filter Banks

Is an array of band-pass filters that separates the input signal into several

(9)

9 these differences must be used. On the other hand, less important frequencies do not have to be exact [5].

2. Discrete Fourier Transform (DFT)

Is one of the specific forms of Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function (which is often a function in the time domain). But the DFT requires an input function that is discrete and whose non-zero values have a limited (finite) duration. Such inputs are often created by sampling a continuous function, like a person's voice.

3. Linear Predictive Coding (LPC)

Is a tool used mostly in audio signal processing and speech processing for

representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model. It is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate and provides extremely accurate estimates of speech parameters [1].

Physical LPC Model:

(10)

10 (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which give rise to formants, or enhanced frequency bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and throat during sibilants and plosives[6]. Figure 2.2shows this mechanism

When you speak:

Air is pushed from your lung through your vocal tract and out of your mouth comes speech. For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened. The shape of your vocal tract determines the sound that you make. As you speak, your vocal tract changes its shape producing different sound. The shape of the vocal tract changes relatively slowly. The amount of air coming from your lung determines the loudness of your voice [6].

Mathematical LPC model:

The idea behind linear predictive analysis is that a speech sample can be

approximated as a linear combination of past samples. By minimizing the sum of the squared differences (over a finite interval) between the actual speech samples and the linearly predicted ones, a unique set of predictor coefficients can be determined. The below model (shown in figure 2.4) is often called the LPC Model. The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses (voiced sounds) or a white noise sequence

(11)

11 (consonant or unvoiced sounds).figure 2.3 which shown below determine the shape of the voiced and unvoiced signal.

Figure 2.3: (a) Unvoiced Signal, (b) Voiced Signal

Figure 2.4: Mathematical Model of LPC

Relationship between physical and mathematical model shown in table 2.1 below Table 2.1: Relationships Between Physical and Mathematical Model

Vocal Tract H(z) (LPC Filter)

Air U(n) (Innovations)

Vocal Cord Vibration Gv (voiced) Vocal Cord Vibration Period T (pitch period) Fricatives and Plosives Gn (unvoiced)

Air Volume Gn (gain)

2.2.2.3 Distance (distortion) Measure

After the feature of the signal has been extracted the next step is to compare this signal with the reference signals to find the best match reference signal, this technique could achieved by various methods some of them are:

1. Log spectral distance

(12)

12 2. Cepstral distance

3. Weighted cepstral distance and liftering 4. Likehood distortion

5. Spectral distortion using a warp frequency scale

2.3 Recognizer

The recognition process is the heart of ASR and it achieved by different method

(approaches) which was obviously mentioned. In the following subsections there is brief description about DTW and description with few details of HMM's.

2.3.1 Dynamic Time Warping (DTW)

Dynamic time warping is an approach that was historically used for speech

recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed [7].

2.3.2 Hidden Markov Modals (HMM's)

Modern general-purpose speech recognition systems are generally based on Hidden Markov Models. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes [1].

2.3.3 Artificial Neural Networks

(13)

13

3.1 Introduction

This chapter describes the analysis of theories of how to extract information from speech signal (which was used in project implementation), which means creating feature vectors from speech signal. A wide range of possibilities exist for parametrically representing of speech signal and its content which was mentioned in chapter 1. The main steps for extracting information from speech signal are illustrated in figure 3.1:

Figure 3.1: Steps For Feature Extraction Process

3.2 Preprocessing

This step is the first step to create the feature vectors. The objective of preprocessing is to modify speech signal x(n),so that it will be „‟more suitable‟‟ for the feature extraction analysis the preprocessing operations can be seen in figure 3.2

Figure 3.2: Preprocessing Operations

3.2.1 Speech End Point Detection

An important problem in speech processing is to detect the presence of speech in a background of noise. This problem is often referred to as the endpoint location problem. The accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum. In many cases the accuracy of alignment depends on the accuracy of the end point detection [8]. The method is depending on the signal zero crossing rate and the signal energy. Figure 3.3 determine the difference the signal before and after the end point detector. The

Preprocessing

Frame blocking and Windowing

Feature

extraction Post processing

Analog to Digital Convertersion

Speech End Point

(14)

14 equations which describe the computation method mentioned in the appendix A. Three thresholds are computed from the signal which are:

1. ITU - Upper energy threshold. 2. ITL - Lower energy threshold.

3. IZCT - Zero crossings rate threshold.

(a) (b)

Figure 3.3: (a)„sara‟ Utterance Before Speech End Point Detection (b)The Same Utterance After Speech End Point Detection.

3.2.2 Pre-emphasis

The pre-emphesizer is used to flatten and to make the digitized and detected signal less susceptible to finite precision effects through a low -order digital system (typically a first-order FIR filter) and figure 3.4 shows the original and the flatten signal .the most widely used pre-emphasis network is the fixed first-order system: 𝐻 𝑧 = 1 − 𝑎𝑧−1 3.1 In this case, the output of the pre-emphasis network x‟(n), is related to the input network, x (n) by the following difference equation

𝑥′ 𝑛 = 𝑥 𝑛 − 𝑎𝑥(𝑛 − 1)

3.3Frame Blocking

(15)

15

`

Figure 3.4: (a) Speech Signal Before emphasizer (b) Speech Signal After Pre-emphasizer

The overlapping frames are important to ensure that all signal information will be included in the further manipulation. On the other hand if M>N there will be no overlap between adjacent frames; in fact, some of speech signal will be totally lost which will generate errors in recognition process.

Figure 3.5: The Original Waveform and Its Overlapped Frames

(16)

16

3.4 Windowing

In general, the most used windows in spectral analysis, such as the Hanning, Hamming, and Kaiser windows (Harris, 1978), have a low-pass characteristic in the frequency domain and a Gaussian-like shape in time. The main purpose of using a fading window is to avoid abrupt boundary discontinuities during signal segmentation. Hamming window is highly recommended because it minimize the error of LPC because it de-emphasis the signal boundary and emphasis on the middle of the signal, figure 3.6(a) determine hamming window function and its frequency response and equation 3.3 determine the hamming window weighting function[1]. W (n) = 0.54 − 0.46cos (2 n / N), for 0 ≤ n ≤ N −1 3.3

3.5 Feature Extraction

The next step is an important one, namely to extract relevant information from speech frames. A variety of choices of this task can be applied which was mentioned in chapter 1 but LPC method is well technique for the following reasons:-

1. It provides a good model of the speech signal.

2. The way in which LPC is applied to the analysis of speech signals leads to reasonable source-vocal tract separation.

3. LPC is an analytically tractable model.

4. The LPC model works well in recognition applications. Experience has shown that the performance of speech recognizers, based on LPC front end, is comparable to or better than that of recognizers based on filter-bank front end which was mentioned in chapter one.

Also in chapter one we was mentioned the mathematical and physical model of LPC technique here in the following sections we will introduce LPC analysis

(17)

17

(b) (a)

(c)

Figure 3.6: (a) Hamming Function and Its Frequency Response, (b) The Speech Signal Word of „sara‟ Before Windowing, (c) The Same Word Signal After Windowing

LPC Analysis Equations

Figure3.7 below determines the LPC model:-

U(n) S(n)

G

The exact relation between s(n) and u(n) is

𝑆(𝑛) = 𝑝_𝑘=1𝑎_𝑘 𝑆(𝑛 − 𝑘) + 𝐺 ∗ 𝑈(𝑛) 3.4 We consider the linear combination of past speech samples as the estimation S‟(n),defined as

𝑨(𝒛)

(18)

18 𝑆′(𝑛) = 𝑝𝑘=1𝑎𝑘𝑆 (𝑛 − 𝑘) 3.5

We now form the prediction error, e (n), defined as

𝑒 𝑛 = 𝑆 𝑛 − 𝑆′ 𝑛 = 𝑆 𝑛 − 𝑝_𝑘=1𝑎_𝑘𝑆(𝑛 − 𝑘) 3.6

With error transfer function 𝐴 𝑧 =𝐸 𝑧

𝑆 𝑧 = 1 − 𝑎𝑘𝑍 −𝑘 𝑝

𝑘=1 3.7

Clearly, when s(n) is actually generated by a linear system of the type shown in figure 3.5 then the prediction error(n),will equal G u(n),the scaled excitation. The basic problem of linear prediction analysis is to determine the set of predictor coefficients, { 𝑎𝑘}. The basic approach to find a set of prediction coefficients is to minimize the

mean square prediction error over a short segment of the speech waveform. To set up the equations that must be solved to determine the predictor coefficients, we define short-term speech and error segment at time n as

Sn(m)=S(m+n) 3.8a

en(m)=e(m+n) 3.8b

and we want to minimize the mean square error signal at time n

𝐸_𝑛 = 𝑒_𝑚 _𝑛2(𝑚) 3.9 Which, using the definition of en (m) in term of Sn(m),can be written as

𝐸𝑛 = [𝑠𝑚 𝑛 𝑚 − 𝑝𝑘=1𝑎𝑘 𝑠𝑛 𝑚 − 𝑘 ]2 3.10

To solve Eq(2.10) ,for predictor coefficients, we differentiate 𝐸𝑛 with respect to each

𝑎𝑘 and set the result to zero

𝜕𝐸𝑛

𝜕𝑎𝑘 = 0 k=1,2,…,p 3.11 Giving

(19)

19 By recognizing that term of the form 𝑠𝑛 𝑚 − 𝑖 𝑠𝑛 𝑚 are term of short-term

covariance of 𝑠_𝑛 𝑚 , i.e,

φ_n i, 0 = 𝑠_𝑚 _𝑛 𝑚 − 𝑖 𝑠_𝑛 𝑚 3.13 So we can express Eq(2.12) in the following form

φ_n i, 0 = p_k=1𝑎′_𝑘 φ_n i, k 3.14 Which describe a set of p equations in p unknowns. It is readily shown that the minimum mean-square error 𝐸′_𝑛 , can be expressed as

𝐸′_𝑛 = 𝑠_𝑚 _𝑛2 𝑚 − _𝑘=1𝑝 𝑎′_𝑘 𝑠_𝑚 _𝑛 𝑚 − 𝑘 𝑠_𝑛 𝑚 3.15 = φ_n 0,0 − 𝑝_𝑘=1𝑎′_𝑘φ_n 0, k 3.16 Thus the minimum mean-square error consists of a fixed term φ_n 0,0 and terms that depend on the predictor coefficients. there are two standard methods can be used to find LPC coefficients which are the covariance and autocorrelation method here in this chapter the autocorrelation will be described in details

The Autocorrelation Method

We said that the signal was segmented and windowing was implemented in each frame so the windowed frame is as follows

Sn(m)= 𝑆 𝑚 + 𝑛 ∗ 𝑊 𝑚 0 ≤ 𝑚 ≥ 𝑁 − 1

0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 . 3.14 Based on the weighted signal of Eq(2.13) the mean-square error becomes

𝐸_𝑛 = 𝑁−1+𝑝_{𝑚 =0} 𝑒_𝑛2(𝑚) 3.15 And φ_n i, k can be expressed as

φ_n i, k = 𝑁−1+𝑝_{𝑚 =0} 𝑠_𝑛 𝑚 − 𝑖 𝑠_𝑛 𝑚 − 𝑘 1≤i≤ p 3.16 0 ≤k≤ p

(20)

20

φ_n i, k = 𝑁−1−(𝑖−𝑘)_{𝑚 =0} 𝑠_𝑛 𝑚 𝑠_𝑛 𝑚 + 𝑖 − 𝑘 1≤i≤ p 3.17 0 ≤k≤ p

Since Eq(3.16) is only function of 𝑖 − 𝑘 (rather than two independent variables 𝑖 and 𝑘) the covariance function, φ_n i, k , reduced to the simple autocorrelation function φ_n i, k = R_n i − k = 𝑁−1−(𝑖−𝑘)_{𝑚 =0} 𝑠_𝑛 𝑚 𝑠_𝑛 𝑚 + 𝑖 − 𝑘 3.18 Since the autocorrelation function is symmetrical (even function)

R_n −k = Rn k 0≤k≤p 3.19

This is a very important property which useful in reducing the computation time. So the LPC equation can be expressed as

𝑝_𝑘=1R_n |i − k| 𝑎′_𝑘 = r_n i 1≤i≤ p 3.20 And can be expressed in matrix form as

The formal method for converting from autocorrelation coefficients to an LPC

parameter set (for the LPC autocorrelation method) is known as Durbin‟s method and can formally be given by the following algorithm

E(0)=R(0) 3.21 Ki={r(i)- 𝑖−1𝑗 =1𝛼_𝑗 𝑖−1 𝑟(𝑖 − 𝑗)}/E(i-1) i=1,2,…,p 3.22

αii=ki 3.23

αji= αii-1- ki α(i-j)(i-1) 3.24

E(i)=(1- ki2)E(i-1) 3.25

(21)

21 𝑎𝑚=LPC coefficients

Km=PARCOR coefficients

Converting LPC Parameter Into Cepstral Coefficients

A very important LPC parameter set, which can be derived directly from the LPC coefficient set, is the LPC cepstral coefficients, c(m) [5]. the recursion used is C0=lnσ2 3.27 Cm=𝑎𝑚+ ( 𝑘 𝑚) 𝑚 −1 𝑘=1 Ck𝑎𝑚 −𝑘 1≤ m≤p 3.28 Cm= ( 𝑘 𝑚) 𝑚 −1 𝑘=1 Ck𝑎𝑚 −𝑘 m>p 3.29

σ2_{Ξ the gain term of LPC model}

3.6 Post-processing (Cepstral Parameter Weighting)

Because of sensitivity of the low order cepstral coefficients to overall spectral slope and sensitivity of the high order cepstral coefficients to noise (and other forms of noise like variability), it is important to weight the cepstral coefficients by a tapered window so as to minimize these sensitivities, a more general weight is of the for 𝑐′_𝑚 = 𝑤_𝑚∗ 𝑐_𝑚 3.30 Where an appropriate weighting is the bandpass lifter (filter in the cepstral domain) 𝑤𝑚 = [1 +

𝑄 2sin

𝜋𝑚

𝑄 ] 1≤ m≤Q 3.31

This weighting function truncates the computation and de-emphasizes 𝑐_𝑚 around m=1 and around m=Q [5].

3.7 Vector Quantization

This step acts as an interface between signal processing step and recognition step. In the feature extraction, a number of parameters are extracted from the pattern under test. These parameters characterize the pattern. The resulting set of numbers in turn acts as the input to a classification scheme, which compares them with stored

(22)

22 class membership of the tested pattern [1]. There are several techniques can be used to compare the test pattern with reference patterns but here in this chapter we will

describe the cepstral distance measure technique because the project was implemented using this technique. To build a VQ codebook and implement a VQ analysis, we need the following[5]:

1. A large set of spectral analysis vectors ,v1,v2,…,vL, which form a training set.

The training set is used to create the “optimal” set of the codebook vector for representing the spectral variability observed in the training set. If we denote the size of the codebook as M=2B vectors (B is the number of bits of codebook). Then we required L>>M (L training set elements) so as to find the best set of M codebook vectors in a robust manner. In practice L is at least 10M.

2. A measure of similarity or distance measure technique to cluster the training set vectors

3. A centroid computation procedure

4. A classification procedure for arbitrary speech spectral analysis

3.7.1 The VQ Training Set

To properly train the VQ codebook, the training set vectors should span the anticipated rang of the following [5]:

1. Talkers, including ranges in age , accent, gender, speaking rate, and other variables

2. Speaking conditions, such as quiet room, and noisy environments

3.

Transducers and transmission systems

3.7.2 Cepstral Distance Measure

(23)

23

3.7.3 Codebook Building

Codebook is ordered version of the training set according to specified algorithm which is K-means clustering algorithm this algorithm was used to build (cluster the training vectors) the codebook of size M from training set of size L[5]. This algorithm is achieved by following steps:

1. Initialization: arbitrary choose M vectors as an initial set of the code vectors in the codebook.

2. Nearest-Neighbor search: for each training vector, find the code vector in the current codebook that is closest (in term of cepstral distance), and assign the vector to the corresponding cell.

3. Centroid update: update the code vectors in each cell using the centroid of the training vectors assigned to that cell.

4. Iteration: repeat step 2 and 3 until the average distance falls below a preset threshold or for specified number of iterations (greater number of iterations results in best codebook)

The greater number of iterations the best clustering but also increasing the storing memory size so the increasing of iteration is not absolute but it is choose so as to minimize the quantization error here in figure 3.8 an example of 4 centroied

(codebook element) build after 30 iteration of random generated numbers from 0 to 1

(24)

24

3.7.4 Classification of The Input Speech Vector

This process is illustrated by figure 3.9

The input pattern will be compared with all centroiedes (codebook elements) of the codebook (using cepstral distance measure) one by one and finally give the codebook index which closest to the input pattern.

Figure 3.9: VQ classification procedure

Centroid 1

Cepstral coefficients for the test pattern

Centroid 2 Centroid 3

Centroid 4

Centroid M

(25)

25

4.1 Introduction

This chapter will make describe a method to train and recognize speech utterance from given observations, at Ot Є RD, where t is a time index and D is the vector

dimension. A complete sequence of observations used to describe the utterance will be denoted as O = (O1, O2... OT). The utterance may be a word, a phoneme, or, in

principle, a complete sentence or paragraph. The method described here is the Hidden Markov Model or HMM. The HMM is an stochastic approach which models the given problem as a "doubly stochastic process" in which the observed data are

thought to be the result of having passed the "true" (hidden) process through a second process. Both processes are to be characterized using only the one that could be observed. The problem with this approach is that one do not know anything about the Markov chains that generate the speech. The number of states in the model is

unknown, there probabilistic functions are unknown and one cannot tell from which state an observation was produced. These properties are hidden, and thereby the name hidden Markov model [9].figure 4.1 determines the four possible levels of the HMM recognizer for solving the recognition problem

Figure 4.1: Levels of Speech Recognition.

Definition of Discrete HMM

An HMM is a finite state machine which consists of N states. For every observation o(t) of an observation sequence (discrete observation) O an underlying state q(t) is assumed. In each state symbols with a corresponding probability are emitted. So, bj(k)

is the probability for being in state j and emitting the symbol k. For every time instant the state of the HMM can change. The probability for a transition from state i to state j

Phoneme

recognition

Isolated word

recognition

Connected word

recognition

(26)

26 is aij.if the observation is continuous this means that the probability is continuous so

the HMM called Gaussian mixture model GMM.

4.2The Elements of a Hidden Markov Model

4.2.1 Number of States in The Model (N)

Because the states are hidden, for many practical applications some meaning related to the states or sets of states of the model exists. In the urn and ball model, each state corresponds to the urns. Generally, the states are interconnected in such a form that any state can be reached from any other state; as we see, a great amount of interconnections between interest states exists, and this can be transferred to applications of speech recognition. We denote the individual states like {S1, S2, S3, ..,

SN}, and the state to time t like qt [1].

4.2.2 Number of Distinct Observation Symbols per State (M)

The observation symbols correspond to the physical output of the system being modeled. The individual symbols are denoted by V = {v1, v2,…, vM}

4.2.3 The State Transition Probability Distribution (A)

The probability distribution of stage transition is A= {aij}, where aij is defined as in equation.

𝑎_𝑖𝑗 = 𝑃 𝑞_𝑡+1 = 𝑗 𝑞_𝑡 = 𝑖 1 ≤ i , j≤N 4.1 For the special case where any state can reach any another state in a single step, we have aij>0 for all i, j.

4.2.4 Observation Symbol Probability Distribution B = { bj( k)}

The observation symbol probability distribution is B = {bj( k)}, in which bj(k) is defined by equation 1.9, where j=1, 2... N.

(27)

27

4.2.5 Initial State Distribution (π)

Initial state distribution π = {πi } is defined by equation 4.3

𝜋_𝑖 = 𝑃[𝑞_𝑡 = 𝑖] 1 ≤ i ≤ N 4.3 It is possible that HMM requires the specification of two parameters for a number of states (N) and a number of different observations from each symbol by state (M), the specification of the observation symbols, and the specification of the three measured probability states A, B and π . By convenience, we use the compact notation:

𝜆 = (𝐴, 𝐵, 𝜋) 4.4

4.3 HMM Generator of Observation

Given appropriate values of N, M, A, B and 𝜋,the HMM can used as a generator to give an observation sequence O=(O1 O2…OT) (in which each Ot is one of the symbols

from V, and T is the number of the observation in the sequence) as follows 1) Choose an initial state q₁= i according to the initial state distribution. 2) Set t =1.

3) Choose O_t = v_kaccording to the symbol probability distribution in state i, i.e., b_j(k).

4) Transit to a new state q_t+1 = j according to the state probability distribution

for state i, i.e., a_ij.

5) Set t = t+1; return to step (3) if t <T; otherwise terminate the procedure. To indicate the complete parameter set of the model. This set of parameters, of course, defines a measurement of probability for O. In the development of the HMM methodology, the following problems are of particular interest.

4.4 The Three Basic Problems for Hidden Markov Models

Given the form of HMM from the previous section, there are three basic problems that must be solved for the model and the solution of these problems will consider later on in this chapter. These problems are the following:

(28)

28 Given the observation sequence OT = (O1, O2..., OT), and a model λ = (A, B, π), how do we efficiently compute P(O|λ)?, where P(O|λ) is the probability of the observation sequence, given the model? [1] which is called the evaluation problem

Problem.2

Given a observation sequence OT = (O1, O2..., OT), and a model λ, how do we choose a corresponding state sequence Q = (q1, q2..., qT), that is optimal in some sense? In this problem we tried to conceal the hidden part of the model, that is to say, to find the correct state sequence [1].which is called the estimation problem

Problem.3

How do we adjust the model parameters λ = (A, B, π) to maximize P(O|λ)? In this problem we attempt to optimize the model parameters to properly describe how an observation sequence comes about. The observation sequence is called a training sequence since it is used to train the HMM. The training problem is crucial for most HMM applications, since it allows us to optimally adapt the model parameters to an observing training data [1].which is called the training problem

4.5The Solution of These Problems

4.5.1 Solution of Problem 1__Probability Evaluation

We wish to calculate the probability of the observation sequence O=(O1

O2…OT),given the model λ, i.e., P(O|λ).the most straightforward way of doing this is

through enumerating every possible state sequence of length T (number of observation ). There are NT such state sequences. Consider one fixed-state sequence q = (q1, q2, q3,. . ., qT) 4.5

Where q1 is the initial state. The probability of the observation sequence O given the

state sequence of Eq 4.5 is

𝑃 𝑂 𝑞, 𝜆 = 𝑇_𝑡=1𝑃(𝑜_𝑡|q_t𝜆). 4.6a Where we have assumed statistical independence of observation. Thus we get

𝑃 𝑂 q, 𝜆 = b_q1 o₁ . b_q2 o₂ … b_qT o_T 4.6b The probability of such state sequence q can be written as

𝑃 𝑞 𝜆 = 𝜋_𝑞1𝑎_𝑞1𝑞2𝑎_𝑞2𝑞3… 𝑎_{𝑞𝑇 −1𝑞𝑇} 4.7 The joint probability of O and q, i.e., the probability that O and q occur

(29)

29 The probability of O(given the model) is obtained by summing this joint probability over all possible state sequences q, giving

𝑃 𝑞 𝜆 = _{𝑎𝑙𝑙 𝑞}𝑃 𝑂 𝑞, 𝜆 𝑃 𝑞 𝜆 4.9 = 𝑞1,𝑞2,…,𝑞𝑇𝜋𝑞1bq1 o1 . bq2 o2 … bqT oT 4.10

Eq 4.10 is the direct definition of 𝑃 𝑞 𝜆 calculation which involves 2T.NT

order of calculation, since at every t=1,2,…,T, there are N possible states that can be reached which is infeasible even for small values of N, this problem was solved by a more efficient procedure which is forward procedure

4.5.1.1 The Forward Procedure

Consider a forward variable 𝛼_𝑡(𝑖) defined as:

𝛼_𝑡 𝑖 = 𝑃(𝑜₁𝑜₂… 𝑜_𝑡, 𝑞_𝑡 = 𝑖| 𝜆) 4.11 Where t represents time and 𝑖 is the state. This gives that 𝛼𝑡(𝑖) will be the probability

of the partial observation sequence, O1O2…OT, (until time t) when being in state 𝑖 at

time t. The forward variable can be calculated inductively, see Fig. 4.2. 𝛼_𝑡+1(𝑖) is found by summing the forward variable for all N states at time t multiplied with their corresponding state transition probability, 𝑎_𝑖𝑗 ,and by the emission probability bj(ot+1).

This can be done with the following procedure:

(30)

30 Set t = t + 1;

Return to step 2 if t < T;

Otherwise, terminate the algorithm (go to step 4). 4. Termination

𝑃 𝑜 𝜆 = 𝑁_𝑖=1𝛼_𝑇(𝑖) 4.14 If the forward algorithm is used there is a need for N(N + l)(T -1) + N multiplications and N(N - l)(T - 1) additions. This is quite an improvement compared to the direct method

4.5.1.2 The Backward Algorithm

The recursion described in the forward algorithm, can also be done in the reverse time. By defining the backward variable 𝛽_𝑡(𝑖)as:

𝛽𝑡 𝑖 = 𝑃(𝑜𝑡+1𝑜𝑡+2… 𝑜𝑇|𝑞𝑡 = 𝑖, 𝜆) 4.15

This is the probability of the partial observation sequence from t + 1 to the end, given state i at time t and the model. Notice that the definition for the forward variable is a joint probability whereas the backward probability is a conditional probability. In a similar manner (according to the forward algorithm), figure 4.3 illustrate the calculations which included in the backward algorithm

Figure 4.3: Backward Procedures Inductively.

The backward algorithm includes the following steps: 1.Initialization

Set t=T-1

𝛽_𝑡 𝑖 = 1 1≤i≤N 4.16 2. Induction

(31)

31 3. Update time

Set t = t - 1;

Return to step 2 if t > 0;

Otherwise, terminate the algorithm.

Note that the initialization step 1 arbitrarily defines 𝛽𝑡(𝑖) to be 1 for all i. Again the

computation of 𝛽_𝑡(𝑖), 1≤t≤T , 1≤i≤N , requires on the order of N2

T calculations.

4.5.1.3 Scaling the Forward and Backward Variables

The calculation of 𝛼𝑡 𝑖 and𝛽𝑡(𝑖), involves multiplication with probabilities. All these

probabilities have a value less than 1 (generally significantly less than 1), and as t starts to grow large, each term of 𝛼_𝑡 𝑖 or 𝛽𝑡(𝑖) starts to head exponentially to zero.

The basic scaling procedure multiplies 𝛼_𝑡 𝑖 by a scaling coefficient that is dependent only of the time t and independent of the state i [5]. The scaling factor for the forward variable is denoted Ct (scaling is done every time t for all states 𝑖 − 1 ≤ 𝑖 ≤ 𝑁 ) . This

factor will also be used for scaling the backward variable, 𝛽_𝑡(𝑖).Scaling 𝛼_𝑡 𝑖 and 𝛽_𝑡(𝑖) with the same scale factor will show useful in problem 3 (parameter estimation). Consider the computation of the forward variable, 𝛼𝑡 𝑖 . In the scaled variant of the

forward algorithm some extra notations will be used. 𝛼_𝑡 𝑖 denote the unscaled forward variable, α′_t(i)denote the scaled and iterated variant of 𝛼_𝑡 𝑖 , 𝛼′′_𝑡 𝑖 denote the local version of 𝛼𝑡 𝑖 before scaling and Ct will represent the scaling coefficient at

(32)

32 Set t = t + 1;

Return to step 2 if t<T;

𝑙𝑜𝑔 𝑃 𝑂 𝜆 = − 𝑇 log 𝑐_𝑡

𝑡=1 4.25

The main difference between the scaled and the none scaled forward algorithm lies in steps two and four. In step two can (4.24) be rewritten if (4.23) and (4.22) are used: 𝛼′_𝑡 𝑖 = 𝑐1𝛼′′𝑡 𝑖 = 1 [𝑏𝑘 𝑜𝑡 𝑁𝑗 =1𝛼′𝑡−1 𝑗 𝑎𝑗𝑘] 𝑁 𝑘 =1 [𝑏_𝑖 𝑜_𝑡 𝑁 𝛼′_𝑡−1 𝑗 𝑗 =1 𝑎𝑗𝑖 ] 1≤i≤N 4.26

By induction, the scaled forward variable can be found in terms of the none scaled as: 𝛼′

𝑡−1 𝑗 = 𝑡−1𝜏=1𝑐𝜏 𝛼𝑡−1 𝑗 4.27

The ordinary induction step can be found as (same as (4.13) but with one time unit shift):

𝛼𝑡 𝑗 = 𝑁𝑖=1𝛼𝑡−1 𝑖 𝑎𝑖𝑗 𝑏𝑗(𝑜𝑡) 4.28

With (4.26) and (4.27) it is now possible to rewrite (4.25) as: 𝛼′𝑡 𝑗 = 𝑏𝑖(𝑜𝑡) 𝑁𝑗 =1𝛼′𝑡−1 𝑗 𝑎𝑗𝑖 𝑏𝑘(𝑜𝑡) 𝑁 𝑘 =1 𝑁𝑘 =1𝛼′𝑡−1 𝑘 𝑎𝑗𝑘 = 𝑏𝑖(𝑜𝑡) 𝑐𝜏 𝑡−1 𝜏=1 𝛼𝑡−1 𝑗 𝑎𝑗𝑖 𝑁 𝑗 =1 𝑏𝑘(𝑜𝑡) 𝑁 𝑘 =1 𝑁𝑘=1 𝑡−1𝜏=1𝑐𝜏 𝛼𝑡−1 𝑘 𝑎𝑗𝑘 4.29 = 𝛼𝑡 𝑗 𝛼𝑡 𝑘 𝑁 𝑘 =1 1≤i≤N

As (4.28) shows, each 𝛼_𝑡 𝑗 is scaled by the sum over all states of 𝛼𝑡 𝑗 when the

scaled forward algorithm is applied. The termination (step 4) of the scaled forward algorithm, evaluation of 𝑃 𝑜 𝜆 , must be done in a different way. This because the sum of 𝛼′𝑡 𝑗 cannot be used, because 𝛼′𝑡 𝑗 is scaled already. However the

following properties can be used:

𝑡−1_𝜏=1𝑐_𝜏 𝑁_𝑖=1𝛼_𝑇 𝑖 = 1 4.30 𝑡−1𝜏=1𝑐𝜏 𝑃 𝑜 𝜆 = 1 4.31

𝑃 𝑜 𝜆 = 1

𝑡−1𝜏=1𝑐𝜏

4.32 As (4.31) shows can 𝑃 𝑜 𝜆 be found, but the problem is that if (4.31) is used the result will still be very small (and probable out of the dynamic range for a computer). If the logarithm is taken on both sides the following equation can be used:

log 𝑃 𝑜 𝜆 = 1

𝑡−1𝜏=1𝑐𝜏

= − 𝑇 log 𝑐_𝑡

(33)

33 This is exactly what is done in the termination step of the scaled forward algorithm. The logarithm of 𝑃 𝑜 𝜆 is often just as useful as 𝑃 𝑜 𝜆 because in most cases, this measure is used as comparison with other probabilities (for other models). The scaled backward algorithm can be found more easily, since it will use the same scale factor as the forward algorithm. The notations used is similar to the forward variables notations,𝛽_𝑡(𝑖) denote the unscaled backward variable, 𝛽_𝑡(𝑖) denote the scaled and iterated variant of 𝛽_𝑡(𝑖), 𝛽′′_𝑡(𝑖) denote the local version of 𝛽_𝑡(𝑖) before scaling and Ct

will represent the scaling coefficient at each time. The following equations determine the scaled backward algorithm:

1.Initialization Set t=T-1 𝛽_𝑇 𝑖 = 1 1≤i≤N 4.34 𝛽′_𝑇 𝑖 = 𝑐_𝑇𝛽_𝑇(𝑖) 1≤i≤N 4.35 2.Induction 𝛽′′_𝑡 𝑖 = 𝑁_{𝑗 =1}𝑎_𝑖𝑗 𝑏_𝑗(𝑜_𝑡+1)𝛽′_𝑡+1(𝑗) 4.36 𝛽′_𝑡 = 𝑐_𝑡𝛽′′_𝑡(𝑗) 4.37 3. Update time Set t = t - 1; Return to step 2 if t > 0;

Otherwise, terminate the algorithm.

4.5.2 Solution to Problem 2 - "Optimal" State Sequence

The problem is to find the optimal sequence of states to a given observation sequence and model. Unlike problem one, for which an exact solution can be found, there are several possible ways of solving this problem. The difficulty lies with the definition of the optimal state sequence, that is, there are several possible optimality criteria. One optimal criterion is to choose the states, qt, that are individually most likely at each

time t. To find this state sequence the following probability variable is needed: 𝛾_𝑡 𝑖 = 𝑃(𝑞_𝑡 = 𝑖|𝑂, 𝜆) 4.38 That is, the probability of being in state i at time t given the observation sequence, O, and the model 𝜆. Other ways to look at 𝛾𝑡 𝑖 can be:

(34)

can 𝑃(𝑂, 𝑞_𝑡 = 𝑖|𝜆) be found as a joint probability:

𝑃 𝑂, 𝑞𝑡 = 𝑖 𝜆 = 𝑃(𝑜1𝑜2… 𝑜𝑡, 𝑞𝑡 = 𝑖| 𝜆)* 𝑃(𝑜𝑡+1𝑜𝑡+2… 𝑜𝑇|𝑞𝑡 = 𝑖, 𝜆) 4.40

With (4.39) it is now possible to rewrite (4.38) as: 𝛾_𝑡 𝑖 = 𝛼𝑡 𝑖 𝛽𝑡 𝑖

𝛼𝑡 𝑖 𝛽𝑡 𝑖 𝑁

𝑖=1

4.41 When 𝛾_𝑡 𝑖 is calculated according to (4.41), the most likely state at time t, q*

, will be found by:

𝑞_𝑡∗ = arg 𝑚𝑎𝑥_1≤i≤N[ 𝛾_𝑡 𝑖 ] 1≤t≤T 4.42 Even if Eq(4.42) maximizes the expected number of correct states, there could be

some problems with the resulting state sequence. This because the state transition probabilities have not been taken into account and for solution of this problem the viterbi algorithm was used and will be discussed in the following section [5].

4.5.2.1 The Viterbi Algorithm

This algorithm is similar to the forward algorithm. The main difference is that the forward algorithm uses summing over previous states, whereas the Viterbi algorithm uses maximization. The aim for the Viterbi algorithm is to find the single best state sequence, q = (q1, q2, q3,. . ., qT) , for the given observation sequence O=(O1 O2…OT)

and model 𝜆. Consider the following quantity:

𝛿_𝑡 𝑖 = 𝑚𝑎𝑥_{𝑞1,𝑞2,…,𝑞𝑡 −1}𝑃(𝑞1𝑞2 … 𝑞𝑡 − 1, 𝑞𝑡 = 𝑖, 𝑂1𝑂2 … 𝑂𝑡|𝜆) 4.43 That is the probability of observing O=(O1 O2…OT), using the best path that ends in

state i at time t, given the model 𝜆. By using induction can 𝛿𝑡+1 𝑖 be found as:

𝛿_𝑡+1 𝑖 = 𝑏𝑗 𝑜𝑡+1 𝑚𝑎𝑥1≤i≤N (𝛿𝑡−1 𝑖 𝑎𝑖𝑗) 4.44

To actually retrieve the state sequence, it is necessary to keep track of the argument that maximizes (4.43), for each t and j. This is done by saving the argument in an array 𝜑_𝑡(𝑗). Here follows the complete Viterbi algorithm:

1. Initialization Set t=2

(35)

35 2.Induction 𝛿𝑡 𝑗 = 𝑏𝑗 𝑜𝑡 𝑚𝑎𝑥1≤i≤N (𝛿𝑡−1 𝑖 𝑎𝑖𝑗) 1 ≤ j ≤ N 4.47 𝜑1 𝑗 = arg 𝑚𝑎𝑥1≤i≤N(𝛿𝑡−1 𝑖 𝑎𝑖𝑗) 1 ≤ j ≤ N 4.48 3. Update time Set t = t + 1; Return to step 2 if t ≤ T;

𝑃∗ = 𝑚𝑎𝑥1≤i≤N 𝛿𝑇 𝑖 4.49

𝑞_𝑇∗ = arg 𝑚𝑎𝑥_1≤i≤N𝛿_𝑇 𝑖 4.50 5. Path (state sequence) backtracking

(a)Initialization Set t = T - 1 (b) Backtracking 𝑞_𝑡∗ = 𝜑_𝑡+1 𝑗 (𝑞_𝑡+1∗ _{) 4.51} (c)Update time Set t = t - 1; Return to step (b) if t 2: 1; Otherwise, terminate the algorithm.

The same problem as for the forward and backward algorithm occurs here. That is the algorithm involves multiplication with probabilities and the precision range will be exceeded. This is why an alternative Viterbi algorithm is needed [9].

4.5.2.2 The Alternative Viterbi Algorithm

As mentioned the original Viterbi algorithm involves multiplications with probabilities. One way to avoid this is to take the logarithm of the model parameters, giving that the multiplications become additions. Figure 4.4 below show an example of viterbi search

4.5.3 Solution to Problem 3 - Parameter Estimation

The third problem is concerned with the estimation of the model parameters 𝜆 = (A, B, π ). The problem can be formulated as:

(36)

36 Given an observation O, find the model 𝜆∗ from all possible 𝜆 that maximizes 𝑃(𝑂|𝜆).This problem is the most difficult of the three problems. This because there is no known way to analytically find the model parameters that maximizes the probability of the observation sequence in a closed form. However can the model

parameters be chosen to locally maximize the likelihood𝑃(𝑂|𝜆) .

Figure 4.4: Example of Viterbi Search

Some common used methods for solving this problem is Baum-Welch method (also known as expectation-maximization method) or gradient techniques. Both of these methods uses iterations to improve the likelihood 𝑃(𝑂|𝜆) (in this chapter we will discuss the Baum-Welch) This section will derive the reestimation equations used in the Baum- Welch method. To describe the procedure for re-estimation (iterative update and improvement) of HMM parameters, we first define 𝜀𝑡(𝑖, 𝑗)in the form

𝜀_𝑡 𝑖, 𝑗 = 𝑃(𝑞𝑡 = 𝑖, 𝑞𝑡+1 = 𝑗, 𝑂|𝜆) 4.53 = 𝛼𝑡 𝑖 𝑎𝑖𝑗𝑏𝑗 𝑂𝑡+1 𝛽𝑡+1(𝑗 ) 𝛼𝑡 𝑖 𝑎𝑖𝑗𝑏𝑗 𝑂𝑡+1 𝛽𝑡+1(𝑗 ) 𝑁 𝑗 =1 𝑁 𝑖=1 4.54 We have previously defined 𝛾_𝑡 𝑖 as the probability of being in state I at time t, given

the entire observation sequence and the model, so we can relate 𝛾𝑡 𝑖 and 𝜀𝑡 𝑖, 𝑗 by

summing over j, giving

𝛾_𝑡 𝑖 = 𝑁_{𝑗 =1}𝜀_𝑡 𝑖, 𝑗 4.55 If we sum 𝛾𝑡 𝑖 over the time index t, we get a quantity that can be interpreted as the

expected number of times that state I is visited. Similarly, summation of 𝜀_𝑡 𝑖, 𝑗 over t can be interpreted as the expected number of transitions from state I to state j. that is,

𝜀𝑡 𝑖, 𝑗 𝑇−1 𝑡=1 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑡𝑜 𝑠𝑡𝑎𝑡𝑒 𝑗 𝑖𝑛 𝑂 4.56a 𝛾𝑡 𝑖 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑖𝑛 𝑂 𝑇−1 𝑡=1 4.56b

(37)

37 𝜋 𝑗 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑎𝑛𝑐𝑦 𝑖𝑛 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡 = 1 = 𝛾𝑡 𝑖 4.57a 𝑎 𝑖𝑗 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 𝑡𝑜 𝑠𝑡𝑎𝑡 𝑒 𝑗 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖 = 𝑇−1𝑡=1𝜀𝑡 𝑖,𝑗 𝛾𝑡 𝑖 𝑇−1 𝑡=1 4.57b

b _j k =expected number of times in state j and observing symbol vk

expected number of times in state j

= 𝑇𝑡=1𝛾𝑡 𝑖

𝛾𝑡 𝑖 𝑇

𝑡=1 4.57c Obviously we said that we need to scale 𝛼_𝑡 𝑖 and 𝛽_𝑡(𝑖) so we can rewrite these

equation 4.57b using scaled 𝛽𝑡(𝑖) and 𝛼𝑡 𝑖 as follows

𝑎 _𝑖𝑗 = 𝛼′𝑡 𝑖 𝑎𝑖𝑗𝑏𝑗(𝑜𝑡+1)𝛽′𝑡+1(𝑖) 𝑇 𝑡=1 𝛼′ 𝑡 𝑖 𝑎𝑖𝑗𝑏𝑗(𝑜𝑡+1)𝛽′𝑡+1(𝑖) 𝑁 𝑗 =1 𝑇−1 𝑡=1 4.58

4.6Types of HMM

Different kinds of structures for HMMs can be used. The structure is defined by the transition matrix, A. The most general structure is the ergodic or fully connected HMM most complex type of HMM structure. In this model can every state be reached from every other state of the model. As shown in figure. 4.5, for an N = 4 state model, this model has the property 0 < aij < 1 (the zero and the one has to be excluded,

otherwise is the ergodic property not fulfilled). The state transition matrix, A, for an ergodic model, can be described by

(38)

38 Figure 4.5: Example for Ergodic Structure of HMM

In speech recognition, it is desirable to use a model which models the observations in a successive manner - since this is the property of speech. The models that fulfills this modeling technique, is the left-right model or Bakis model, see Fig 4.58a.The property for a left-right model is:

aij = 0, j < i 4.58a

That is, no jumps can be made to a previous state. Also Eq4.58b determine the initialization for π this type:-

𝜋_𝑖 = 0 , 𝑖 ≠ 1

1 , 𝑖 = 1 4.58b Note that, for a left-right model, the state transitions coefficients for the last state has the following property:

𝑎_𝑁𝑁 = 1 4.59 𝑎_𝑗𝑁 = 0, j < N

In Figure 4.6a and 4.6b are structure of two left-right models.

(39)

39 Figure 4.6b: Left-Right Model With Three Transition Allowed

The following transition matrix is for figure 4.6b model

A             a a a a a a a a a 11 12 13 22 23 24 33 34 44 0 0 0 0 0 0 0 4.60

4.5 Multiple Observation Sequence:-

The transient nature of the states within the model allows only a small number of observations for any state and we obviously know that the training problem is a very difficult problem and must be solved perfectly to ensure system performance for any possible user input. Since to have a sufficient and reliable data the multiple

observation sequence was recommended. The modification of the re-estimation procedure is straightforward and as follows:-

O=[O(1) ,O(2),…,O(k)] 4.61 Where O(k)=[O1(k) ,O2(k),…,OT(k)] is the kth observation sequence. We assumed each

observation sequence is independent of every other observation sequence, and our goal is to adjust the parameter of the model 𝜆 to maximize

𝑃 𝑂 𝜆 = 𝐾_𝑘=1𝑃_𝑘 4.62 Thus the modification in the re-estimation formulas 𝑎 _𝑖𝑗 = 1 𝑃 𝑘 𝐾 𝑘=1 𝑇𝑘 −1𝑡=1 𝛼𝑘𝑡 𝑖 𝑎𝑖𝑗𝑏𝑗 𝑂𝑘𝑡+1 𝛽𝑘𝑡+1(𝑗 ) 1 𝑃 𝑘 𝐾 𝑘=1 𝑇𝑘 −1𝑡=1 𝛼𝑘𝑡 𝑖 𝛽𝑘𝑡(𝑗 ) 4.63 𝑏 _𝑗 𝑙 = 1 𝑃 𝑘 𝐾 𝑘 =1 𝑇𝑘−1_{𝑡=1𝑂𝑡 =𝑣𝑙}𝛼𝑘𝑡 𝑖 𝛽𝑘𝑡(𝑗 ) 1 𝑃 𝑘 𝐾 𝑘 =1 𝑇𝑘 −1𝑡=1 𝛼𝑘𝑡 𝑖 𝛽𝑘𝑡(𝑗 ) 4.64 And 𝜋_𝑖is not reestimated because it constrained by Eq 4.58b.also we can write the scaled transition probability equation as:-

(40)

40

5.1 Introduction

This chapter deals with the flowchart of each module (algorithm) which was used in the project and was mentioned in a theoretical manner in the previous chapters (chapter 3, &chapter 4) and the equations of each step was derived and analyzed earlier. And also we will introduce the implementation scenarios for the project.

5.2 Recording Procedure

Here in this module give the allowance for the user to enter the utterance (isolated word) to the program and then it was read from the sound card in bytes (8 bits) and then converted into signed word (16 bit) and adjusting the sampling frequency and number of bits per second while the user recording process then after the specified period stop recording. Figure 5.1 determine the flowchart of this process

Figure 5.1: Flowchart of Recording Procedure

5.3 Speech End Point Algorithm

(41)

41 1. Read the entered utterance.

2. Block this utterance into overlapping frames.

3. Compute energy and zero crossing rate for each frame.

Figure 5.2: Flowchart of Speech End Point Algorithm.

4. Compute the upper and lower energy threshold (ITU and ITL). 5. Set zero crossing thresholds.

6. Fetch from the beginning of the signal for the first frame with energy greater than ITU.

7. Then backward to the beginning of the signal fetching for the first frame with energy lowers than ITL.

8. The backward also to the beginning of signal and comparing ZCR with ZCTH (if ZCR>ZCTH the signal is speech else the signal is

(42)

42 9. If ZCR>ZCTH three times (this means that these 3 frames is speech)

then put the starting point (real start) in the last frame with ZCR>ZCTH.

10. Else (this means that these three frames is background noise) then put the starting point (real start) at the first frame which has an energy < ITL.

11. Set the starting point = real start.

This process is also done from the end of the signal also to find the real finish point by the same previous steps.

5.4 Pre-emphasizer

This module read the detected signal and multiplies each sample by the mentioned equation (first order filter) for filtering out the signal figure 5.3 shows the flowchart for this process.

Figure 5.3: Flowchart of Pre-emphasizer

5.5 Frame blocking and windowing

(43)

43 overlapping samples to start make the second frame and so on until the complete. Figure 5.4 determine this process

Figure 5.4: Flowchart of Overlapping Frame Blocking Procedure

In figure 5.5 the overlapped frames is weighted by hamming window coefficients so as to minimize the LPC error this done by read each frame and implement this process in it until completing all frames.

(44)

44 Figure 5.5: Flowchart of Windowing Procedure

5.6 Parameterization

In this step we begin to convert the signal into vector represented by a weighting Cepstral coefficients, figure 5.6 shows a flowchart for this process.

[1] Reference to flowchart 7 [2] Reference to flowchart 8 [3] Reference to flowchart 9

[4] Reference to equation 3.29 by using 3.30 weighting function

5.6.1 Autocorrelation Coefficient

(45)

45 Figure 5.6:Flowchart of Parameterization

Figure 5.7: Flowchart of Computation of Autocorrelation Coefficients

5.6.2 LPC Coefficients

(46)

46 mentioned in section 3.5, this process is done for each individual frame, see figure 5.8.

[1] Reference to equation 3.21

Figure 5.8: Flowchart of Levienson-Durbin Algorithm

5.6.3 Cepstral Coefficients

Here the cepstral coefficients is computed using LPC coefficients for each frame, obviously in chapter we said that the number of the cepstral coefficients may be greater than the number of LPC coefficients so the first condition is in the number of LPC coefficient if it is less than p then compute cepstral coefficient using equation [1] else use equation [2] the if coefficient number > cepstral coefficient specified

numbers then stop (terminate the algorithm). But initially set C0=lnσ2.

[1] Reference to equation 3.27 [2] Reference to equation 3.28

(47)

47

Figure 5.9: Flowchart of Cepstral Coefficients

5.7 Design of Vector Quantizer

5.7.1 Clustering Algorithm

(48)

48 Figure 5.10: Flowchart of K-means Algorithm

[1] Refernce to equation 3.31

[2] Average distance = sum of distances of training set vector which have the same index/the number of them

5.7.2 Classification Procedure

This procedure is to give each vector a specified index from the codebook this by measuring the distance between vector and codebook elements and fetching for the minimum measured distance and give its codebook index to the vector.

(49)

49 Figure 5.11: Flowchart of the Classification Procedure

5.8 Software Design

The first stage in designing the system was that of developing a pre-liminary prototype using MATLAB. This prototype consisted of various M-file implementations of the processing blocks described in Chapter 3. After the preliminary design and testing using MATLAB, the system was designed and developed from scratch using C++ in a Linux environment (Fedora Core 8).

5.8.1 MATLAB Design

This section contains all functions which were used to develop the system design using MATLAB but these functions were developed for signal processing and VQ steps only.

asr.m this was the main speech analysis script, it performed the following: 1- It uses analoginput() to get input from the user via the microphone.

(50)

50 2- It performs the windowing manually without calling any external function.

myVAD.m

It performs speech endpoint detection on an input signal using an end point detection algorithm described in chapter 3, with a flowchart given in figure 5.2.

preemp.m

Contains a function that performs emphasis on an input signal with a given pre-emphasis parameter (a).

frameblock.m

It performs frame blocking on the pre-emphasized signal, with a given overlap. It returns a matrix whose rows are the set of frames, and number of columns is the number of samples per frame.

mylpc.m

It takes an input signal, and the order of LPC analysis. It calculates the LPC coefficients using the Levinson-Durbin algorithm.

cepsco.m

It uses the output LPC coefficients to compute the cepstral coefficients which are the distortion measures.

buildcb.m

It uses the K-means Lloyd algorithm (described in chapter 3 and in figure 3.8) to build a codebook of reference vectors from a training set of vectors.

distanc.m