Dynamic Bayesian Network Based ASR System

Bayesian networks (BNs) model a set of variables V . The variables can be both discrete and continuous. DBNs extend this framework by modelling these variable at every discrete time step n. DBNs are generalization of HMMs (Zweig, 1998; Stephenson, 2003), and are also part of larger group of probabilistic models called graphical models. From a graphical viewpoint, these variables are the vertices in a directed acyclic graph with edge between the vertices, as illustrated in Figure 3.2.

q

₁

q

₂

q

₃

a

₁

a

₂

a

₃

x

₃

x

₂

x

₁

Figure 3.2. Example of DBNs. Note: there is a difference between the visual representation of DBNs and HMMs, for example the vertices of DBNs represent the variable where as the vertices of HMMs are the value of the variable.

The edges have a parent-child relationship, i.e., each edge points from the parent vertex to the child vertex, for e.g., vertex q1 is parent of vertex x1. In our work, the edges do not span back in

time and they span at most one time frame. Edges from continuous variables go to only continuous variables. If pa(v) is all the parents of an arbitrary vertex v and P (v|pa(v)) is the local probability distribution associated with vertex v. The joint probability distribution of V is then the product of all the local probability distributions, as shown below:

P (V ) = Y

vi n∈V

P (vi

n|pa(vni))

Thus, for Figure 3.2 we have

where, V3

1 = {q1, x1, a1, q2, x2, a2, q3, x3, a3}.

The actual estimation of p(V3

1)without any statistical assumptions of the dependencies between

variables would have needed much more local probability distributions1_{or in other words the DBN}

representing the actual estimation of p(V3

1)will have much more edges than in the figure. Thus, one

of the main purposes of DBNs is sparse factorization of the joint distribution by learning certain dependencies between the variables. Similar to forward-backward algorithm in HMM, the probabilistic inference consists of a two pass inference: the first to compute the likelihood of the observed data given the prior distribution and the second to compute the posterior distribution of variables given the observed data. This posterior distribution is then used in the EM training as the expected counts. During recognition the data likelihood obtained from the first pass is used to get the most likely sequence of words. In case of the HMMs the probabilistic dependencies and inference are de- termined at compile time, where as in DBNs this done at run time. This makes DBNs more flexible in the sense that at each time, if we want to change variables or the statistical dependencies, we do not have to write a new program.

DBNs have been recently used in ASR research (Zweig, 1998; Bilmes, 1999; Bilmes and Zweig, 2002; Zweiget al., 2002; Livescu et al., 2003; Stephenson et al., 2004; Bilmes, 2004). In this thesis, we have used the DBN software developed by Todd Stephenson in his PhD thesis (Stephenson, 2003). For further details about the implementation and probabilistic inference process refer to (Stephenson, 2003, Chapter 3, Chapter 5 Section 5.3 and Appendix B). The components of the DBN software that are used in this thesis are dbnExpect, dbnMax (used for training) and dbnVite (used for recognition). The dbnExpect is the E-step of the EM training, which collects the posterior values for each of the hidden discrete variable such as, transition variable and the mixture component variable. The dbnMax is the M-step of the EM training, where the distribution of the each variable distribution jointly with its parents is maximized according to all the posterior counts and then the conditional distribution of each variable given its parent is obtained. Before saving the conditional distributions, the variances of the acoustic feature vector are floored to 0.1 times the global variance of the training data. The dbnVite performs Viterbi decoding in the DBN framework. It uses a simple language model with equal probability to transit from any word to any other word. Furthermore, it does not incorporate word insertion penalty and language scaling factor which are used in standard decoders such as, HDecode in HTK (Younget al., 1997).

In this thesis, the DBN-based ASR systems are trained in the following manner:

1. Initialization: Using the segmentation of the training set (also used to train ANNs), the acoustic feature vectors for each state are clustered into the required number of mixtures for the GMMs. The mean vector and variance vector for each cluster is computed. The variances are then floored so that they are at least 0.1 times the global variance. The GMMs of that state are then initialized with the mean vectors and covariance matrices.

2. One iteration of EM training of the DBNs is performed, i.e. dbnExpect followed by dbnMax.

1_p(V3

3. After each iteration, the difference between the log likelihoods outputted by dbnMax between two successive iterations is computed. If the difference is above 0.1% then another iteration of EM training is performed else the training ends. This convergence criteria has been chosen so as to have the DBNs that are reasonably trained and at the same time they are trained in a reasonable amount of time.

In document Using Auxiliary Sources of Knowledge for Automatic Speech Recognition (Page 48-50)