Model Parameter Estimation - Linear Trajectory SHMMs

5.4 Linear Trajectory SHMMs

5.4.2 Model Parameter Estimation

To look at the model parameter estimation for the linear trajectory SHMMs, we should start from the general PTSHMMs. For reasons of mathemati-cal tractability and of trainability, all variability in PTSHMMs is modelled with Gaussian distributions assuming diagonal covariance matrices. In a

PTSHMM the “extra segmental” variation for a state is defined by a PDF defined in the set of possible state trajectories. In (Holmes and Russell 1999) it is assumed that this PDF is Gaussian. In PTSHMMs we define the ex-pected trajectory mid point mean of state σ_i as ν_i, the mid point variance as η_i, the trajectory slope mean as µ_i and the slope variance as γ_i. Suppose we have a single segment of feature vectors Y = [y₁, y₂, ..., y_T], and the trajec-tory f has a mid point c and a slope m, the probability of the observation Y and trajectory f given state σ_i is

p(Y, f jσ_i) = d_σ_i(T )p_σ_i(f )

In a PTSHMM there is one intra-segmental probability per frame in a segment but also one extra-segmental probability per segment. Different explanations of the data may use different numbers of the two types of prob-ability, depending on the number of segments. Recognition performance is thus dependent on a suitable balance between the different numbers of prob-ability contributions, which compromises performance (Holmes and Russell 1999).

Two approximations to 5.5 are considered in (Holmes and Russell 1999).

One is the “optimal trajectory” approximation, which was proposed by Rus-sell (RusRus-sell 1993). This method is to use an approximation by considering

p(y, f ) for only one specified trajectory ˆf :

f = argmaxˆ _fp(Y, f jσ_i), (5.6)

where ˆf is the most likely trajectory. The probability of the observation sequence given state i hence can be written as:

p(Y jσ_i) = p(Y, ˆf jσ_i). (5.7)

The second alternative method is the “fixed linear trajectory” SHMM. In the “fixed linear trajectory” SHMM the means of the PDFs which describe the trajectory distribution in a PTSHMM are treated as the actual mid-point and slope values of a single fixed linear trajecotry. Using this method, the probability of the observation sequence given state i is given by:

p(Y jσ_i) = p(Y, ¯f jσ_i), (5.8)

where ¯f is the linear trajectory with mid point m_i and slope c_i.

Holmes and Russell’s research (Holmes and Russell 1999) shows that the

“optimal trajectory” approximation method to PTSHMM doesn’t work very reliably and the speech recognition performance is poor using this method.

This appears to be because of the imbalance between the two types of prob-ability, the intra- and extra- segmental probabilities. The speech recognition performance for the “fixed linear trajectory” SHMM, however, is almost as good as the PTSHMM in which p(Y ) is calculated properly using the in-tegral. Thus in SEGVit, the software we developed to train and test our

segmental HMM system, the “fixed linear trajectory” method is chosen. An-other reason to choose the “fixed linear trajectory” SHMM is because that the SEGVit software was designed to support a more complicated segmental HMM structure. This is called a multiple-layer segmental HMM (Russell and Jackson 2005). This multiple-layer model has an ‘articulatory’ intermediate layer which can be transferred into or from the acoustic domain using linear or non-linear mappings. In a single layer model, this integral in 5.5 is tractable because the PDFs are assumed to be Gaussian. This is also likely to be the case for a multiple-level model in which the “articulatory-to-acoustic” map-ping is linear. However, for a non-linear mapmap-ping this is no longer the case.

It was decided that for simplicity the “fixed linear trajectory” SHMM would be implemented in SEGVit. The “fixed linear trajectory” SHMM is the type of segmental model which is used in this thesis. In the following chapters the term SHMM actually means the “fixed linear trajectory” SHMM.

If Y^T = [y₁, y₂, . . . , y_T] is a sequence of acoustic feature vectors and a sample for state σi, the probability density of Y^T given state σi is given by:

p(Y^Tjσi) = bi(Y^T) = di(T ) N (yt; fi(t), Vi) is a D dimensional Gaussian probability density function with mean f_i(t) and covariance matrix V_i (it is assumed that V_i is diagonal). The case m_i = 0 corresponds to a constant trajectory SHMM. If, in addition, d_i is a geometric probability density function then this is functionally identical to a conventional HMM except for an upper bound τ_max on state duration.

With a trajectory structure in the model, the probability calculations in a SHMM system take more computer running time than the calculations in a conventional HMM system. To save the computational load, in this thesis the SHMM model parameters are optimized using an estimation-maximization scheme based on segmental Viterbi decoding. ˆα_t(i) is defined to be the joint probability of the acoustic sequence y₁^t = [y₁, y₂, ..., y_t] and the partial state sequence X₁^t = [x₁, x₂, ..., x_t] which maximizes the probability p(y₁^t, x^t₁) given the model parameters and given that the state at time t + 1 is not state i. It can be written as:

α_t(i) = max_dmax_jαˆ_t−d(j)a_ijb_i(y_t−d+1, y_t−d+2, ..., y_t), (5.10)

where d is the duration, aji is the transition probability from state i to state j, and bi(yt−d+1, yt−d+2, ..., yt) is the state probability of the observation sequence of duration d, emitted by state i.

Because ˆα_t(i) is decided by the maximal value of duration d and the maximal value of ˆα_t(j) at state j, the Viterbi algorithm helps the probability to be traced back to find an optimal state sequence. Once this is finished, the slope m⁰(Y^T) and mid-point value c⁰(Y^T) of the linear trajectory which provides the best fit to Y^T (in a least-squared error sense) can be calculated.

They are given by

The segmental Viterbi training procedure is applied in an iterative man-ner until the system converges and a local optimum is reached.

5.5 Summary

This Chapter has introduced the segmental HMM as a new model for speaker verification. The SHMM has a better structure than the conventional HMM which makes it well suited for speaker modelling. The model associates its states with variable-length sequences of acoustic feature vectors. By mod-elling the dynamic regions in speech, which may reflect some of the underlying mechanisms that give rise to individual differences, this model better char-acterizes the variations of a person’s voice and thus should improve speaker verification accuracy.

The theory of the original Probabilistic-Trajectory SHMM was presented as well as its two approximation methods. The “fixed linear trajectory”

SHMM is an alternative which works about as same as the original PT-SHMM in terms of the speech recognition performance. The “fixed linear trajectory” SHMM is hence applied to text-dependent and text-independent speaker verification in this thesis. For model parameter estimation, the seg-mental Viterbi decoder provides an iterative maximum likelihood estimation technique.

Our application of SHMMs focused on text-dependent speaker verifica-tion on the YOHO corpus and text-independent speaker verificaverifica-tion on the Switchboard speech database. The next two chapters examines many issues related to the training of speaker models and the performance of the SHMM

speaker verification system.

CHAPTER 6 Text-Dependent Speaker Verification

The most straightforward application of SHMMs to speaker recognition is text-dependent speaker verification (TD-SV). This is because a conventional TD-SV system typically uses phone-level HMMs, which can simply be re-placed by the corresponding phone-level SHMMs.

Suppose that a sequence of acoustic feature vectors Y = [y₁, ..., y_τ] is claimed to result from subject S speaking a text W . The decision whether to accept or reject this claim is based on the likelihood ratio:

L(S) = p(Y jS, W )

p(Y jW ) (6.1)

where p(Y jS, W ) is computed using a set of phone-level models for speaker S, configured to represent the text W , and p(Y jW ) is calculated using a set of speaker-independent background models configured to represent W . If the likelihood ratio L(S) is bigger than a preset threshold, the claim is accepted.

We built a conventional HMM system and a segmental HMM system,

both using the same set of context-sensitive triphone model labels. Both systems were trained on the TIMIT and YOHO training material and tested on the YOHO test set.

6.1 Experimental Method

In document The role of dynamic features in speaker verification (Page 83-91)