Iterative Viterbi-Segmentation - Discriminative features for GMM and i-vector based speaker dia

The baseline system models each set of clusters using ergodic hidden Markov model (HMM) where each state in the model represents one cluster. Give a set of speech segments {X1, X2, ..., Xn}, the baseline system finds the optimal number of clusters

K and their corresponding acoustic models that produce the best segmentation using the following equation:

θ_k∗, k∗ = arg max

θk,k

{P r(X, pbest|θk, k)} (3.4)

where pbest is the Viterbi path with the highest likelihood (i.e., sequence of states that

produce the maximum likelihood given the observations). After the completion of the algorithm execution, each remaining state is considered to represent a different speaker. This is done to refine the initial segmentation and improves the speaker boundaries [Tranter and Reynolds, 2006].

Figure 3.3: Example of minimum duration constraint.

We want to find a set of clusters and their acoustic models that maximize the likelihood of the data based on this HMM topology. Since we do not want to consider all possible values for k, a maximum value is selected for k using the initial segmentation outlined in Section 3.2. After each iteration of merging clusters, the value of k is reduced until we find an optimal number of clusters k∗ and their acoustic models θ_k∗.

A minimum duration (MD) is also constrained on the HMM topology as it is shown in Figure 3.3. Each state of the HMM consists of a set of sub-states imposing a minimum duration for each model. Each one of the sub-states has a probability density function modeled via a Gaussian mixture model (GMM). The same GMM model is tied for all sub-states of a given state. After entering a state at time n, the model moves to the following sub-state with probability 1.0 until the last sub-state is reached. It can remain in the same sub-state with transition weight α, or jump to the first sub-state of another state with weight _Kβ , where K is the number of active states at that time.

After merging of two clusters at each iteration, the the total number of parameters in the HMM decreases. The likelihood scores at each iteration reduce when the same amount of data is modeled using fewer parameters. Since the merging process decreases the likelihoods of equation 3.4, a threshold value to stop merging the process has to selected.

3.4 Speaker Clustering

Once the speech segments have been generated by Viterbi segmentation, the speaker clustering merges the speech of the same speakers iteratively. A single cluster is modeled for each speaker in the audio, and all speech parts of a specific speaker are represented in a single cluster.

The baseline system is based on the most widely used agglomerative hierarchical clustering (AHC) technique. The speech segments generated by Viterbi segmentation are

modeled by Gaussian mixtures, fitting the probability distribution of the features by the classical expectation-maximization (EM) algorithm. Segments which belong to the same speaker are represented in a single model. The minimum duration of speaker segment is restricted to 3 seconds as in [Ajmera and Wooters, 2003]. The selection of 3 second as a minimum duration in the baseline system is also justified in [Luque, 2012] (see Figue

3.4). The figure shows that the selection of 3 provides the best DER among different minimum duration values.

Figure 3.4: DER results on NIST Transcription 2006 and 2007 evaluation conference data using the minimum duration into account in the HMM decoding.

The figure is taken from thesis of [Luque, 2012] (baseline system) on page number 167. The clustering technique groups acoustically similar segments based on the Bayesian information criterion (BIC) metric among Gaussian distributions. At each iteration, the two segments with the highest BIC distance are merged. The HMM decoding process is repeated and a new mixture of Gaussians is assigned for the new set of clusters. The similarity matrix of the cluster pairs is updated. This procedure is iterated until the stopping criterion is met.The stopping criterion is met when the maximum BIC distance among all set of clusters is less than 0. Finally, the speaker diarization system outputs the hypothesis results (see Figure3.1, block C).

There are different ways of performing speaker segmentation and speaker clustering in speaker diarization. One of the method is performing segmentation first and running speaker clustering next. This method lacks flexibility since it doesn’t provide the option of correcting the speaker segmentation errors. The other method is performing the speaker segmentation and speaker clustering together iteratively. This method enables to refine the speaker segmentation errors. The UPC baseline system uses the second method. It uses an iterative bottom-up strategy based on HMM alignments and BIC values. Segments that belong to the same speaker are combined in a new model at

each iteration. A time constraint is imposed as in [Ajmera and Wooters, 2003] on the duration of the speaker segments through a hierarchical modeling of each state as it is shown in Figure 3.3. The Viterbi decoding decisions are based on the estimation of the observation probabilities of accumulated likelihoods per cluster/state in a 3 seconds window. This procedure is carried out iteratively until the stopping criterion is reached. The stopping criterion is reached when the highest BIC distance scores among the set of clusters is less than 0. Finally, the system output the speaker segmentation outputs. Since the segmentation and clustering steps are performed iteratively in the baseline system, the errors made in the segmentation step are corrected in the clustering.

In document Discriminative features for GMM and i-vector based speaker diarization (Page 61-64)