Tied Mixture of Factor Analyzers - Subspace Gaussian Mixture Models for Language Identification

TMFA [Kenny et al., 2004; Miguel et al., 2014] is the multimodal version of TFA. It is related to MFA but the hidden variable is tied to a set of data points. Tying the hidden variable allows us to capture the correlations among different Gaussian components. Thus we learn how the points generated by a given Gaussian component are distributed, if we know their distribution in the rest of Gaussian components. As in TFA, the hidden variable can learn one underlying aspect of the speech, like the speaker or the channel, and be used to adapt a universal model to a specific scenario, but in addition, multimodality is considered. Unlike MFA, the sign of each subspace is

important, because it is related with the signs of the rest of subspaces. In MFA, we can change the sign of any subspace and obtain the same model, because they only have a local meaning. However, in TMFA, if we change the sign of any subspace we will not obtain the same solution. A valid solution would be obtained by changing the sign of all the subspaces simultaneously.

In this section we present the exact formulation of TMFA. However, the obtained solution is infeasible for real cases due to the required computational load. This model was first introduced with some approximations to make it tractable with the name of JFA [Kenny and Dumouchel, 2004; Kenny, 2006; Kenny et al., 2007]. JFA was an evolution of the MAP techniques presented in Section 2.8. Initially, it contained up to three hidden variables, two of them to model the speaker, and one to model the channel. Later, it has been used for many other problems, like LID, with different number of hidden variables [Castaldo et al., 2007b; Hubeika et al., 2008; Campbell et al., 2008; Br¨ummer et al., 2009; Verdet et al., 2009; Jancik et al., 2010]. Recently, other approximations, more accurate than JFA, have been presented with success [Miguel et al., 2014]. Given the importance of JFA in this Thesis, it will be presented separately in Section 2.11.

In TMFA, an observation at time n generated by component k can be expressed as

on= µk+ Wkx + k. (2.28)

We can see the graphical model in Figure 2.12a. In Figure 2.12b we can also see its expanded version. Unlike MFA, whose graphical model is shown in Figure 2.7b, where each data point has an associated hidden variable, TMFA has a hidden variable which is common to a set of N data points.

2.10.1 Exact Calculation

Observe the difference between the graphical models of TMFA in Figure 2.12, and the graphical model of MFA in Figure 2.7b. In TMFA, the whole sequence of N observed variables, O = o1...oN, depends on the same hidden variable, x, and hence, they are

not independent of each other unless x is known. Thus we have to model the whole sequence of observations together and capture the correlations of the observed variables at different times n. This is achieved by modeling the concatenation of all the observed

N on x

(a) Graphical model of TMFA. The hidden variable, x, is common to a set of N data points. z₁ o₁ x z₂ o₂ z_N o_N

(b) Expanded graphical model of TMFA. We see that N observations de- pend on the same hidden variable x. Figure 2.12: Graphical model of TMFA in short and expanded notation. variables as a single vector ¯O = [o1; ...; oN] [Miguel et al., 2014]. From now on, we will

call supervectors to the concatenation of vectors. To compute the marginal distribution of O, we have to integrate out the hidden variable, which follows a standard normal distribution, and the sequence of indicator variables Z = z1...zN. Then

P ( ¯O) = K X s1=1 ... K X sN=1 Z p( ¯O, x, z1 = zs1, ..., zN = zsN)dx = K X s1=1 ... K X sN=1 Z Y n p(on|x, zn= zsn)p(zn= zsn)p(x)dx = K X s1=1 ... K X sN=1 ωs1...ωsNN( ¯O|¯µs, ¯Σs), (2.29)

where s is an integer from 1 to KN that identifies the current sequence of Gaussian indices s1, ...sN, sn indicates the active Gaussian component of sequence s at time n,

K is the number of Gaussian components, p(zn = zsn) indicates the probability that

Gaussian component sn is active at time n, also indicated as p(zsn = 1), ¯µs is the

concatenation of the means of Gaussian components indicated by s, ¯Σs is a DN xDN

matrix with the following structure ¯

Σs = ¯WsW¯s|+ ¯Ψs, (2.30)

where ¯Ws= W|s1...W|sN is the concatenation of matrix subspaces corresponding to the

index combination s, and ¯Ψs is a very large diagonal matrix whose nth diagonal block

is Ψsn, and the off-diagonal blocks are set to 0. In order to keep notation uncluttered,

we express it as P ( ¯O) = KN X s=1 ¯ ωsN( ¯O; ¯µs, ¯Σs), (2.31)

−6 −4 −2 0 2 4 6 −2 0 2 4 6 8

Train Data with Tied Mixture of Factor Analysers

Figure 2.13: Example of TMFA - TMFA modeling two clusters of data. This models is able to learn correlations among different Gaussians. See the subspaces spanned by the arrows. In the lower part of the arrows we find dark blue, red, green and black points in the two components, while on the top part we find yellow, cyan, pink, and another set of black points in the two components.

where now s goes through all possible KN permutations of Gaussian indices, and ¯ωs

is the product of Gaussian weights corresponding to the combination s. As per eq. (2.31), the model is equivalent to a mixture of factor analyzers with KN components.

The model parameters must be computed with an iterative process, like the EM algorithm presented in Section A.7.1 of Appendix A.

2.10.2 Example

In Figure 2.13, we have an example of TMFA, with J = 5, Nj = 10, K = 2, and

we have run 10 iterations of the EM algorithm. We have artificially created 10 data points for each file j, where the points of each file are printed in a different color. The data are 2 dimensional, whereas the dimension of the hidden variable, x, is 1. Unlike MFA, where we obtained a subspace for each component pointing in the directions of maximum variability, and the model only captures correlations among points generated by the same Gaussian component, in TMFA, the model learns correlations among points generated by different Gaussian components. See for example the blue points in the Figure. They are on the bottom of the arrow in both Gaussians. Or the yellow

points, which are on top of the arrow. The hidden variable in the blue points is always low, while in the yellow points is always high. The reason is that the model has learnt the correlations among points generated by different Gaussian components. The correlation among points of Gaussian component k = 1 will be given by W1W₁|,

whereas the correlations among points of Gaussian components k = 1 and k = 2 will be given by W1W₂|. Now, the subspaces can not be considered only locally, and they

must be considered globally. We can see that, unlike MFA, the sign of the subspace is important. As we said at the beginning of this section, only changing the sign of all the subspaces simultaneously would give an equivalent solution.

In document Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment (Page 78-82)